Description
This is a large Named Entity Recognition model, trained with a subset of generic conLL, financial and legal conll, ontonotes and several in-house corpora, to detect Organizations, Products and Aliases of Companies.
Predicted Entities
ORG
, PROD
, ALIAS
How to use
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
ner_converter])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text = ["""In 2020, we acquired certain assets of Spell Security Private Limited (also known as "Spell Security"). More specifically, their Compliance product - Policy Compliance (PC)")."""]
res = model.transform(spark.createDataFrame([text]).toDF("text"))
res.select(F.explode(F.arrays_zip(res.ner_chunk.result, res.ner_chunk.metadata)).alias("cols")) \
.select(F.expr("cols['0']").alias("chunk"),
F.expr("cols['1']['entity']").alias("ner_label"),
F.expr("cols['1']['confidence']").alias("confidence")).show(truncate=False)
Results
+------------------------------+---------+----------+
|chunk |ner_label|confidence|
+------------------------------+---------+----------+
|Spell Security Private Limited|ORG |0.8475 |
|Spell Security |ALIAS |0.8871 |
|Policy Compliance |PRODUCT |0.7991 |
+------------------------------+---------+----------+
Model Information
Model Name: | finner_orgs_prods_alias |
Type: | finance |
Compatibility: | Finance NLP 1.0.0+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | en |
Size: | 16.7 MB |
References
ConLL-2003, FinSec ConLL, a subset of Ontonotes, In-house corpora
Benchmarking
label tp fp fn prec rec f1
I-ORG 12853 2621 2685 0.8306191 0.82719785 0.828905
B-PRODUCT 2306 697 932 0.76789874 0.712168 0.7389841
I-ALIAS 14 6 13 0.7 0.5185185 0.59574467
B-ORG 8967 2078 2311 0.81186056 0.79508775 0.80338657
I-PRODUCT 2336 803 1091 0.74418604 0.68164575 0.7115443
B-ALIAS 76 14 22 0.84444445 0.7755102 0.80851066
Macro-average 26552 6219 7054 0.78316814 0.7183547 0.7493626
Micro-average 26552 6219 7054 0.8102285 0.790097 0.80003613