Description
This is an NER model, aimed to detect Company Sectors. It was trained with wikipedia texts about companies.
Predicted Entities
SECTOR
, O
How to use
documenter = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencizer = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)
chunks = finance.NerConverterInternal()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
ner = finance.NerModel().pretrained("finner_wiki_sector", "en", "finance/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
pipe = nlp.Pipeline(stages=[documenter, sentencizer, tokenizer, embeddings, ner, chunks])
model = pipe.fit(df)
res = model.transform(df)
res.select(F.explode(F.arrays_zip(res.ner_chunk.result, res.ner_chunk.begin, res.ner_chunk.end, res.ner_chunk.metadata)).alias("cols")) \
.select(F.expr("cols['3']['sentence']").alias("sentence_id"),
F.expr("cols['0']").alias("chunk"),
F.expr("cols['2']").alias("end"),
F.expr("cols['3']['entity']").alias("ner_label"))\
.filter("ner_label!='O'")\
.show(truncate=False)
Results
+-----------+-----------------+---+---------+
|sentence_id|chunk |end|ner_label|
+-----------+-----------------+---+---------+
|1 |lawn mowers |175|SECTOR |
|1 |snow blowers |192|SECTOR |
|1 |irrigation system|214|SECTOR |
+-----------+-----------------+---+---------+
Model Information
Model Name: | finner_wiki_sector |
Compatibility: | Finance NLP 1.0.0+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | en |
Size: | 1.2 MB |
References
Wikipedia
Benchmarking
label tp fp fn prec rec f1
B-SECTOR 70 17 23 0.8045977 0.75268817 0.7777778
I-SECTOR 24 11 9 0.6857143 0.72727275 0.70588243
Macro-average 94 28 32 0.745156 0.73998046 0.7425592
Micro-average 94 32 0.7704918 0.74603176 0.7580645