Detect former names of companies in texts (small)

Description

This is an NER model, aimed to detect Former Names of companies. It was trained with wikipedia texts about companies.

Predicted Entities

FORMER_NAME, O

Copy S3 URI

How to use

documenter = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencizer = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")
    
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \
    .setInputCols("sentence", "token") \
    .setOutputCol("embeddings")\
    .setMaxSentenceLength(512)

chunks = finance.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

ner = finance.NerModel().pretrained("finner_wiki_formername", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

 pipe = nlp.Pipeline(stages=[documenter, sentencizer, tokenizer, embeddings, ner, chunks])
 model = pipe.fit(df)
 res = model.transform(df)


res.select(F.explode(F.arrays_zip(res.ner_chunk.result, res.ner_chunk.begin, res.ner_chunk.end, res.ner_chunk.metadata)).alias("cols")) \
       .select(F.expr("cols['3']['sentence']").alias("sentence_id"),
               F.expr("cols['0']").alias("chunk"),
               F.expr("cols['2']").alias("end"),
               F.expr("cols['3']['entity']").alias("ner_label"))\
       .filter("ner_label!='O'")\
       .show(truncate=False)

Results

+-----------+------------------+---+-----------+
|sentence_id|chunk             |end|ner_label  |
+-----------+------------------+---+-----------+
|0          |Toro Motor Company|57 |FORMER_NAME|
+-----------+------------------+---+-----------+

Model Information

Model Name: finner_wiki_formername
Compatibility: Finance NLP 1.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 1.2 MB

References

Wikipedia

Benchmarking

label	 tp	 fp	 fn	 prec	 rec	 f1
I-FORMER_NAME	 29	 20	 13	 0.59183675	 0.6904762	 0.63736266
B-FORMER_NAME	 19	 5	 8	 0.7916667	 0.7037037	 0.7450981
Macro-average 48 25 21 0.6917517 0.6970899 0.69441056
Micro-average 48 25 21 0.65753424 0.6956522 0.6760564