Detect Nationality / Company Founding Places in texts

Description

This is an NER model, aimed to detect Nationalities, more specifically when talking about countries founding place. It was trained with wikipedia texts about companies.

Predicted Entities

NATIONALITY, O

Copy S3 URI

How to use

documenter = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencizer = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")
    
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \
    .setInputCols("sentence", "token") \
    .setOutputCol("embeddings")\
    .setMaxSentenceLength(512)

chunks = finance.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

ner = finance.NerModel().pretrained("finner_wiki_nationality", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

 pipe = nlp.Pipeline(stages=[documenter, sentencizer, tokenizer, embeddings, ner, chunks])
 model = pipe.fit(df)
 res = model.transform(df)


res.select(F.explode(F.arrays_zip(res.ner_chunk.result, res.ner_chunk.begin, res.ner_chunk.end, res.ner_chunk.metadata)).alias("cols")) \
       .select(F.expr("cols['3']['sentence']").alias("sentence_id"),
               F.expr("cols['0']").alias("chunk"),
               F.expr("cols['2']").alias("end"),
               F.expr("cols['3']['entity']").alias("ner_label"))\
       .filter("ner_label!='O'")\
       .show(truncate=False)

Results

+-----------+--------+---+-----------+
|sentence_id|chunk   |end|ner_label  |
+-----------+--------+---+-----------+
|0          |American|73 |NATIONALITY|
+-----------+--------+---+-----------+

Model Information

Model Name: finner_wiki_nationality
Compatibility: Finance NLP 1.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 1.2 MB

References

Wikipedia

Benchmarking

label	 tp	 fp	 fn	 prec	 rec	 f1
B-NATIONALITY	 57	 7	 1	 0.890625	 0.98275864	 0.93442625
Macro-average 57 7 1 0.890625 0.98275864	 0.93442625
Micro-average 57 7 1 0.890625 0.98275864	 0.93442625