Description
This model detects Organizations (ORG), People (PER) and Locations (LOC) in financial texts. Was trained using manual annotations, coNll2003 and financial documents obtained from U.S. Security and Exchange Commission (SEC) filings.
Financial documents may be long, going over the limits of most of the standard Deep Learning and Transforming architectures. Please considering aggressive Sentece Splitting mechanisms to split sentences into smaller chunks.
Predicted Entities
ORG
, LOC
, PER
How to use
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
# Consider using nlp.SentenceDetector with rules/patterns to get smaller chunks from long sentences
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_legal_bert_base_uncased","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
jsl_ner = finance.NerModel.pretrained("finner_sec_conll", "en", "finance/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("jsl_ner")
jsl_ner_converter = nlp.NerConverter() \
.setInputCols(["sentence", "token", "jsl_ner"]) \
.setOutputCol("ner_chunk")
jsl_ner_pipeline = nlp.Pipeline().setStages([
documentAssembler,
sentence_detector,
tokenizer,
embeddings,
jsl_ner,
jsl_ner_converter])
text = """December 2007 SUBORDINATED LOAN AGREEMENT. THIS LOAN AGREEMENT is made on 7th December, 2007 BETWEEN: (1) SILICIUM DE PROVENCE S.A.S., a private company with limited liability, incorporated under the laws of France, whose registered office is situated at Usine de Saint Auban, France, represented by Mr.Frank Wouters, hereinafter referred to as the "Borrower", and ( 2 ) EVERGREEN SOLAR INC., a company incorporated in Delaware, U.S.A., with registered number 2426798, whose registered office is situated at Bartlett Street, Marlboro, Massachusetts, U.S.A. represented by Richard Chleboski, hereinafter referred to as "Lender"."""
df = spark.createDataFrame([[text]]).toDF("text")
model = jsl_ner_pipeline.fit(df)
res = model.transform(df)
Results
+------------------------------------------------+-----+
|ner_chunk |label|
+------------------------------------------------+-----+
|SILICIUM DE PROVENCE S.A.S |ORG |
|France |LOC |
|Usine de Saint Auban |LOC |
|France |LOC |
|Mr.Frank Wouters |PER |
|Borrower |PER |
|EVERGREEN SOLAR INC |ORG |
|Delaware |LOC |
|U.S.A |LOC |
|Bartlett Street |LOC |
|Marlboro |LOC |
|Massachusetts |LOC |
|U.S.A |LOC |
|Richard Chleboski |PER |
|Lender |PER |
+------------------------------------------------+-----+
Model Information
Model Name: | finner_sec_conll |
Type: | finance |
Compatibility: | Finance NLP 1.0.0+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | en |
Size: | 16.3 MB |
References
Manual annotations, coNll2003 and financial documents obtained from U.S. Security and Exchange Commission (SEC) filings.
Benchmarking
label tp fp fn prec rec f1
B-LOC 14 6 11 0.7 0.56 0.6222222
I-ORG 59 30 3 0.66292137 0.9516129 0.781457
I-LOC 32 2 22 0.9411765 0.5925926 0.7272727
I-PER 18 4 5 0.8181818 0.7826087 0.8
B-ORG 47 17 5 0.734375 0.90384614 0.8103449
B-PER 211 7 2 0.9678899 0.9906103 0.97911835
Macro-average 381 66 48 0.80409074 0.7968784 0.8004684
Micro-average 381 66 48 0.852349 0.8881119 0.869863