Financial Zero-shot Relation Extraction

Description

This is a Zero-shot Relation Extraction Model, meaning that it does not require any training data, just few examples of of the relations types you are looking for, to output a proper result.

Make sure you keep the proper syntax of the relations you want to extract. For example:

re_model.setRelationalCategories({
    "DECREASE": ["{PROFIT_DECLINE} decrease {AMOUNT}", "{PROFIT_DECLINE}} decrease {PERCENTAGE}",
    "INCREASE": ["{PROFIT_INCREASE} increase {AMOUNT}", "{PROFIT_INCREASE}} increase {PERCENTAGE}"]
})

The keys of the dictionary are the name of the relations (DECREASE, INCREASE)
The values are list of sentences with similar examples of the relation
The values in brackets are the NER labels extracted by an NER component before

Predicted Entities

Download Copy S3 URI

How to use

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \
  .setInputCols("sentence", "token") \
  .setOutputCol("embeddings")\
  .setMaxSentenceLength(512)

ner_model = finance.NerModel.pretrained("finner_10k", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")\

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

re_model = finance.ZeroShotRelationExtractionModel.pretrained("finre_zero_shot", "en", "finance/models")\
    .setInputCols(["ner_chunk", "sentence"]) \
    .setOutputCol("relations")

# Remember it's 2 curly brackets instead of one if you are using Spark NLP < 4.0
re_model.setRelationalCategories({
    "DECREASE": ["{PROFIT_DECLINE} decrease {AMOUNT}", "{PROFIT_DECLINE} decrease {PERCENTAGE}"],
    "INCREASE": ["{PROFIT_INCREASE} increase {AMOUNT}", "{PROFIT_INCREASE} increase {PERCENTAGE}"]
})

pipeline = sparknlp.base.Pipeline() \
    .setStages([document_assembler,  
                sentence_detector,
                tokenizer, 
                embeddings,
                ner_model,
                ner_converter,
                re_model
               ])
               
sample_text = """License fees revenue decreased 40 %, or $ 0.5 million to $ 0.7 million for the year ended December 31, 2020 
compared to $ 1.2 million for the year ended December 31, 2019. Services revenue increased 4 %, or $ 1.1 million, to $ 25.6 million 
for the year ended December 31, 2020 from $ 24.5 million for the year ended December 31, 2019.
Costs of revenue, excluding depreciation and amortization increased by $ 0.1 million, or 2 %, to $ 8.8 million for the year ended December 31, 2020 
from $ 8.7 million for the year ended December 31, 2019. The increase was primarily related to increase in internal staff costs of $ 1.1 million as 
we increased delivery staff and work performed on internal projects, partially offset by a decrease in third party consultant costs of $ 0.6 million 
as these were converted to internal staff or terminated. Also, a decrease in travel costs of $ 0.4 million due to travel restrictions caused by the global pandemic. 
As a percentage of revenue, cost of revenue, excluding depreciation and amortization was 34 % for each of the years ended December 31, 2020 and 2019. 
Sales and marketing expenses decreased 20 %, or $ 1.5 million, to $ 6.0 million for the year ended December 31, 2020 from $ 7.5 million for the year ended December 31, 2019"
"""

data = spark.createDataFrame([[sample_text]]).toDF("text")
model = pipeline.fit(data)
results = model.transform(data)

# ner output
results.selectExpr("explode(ner_chunk) as ner").show(truncate=False)

# relations output
results.selectExpr("explode(relations) as relation").show(truncate=False)

Results

+--------------------------------------------------------------------------------------------------------------------------+
|ner                                                                                                                       |
+--------------------------------------------------------------------------------------------------------------------------+
|[chunk, 0, 19, License fees revenue, [entity -> PROFIT_DECLINE, sentence -> 0, chunk -> 0, confidence -> 0.41060004], []] |
|[chunk, 31, 32, 40, [entity -> PERCENTAGE, sentence -> 0, chunk -> 1, confidence -> 0.9995], []]                          |
|[chunk, 40, 40, $, [entity -> CURRENCY, sentence -> 0, chunk -> 2, confidence -> 0.9995], []]                             |
|[chunk, 42, 52, 0.5 million, [entity -> AMOUNT, sentence -> 0, chunk -> 3, confidence -> 0.99995], []]                    |
|[chunk, 57, 57, $, [entity -> CURRENCY, sentence -> 0, chunk -> 4, confidence -> 0.9998], []]                             |
|[chunk, 59, 69, 0.7 million, [entity -> AMOUNT, sentence -> 0, chunk -> 5, confidence -> 0.99985003], []]                 |
|[chunk, 90, 106, December 31, 2020, [entity -> FISCAL_YEAR, sentence -> 0, chunk -> 6, confidence -> 0.977525], []]       |
|[chunk, 121, 121, $, [entity -> CURRENCY, sentence -> 0, chunk -> 7, confidence -> 0.9996], []]                           |
|[chunk, 123, 133, 1.2 million, [entity -> AMOUNT, sentence -> 0, chunk -> 8, confidence -> 0.99975], []]                  |
|[chunk, 154, 170, December 31, 2019, [entity -> FISCAL_YEAR, sentence -> 0, chunk -> 9, confidence -> 0.96227497], []]    |
|[chunk, 173, 188, Services revenue, [entity -> PROFIT_INCREASE, sentence -> 1, chunk -> 10, confidence -> 0.57490003], []]|
|[chunk, 200, 200, 4, [entity -> PERCENTAGE, sentence -> 1, chunk -> 11, confidence -> 0.9997], []]                        |
|[chunk, 208, 208, $, [entity -> CURRENCY, sentence -> 1, chunk -> 12, confidence -> 0.999], []]                           |
|[chunk, 210, 220, 1.1 million, [entity -> AMOUNT, sentence -> 1, chunk -> 13, confidence -> 0.99995], []]                 |
|[chunk, 226, 226, $, [entity -> CURRENCY, sentence -> 1, chunk -> 14, confidence -> 0.9982], []]                          |
|[chunk, 228, 239, 25.6 million, [entity -> AMOUNT, sentence -> 1, chunk -> 15, confidence -> 0.99975], []]                |
|[chunk, 261, 277, December 31, 2020, [entity -> FISCAL_YEAR, sentence -> 1, chunk -> 16, confidence -> 0.97915], []]      |
|[chunk, 284, 284, $, [entity -> CURRENCY, sentence -> 1, chunk -> 17, confidence -> 0.9991], []]                          |
|[chunk, 286, 297, 24.5 million, [entity -> AMOUNT, sentence -> 1, chunk -> 18, confidence -> 0.99965], []]                |
|[chunk, 318, 334, December 31, 2019, [entity -> FISCAL_YEAR, sentence -> 1, chunk -> 19, confidence -> 0.9588], []]       |
+--------------------------------------------------------------------------------------------------------------------------+

+--------+
|relation                                                                                                                                                 +--------+
|[category, 0, 217, DECREASE, [entity1_begin -> 0, relation -> DECREASE, hypothesis -> License fees revenue decrease 40, confidence -> 0.9931541, nli_prediction -> entail, entity1 -> PROFIT_DECLINE, syntactic_distance -> undefined, chunk2 -> 40, entity2_end -> 32, entity1_end -> 19, entity2_begin -> 31, entity2 -> PERCENTAGE, chunk1 -> License fees revenue, sentence -> 0], []]                  |
|[category, 672, 898, DECREASE, [entity1_begin -> 0, relation -> DECREASE, hypothesis -> License fees revenue decrease 1.2 million, confidence -> 0.7394818, nli_prediction -> entail, entity1 -> PROFIT_DECLINE, syntactic_distance -> undefined, chunk2 -> 1.2 million, entity2_end -> 133, entity1_end -> 19, entity2_begin -> 123, entity2 -> AMOUNT, chunk1 -> License fees revenue, sentence -> 0], []]|
|[category, 445, 671, DECREASE, [entity1_begin -> 0, relation -> DECREASE, hypothesis -> License fees revenue decrease 0.7 million, confidence -> 0.99002415, nli_prediction -> entail, entity1 -> PROFIT_DECLINE, syntactic_distance -> undefined, chunk2 -> 0.7 million, entity2_end -> 69, entity1_end -> 19, entity2_begin -> 59, entity2 -> AMOUNT, chunk1 -> License fees revenue, sentence -> 0], []] |
|[category, 218, 444, DECREASE, [entity1_begin -> 0, relation -> DECREASE, hypothesis -> License fees revenue decrease 0.5 million, confidence -> 0.99084955, nli_prediction -> entail, entity1 -> PROFIT_DECLINE, syntactic_distance -> undefined, chunk2 -> 0.5 million, entity2_end -> 52, entity1_end -> 19, entity2_begin -> 42, entity2 -> AMOUNT, chunk1 -> License fees revenue, sentence -> 0], []] |
+--------+

Model Information

Model Name:	finre_zero_shot
Type:	finance
Compatibility:	Finance NLP 1.0.0+
License:	Licensed
Edition:	Official
Language:	en
Size:	406.4 MB
Case sensitive:	true

References

Bert Base (cased) trained on the GLUE MNLI dataset.

PREVIOUSMapping Companies IRS to Edgar Database

NEXTLegal NER Obligations on Agreements