Normalize Parent Companies Names using Wikidata

Description

This is an Entity Resolution model, aimed to normalize a previously extracted ORG entity, using its reference name in WIkidata. This is useful to then use finel_wiki_parentorgs Chunk Mapping model and get information of the subsidiaries, countries, stock exchange, etc.

It also retrieves the TICKER, which can be retrieved from aux_label column in metadata.

Predicted Entities

Download Copy S3 URI

How to use

documentAssembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("ner_chunk")

embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \
      .setInputCols("ner_chunk") \
      .setOutputCol("sentence_embeddings")
    
resolver = finance.SentenceEntityResolverModel.pretrained("finel_wiki_parentorgs", "en", "finance/models")\
      .setInputCols(["sentence_embeddings"]) \
      .setOutputCol("normalized_name")\
      .setDistanceFunction("EUCLIDEAN")

pipelineModel = nlp.Pipeline(
      stages = [
          documentAssembler,
          embeddings,
          resolver
      ])

lp = nlp.LightPipeline(pipelineModel)
test_pred = lp.fullAnnotate('ALPHABET')
print(test_pred[0]['normalized_name'][0].result)
print(test_pred[0]['normalized_name'][0].metadata['all_k_aux_labels'].split(':::')[0])

Results

Alphabet Inc.
Aux data: GOOGL

Model Information

Model Name:	finel_wiki_parentorgs
Compatibility:	Finance NLP 1.0.0+
License:	Licensed
Edition:	Official
Input Labels:	[sentence_embeddings]
Output Labels:	[original_company_name]
Language:	en
Size:	2.8 MB
Case sensitive:	false

References

Wikidata dump about company holdings using SparQL

PREVIOUSОcr base v2 optimized for printed text

NEXTResolve Company Names to Tickers using Wikidata