Mapping Companies IRS to Edgar Database

Description

This Chunk Mapper model allows you to, given a detected IRS with any NER model, augment it with information available in the SEC Edgar database. Some of the fields included in this Chunk Mapper are:

  • Company Name
  • Sector
  • Former names
  • Address, Phone, State
  • Dates where the company submitted filings
  • etc.

Predicted Entities

Live Demo Copy S3 URI

How to use

document_assembler = nlp.DocumentAssembler()\
      .setInputCol('text')\
      .setOutputCol('document')

tokenizer = nlp.Tokenizer()\
      .setInputCols("document")\
      .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained('glove_100d') \
        .setInputCols(['document', 'token']) \
        .setOutputCol('embeddings')

ner_model = nlp.NerDLModel.pretrained("onto_100", "en") \
        .setInputCols(["document", "token", "embeddings"]) \
        .setOutputCol("ner")
 
ner_converter = nlp.NerConverter()\
      .setInputCols(["document", "token", "ner"])\
      .setOutputCol("ner_chunk")\
      .setWhiteList(["CARDINAL"])

CM = finance.ChunkMapperModel()\
      .pretrained("finmapper_edgar_irs", "en", "finance/models")\
      .setInputCols(["ner_chunk"])\
      .setOutputCol("mappings")

pipeline = nlp.Pipeline().setStages([document_assembler,
                                 tokenizer, 
                                 embeddings,
                                 ner_model, 
                                 ner_converter, 
                                 CM])

text = ["""873474341 is an American multinational corporation that is engaged in the design, development, manufacturing, and worldwide marketing and sales of footwear, apparel, equipment, accessories, and services"""]

test_data = spark.createDataFrame([text]).toDF("text")

model = pipeline.fit(test_data)
res= model.transform(test_data)

Results

[Row(mappings=[Row(annotatorType='labeled_dependency', begin=0, end=8, result='Masterworks 096, LLC', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'name', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='RETAIL-RETAIL STORES, NEC [5990]', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'sic', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='5990', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'sic_code', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='873474341', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'irs_number', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='1231', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'fiscal_year_end', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='NY', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'state_location', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='DE', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'state_incorporation', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='225 LIBERTY STREET', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'business_street', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='NEW YORK', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'business_city', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='NY', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'business_state', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='10281', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'business_zip', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='2035185172', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'business_phone', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'former_name', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'former_name_date', 'all_relations': ''}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='2022-01-10', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'date', 'all_relations': '2022-04-26:::2021-11-17'}, embeddings=[]), Row(annotatorType='labeled_dependency', begin=0, end=8, result='1894064', metadata={'sentence': '0', 'chunk': '0', 'entity': '873474341', 'relation': 'company_id', 'all_relations': ''}, embeddings=[])])]

Model Information

Model Name: finmapper_edgar_irs
Type: finance
Compatibility: Finance NLP 1.0.0+
License: Licensed
Edition: Official
Input Labels: [ner_chunk]
Output Labels: [mappings]
Language: en
Size: 5.7 MB

References

Manually scrapped Edgar Database