Financial Deidentification Pipeline

Description

This is a Pretrained Pipeline aimed to deidentify legal and financial documents to be compliant with data privacy regulations as GDPR and CCPA. Since the models used in this pipeline are statistical, make sure you use this model in a human-in-the-loop process to guarantee a 100% accuracy.

You can carry out both masking and obfuscation with this pipeline, on the following entities: ALIAS, EMAIL, PHONE, PROFESSION, ORG, DATE, PERSON, ADDRESS, STREET, CITY, STATE, ZIP, COUNTRY, TITLE_CLASS, TICKER, STOCK_EXCHANGE, CFN, IRS

Live Demo Copy S3 URI

How to use

from sparknlp.pretrained import PretrainedPipeline

deid_pipeline = PretrainedPipeline("finpipe_deid", "en", "finance/models")

sample = """CARGILL, INCORPORATED

By:     Pirkko Suominen



Name: Pirkko Suominen Title: Director, Bio Technology Development,  Date:   10/19/2011

BIOAMBER, SAS

By:     Jean-François Huc



Name: Jean-François Huc  Title: President Date:   October 15, 2011

email : jeanfran@gmail.com
phone : 1808733909 

"""

result = deid_pipeline.annotate(sample)
print("\nMasked with entity labels")
print("-"*30)
print("\n".join(result['deidentified']))
print("\nMasked with chars")
print("-"*30)
print("\n".join(result['masked_with_chars']))
print("\nMasked with fixed length chars")
print("-"*30)
print("\n".join(result['masked_fixed_length_chars']))
print("\nObfuscated")
print("-"*30)
print("\n".join(result['obfuscated']))

Results

Masked with entity labels
------------------------------
<PARTY>, <PARTY>
By:     <SIGNING_PERSON>
Name: <PARTY>: <SIGNING_TITLE>,  Date:   <EFFDATE>
<PARTY>, <PARTY>
By:     <SIGNING_PERSON>
Name: <PARTY>: <SIGNING_TITLE>Date:   <EFFDATE>

email : <EMAIL>
phone : <PHONE>

Masked with chars
------------------------------
[*****], [**********]
By:     [*************]
Name: [*******************]: [**********************************]  Center,  Date:   [********]
[******], [*]
By:     [***************]
Name: [**********************]: [*******]Date:   [**************]

email : [****************]
phone : [********]

Masked with fixed length chars
------------------------------
****, ****
By:     ****
Name: ****: ****,  Date:   ****
****, ****
By:     ****
Name: ****: ****Date:   ****

email : ****
phone : ****

Obfuscated
------------------------------
MGT Trust Company, LLC., Clarus llc.
By:     Benjamin Dean
Name: John Snow Labs Inc: Sales Manager,  Date:   03/08/2025
Clarus llc., SESA CO.
By:     JAMES TURNER
Name: MGT Trust Company, LLC.: Business ManagerDate:   11/7/2016

email : Tyrus@google.com
phone : 78 834 854

Model Information

Model Name: finpipe_deid
Type: pipeline
Compatibility: Finance NLP 1.0.0+
License: Licensed
Edition: Official
Language: en
Size: 458.6 MB

Included Models

  • DocumentAssembler
  • SentenceDetector
  • TokenizerModel
  • BertEmbeddings
  • FinanceNerModel
  • NerConverterInternalModel
  • FinanceNerModel
  • NerConverterInternalModel
  • FinanceNerModel
  • NerConverterInternalModel
  • ContextualParserModel
  • ContextualParserModel
  • ContextualParserModel
  • ContextualParserModel
  • ContextualParserModel
  • ChunkMergeModel
  • DeIdentificationModel
  • DeIdentificationModel
  • DeIdentificationModel
  • DeIdentificationModel