Description
This is a Pretrained Pipeline aimed to deidentify legal and financial documents to be compliant with data privacy regulations as GDPR and CCPA. Since the models used in this pipeline are statistical, make sure you use this model in a human-in-the-loop process to guarantee a 100% accuracy.
You can carry out both masking and obfuscation with this pipeline, on the following entities:
PROFESSION, URL, LOCATION-OTHER, CITY, DATE, ZIP, PERSON, STATE, COUNTRY, STREET, ORG, PHONE, EMAIL, AGE, ADDRESS, FISCAL_YEAR, TICKER, TITLE_CLASS, CFN, STOCK_EXCHANGE, IRS, SIGNING_PERSON, PARTY, SIGNING_TITLE, ALIAS
Predicted Entities
PROFESSION, URL, LOCATION-OTHER, CITY, DATE, ZIP, PERSON, STATE, COUNTRY, STREET, ORG, PHONE, EMAIL, AGE, ADDRESS, FISCAL_YEAR, TICKER, TITLE_CLASS, CFN, STOCK_EXCHANGE, IRS, SIGNING_PERSON, PARTY, SIGNING_TITLE, ALIAS
How to use
from sparknlp.pretrained import PretrainedPipeline
deid_pipeline = PretrainedPipeline("finpipe_deid", "en", "finance/models")
sample = """CARGILL, INCORPORATED
By:     Pirkko Suominen
Name: Pirkko Suominen Title: Director, Bio Technology Development,  Date:   10/19/2011
BIOAMBER, SAS
By:     Jean-François Huc
Name: Jean-François Huc  Title: President Date:   October 15, 2011
email : jeanfran@gmail.com
phone : 1808733909 
"""
result = deid_pipeline.annotate(sample)
print("\nMasked with entity labels")
print("-"*30)
print("\n".join(result['deidentified']))
print("\nMasked with chars")
print("-"*30)
print("\n".join(result['masked_with_chars']))
print("\nMasked with fixed length chars")
print("-"*30)
print("\n".join(result['masked_fixed_length_chars']))
print("\nObfuscated")
print("-"*30)
print("\n".join(result['obfuscated']))
Results
Masked with entity labels
------------------------------
<PARTY>, <PARTY>
By:     <SIGNING_PERSON>
Name: <PARTY>: <SIGNING_TITLE>,  Date:   <EFFDATE>
<PARTY>, <PARTY>
By:     <SIGNING_PERSON>
Name: <PARTY>: <SIGNING_TITLE>Date:   <EFFDATE>
email : <EMAIL>
phone : <PHONE>
Masked with chars
------------------------------
[*****], [**********]
By:     [*************]
Name: [*******************]: [**********************************]  Center,  Date:   [********]
[******], [*]
By:     [***************]
Name: [**********************]: [*******]Date:   [**************]
email : [****************]
phone : [********]
Masked with fixed length chars
------------------------------
****, ****
By:     ****
Name: ****: ****,  Date:   ****
****, ****
By:     ****
Name: ****: ****Date:   ****
email : ****
phone : ****
Obfuscated
------------------------------
MGT Trust Company, LLC., Clarus llc.
By:     Benjamin Dean
Name: John Snow Labs Inc: Sales Manager,  Date:   03/08/2025
Clarus llc., SESA CO.
By:     JAMES TURNER
Name: MGT Trust Company, LLC.: Business ManagerDate:   11/7/2016
email : Tyrus@google.com
phone : 78 834 854
Model Information
| Model Name: | finpipe_deid | 
| Type: | pipeline | 
| Compatibility: | Finance NLP 1.0.0+ | 
| License: | Licensed | 
| Edition: | Official | 
| Language: | en | 
| Size: | 458.6 MB | 
Included Models
- DocumentAssembler
- SentenceDetector
- TokenizerModel
- BertEmbeddings
- FinanceNerModel
- NerConverterInternalModel
- FinanceNerModel
- NerConverterInternalModel
- FinanceNerModel
- NerConverterInternalModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ContextualParserModel
- ChunkMergeModel
- DeIdentificationModel
- DeIdentificationModel
- DeIdentificationModel
- DeIdentificationModel