Healthcare NLP v4.2.0 Release Notes

4.2.0

Highlights

Introducing 46 new Oncology specific pretrained models (12 NER, 12 BERT-based token classification, 14 relation extraction, 8 assertion status models)
Brand new NerQuestionGenerator annotator for automated prompt generation for a QA-based Zero-Shot NER model
Updated ALAB (Annotation Lab) module becoming a fullfledged suite to manage activities on ALAB via its API remotely
New pretrained assertion status detection model (assertion_jsl_augmented) to classify the negativity & assertion scope of medical concepts
New chunk mapper models and pretrained pipeline to map entities (phrases) to their corresponding ICD-9, ICD-10-CM and RxNorm codes
New ICD-9-CM sentence entity resolver model and pretrained pipeline
New shifting days feature in DeIdentification by using the new DocumentHashCoder annotator
Updated NER model finder pretrained pipeline to help users find the most appropriate NER model for their use case in one-liner
Medicare risk adjustment score calculation module updated to support different version and year combinations
Core improvements and bug fixes
New and updated notebooks
50+ new clinical models and pipelines added & updated in total

Introducing 46 New Oncology Specific Pretrained Models (12 NER, 12 BERT-Based Token Classification, 14 Relation Extraction, 8 Assertion Status Models)

These models will be the first versions (wip - work in progress) of Oncology models.

See Oncology Model Notebook for examples.

New Oncological NER and BERT-Based Token Classification Models

We have 12 new oncological NER and their BERT-based token classification models.

NER model name (MedicalNerModel)	BERT-Based model name (MedicalBertForTokenClassifier)	description	predicted entities
ner_oncology_therapy_wip	bert_token_classifier_ner_oncology_therapy_wip	This model extracts entities related to cancer therapies, including posology entities and response to treatment, using granular labels.	`Response_To_Treatment`, `Line_Of_Therapy`, `Cancer_Surgery`, `Radiotherapy`, `Immunotherapy`, `Targeted_Therapy`, `Hormonal_Therapy`, `Chemotherapy`, `Unspecific_Therapy`, `Route`, `Duration`, `Cycle_Count`, `Dosage`, `Frequency`, `Cycle_Number`, `Cycle_Day`, `Radiation_Dose`
ner_oncology_diagnosis_wip	bert_token_classifier_ner_oncology_diagnosis_wip	This model extracts entities related to cancer diagnosis, including the presence of metastasis.	`Grade`, `Staging`, `Tumor_Size`, `Adenopathy`, `Pathology_Result`, `Histological_Type`, `Metastasis`, `Cancer_Score`, `Cancer_Dx`, `Invasion`, `Tumor_Finding`, `Performance_Status`
ner_oncology_wip	bert_token_classifier_ner_oncology_wip	This model extracts more than 40 oncology-related entities.	`Histological_Type`, `Direction`, `Staging`, `Cancer_Score`, `Imaging_Test`, `Cycle_Number`, `Tumor_Finding`, `Site_Lymph_Node`, `Invasion`, `Response_To_Treatment`, `Smoking_Status`, `Tumor_Size`, `Cycle_Count`, `Adenopathy`, `Age`, `Biomarker_Result`, `Unspecific_Therapy`, `Site_Breast`, `Chemotherapy`, `Targeted_Therapy`, `Radiotherapy`, `Performance_Status`, `Pathology_Test`, `Site_Other_Body_Part`, `Cancer_Surgery`, `Line_Of_Therapy`, `Pathology_Result`, `Hormonal_Therapy`, `Site_Bone`, `Biomarker`, `Immunotherapy`, `Cycle_Day`, `Frequency`, `Route`, `Duration`, `Death_Entity`, `Metastasis`, `Site_Liver`, `Cancer_Dx`, `Grade`, `Date`, `Site_Lung`, `Site_Brain`, `Relative_Date`, `Race_Ethnicity`, `Gender`, `Oncogene`, `Dosage`, `Radiation_Dose`
ner_oncology_tnm_wip	bert_token_classifier_ner_oncology_tnm_wip	This model extracts mentions related to TNM staging.	`Lymph_Node`, `Staging`, `Lymph_Node_Modifier`, `Tumor_Description`, `Tumor`, `Metastasis`, `Cancer_Dx`
ner_oncology_anatomy_general_wip	bert_token_classifier_ner_oncology_anatomy_general_wip	This model extracts anatomical entities.	`Anatomical_Site`, `Direction`
ner_oncology_demographics_wip	bert_token_classifier_ner_oncology_demographics_wip	This model extracts demographic information, including smoking status.	`Age`, `Gender`, `Smoking_Status`, `Race_Ethnicity`
ner_oncology_test_wip	bert_token_classifier_ner_oncology_test_wip	This model extracts mentions of oncology-related tests.	`Oncogene`, `Biomarker`, `Biomarker_Result`, `Imaging_Test`, `Pathology_Test`
ner_oncology_unspecific_posology_wip	bert_token_classifier_ner_oncology_unspecific_posology_wip	This model extracts any mention of cancer therapies and posology information using general labels	`Cancer_Therapy`, `Posology_Information`
ner_oncology_anatomy_granular_wip	bert_token_classifier_ner_oncology_anatomy_granular_wip	This model extracts anatomical entities using granular labels.	Direction, Site_Lymph_Node, Site_Breast, Site_Other_Body_Part, Site_Bone, Site_Liver, Site_Lung, Site_Brain
ner_oncology_response_to_treatment_wip	bert_token_classifier_ner_oncology_response_to_treatment_wip	This model extracts entities related to the patient’s response to cancer treatment.	`Response_To_Treatment`, `Size_Trend`, `Line_Of_Therapy`
ner_oncology_biomarker_wip	bert_token_classifier_ner_oncology_biomarker_wip	This model extracts biomarkers and their results.	`Biomarker`, `Biomarker_Result`
ner_oncology_posology_wip	bert_token_classifier_ner_oncology_posology_wip	This model extracts oncology specific posology information and cancer therapies.	`Cycle_Number`, `Cycle_Count`, `Radiotherapy`, `Cancer_Surgery`, `Cycle_Day`, `Frequency`, `Route`, `Cancer_Therapy`, `Duration`, `Dosage`, `Radiation_Dose`

F1 Scores:

label	f1	label	f1	label	f1	label	f1	label	f1
Adenopathy	0.73	Cycle_Day	0.83	Histological_Type	0.71	Posology_Information	0.88	Site_Lymph_Node	0.91
Age	0.97	Cycle_Number	0.79	Hormonal_Therapy	0.90	Race_Ethnicity	0.86	Smoking_Status	0.82
Anatomical_Site	0.83	Date	0.97	Imaging_Test	0.90	Radiation_Dose	0.87	Staging	0.85
Biomarker	0.89	Death_Entity	0.82	Invasion	0.80	Radiotherapy	0.90	Targeted_Therapy	0.87
Biomarker_Result	0.82	Direction	0.82	Line_Of_Therapy	0.91	Relative_Date	0.79	Tumor	0.91
Cancer_Dx	0.92	Dosage	0.91	Lymph_Node	0.86	Route	0.84	Tumor_Description	0.81
Cancer_Surgery	0.85	Duration	0.77	Lymph_Node_Modifier	0.75	Site_Bone	0.80	Tumor_Finding	0.92
Cancer_Therapy	0.90	Frequency	0.88	Metastasis	0.95	Site_Brain	0.78	Tumor_Size	0.88
Chemotherapy	0.90	Gender	0.99	Oncogene	0.77	Site_Breast	0.88
Cycle_Count	0.81	Grade	0.81	Pathology_Test	0.79	Site_Lung	0.79

NER Model Example:

...
medical_ner = MedicalNerModel.pretrained("ner_oncology_wip", "en", "clinical/models") \
                   .setInputCols(["sentence", "token", "embeddings"]) \
                   .setOutputCol("ner")
...

sample_text = "The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago. The tumor was positive for ER. Postoperatively, radiotherapy was administered to her breast."

BERT-Based Token Classification Model Example:

...
tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_oncology_wip", "en", "clinical/models")\
    .setInputCols("token", "document")\
    .setOutputCol("ner")\
    .setCaseSensitive(True)
...

sample_text = "The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago. The tumor was positive for ER. Postoperatively, radiotherapy was administered to her breast."

Results:

+------------------------------+---------------------+
|chunk                         |ner_label            |
+------------------------------+---------------------+
|left                          |Direction            |
|mastectomy                    |Cancer_Surgery       |
|axillary lymph node dissection|Cancer_Surgery       |
|left                          |Direction            |
|breast cancer                 |Cancer_Dx            |
|twenty years ago              |Relative_Date        |
|tumor                         |Tumor_Finding        |
|positive                      |Biomarker_Result     |
|ER                            |Biomarker            |
|radiotherapy                  |Radiotherapy         |
|her                           |Gender               |
|breast                        |Site_Breast          |
+------------------------------+---------------------+

New Oncological Assertion Status Models

We have 8 new oncological assertion status detection models.

model name	description	predicted entities
assertion_oncology_wip	This model identifies the assertion status of different oncology-related entities.	`Medical_History`, `Family_History`, `Possible`, `Hypothetical_Or_Absent`
assertion_oncology_problem_wip	This assertion model identifies the status of Cancer_Dx extractions and other problem entities.	`Present`, `Possible`, `Hypothetical`, `Absent`, `Family`
assertion_oncology_treatment_wip	This model identifies the assertion status of treatments mentioned in text.	`Present`, `Planned`, `Past`, `Hypothetical`, `Absent`
assertion_oncology_response_to_treatment_wip	This assertion model identifies if the response to treatment mentioned in text actually happened, or if it mentioned as something absent or hypothetical.	`Present_Or_Past`, `Hypothetical_Or_Absent`
assertion_oncology_test_binary_wip	This assertion model identifies if a test mentioned in text actually was used, or if it mentioned as something absent or hypothetical.	`Present_Or_Past`, `Hypothetical_Or_Absent`
assertion_oncology_smoking_status_wip	This assertion model is used to classify the smoking status of the patient.	`Absent`, `Past`, `Present`
assertion_oncology_family_history_wip	This assertion model identifies if an entity refers to a family member.	`Family_History`, `Other`
assertion_oncology_demographic_binary_wip	This assertion model identifies if the demographic entities refer to the patient or to someone else.	`Patient`, `Someone_Else`

Example:

...
assertion = AssertionDLModel.pretrained("assertion_oncology_problem_wip", "en", "clinical/models") \
                .setInputCols(["sentence", 'ner_chunk', "embeddings"]) \
                .setOutputCol("assertion")
...

sample_text = "Considering the findings, the patient may have a breast cancer. There are no signs of metastasis. Family history positive for breast cancer in her maternal grandmother."

Results:

+-------------+----------+---------+
|        chunk| ner_label|assertion|
+-------------+----------+---------+
|breast cancer| Cancer_Dx| Possible|
|   metastasis|Metastasis|   Absent|
|breast cancer| Cancer_Dx|   Family|
+-------------+----------+---------+

New Oncological Relation Extraction Models

We are releasing 7 new RelationExtractionModel and 7 new RelationExtractionDLModel models to extract relations between various oncological concepts.

model name	description	predicted entities
re_oncology_size_wip	This model links Tumor_Size extractions to their corresponding Tumor_Finding extractions.	`is_size_of`, `O`
re_oncology_biomarker_result_wip	This model links Biomarker and Oncogene extractions to their corresponding Biomarker_Result extractions.	`is_finding_of`, `O`
re_oncology_granular_wip	This model can be identified four relation types	`is_size_of`, `is_finding_of`, `is_date_of`, `is_location_of`, `O`
re_oncology_location_wip	This model links extractions from anatomical entities (such as Site_Breast or Site_Lung) to other clinical entities (such as Tumor_Finding or Cancer_Surgery).	`is_location_of`, `O`
re_oncology_temporal_wip	This model links Date and Relative_Date extractions to clinical entities such as Test or Cancer_Dx.	`is_date_of`, `O`
re_oncology_test_result_wip	This model links test extractions to their corresponding results.	`is_finding_of`, `O`
re_oncology_wip	This model link between dates and other clinical entities, between tumor mentions and their size, between anatomical entities and other clinical entities, and between tests and their results.	`is_related_to`, `O`
redl_oncology_size_biobert_wip	This model links Tumor_Size extractions to their corresponding Tumor_Finding extractions.	`is_size_of`, `O`
redl_oncology_biomarker_result_biobert_wip	This model links Biomarker and Oncogene extractions to their corresponding Biomarker_Result extractions.	`is_finding_of`, `O`
redl_oncology_location_biobert_wip	This model links extractions from anatomical entities (such as Site_Breast or Site_Lung) to other clinical entities (such as Tumor_Finding or Cancer_Surgery).	`is_location_of`, `O`
redl_oncology_temporal_biobert_wip	This model links Date and Relative_Date extractions to clinical entities such as Test or Cancer_Dx.	`is_date_of`, `O`
redl_oncology_test_result_biobert_wip	This model links test extractions to their corresponding results.	`is_finding_of`, `O`
redl_oncology_biobert_wip	This model identifies relations between dates and other clinical entities, between tumor mentions and their size, between anatomical entities and other clinical entities, and between tests and their results.	`is_related_to`
redl_oncology_granular_biobert_wip	This model can be identified four relation types	`is_date_of`, `is_finding_of`, `is_location_of`, `is_size_of`, `O`

F1 Scores and Samples:

label	F1 Score	sample_text	results
is_finding_of	0.95	“Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A.”	`negative - thyroid transcription factor-1`, `negative - napsin`
is_date_of	0.81	“A mastectomy was performed two months ago.”	`mastectomy-two months ago`
is_location_of	0.92	“In April 2011, she first noticed a lump in her right breast.”	`lump - breast`
is_size_of	0.86	“The patient presented a 2 cm mass in her left breast.”	`2 cm - mass`
is_related_to	0.87	A mastectomy was performed two months ago.”	`mastectomy - two months ago`

Example:

...
re_model = RelationExtractionModel.pretrained("re_oncology_size_wip", "en", "clinical/models") \
    .setInputCols(["embeddings", "pos_tags", "ner_chunk", "dependencies"]) \
    .setOutputCol("relations") \
    .setRelationPairs(["Tumor_Finding-Tumor_Size", "Tumor_Size-Tumor_Finding"]) \
    .setMaxSyntacticDistance(10)
...

sample_text = "The patient presented a 2 cm mass in her left breast, and the tumor in her other breast was 3 cm long."

Results:

+----------+-------------+------+-------------+------+----------+
|  relation|      entity1|chunk1|      entity2|chunk2|confidence|
+----------+-------------+------+-------------+------+----------+
|is_size_of|   Tumor_Size|  2 cm|Tumor_Finding|  mass| 0.8532705|
|is_size_of|Tumor_Finding| tumor|   Tumor_Size|  3 cm| 0.8156226|
+----------+-------------+------+-------------+------+----------+

Brand New `NerQuestionGenerator` Annotator For Automated Prompt Generation For A QA-based Zero-Shot NER Model.

This annotators helps you build questions on the fly using 2 entities from different labels (preferably a subject and a verb). For example, let’s suppose you have an NER model, able to detect PATIENTand ADMISSION in the following text:

John Smith was admitted Sep 3rd to Mayo Clinic

PATIENT: John Smith
ADMISSION: was admitted

You can add the following annotator to construct questions using PATIENT and ADMISSION:

# setEntities1 says which entity from NER goes first in the question
# setEntities2 says which entity from NER goes second in the question
# setQuestionMark to True adds a '?' at the end of the sentence (after entity 2)
# To sum up, the pattern is     [QUESTIONPRONOUN] [ENTITY1] [ENTITY2] [QUESTIONMARK]

qagenerator = NerQuestionGenerator()\
  .setInputCols(["ner_chunk"])\
  .setOutputCol("question")\
  .setQuestionMark(True)\
  .setQuestionPronoun("When")\
  .setStrategyType("Paired")\
  .setEntities1(["PATIENT"])\
  .setEntities2(["ADMISSION"])

In the column question you will find: When John Smith was admitted?. Likewise you could have Where or any other question pronoun you may need.

You can use those questions in a QuestionAnsweringModel or ZeroShotNER (any model which requires a question as an input. Let’s see the case of QA.

qa = BertForQuestionAnswering.pretrained("bert_qa_spanbert_finetuned_squadv1","en") \
  .setInputCols(["question", "document"]) \
  .setOutputCol("answer") \
  .setCaseSensitive(True)

The result will be:

+--------------------------------------------------------+-----------------------------+
|question                                                |answer                       |
+--------------------------------------------------------+-----------------------------+
|[{document, 0, 25, When John Smith was admitted ? ...}] |[{chunk, 0, 8, Sep 3rd ...}] |
+--------------------------------------------------------+-----------------------------+

Strategies:

Paired: First chunk of Entity 1 will be grouped with first chunk of Entity 2, second with second, third with third, etc (one-vs-one)
Combined: A more flexible strategy to be used in case the number of chukns in Entity 1 is not aligned with the number of chunks in Entityt 2. The first chunk from Entity 1 will be grouped with all chunks in Entity 2, the second chunk in Entity 1 with again be grouped with all the chunks in Entity 2, etc (one-vs-all).

Updated ALAB (Annotation Lab) Module Becoming a Fullfledged Suite to Manage Activities on ALAB Via Its API Remotely

We are release a new module for interacting with Annotation Lab with minimal code. Users can now create/edit/delete projects and their tasks. Also, they can upload preannotations, and export annotations and generate training data for various models. Complete documentation and tutorial is available at Spark NLP Workshop. Following is a comprehensive list of supported tasks:

Getting details of all projects in the Annotation Lab instance.
Creating New Projects.
Deleting Projects.
Setting & editing configuration of projects.
Accessing/getting configuration of any existing project.
Upload tasks to a project.
Deleting tasks of a project.
Generating Preannotations for a project using custom Spark NLP pipelines.
Uploading Preannotations to a project.
Generating dataset for training Classification models.
Generating dataset for training NER models.
Generating dataset for training Assertion models.
Generating dataset for training Relation Extraction models.

Using Annotation Lab Module:

from sparknlp_jsl.alab import AnnotationLab

alab = AnnotationLab()
alab.set_credentials(username=username, password=password, client_secret=client_secret, annotationlab_url=annotationlab_url)

# create a new project
alab.create_project('alab_demo')

# assign ner labels to the project
alab.set_project_config('alab_demo', ner_labels=['Age', 'Gender'])

# upload tasks
alab.upload_tasks('alab_demo', task_list=[txt1, txt2...])

# export tasks
alab.get_annotations('alab_demo')

New Pretrained Assertion Status Detection Model (`assertion_jsl_augmented`) to Classify The Negativity & Assertion Scope of Medical Concepts

We are releasing new assertion_jsl_augmented model to classify the assertion status of the clinical entities with Present, Absent, Possible, Planned, Past, Family, Hypothetical and SomeoneElse labels.

See Models Hub Page for more details.

Example:

...
clinical_assertion = AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models")
    .setInputCols(["sentence", "ner_chunk", "embeddings"])
    .setOutputCol("assertion")
...

sample_text = """Patient had a headache for the last 2 weeks, and appears anxious when she walks fast. No alopecia noted.
She denies pain. Her father is paralyzed and it is a stressor for her. She was bullied by her boss and got antidepressant.
We prescribed sleeping pills for her current insomnia"""

Results:

+--------------+-----+---+-------------------------+-----------+---------+
|ner_chunk     |begin|end|ner_label                |sentence_id|assertion|
+--------------+-----+---+-------------------------+-----------+---------+
|headache      |14   |21 |Symptom                  |0          |Past     |
|anxious       |57   |63 |Symptom                  |0          |Possible |
|alopecia      |89   |96 |Disease_Syndrome_Disorder|1          |Absent   |
|pain          |116  |119|Symptom                  |2          |Absent   |
|paralyzed     |136  |144|Symptom                  |3          |Family   |
|antidepressant|212  |225|Drug_Ingredient          |4          |Past     |
|sleeping pills|242  |255|Drug_Ingredient          |5          |Planned  |
|insomnia      |273  |280|Symptom                  |5          |Present  |
+--------------+-----+---+-------------------------+-----------+---------+

New Chunk Mapper models and Pretrained Pipeline to map entities (phrases) to their corresponding ICD-9, ICD-10-CM and RxNorm codes

We are releasing 4 new chunk mapper models that can map entities to their corresponding ICD-9, ICD-10-CM and RxNorm codes.

model name	description
rxnorm_normalized_mapper	Mapping drug entities (phrases) with the corresponding RxNorm codes and normalized resolutions.
icd9_mapper	Mapping entities with their corresponding ICD-9-CM codes.
icd10_icd9_mapper	Mapping ICD-10-CM codes with their corresponding ICD-9-CM codes.
icd9_icd10_mapper	Mapping ICD-9-CM codes with their corresponding ICD-10-CM codes.
icd10_icd9_mapping (Pipeline)	This pretrained pipeline maps ICD-10-CM codes to ICD-9-CM codes without using any text data.

Model Example:

...
chunkerMapper = ChunkMapperModel.pretrained("rxnorm_normalized_mapper", "en", "clinical/models")\
        .setInputCols(["ner_chunk"])\
        .setOutputCol("mappings")\
        .setRels(["rxnorm_code", "normalized_name"])
...

sample_text = "The patient was given Zyrtec 10 MG, Adapin 10 MG Oral Capsule, Septi-Soothe 0.5 Topical Spray"

Results:

+------------------------------+-----------+--------------------------------------------------------------+
|ner_chunk                     |rxnorm_code|normalized_name                                               |
+------------------------------+-----------+--------------------------------------------------------------+
|Zyrtec 10 MG                  |1011483    |cetirizine hydrochloride 10 MG [Zyrtec]                       |
|Adapin 10 MG Oral Capsule     |1000050    |doxepin hydrochloride 10 MG Oral Capsule [Adapin]             |
|Septi-Soothe 0.5 Topical Spray|1000046    |chlorhexidine diacetate 0.5 MG/ML Topical Spray [Septi-Soothe]|
+------------------------------+-----------+--------------------------------------------------------------+

Pipeline Example:

from sparknlp.pretrained import PretrainedPipeline

pipeline = PretrainedPipeline( "icd10_icd9_mapping","en","clinical/models")
pipeline.annotate("Z833 A0100 A000")

Results:

| icd10_code          | icd9_code          |
|:--------------------|:-------------------|
| Z833 - A0100 - A000 | V180 - 0020 - 0010 |

New ICD-9-CM Sentence Entity Resolver Model and Pretrained Pipeline

sbiobertresolve_icd9 : This model maps extracted medical entities to their corresponding ICD-9-CM codes using sbiobert_base_cased_mli Sentence Bert Embeddings.

Example:

...
icd10_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd9","en", "clinical/models") \
    .setInputCols(["sbert_embeddings"]) \
    .setOutputCol("resolution")\
    .setDistanceFunction("EUCLIDEAN")
...

sample_text = "A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus, associated with an acute hepatitis, and obesity with a body mass index (BMI) of 33.5 kg/m2."

Results:

+-------------------------------------+-------+---------+------------------------------------------------+----------------------------------------------------------+
|                            ner_chunk| entity|icd9_code|                                      resolution|                                                 all_codes|
+-------------------------------------+-------+---------+------------------------------------------------+----------------------------------------------------------+
|        gestational diabetes mellitus|PROBLEM|   V12.21|[Personal history of gestational diabetes, Ne...|[V12.21, 775.1, 249, 250, 249.7, 249.71, 249.9, 249.61,...|
|subsequent type two diabetes mellitus|PROBLEM|      249|[Secondary diabetes mellitus, Diabetes mellit...|[249, 250, 249.9, 249.7, 775.1, 249.6, 249.8, V12.21, 2...|
|                   an acute hepatitis|PROBLEM|    571.1|[Acute alcoholic hepatitis, Viral hepatitis, ...|[571.1, 070, 571.42, 902.22, 279.51, 571.4, 091.62, 572...|
|                              obesity|PROBLEM|    278.0|[Overweight and obesity, Morbid obesity, Over...|[278.0, 278.01, 278.02, V77.8, 278, 278.00, 272.2, 783....|
|                    a body mass index|PROBLEM|      V85|[Body mass index [BMI], Human bite, Localized...|[V85, E928.3, 278.1, 993, E008.4, V61.5, 747.63, V85.5,...|
+-------------------------------------+-------+---------+------------------------------------------------+----------------------------------------------------------+

icd9_resolver_pipeline : This pretrained pipeline maps entities with their corresponding ICD-9-CM codes. You’ll just feed your text and it will return the corresponding ICD-9-CM codes.

Example:

from sparknlp.pretrained import PretrainedPipeline
resolver_pipeline = PretrainedPipeline("icd9_resolver_pipeline", "en", "clinical/models")

sample_text = """A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years and anisakiasis. Also, it was reported that fetal and neonatal hemorrhage"""

result = resolver_pipeline.fullAnnotate(sample_text)

Results:

+-----------------------------+---------+---------+
|chunk                        |ner_chunk|icd9_code|
+-----------------------------+---------+---------+
|gestational diabetes mellitus|PROBLEM  |V12.21   |
|anisakiasis                  |PROBLEM  |127.1    |
|fetal and neonatal hemorrhage|PROBLEM  |772      |
+-----------------------------+---------+---------+

New Shifting Days Feature in `Deidentification` by Using the New `DocumentHashCoder` Annotator

Now we can shift dates in the documents rather than obfuscating randomly. We have a new DocumentHashCoder() annotator to determine shifting days. This annotator gets the hash of the specified column and creates a new document column containing day shift information. And then, the DeIdentification annotator deidentifies this new doc. We can use the seed parameter to hash consistently.

Example:

documentHasher = DocumentHashCoder()\
    .setInputCols("document")\
    .setOutputCol("document2")\
    .setPatientIdColumn("patientID")\
    .setRangeDays(100)\
    .setNewDateShift("shift_days")\
    .setSeed(100)

de_identification = DeIdentification() \
    .setInputCols(["ner_chunk", "token", "document2"]) \
    .setOutputCol("deid_text") \
    .setMode("obfuscate") \
    .setObfuscateDate(True) \
    .setDateTag("DATE") \
    .setLanguage("en") \
    .setObfuscateRefSource('faker') \
    .setUseShifDays(True)

Results:

output.select('patientID','text', 'deid_text.result').show(truncate = False)
+---------+----------------------------------------+---------------------------------------------+
|patientID|text                                    |result                                       |
+---------+----------------------------------------+---------------------------------------------+
|A001     |Chris Brown was discharged on 10/02/2022|[Glorious Mc was discharged on 27/03/2022]   |
|A001     |Mark White was discharged on 10/04/2022 |[Kimberlee Bair was discharged on 25/05/2022]|
|A003     |John was discharged on 15/03/2022       |[Monia Richmond was discharged on 17/05/2022]|
|A003     |John Moore was discharged on 15/12/2022 |[Veleta Pollard was discharged on 16/02/2023]|
+---------+----------------------------------------+---------------------------------------------+

Instead of shifting days according to ID column, we can specify shifting values with another column.

Example:

documentHasher = DocumentHashCoder()\
    .setInputCols("document")\
    .setOutputCol("document2")\
    .setDateShiftColumn("dateshift")\

de_identification = DeIdentification() \
    .setInputCols(["ner_chunk", "token", "document2"]) \
    .setOutputCol("deid_text") \
    .setMode("obfuscate") \
    .setObfuscateDate(True) \
    .setDateTag("DATE") \
    .setLanguage("en") \
    .setObfuscateRefSource('faker') \
    .setUseShifDays(True)

Results:

+----------------------------------------+---------+---------------------------------------------+
|text                                    |dateshift|result                                       |
+----------------------------------------+---------+---------------------------------------------+
|Chris Brown was discharged on 10/02/2022|10       |[Levorn Powers was discharged on 20/02/2022] |
|Mark White was discharged on 10/04/2022 |10       |[Hall Jointer was discharged on 20/04/2022]  |
|John was discharged on 15/03/2022       |30       |[Jared Gains was discharged on 14/04/2022]   |
|John Moore was discharged on 15/12/2022 |30       |[Frederic Seitz was discharged on 14/01/2023]|
+----------------------------------------+---------+---------------------------------------------+

You can check Clinical Deidentification Notebook for more examples.

Updated NER Model Finder Pretrained Pipeline to Help Users Find The Most Appropriate NER Model For Their Use Case In One-Liner

We have updated ner_model_finder pretrained pipeline and sbertresolve_ner_model_finder resolver model with 70 clinical NER models and their labels.

See Models Hub Page for more details and the Pretrained Clinical Pipelines Notebook for the examples.

Support Different Version and Year Combinations on Medicare Risk Adjustment Score Calculation Module

Now, you can calculate CMS-HCC risk score with different version and year combinations by importing one of the following function calculate the score.

- profileV2217   - profileV2318  - profileV2417
- profileV2218   - profileV2319  - profileV2418
- profileV2219                   - profileV2419
- profileV2220                   - profileV2420
- profileV2221                   - profileV2421
- profileV2222                   - profileV2422

from sparknlp_jsl.functions import profileV24Y20

See the notebook for more details.

Core Improvements and Bug Fixes

ContextualParserApproach: New parameter completeContextMatch.
This parameter let the user define whether to do an exact match of prefix and suffix.
Deidentification: Enhanced default regex rules in French deidentification for DATE entity extraction.
ZeroShotRelationExtractionModel: Fixed the issue that setting some parameters together and no need to setRelationalCategories after downloading the model.

New and Updated Notebooks

New MedicalBertForSequenceClassification Notebook to show how to use MedicalBertForSequenceClassification models.
New ALAB Module Notebook to show all features of ALAB Module.
New Oncology Models Notebook to show the examples of the new Oncology models.
Updated Medicare Risk Adjustment Score Calculation Notebook with the new changes in HCC score calculation functions.
Updated Clinical DeIdentification Notebook by adding how not to deidentify a part of an entity section and showing examples of shifting days feature with the new DocumentHashCoder.
Updated Pretrained Clinical Pipelines Notebook with the updated ner_model_finder results.

50+ New Clinical Models and Pipelines Added & Updated in Total

assertion_jsl_augmented
rxnorm_normalized_mapper
ner_model_finder
sbertresolve_ner_model_finder
sbiobertresolve_icd9
icd9_resolver_pipeline
rxnorm_normalized_mapper
icd9_mapper
icd10_icd9_mapper
icd9_icd10_mapper
icd10_icd9_mapping
bert_qa_spanbert_finetuned_squadv1
ner_oncology_therapy_wip
ner_oncology_diagnosis_wip
ner_oncology_wip
ner_oncology_tnm_wip
ner_oncology_anatomy_general_wip
ner_oncology_demographics_wip
ner_oncology_test_wip
ner_oncology_unspecific_posology_wip
ner_oncology_anatomy_granular_wip
ner_oncology_response_to_treatment_wip
ner_oncology_biomarker_wip
ner_oncology_posology_wip
bert_token_classifier_ner_oncology_therapy_wip
bert_token_classifier_ner_oncology_diagnosis_wip
bert_token_classifier_ner_oncology_wip
bert_token_classifier_ner_oncology_tnm_wip
bert_token_classifier_ner_oncology_anatomy_general_wip
bert_token_classifier_ner_oncology_demographics_wip
bert_token_classifier_ner_oncology_test_wip
bert_token_classifier_ner_oncology_unspecific_posology_wip
bert_token_classifier_ner_oncology_anatomy_granular_wip
bert_token_classifier_ner_oncology_response_to_treatment_wip
bert_token_classifier_ner_oncology_biomarker_wip
bert_token_classifier_ner_oncology_posology_wip
assertion_oncology_wip
assertion_oncology_problem_wip
assertion_oncology_treatment_wip
assertion_oncology_response_to_treatment_wip
assertion_oncology_test_binary_wip
assertion_oncology_smoking_status_wip
assertion_oncology_family_history_wip
assertion_oncology_demographic_binary_wip
re_oncology_size_wip
re_oncology_biomarker_result_wip
re_oncology_granular_wip
re_oncology_location_wip
re_oncology_temporal_wip
re_oncology_test_result_wip
re_oncology_wip
redl_oncology_size_biobert_wip
redl_oncology_biomarker_result_biobert_wip
redl_oncology_location_biobert_wip
redl_oncology_temporal_biobert_wip
redl_oncology_test_result_biobert_wip
redl_oncology_biobert_wip
redl_oncology_granular_biobert_wip

Versions

Version
Version
Version

PREVIOUSVersion Compatibility

4.2.0

Highlights

Introducing 46 New Oncology Specific Pretrained Models (12 NER, 12 BERT-Based Token Classification, 14 Relation Extraction, 8 Assertion Status Models)

New Oncological NER and BERT-Based Token Classification Models

New Oncological Assertion Status Models

New Oncological Relation Extraction Models

Brand New NerQuestionGenerator Annotator For Automated Prompt Generation For A QA-based Zero-Shot NER Model.

Updated ALAB (Annotation Lab) Module Becoming a Fullfledged Suite to Manage Activities on ALAB Via Its API Remotely

New Pretrained Assertion Status Detection Model (assertion_jsl_augmented) to Classify The Negativity & Assertion Scope of Medical Concepts

New Chunk Mapper models and Pretrained Pipeline to map entities (phrases) to their corresponding ICD-9, ICD-10-CM and RxNorm codes

New ICD-9-CM Sentence Entity Resolver Model and Pretrained Pipeline

New Shifting Days Feature in Deidentification by Using the New DocumentHashCoder Annotator

Updated NER Model Finder Pretrained Pipeline to Help Users Find The Most Appropriate NER Model For Their Use Case In One-Liner

Support Different Version and Year Combinations on Medicare Risk Adjustment Score Calculation Module

Core Improvements and Bug Fixes

New and Updated Notebooks

50+ New Clinical Models and Pipelines Added & Updated in Total

Versions

Brand New `NerQuestionGenerator` Annotator For Automated Prompt Generation For A QA-based Zero-Shot NER Model.

New Pretrained Assertion Status Detection Model (`assertion_jsl_augmented`) to Classify The Negativity & Assertion Scope of Medical Concepts

New Shifting Days Feature in `Deidentification` by Using the New `DocumentHashCoder` Annotator