Healthcare NLP v4.0.2 Release Notes

4.0.2

Highlights

16 new text classification models for English and Spanish social media text related to public health topics (stress, domestic violence, vaccine status, drug reviews etc.)
Pretrained medication NER pipeline to augment posology NER models with Drugbank dataset
Pretrained medication resolver pipeline to extract RxNorm, UMLS, NDC, SNOMED CT codes and action/treatments.
New disease NER model for Spanish language
5 new chunk mapper models to convert clinical entities to relevant medical terminology (UMLS)
5 new pretrained resolver pipelines to convert clinical entities to relevant medical terminology (UMLS)
New Relation Extraction model to detect Drug and ADE relations
New module for converting Annotation Lab (ALAB) exports into formats suitable for training new models
Updated De-identification pretrained pipelines
New setBlackList() parameter in ChunkFilterer() annotator
New Doc2ChunkInternal() annotator
Listing clinical pretrained models and pipelines with one-liner
Bug fixes
New and updated notebooks
List of recently updated or added models (40+ new models and pipelines)

We are releasing 11 new MedicalBertForSequenceClassification models to classify text from social media data for English and Spanish languages.

model name	description	predicted entities
bert_sequence_classifier_ade_augmented	this model classify tweets reporting ADEs (Adverse Drug Events).	`ADE` `noADE`
bert_sequence_classifier_health_mandates_stance_tweet	this model classifies stance in tweets about health mandates.	`FAVOR` `AGAINST` `NONE`
bert_sequence_classifier_health_mandates_premise_tweet	this model classifies premise in tweets about health mandates.	`has_premse` `has_no_premse`
bert_sequence_classifier_treatement_changes_sentiment_tweet	this model classifies treatment changes reviews in tweets as `negative` and `positive`.	`positive` `negative`
bert_sequence_classifier_drug_reviews_webmd	this model classifies drug reviews from WebMD as `negative` and `positive`.	`positive` `negative`
bert_sequence_classifier_self_reported_age_tweet	this model classifies if there is a self-reported age in social media data.	`self_report_age` `no_report`
bert_sequence_classifier_self_reported_symptoms_tweet	this model classifies self-reported COVID-19 symptoms in Spanish language tweets.	`Lit-News_mentions` `Self_reports non_personal_reports`
bert_sequence_classifier_self_reported_vaccine_status_tweet	this model classifies self-reported COVID-19 vaccination status in tweets.	`Vaccine_chatter` `Self_reports`
bert_sequence_classifier_self_reported_partner_violence_tweet	this model classifies self-reported Intimate partner violence (IPV) in tweets.	`intimate_partner_violence` `non_intimate_partner_violence`
bert_sequence_classifier_exact_age_reddit	this model classifies if there is a self-reported age in social media forum posts (Reddit).	`self_report_age` `no_report`
bert_sequence_classifier_self_reported_stress_tweet	this model classifies stress in social media (Twitter) posts in the self-disclosure category.	`stressed` `not-stressed`

Example :

...
sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_exact_age_reddit", "en", "clinical/models")\
    .setInputCols(["document", "token"])\
    .setOutputCol("class")\
...
sample_text = ["Is it bad for a 19 year old it's been getting worser.",
               "I was about 10. So not quite as young as you but young."]

Results :

+-------------------------------------------------------+-----------------+
|text                                                   |class            |
+-------------------------------------------------------+-----------------+
|Is it bad for a 19 year old its been getting worser.   |[self_report_age]|
|I was about 10. So not quite as young as you but young.|[no_report]      |
+-------------------------------------------------------+-----------------+

We are releasing 5 new public health classification models.

model name	description	predicted entities
bert_sequence_classifier_health_mentions	This model can classify public health mentions in social media text	`figurative_mention` `other_mention` `health_mention`
classifierdl_health_mentions	This model can classify public health mentions in social media text	`figurative_mention` `other_mention` `health_mention`
bert_sequence_classifier_vaccine_sentiment	This model can extract information from COVID-19 Vaccine-related tweets	`neutral` `positive` `negative`
classifierdl_vaccine_sentiment	This model can extract information from COVID-19 Vaccine-related tweets	`neutral` `positive` `negative`
bert_sequence_classifier_stressor	This model can classify source of emotional stress in text.	`Family_Issues` `Financial_Problem` `Health_Fatigue_or_Physical Pain` `Other` `School` `Work` `Social_Relationships`

Example :

...
sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_health_mentions", "en", "clinical/models")\
     .setInputCols(["document","token"])\
     .setOutputCol("class")
...

sample_text =["Another uncle of mine had a heart attack and passed away. Will be cremated Saturday I think I ve gone numb again RIP Uncle Mike",
              "I don't wanna fall in love. If I ever did that, I think I'd have a heart attack",
              "Aluminum is a light metal that causes dementia and Alzheimer's disease. You should never put aluminum into your body (including deodorants)."]

Results :

+--------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
|text                                                                                                                                        |result              |
+--------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
|Another uncle of mine had a heart attack and passed away. Will be cremated Saturday I think I ve gone numb again RIP Uncle Mike             |[health_mention]    |
|I don't wanna fall in love. If I ever did that, I think I'd have a heart attack                                                             |[figurative_mention]|
|Aluminum is a light metal that causes dementia and Alzheimer's disease. You should never put aluminum into your body (including deodorants).|[other_mention]     |
+--------------------------------------------------------------------------------------------------------------------------------------------+--------------------+

Pretrained Medication NER Pipeline to Augmented Posology NER Models with Drugbank Dataset

We are releasing a medication NER pretrained pipeline to extract medications in clinical text. It’s an augmented version of posology NER model with Drugbank datasets and can retun all the medications with a single line of code without building a pipeline with models.

ner_medication_pipeline: This pretrained pipeline can detect medication entities and label them as DRUG in clinical text.

See Models Hub Page for more details.

Example :

from sparknlp.pretrained import PretrainedPipeline

medication_pipeline = PretrainedPipeline("ner_medication_pipeline", "en", "clinical/models")

text = """The patient was prescribed metformin 1000 MG, and glipizide 2.5 MG. The other patient was given Fragmin 5000 units, Xenaderm to wounds topically b.i.d. and OxyContin 30 mg."""

Results :

|--------------------|-----------|
| chunk              | ner_label |
|--------------------|-----------|
| metformin 1000 MG  | DRUG      |
| glipizide 2.5 MG   | DRUG      |
| Fragmin 5000 units | DRUG      |
| Xenaderm           | DRUG      |
| OxyContin 30 mg    | DRUG      |
|--------------------|-----------|

Pretrained Medication Resolver Pipeline to Extract RxNorm, UMLS, NDC , SNOMED CT Codes and Action/Treatments

We are releasing a medication resolver pipeline to extract medications and and resolve RxNorm, UMLS, NDC, SNOMED CT codes and action/treatments in clinical text. You can get those codes if available with a single line of code without building a pipeline with models.

medication_resolver_pipeline: This pretrained pipeline can detect medication entities and resolve codes if available.

Example :

from sparknlp.pretrained import PretrainedPipeline

medication_pipeline = PretrainedPipeline("medication_resolver_pipeline", "en", "clinical/models")

text = """The patient was prescribed Mycobutn 150 MG, Salagen 5 MG oral tablet,
The other patient is given Lescol 40 MG and Lidoderm 0.05 MG/MG, triazolam 0.125 MG Oral Tablet, metformin hydrochloride 1000 MG Oral Tablet"""

Results :

|---------------------------------------------|----------------|---------------------|--------------------------------------------|----------|-------------|---------------|---------------|----------|
| ner_chunk                                   |   RxNorm_Chunk | Action              | Treatment                                  | UMLS     | SNOMED_CT   | NDC_Product   | NDC_Package   | entity   |
|---------------------------------------------|----------------|---------------------|--------------------------------------------|----------|-------------|---------------|---------------|----------|
| Mycobutn 150 MG                             |         103899 | Antimiycobacterials | Infection                                  | C0353536 | -           | 00013-5301    | 00013-5301-17 | DRUG     |
| Salagen 5 MG oral tablet                    |        1000915 | Antiglaucomatous    | Cancer                                     | C0361693 | -           | 59212-0705    | 59212-0705-10 | DRUG     |
| Lescol 40 MG                                |         103919 | Hypocholesterolemic | Heterozygous Familial Hypercholesterolemia | C0353573 | -           | 00078-0234    | 00078-0234-05 | DRUG     |
| Lidoderm 0.05 MG/MG                         |        1011705 | Anesthetic          | Pain                                       | C0875706 | -           | 00247-2129    | 00247-2129-30 | DRUG     |
| triazolam 0.125 MG Oral Tablet              |         198317 | -                   | -                                          | C0690642 | 373981005   | 00054-4858    | 00054-4858-25 | DRUG     |
| metformin hydrochloride 1000 MG Oral Tablet |         861004 | -                   | -                                          | C0978482 | 376701008   | 00093-7214    | 00185-0221-01 | DRUG     |
|---------------------------------------------|----------------|---------------------|--------------------------------------------|----------|-------------|---------------|---------------|----------|

New Disease NER Model for Spanish Language

We are releasing a new MedicalBertForTokenClassifier model to extract disease entities from social media text in Spanish.

bert_token_classifier_disease_mentions_tweet: This model can extract disease entities in Spanish tweets and label them as ENFERMEDAD (disease).

See Models Hub Page for more details.

Example :

...
tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_disease_mentions_tweet", "es", "clinical/models")\
  .setInputCols("token", "sentence")\
  .setOutputCol("label")\
  .setCaseSensitive(True)
...
example_text = """El diagnóstico fueron varios. Principal: Neumonía en el pulmón derecho. Sinusitis de caballo, Faringitis aguda e infección de orina, también elevada. Gripe No. Estuvo hablando conmigo, sin exagerar, mas de media hora, dándome ánimo y fuerza y que sabe, porque ha visto"""

Results :

+---------------------+----------+
|chunk                |ner_label |
+---------------------+----------+
|Neumonía en el pulmón|ENFERMEDAD|
|Sinusitis            |ENFERMEDAD|
|Faringitis aguda     |ENFERMEDAD|
|infección de orina   |ENFERMEDAD|
|Gripe                |ENFERMEDAD|
+---------------------+----------+

5 new Chunk Mapper Models to Convert Clinical Entities to Relevant Medical Terminology (UMLS)

We are releasing 5 new ChunkMapperModel models to map clinical entities with their corresponding UMLS CUI codes.

Mapper Name	Source	Target
umls_clinical_drugs_mapper	Drugs	UMLS CUI
umls_clinical_findings_mapper	Clinical Findings	UMLS CUI
umls_disease_syndrome_mapper	Disease and Syndromes	UMLS CUI
umls_major_concepts_mapper	Clinical Major Concepts	UMLS CUI
umls_drug_substance_mapper	Drug Substances	UMLS CUI

Example :

...
ner_model = MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models")\
     .setInputCols(["sentence", "token", "embeddings"])\
     .setOutputCol("clinical_ner")

ner_model_converter = NerConverterInternal()\
     .setInputCols("sentence", "token", "clinical_ner")\
     .setOutputCol("ner_chunk")

chunkerMapper = ChunkMapperModel.pretrained("umls_drug_substance_mapper", "en", "clinical/models")\
       .setInputCols(["ner_chunk"])\
       .setOutputCol("mappings")\
       .setRels(["umls_code"])\
       .setLowerCase(True)
...

example_text = """The patient was given  metformin, lenvatinib and lavender 700 ml/ml"""

Results :

+------------------+---------+---------+
|         ner_chunk|ner_label|umls_code|
+------------------+---------+---------+
|         metformin|     DRUG| C0025598|
|        lenvatinib|     DRUG| C2986924|
|lavender 700 ml/ml|     DRUG| C0772360|
+------------------+---------+---------+

5 new Pretrained Resolver Pipelines to Convert Clinical Entities to Relevant Medical Terminology (UMLS)

We now have 5 new resolver PretrainedPipeline to convert clinical entities to their UMLS CUI codes. You just need to feed your text and it will return the corresponding UMLS codes.

Pipeline Name	Entity	Target
umls_drug_resolver_pipeline	Drugs	UMLS CUI
umls_clinical_findings_resolver_pipeline	Clinical Findings	UMLS CUI
umls_disease_syndrome_resolver_pipeline	Disease and Syndromes	UMLS CUI
umls_major_concepts_resolver_pipeline	Clinical Major Concepts	UMLS CUI
umls_drug_substance_resolver_pipeline	Drug Substances	UMLS CUI

Example :

from sparknlp.pretrained import PretrainedPipeline

pipeline= PretrainedPipeline("umls_clinical_findings_resolver_pipeline", "en", "clinical/models")

sample_text = "HTG-induced pancreatitis associated with an acute hepatitis, and obesity"

Results :

+-------------------------+---------+---------+
|chunk                    |ner_label|umls_code|
+-------------------------+---------+---------+
|HTG-induced pancreatitis |PROBLEM  |C1963198 |
|an acute hepatitis       |PROBLEM  |C4750596 |
|obesity                  |PROBLEM  |C1963185 |
+-------------------------+---------+---------+

New Relation Extraction Model to Detect Drug and ADE relations

We are releasing new re_ade_conversational model that can extract relations between DRUG and ADE entities from conversational texts and tag the relations as is_related and not_related.

See Models Hub Page for more details.

Example :

...
re_model = RelationExtractionModel().pretrained("re_ade_conversational", "en", 'clinical/models')\
    .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
    .setOutputCol("relations")\
    .setRelationPairs(["ade-drug", "drug-ade"])
...

sample_text = "E19.32 day 20 rivaroxaban diary. still residual aches and pains; only had 4 paracetamol today."

Results :

|--------------------------|----------|-------------|---------|-------------|
| chunk1                   | entitiy1 | chunk2      | entity2 | relation    |
|--------------------------|----------|-------------|---------|-------------|
| residual aches and pains | ADE      | rivaroxaban | DRUG    | is_related  |
| residual aches and pains | ADE      | paracetamol | DRUG    | not_related |
|--------------------------|----------|-------------|---------|-------------|

New Module for Converting Annotation Lab (ALAB) Exports Into Suitable Formats for Training New Models

We have a new sparknlp_jsl.alab module with functions for converting ALAB JSON exports into suitable formats for training NER, Assertion and Relation Extraction models.

Example :

from sparknlp_jsl.alab import get_conll_data, get_assertion_data, get_relation_extraction_data

get_conll_data(spark=spark, input_json_path="alab_demo.json", output_name="conll_demo")

assertion_df = get_assertion_data(spark=spark, input_json_path = 'alab_demo.json', assertion_labels = ['ABSENT'], relevant_ner_labels = ['PROBLEM', 'TREATMENT'])

relation_df = get_relation_extraction_data(spark=spark, input_json_path='alab_demo.json')

These functions contain over 10 arguments each which give you all the flexibility you need to convert your annotations to trainable formats. These include parameters controlling tokenization, ground truth selections, negative annotations, negative annotation weights, task exclusions, and many more. To find out how to make best use of these functions, head over to this repository.

Updated De-identification Pretrained Pipelines

We have updated de-identification pretrained pipelines to provide better performance than ever before. This includes an update to the clinical_deidentification pretrained pipeline and a new light-weight version clinical_deidentification_slim.

Example :

from sparknlp.pretrained import PretrainedPipeline

deid_pipeline = PretrainedPipeline("clinical_deidentification", "en", "clinical/models")
slim_deid_pipeline = PretrainedPipeline("clinical_deidentification_slim", "en", "clinical/models")

sample_text = "Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435"

Results :

Name : <PATIENT>, Record date: <DATE>, <MEDICALRECORD>
Name : [**************], Record date: [********], [****]
Name : ****, Record date: ****, ****
Name : Alexia Mcgill, Record date: 2093-02-19, Y138038

New `setBlackList()` Parameter in `ChunkFilterer()` Annotator

We are releasing a new setBlackList() parameter in the ChunkFilterer() annotator. ChunkFilterer() lets through every chunk except those that match the list of phrases or regex rules in the setBlackList() parameter.

Example :

...
chunk_filterer = ChunkFilterer()\
    .setInputCols("sentence","ner_chunk")\
    .setOutputCol("chunk_filtered")\
    .setCriteria("isin")\
    .setBlackList(['severe fever', 'severe cough'])
...

example_text= """Patient with severe fever, severe cough, sore throat, stomach pain, and a headache."""

Results :

+-------------------------------------------------------------------+---------------------------------------+
|ner_chunk                                                          |chunk_filtered                         |
+-------------------------------------------------------------------+---------------------------------------+
|[severe fever, severe cough, sore throat, stomach pain, a headache]|[sore throat, stomach pain, a headache]|
+-------------------------------------------------------------------+---------------------------------------+

New `Doc2ChunkInternal()` Annotator

We are releasing a Doc2ChunkInternal() annotator. This is a licensed version of the open source Doc2Chunk() annotator. You can now customize the tokenization step within Doc2Chunk(). This will be quite handy when it comes to training custom assertion models.

Example :

...
doc2ChunkInternal = Doc2ChunkInternal()\
.setInputCols("document","token")\
.setStartCol("start")\
.setChunkCol("target")\
.setOutputCol("doc2chunkInternal")

...

df= spark.createDataFrame([
    ["The mass measures 4 x 3.5cm in size more.",8,"size"],
    ["The mass measures 4 x 3.5cm in size more.",9,"size"]]).toDF("sentence","start", "target")

Results :

+-----------------------------------------+-----+------+--------------------------------------------------------+-----------------------------------------------------------+
|                                 sentence|start|target|                                       doc2chunkInternal|                                                  doc2chunk|
+-----------------------------------------+-----+------+--------------------------------------------------------+-----------------------------------------------------------+
|The mass measures 4 x 3.5cm in size more.|    8|  size|[{chunk, 31, 34, size, {sentence -> 0, chunk -> 0}, []}]|[{chunk, 31, 34, size, {sentence -> 0, chunk -> 0}, []}]   |
|The mass measures 4 x 3.5cm in size more.|    9|  size|                                                      []|[{chunk, 31, 34, size, {sentence -> 0, chunk -> 0}, []}]   |
+-----------------------------------------+-----+------+--------------------------------------------------------+-----------------------------------------------------------+

Listing Pretrained Clinical Models and Pipelines with One-Liner

We have new returnPrivatePipelines() and returnPrivateModels() features under InternalResourceDownloader package to return licensed models and pretrained pipelines as a list.

Example :

from sparknlp_jsl.pretrained import InternalResourceDownloader

# pipelines = InternalResourceDownloader.returnPrivatePipelines()
assertion_models = InternalResourceDownloader.returnPrivateModels("AssertionDLModel")

Results :

[['assertion_ml', 'en', '2.0.2'],
 ['assertion_dl', 'en', '2.0.2'],
 ['assertion_dl_healthcare', 'en', '2.7.2'],
 ['assertion_dl_biobert', 'en', '2.7.2'],
 ['assertion_dl', 'en', '2.7.2'],
 ['assertion_dl_radiology', 'en', '2.7.4'],
 ['assertion_jsl_large', 'en', '3.1.2'],
 ['assertion_jsl', 'en', '3.1.2'],
 ['assertion_dl_scope_L10R10', 'en', '3.4.2'],
 ['assertion_dl_biobert_scope_L10R10', 'en', '3.4.2'],
 ['assertion_oncology_treatment_binary_wip', 'en', '3.5.0']]

Bug Fixes

ZeroShotRelationExtractionModel: Fixed the issue that blocks the use of this annotator.
AnnotationToolJsonReader: Fixed the issue with custom pipeline usage in this annotator.
RelationExtractionApproach: Fixed issues related to training logs and inference.

New and Updated Notebooks

Clinical Named Entity Recognition Notebook: Added new getPrivateModel() feature
Clinical Entity Resolvers Notebook: Added an example of reseolver pretrained pipelines
Pretrained Clinical Pipelines Notebook: Pipeline list updated and examples of resolver pretrained pipelines were added
Chunk Mapping Notebook: New mapper models added into model list
All certification notebooks updated with v4.0.0.

List of Recently Updated and Added Models and Pretrained Pipelines

bert_token_classifier_ner_anatem
bert_token_classifier_ner_bc2gm_gene
bert_token_classifier_ner_bc4chemd_chemicals
bert_token_classifier_ner_bc5cdr_chemicals
bert_token_classifier_ner_bc5cdr_disease
bert_token_classifier_ner_jnlpba_cellular
bert_token_classifier_ner_linnaeus_species
bert_token_classifier_ner_ncbi_disease
bert_token_classifier_ner_species
bert_sequence_classifier_ade_augmented
bert_sequence_classifier_health_mandates_stance_tweet
bert_sequence_classifier_health_mandates_premise_tweet
bert_sequence_classifier_treatement_changes_sentiment_tweet
bert_sequence_classifier_drug_reviews_webmd
bert_sequence_classifier_self_reported_age_tweet
bert_sequence_classifier_self_reported_symptoms_tweet => es
bert_sequence_classifier_self_reported_vaccine_status_tweet
bert_sequence_classifier_self_reported_partner_violence_tweet
bert_sequence_classifier_exact_age_reddit
bert_sequence_classifier_self_reported_stress_tweet
bert_token_classifier_disease_mentions_tweet => es
bert_token_classifier_ner_ade_tweet_binary
bert_token_classifier_ner_pathogen
clinical_deidentification
clinical_deidentification_slim
umls_clinical_drugs_mapper
umls_clinical_findings_mapper
umls_disease_syndrome_mapper
umls_major_concepts_mapper
umls_drug_substance_mapper
umls_drug_resolver_pipeline
umls_clinical_findings_resolver_pipeline
umls_disease_syndrome_resolver_pipeline
umls_major_concepts_resolver_pipeline
umls_drug_substance_resolver_pipeline
classifierdl_health_mentions
bert_sequence_classifier_health_mentions
ner_medication_pipeline
bert_sequence_classifier_vaccine_sentiment
classifierdl_vaccine_sentiment
bert_sequence_classifier_stressor
re_ade_conversational
medication_resolver_pipeline

Versions

Version
Version
Version

PREVIOUSVersion Compatibility

4.0.2

Highlights

16 New Classification Models for English and Spanish Social Media Texts Related to Public Health Topics (Stress, Domestic Violence, Vaccine Status, Drug Reviews etc.)

Pretrained Medication NER Pipeline to Augmented Posology NER Models with Drugbank Dataset

Pretrained Medication Resolver Pipeline to Extract RxNorm, UMLS, NDC , SNOMED CT Codes and Action/Treatments

New Disease NER Model for Spanish Language

5 new Chunk Mapper Models to Convert Clinical Entities to Relevant Medical Terminology (UMLS)

5 new Pretrained Resolver Pipelines to Convert Clinical Entities to Relevant Medical Terminology (UMLS)

New Relation Extraction Model to Detect Drug and ADE relations

New Module for Converting Annotation Lab (ALAB) Exports Into Suitable Formats for Training New Models

Updated De-identification Pretrained Pipelines

New setBlackList() Parameter in ChunkFilterer() Annotator

New Doc2ChunkInternal() Annotator

Listing Pretrained Clinical Models and Pipelines with One-Liner

Bug Fixes

New and Updated Notebooks

List of Recently Updated and Added Models and Pretrained Pipelines

Versions

New `setBlackList()` Parameter in `ChunkFilterer()` Annotator

New `Doc2ChunkInternal()` Annotator