Spark NLP for Healthcare Release Notes 4.0.2

 

4.0.2

Highlights

  • 16 new text classification models for English and Spanish social media text related to public health topics (stress, domestic violence, vaccine status, drug reviews etc.)
  • Pretrained medication NER pipeline to augment posology NER models with Drugbank dataset
  • Pretrained medication resolver pipeline to extract RxNorm, UMLS, NDC, SNOMED CT codes and action/treatments.
  • New disease NER model for Spanish language
  • 5 new chunk mapper models to convert clinical entities to relevant medical terminology (UMLS)
  • 5 new pretrained resolver pipelines to convert clinical entities to relevant medical terminology (UMLS)
  • New Relation Extraction model to detect Drug and ADE relations
  • New module for converting Annotation Lab (ALAB) exports into formats suitable for training new models
  • Updated De-identification pretrained pipelines
  • New setBlackList() parameter in ChunkFilterer() annotator
  • New Doc2ChunkInternal() annotator
  • Listing clinical pretrained models and pipelines with one-liner
  • Bug fixes
  • New and updated notebooks
  • List of recently updated or added models (40+ new models and pipelines)
  • We are releasing 11 new MedicalBertForSequenceClassification models to classify text from social media data for English and Spanish languages.
model name description predicted entities
bert_sequence_classifier_ade_augmented this model classify tweets reporting ADEs (Adverse Drug Events). ADE noADE
bert_sequence_classifier_health_mandates_stance_tweet this model classifies stance in tweets about health mandates. FAVOR AGAINST NONE
bert_sequence_classifier_health_mandates_premise_tweet this model classifies premise in tweets about health mandates. has_premse has_no_premse
bert_sequence_classifier_treatement_changes_sentiment_tweet this model classifies treatment changes reviews in tweets as negative and positive. positive negative
bert_sequence_classifier_drug_reviews_webmd this model classifies drug reviews from WebMD as negative and positive. positive negative
bert_sequence_classifier_self_reported_age_tweet this model classifies if there is a self-reported age in social media data. self_report_age no_report
bert_sequence_classifier_self_reported_symptoms_tweet this model classifies self-reported COVID-19 symptoms in Spanish language tweets. Lit-News_mentions Self_reports non_personal_reports
bert_sequence_classifier_self_reported_vaccine_status_tweet this model classifies self-reported COVID-19 vaccination status in tweets. Vaccine_chatter Self_reports
bert_sequence_classifier_self_reported_partner_violence_tweet this model classifies self-reported Intimate partner violence (IPV) in tweets. intimate_partner_violence non_intimate_partner_violence
bert_sequence_classifier_exact_age_reddit this model classifies if there is a self-reported age in social media forum posts (Reddit). self_report_age no_report
bert_sequence_classifier_self_reported_stress_tweet this model classifies stress in social media (Twitter) posts in the self-disclosure category. stressed not-stressed

Example :

...
sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_exact_age_reddit", "en", "clinical/models")\
    .setInputCols(["document", "token"])\
    .setOutputCol("class")\
...
sample_text = ["Is it bad for a 19 year old it's been getting worser.",
               "I was about 10. So not quite as young as you but young."]

Results :

+-------------------------------------------------------+-----------------+
|text                                                   |class            |
+-------------------------------------------------------+-----------------+
|Is it bad for a 19 year old its been getting worser.   |[self_report_age]|
|I was about 10. So not quite as young as you but young.|[no_report]      |
+-------------------------------------------------------+-----------------+
  • We are releasing 5 new public health classification models.
model name description predicted entities
bert_sequence_classifier_health_mentions This model can classify public health mentions in social media text figurative_mention other_mention health_mention
classifierdl_health_mentions This model can classify public health mentions in social media text figurative_mention other_mention health_mention
bert_sequence_classifier_vaccine_sentiment This model can extract information from COVID-19 Vaccine-related tweets neutral positive negative
classifierdl_vaccine_sentiment This model can extract information from COVID-19 Vaccine-related tweets neutral positive negative
bert_sequence_classifier_stressor This model can classify source of emotional stress in text. Family_Issues Financial_Problem Health_Fatigue_or_Physical Pain Other School Work Social_Relationships

Example :

...
sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_health_mentions", "en", "clinical/models")\
     .setInputCols(["document","token"])\
     .setOutputCol("class")
...

sample_text =["Another uncle of mine had a heart attack and passed away. Will be cremated Saturday I think I ve gone numb again RIP Uncle Mike",
              "I don't wanna fall in love. If I ever did that, I think I'd have a heart attack",
              "Aluminum is a light metal that causes dementia and Alzheimer's disease. You should never put aluminum into your body (including deodorants)."]

Results :

+--------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
|text                                                                                                                                        |result              |
+--------------------------------------------------------------------------------------------------------------------------------------------+--------------------+
|Another uncle of mine had a heart attack and passed away. Will be cremated Saturday I think I ve gone numb again RIP Uncle Mike             |[health_mention]    |
|I don't wanna fall in love. If I ever did that, I think I'd have a heart attack                                                             |[figurative_mention]|
|Aluminum is a light metal that causes dementia and Alzheimer's disease. You should never put aluminum into your body (including deodorants).|[other_mention]     |
+--------------------------------------------------------------------------------------------------------------------------------------------+--------------------+

Pretrained Medication NER Pipeline to Augmented Posology NER Models with Drugbank Dataset

We are releasing a medication NER pretrained pipeline to extract medications in clinical text. It’s an augmented version of posology NER model with Drugbank datasets and can retun all the medications with a single line of code without building a pipeline with models.

  • ner_medication_pipeline: This pretrained pipeline can detect medication entities and label them as DRUG in clinical text.

See Models Hub Page for more details.

Example :

from sparknlp.pretrained import PretrainedPipeline

medication_pipeline = PretrainedPipeline("ner_medication_pipeline", "en", "clinical/models")

text = """The patient was prescribed metformin 1000 MG, and glipizide 2.5 MG. The other patient was given Fragmin 5000 units, Xenaderm to wounds topically b.i.d. and OxyContin 30 mg."""

Results :

|--------------------|-----------|
| chunk              | ner_label |
|--------------------|-----------|
| metformin 1000 MG  | DRUG      |
| glipizide 2.5 MG   | DRUG      |
| Fragmin 5000 units | DRUG      |
| Xenaderm           | DRUG      |
| OxyContin 30 mg    | DRUG      |
|--------------------|-----------|

Pretrained Medication Resolver Pipeline to Extract RxNorm, UMLS, NDC , SNOMED CT Codes and Action/Treatments

We are releasing a medication resolver pipeline to extract medications and and resolve RxNorm, UMLS, NDC, SNOMED CT codes and action/treatments in clinical text. You can get those codes if available with a single line of code without building a pipeline with models.

  • medication_resolver_pipeline: This pretrained pipeline can detect medication entities and resolve codes if available.

Example :

from sparknlp.pretrained import PretrainedPipeline

medication_pipeline = PretrainedPipeline("medication_resolver_pipeline", "en", "clinical/models")

text = """The patient was prescribed Mycobutn 150 MG, Salagen 5 MG oral tablet,
The other patient is given Lescol 40 MG and Lidoderm 0.05 MG/MG, triazolam 0.125 MG Oral Tablet, metformin hydrochloride 1000 MG Oral Tablet"""

Results :

|---------------------------------------------|----------------|---------------------|--------------------------------------------|----------|-------------|---------------|---------------|----------|
| ner_chunk                                   |   RxNorm_Chunk | Action              | Treatment                                  | UMLS     | SNOMED_CT   | NDC_Product   | NDC_Package   | entity   |
|---------------------------------------------|----------------|---------------------|--------------------------------------------|----------|-------------|---------------|---------------|----------|
| Mycobutn 150 MG                             |         103899 | Antimiycobacterials | Infection                                  | C0353536 | -           | 00013-5301    | 00013-5301-17 | DRUG     |
| Salagen 5 MG oral tablet                    |        1000915 | Antiglaucomatous    | Cancer                                     | C0361693 | -           | 59212-0705    | 59212-0705-10 | DRUG     |
| Lescol 40 MG                                |         103919 | Hypocholesterolemic | Heterozygous Familial Hypercholesterolemia | C0353573 | -           | 00078-0234    | 00078-0234-05 | DRUG     |
| Lidoderm 0.05 MG/MG                         |        1011705 | Anesthetic          | Pain                                       | C0875706 | -           | 00247-2129    | 00247-2129-30 | DRUG     |
| triazolam 0.125 MG Oral Tablet              |         198317 | -                   | -                                          | C0690642 | 373981005   | 00054-4858    | 00054-4858-25 | DRUG     |
| metformin hydrochloride 1000 MG Oral Tablet |         861004 | -                   | -                                          | C0978482 | 376701008   | 00093-7214    | 00185-0221-01 | DRUG     |
|---------------------------------------------|----------------|---------------------|--------------------------------------------|----------|-------------|---------------|---------------|----------|

New Disease NER Model for Spanish Language

We are releasing a new MedicalBertForTokenClassifier model to extract disease entities from social media text in Spanish.

  • bert_token_classifier_disease_mentions_tweet: This model can extract disease entities in Spanish tweets and label them as ENFERMEDAD (disease).

See Models Hub Page for more details.

Example :

...
tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_disease_mentions_tweet", "es", "clinical/models")\
  .setInputCols("token", "sentence")\
  .setOutputCol("label")\
  .setCaseSensitive(True)
...
example_text = """El diagnóstico fueron varios. Principal: Neumonía en el pulmón derecho. Sinusitis de caballo, Faringitis aguda e infección de orina, también elevada. Gripe No. Estuvo hablando conmigo, sin exagerar, mas de media hora, dándome ánimo y fuerza y que sabe, porque ha visto"""

Results :

+---------------------+----------+
|chunk                |ner_label |
+---------------------+----------+
|Neumonía en el pulmón|ENFERMEDAD|
|Sinusitis            |ENFERMEDAD|
|Faringitis aguda     |ENFERMEDAD|
|infección de orina   |ENFERMEDAD|
|Gripe                |ENFERMEDAD|
+---------------------+----------+

5 new Chunk Mapper Models to Convert Clinical Entities to Relevant Medical Terminology (UMLS)

We are releasing 5 new ChunkMapperModel models to map clinical entities with their corresponding UMLS CUI codes.

Mapper Name Source Target
umls_clinical_drugs_mapper Drugs UMLS CUI
umls_clinical_findings_mapper Clinical Findings UMLS CUI
umls_disease_syndrome_mapper Disease and Syndromes UMLS CUI
umls_major_concepts_mapper Clinical Major Concepts UMLS CUI
umls_drug_substance_mapper Drug Substances UMLS CUI

Example :

...
ner_model = MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models")\
     .setInputCols(["sentence", "token", "embeddings"])\
     .setOutputCol("clinical_ner")

ner_model_converter = NerConverterInternal()\
     .setInputCols("sentence", "token", "clinical_ner")\
     .setOutputCol("ner_chunk")

chunkerMapper = ChunkMapperModel.pretrained("umls_drug_substance_mapper", "en", "clinical/models")\
       .setInputCols(["ner_chunk"])\
       .setOutputCol("mappings")\
       .setRels(["umls_code"])\
       .setLowerCase(True)
...

example_text = """The patient was given  metformin, lenvatinib and lavender 700 ml/ml"""

Results :

+------------------+---------+---------+
|         ner_chunk|ner_label|umls_code|
+------------------+---------+---------+
|         metformin|     DRUG| C0025598|
|        lenvatinib|     DRUG| C2986924|
|lavender 700 ml/ml|     DRUG| C0772360|
+------------------+---------+---------+

5 new Pretrained Resolver Pipelines to Convert Clinical Entities to Relevant Medical Terminology (UMLS)

We now have 5 new resolver PretrainedPipeline to convert clinical entities to their UMLS CUI codes. You just need to feed your text and it will return the corresponding UMLS codes.

Pipeline Name Entity Target
umls_drug_resolver_pipeline Drugs UMLS CUI
umls_clinical_findings_resolver_pipeline Clinical Findings UMLS CUI
umls_disease_syndrome_resolver_pipeline Disease and Syndromes UMLS CUI
umls_major_concepts_resolver_pipeline Clinical Major Concepts UMLS CUI
umls_drug_substance_resolver_pipeline Drug Substances UMLS CUI

Example :

from sparknlp.pretrained import PretrainedPipeline

pipeline= PretrainedPipeline("umls_clinical_findings_resolver_pipeline", "en", "clinical/models")

sample_text = "HTG-induced pancreatitis associated with an acute hepatitis, and obesity"

Results :

+-------------------------+---------+---------+
|chunk                    |ner_label|umls_code|
+-------------------------+---------+---------+
|HTG-induced pancreatitis |PROBLEM  |C1963198 |
|an acute hepatitis       |PROBLEM  |C4750596 |
|obesity                  |PROBLEM  |C1963185 |
+-------------------------+---------+---------+

New Relation Extraction Model to Detect Drug and ADE relations

We are releasing new re_ade_conversational model that can extract relations between DRUG and ADE entities from conversational texts and tag the relations as is_related and not_related.

See Models Hub Page for more details.

Example :

...
re_model = RelationExtractionModel().pretrained("re_ade_conversational", "en", 'clinical/models')\
    .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
    .setOutputCol("relations")\
    .setRelationPairs(["ade-drug", "drug-ade"])
...

sample_text = "E19.32 day 20 rivaroxaban diary. still residual aches and pains; only had 4 paracetamol today."

Results :

|--------------------------|----------|-------------|---------|-------------|
| chunk1                   | entitiy1 | chunk2      | entity2 | relation    |
|--------------------------|----------|-------------|---------|-------------|
| residual aches and pains | ADE      | rivaroxaban | DRUG    | is_related  |
| residual aches and pains | ADE      | paracetamol | DRUG    | not_related |
|--------------------------|----------|-------------|---------|-------------|

New Module for Converting Annotation Lab (ALAB) Exports Into Suitable Formats for Training New Models

We have a new sparknlp_jsl.alab module with functions for converting ALAB JSON exports into suitable formats for training NER, Assertion and Relation Extraction models.

Example :

from sparknlp_jsl.alab import get_conll_data, get_assertion_data, get_relation_extraction_data

get_conll_data(spark=spark, input_json_path="alab_demo.json", output_name="conll_demo")

assertion_df = get_assertion_data(spark=spark, input_json_path = 'alab_demo.json', assertion_labels = ['ABSENT'], relevant_ner_labels = ['PROBLEM', 'TREATMENT'])

relation_df = get_relation_extraction_data(spark=spark, input_json_path='alab_demo.json')

These functions contain over 10 arguments each which give you all the flexibility you need to convert your annotations to trainable formats. These include parameters controlling tokenization, ground truth selections, negative annotations, negative annotation weights, task exclusions, and many more. To find out how to make best use of these functions, head over to this repository.

Updated De-identification Pretrained Pipelines

We have updated de-identification pretrained pipelines to provide better performance than ever before. This includes an update to the clinical_deidentification pretrained pipeline and a new light-weight version clinical_deidentification_slim.

Example :

from sparknlp.pretrained import PretrainedPipeline

deid_pipeline = PretrainedPipeline("clinical_deidentification", "en", "clinical/models")
slim_deid_pipeline = PretrainedPipeline("clinical_deidentification_slim", "en", "clinical/models")

sample_text = "Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435"

Results :

Name : <PATIENT>, Record date: <DATE>, <MEDICALRECORD>
Name : [**************], Record date: [********], [****]
Name : ****, Record date: ****, ****
Name : Alexia Mcgill, Record date: 2093-02-19, Y138038

New setBlackList() Parameter in ChunkFilterer() Annotator

We are releasing a new setBlackList() parameter in the ChunkFilterer() annotator. ChunkFilterer() lets through every chunk except those that match the list of phrases or regex rules in the setBlackList() parameter.

Example :

...
chunk_filterer = ChunkFilterer()\
    .setInputCols("sentence","ner_chunk")\
    .setOutputCol("chunk_filtered")\
    .setCriteria("isin")\
    .setBlackList(['severe fever', 'severe cough'])
...

example_text= """Patient with severe fever, severe cough, sore throat, stomach pain, and a headache."""

Results :

+-------------------------------------------------------------------+---------------------------------------+
|ner_chunk                                                          |chunk_filtered                         |
+-------------------------------------------------------------------+---------------------------------------+
|[severe fever, severe cough, sore throat, stomach pain, a headache]|[sore throat, stomach pain, a headache]|
+-------------------------------------------------------------------+---------------------------------------+

New Doc2ChunkInternal() Annotator

We are releasing a Doc2ChunkInternal() annotator. This is a licensed version of the open source Doc2Chunk() annotator. You can now customize the tokenization step within Doc2Chunk(). This will be quite handy when it comes to training custom assertion models.

Example :

...
doc2ChunkInternal = Doc2ChunkInternal()\
.setInputCols("document","token")\
.setStartCol("start")\
.setChunkCol("target")\
.setOutputCol("doc2chunkInternal")

...

df= spark.createDataFrame([
    ["The mass measures 4 x 3.5cm in size more.",8,"size"],
    ["The mass measures 4 x 3.5cm in size more.",9,"size"]]).toDF("sentence","start", "target")

Results :

+-----------------------------------------+-----+------+--------------------------------------------------------+-----------------------------------------------------------+
|                                 sentence|start|target|                                       doc2chunkInternal|                                                  doc2chunk|
+-----------------------------------------+-----+------+--------------------------------------------------------+-----------------------------------------------------------+
|The mass measures 4 x 3.5cm in size more.|    8|  size|[{chunk, 31, 34, size, {sentence -> 0, chunk -> 0}, []}]|[{chunk, 31, 34, size, {sentence -> 0, chunk -> 0}, []}]   |
|The mass measures 4 x 3.5cm in size more.|    9|  size|                                                      []|[{chunk, 31, 34, size, {sentence -> 0, chunk -> 0}, []}]   |
+-----------------------------------------+-----+------+--------------------------------------------------------+-----------------------------------------------------------+

Listing Pretrained Clinical Models and Pipelines with One-Liner

We have new returnPrivatePipelines() and returnPrivateModels() features under InternalResourceDownloader package to return licensed models and pretrained pipelines as a list.

Example :

from sparknlp_jsl.pretrained import InternalResourceDownloader

# pipelines = InternalResourceDownloader.returnPrivatePipelines()
assertion_models = InternalResourceDownloader.returnPrivateModels("AssertionDLModel")

Results :

[['assertion_ml', 'en', '2.0.2'],
 ['assertion_dl', 'en', '2.0.2'],
 ['assertion_dl_healthcare', 'en', '2.7.2'],
 ['assertion_dl_biobert', 'en', '2.7.2'],
 ['assertion_dl', 'en', '2.7.2'],
 ['assertion_dl_radiology', 'en', '2.7.4'],
 ['assertion_jsl_large', 'en', '3.1.2'],
 ['assertion_jsl', 'en', '3.1.2'],
 ['assertion_dl_scope_L10R10', 'en', '3.4.2'],
 ['assertion_dl_biobert_scope_L10R10', 'en', '3.4.2'],
 ['assertion_oncology_treatment_binary_wip', 'en', '3.5.0']]

Bug Fixes

  • ZeroShotRelationExtractionModel: Fixed the issue that blocks the use of this annotator.
  • AnnotationToolJsonReader: Fixed the issue with custom pipeline usage in this annotator.
  • RelationExtractionApproach: Fixed issues related to training logs and inference.

New and Updated Notebooks

List of Recently Updated and Added Models and Pretrained Pipelines

  • bert_token_classifier_ner_anatem
  • bert_token_classifier_ner_bc2gm_gene
  • bert_token_classifier_ner_bc4chemd_chemicals
  • bert_token_classifier_ner_bc5cdr_chemicals
  • bert_token_classifier_ner_bc5cdr_disease
  • bert_token_classifier_ner_jnlpba_cellular
  • bert_token_classifier_ner_linnaeus_species
  • bert_token_classifier_ner_ncbi_disease
  • bert_token_classifier_ner_species
  • bert_sequence_classifier_ade_augmented
  • bert_sequence_classifier_health_mandates_stance_tweet
  • bert_sequence_classifier_health_mandates_premise_tweet
  • bert_sequence_classifier_treatement_changes_sentiment_tweet
  • bert_sequence_classifier_drug_reviews_webmd
  • bert_sequence_classifier_self_reported_age_tweet
  • bert_sequence_classifier_self_reported_symptoms_tweet => es
  • bert_sequence_classifier_self_reported_vaccine_status_tweet
  • bert_sequence_classifier_self_reported_partner_violence_tweet
  • bert_sequence_classifier_exact_age_reddit
  • bert_sequence_classifier_self_reported_stress_tweet
  • bert_token_classifier_disease_mentions_tweet => es
  • bert_token_classifier_ner_ade_tweet_binary
  • bert_token_classifier_ner_pathogen
  • clinical_deidentification
  • clinical_deidentification_slim
  • umls_clinical_drugs_mapper
  • umls_clinical_findings_mapper
  • umls_disease_syndrome_mapper
  • umls_major_concepts_mapper
  • umls_drug_substance_mapper
  • umls_drug_resolver_pipeline
  • umls_clinical_findings_resolver_pipeline
  • umls_disease_syndrome_resolver_pipeline
  • umls_major_concepts_resolver_pipeline
  • umls_drug_substance_resolver_pipeline
  • classifierdl_health_mentions
  • bert_sequence_classifier_health_mentions
  • ner_medication_pipeline
  • bert_sequence_classifier_vaccine_sentiment
  • classifierdl_vaccine_sentiment
  • bert_sequence_classifier_stressor
  • re_ade_conversational
  • medication_resolver_pipeline

Versions

Last updated