Spark NLP for Healthcare Release Notes 3.4.1

 

3.4.1

We are glad to announce that Spark NLP Healthcare 3.4.1 has been released!

Highlights

  • Brand new Spanish deidentification NER models
  • Brand new Spanish deidentification pretrained pipeline
  • New clinical NER model to detect supplements
  • New RxNorm sentence entity resolver model
  • New EntityChunkEmbeddings annotator
  • New MedicalBertForSequenceClassification annotator
  • New MedicalDistilBertForSequenceClassification annotator
  • New MedicalDistilBertForSequenceClassification and MedicalBertForSequenceClassification models
  • Redesign of the ContextualParserApproach annotator
  • getClasses method in RelationExtractionModel and RelationExtractionDLModel annotators
  • Label customization feature for RelationExtractionModel and RelationExtractionDL models
  • useBestModel parameter in MedicalNerApproach annotator
  • Early stopping feature in MedicalNerApproach annotator
  • Multi-Language support for faker and regex lists of Deidentification annotator
  • Spark 3.2.0 compatibility for the entire library
  • Saving visualization feature in spark-nlp-display library
  • Deploying a custom Spark NLP image (for opensource, healthcare, and Spark OCR) to an enterprise version of Kubernetes: OpenShift
  • New speed benchmarks table on databricks
  • New & Updated Notebooks
  • List of recently updated or added models

Brand New Spanish Deidentification NER Models

We trained two new NER models to find PHI data (protected health information) that may need to be deidentified in Spanish. ner_deid_generic and ner_deid_subentity models are trained with in-house annotations. Both also are available for using Roberta Spanish Clinical Embeddings and sciwiki 300d.

  • ner_deid_generic : Detects 7 PHI entities in Spanish (DATE, NAME, LOCATION, PROFESSION, CONTACT, AGE, ID).

  • ner_deid_subentity : Detects 13 PHI sub-entities in Spanish (PATIENT, HOSPITAL, DATE, ORGANIZATION, E-MAIL, USERNAME, LOCATION, ZIP, MEDICALRECORD, PROFESSION, PHONE, DOCTOR, AGE).

Example :

...
embeddings = WordEmbeddingsModel.pretrained("embeddings_sciwiki_300d","es","clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

deid_ner = MedicalNerModel.pretrained("ner_deid_generic", "es", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

deid_sub_entity_ner = MedicalNerModel.pretrained("ner_deid_subentity", "es", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_sub_entity")
...

text = """Antonio Pérez Juan, nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14/03/2020
y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos.."""
result = model.transform(spark.createDataFrame([[text]], ["text"]))

Results :

| chunk              | ner_deid_generic_chunk | ner_deid_subentity_chunk |
|--------------------|------------------------|--------------------------|
| Antonio Pérez Juan | NAME                   | PATIENT                  |
| Cádiz              | LOCATION               | LOCATION                 |
| España             | LOCATION               | LOCATION                 |
| 14/03/2022         | DATE                   | DATE                     |
| Clínica San Carlos | LOCATION               | HOSPITAL                 |

Brand New Spanish Deidentification Pretrained Pipeline

We developed a clinical deidentification pretrained pipeline that can be used to deidentify PHI information from Spanish medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask, fake or obfuscate the following entities: AGE, DATE, PROFESSION, E-MAIL, USERNAME, LOCATION, DOCTOR, HOSPITAL, PATIENT, URL, IP, MEDICALRECORD, IDNUM, ORGANIZATION, PHONE, ZIP, ACCOUNT, SSN, PLATE, SEX and IPADDR.

from sparknlp.pretrained import PretrainedPipeline
deid_pipeline = PretrainedPipeline("clinical_deidentification", "es", "clinical/models")

sample_text = """Datos del paciente. Nombre:  Jose . Apellidos: Aranda Martinez. NHC: 2748903. NASS: 26 37482910."""

result = deid_pipe.annotate(text)

print("\n".join(result['masked']))
print("\n".join(result['masked_with_chars']))
print("\n".join(result['masked_fixed_length_chars']))
print("\n".join(result['obfuscated']))

Results:

Masked with entity labels
------------------------------
Datos del paciente. Nombre:  <PATIENT> . Apellidos: <PATIENT>. NHC: <SSN>. NASS: <SSN> <SSN>

Masked with chars
------------------------------
Datos del paciente. Nombre:  [**] . Apellidos: [*************]. NHC: [*****]. NASS: [**] [******]

Masked with fixed length chars
------------------------------
Datos del paciente. Nombre:  **** . Apellidos: ****. NHC: ****. NASS: **** ****

Obfuscated
------------------------------
Datos del paciente. Nombre:  Sr. Lerma . Apellidos: Aristides Gonzalez Gelabert. NHC: BBBBBBBBQR648597. NASS: 041010000011 RZRM020101906017 04.

New Clinical NER Model to Detect Supplements

We are releasing ner_supplement_clinical model that can extract benefits of using drugs for certain conditions. It can label detected entities as CONDITION and BENEFIT. Also this model is trained on the dataset that is released by Spacy in their HealthSea product. Here is the benchmark comparison of both versions:

Entity Spark NLP Spacy-HealthSea
BENEFIT 0.8729641 0.8330684
CONDITION 0.8339274 0.8333333

Example :

...
clinical_ner = MedicalNerModel.pretrained("ner_supplement_clinical", "en", "clinical/models") \
      .setInputCols(["sentence", "token", "embeddings"]) \
      .setOutputCol("ner_tags")
...

results = ner_model.transform(spark.createDataFrame([["Excellent!. The state of health improves, nervousness disappears, and night sleep improves. It also promotes hair and nail growth."]], ["text"]))

Results :

+------------------------+---------------+
| chunk                  | ner_label     |
+------------------------+---------------+
| nervousness            | CONDITION     |
| night sleep improves   | BENEFIT       |
| hair                   | BENEFIT       |
| nail                   | BENEFIT       |
+------------------------+---------------+

New RxNorm Sentence Entity Resolver Model

sbiobertresolve_rxnorm_augmented_re : This model maps clinical entities and concepts (like drugs/ingredients) to RxNorm codes without specifying the relations between the entities (relations are calculated on the fly inside the annotator) using sbiobert_base_cased_mli Sentence Bert Embeddings (EntityChunkEmbeddings).

Example :

...
rxnorm_resolver = SentenceEntityResolverModel\
      .pretrained("sbiobertresolve_rxnorm_augmented_re", "en", "clinical/models")\
      .setInputCols(["entity_chunk_embeddings"])\
      .setOutputCol("rxnorm_code")\
      .setDistanceFunction("EUCLIDEAN")
...

New EntityChunkEmbeddings Annotator

We have a new EntityChunkEmbeddings annotator to compute a weighted average vector representing entity-related vectors. The model’s input usually consists of chunks of recognized named entities produced by MedicalNerModel. We can specify relations between the entities by the setTargetEntities() parameter, and the internal Relation Extraction model finds related entities and creates a chunk. Embedding for the chunk is calculated according to the weights specified in the setEntityWeights() parameter.

For instance, the chunk warfarin sodium 5 MG Oral Tablet has DRUG, STRENGTH, ROUTE, and FORM entity types. Since DRUG label is the most prominent label for resolver models, now we can assign weight to prioritize DRUG label (i.e {"DRUG": 0.8, "STRENGTH": 0.2, "ROUTE": 0.2, "FORM": 0.2} as shown below). In other words, embeddings of these labels are multipled by the assigned weights such as DRUG by 0.8.

For more details and examples, please check Sentence Entity Resolvers with EntityChunkEmbeddings Notebook in the Spark NLP workshop repo.

Example :

...

drug_chunk_embeddings = EntityChunkEmbeddings()\
    .pretrained("sbiobert_base_cased_mli","en","clinical/models")\
    .setInputCols(["ner_chunks", "dependencies"])\
    .setOutputCol("drug_chunk_embeddings")\
    .setMaxSyntacticDistance(3)\
    .setTargetEntities({"DRUG": ["STRENGTH", "ROUTE", "FORM"]})\
    .setEntityWeights({"DRUG": 0.8, "STRENGTH": 0.2, "ROUTE": 0.2, "FORM": 0.2})

rxnorm_resolver = SentenceEntityResolverModel\
    .pretrained("sbiobertresolve_rxnorm_augmented_re", "en", "clinical/models")\
    .setInputCols(["drug_chunk_embeddings"])\
    .setOutputCol("rxnorm_code")\
    .setDistanceFunction("EUCLIDEAN")

rxnorm_weighted_pipeline_re = Pipeline(
    stages = [
        documenter,
        sentence_detector,
        tokenizer,
        embeddings,
        posology_ner_model,
        ner_converter,
        pos_tager,
        dependency_parser,
        drug_chunk_embeddings,
        rxnorm_resolver])

sampleText = ["The patient was given metformin 500 mg, 2.5 mg of coumadin and then ibuprofen.",
              "The patient was given metformin 400 mg, coumadin 5 mg, coumadin, amlodipine 10 MG"]

data_df = spark.createDataFrame(sample_df)
results = rxnorm_weighted_pipeline_re.fit(data_df).transform(data_df)

The internal relation extraction creates the chunks here, and the embedding is computed according to the weights.

Results :

+-----+----------------+--------------------------+--------------------------------------------------+
|index|           chunk|rxnorm_code_weighted_08_re|                                      Concept_Name|
+-----+----------------+--------------------------+--------------------------------------------------+
|    0|metformin 500 mg|                    860974|metformin hydrochloride 500 MG:::metformin 500 ...|
|    0| 2.5 mg coumadin|                    855313|warfarin sodium 2.5 MG [Coumadin]:::warfarin so...|
|    0|       ibuprofen|                   1747293|ibuprofen Injection:::ibuprofen Pill:::ibuprofe...|
|    1|metformin 400 mg|                    332809|metformin 400 MG:::metformin 250 MG Oral Tablet...|
|    1|   coumadin 5 mg|                    855333|warfarin sodium 5 MG [Coumadin]:::warfarin sodi...|
|    1|        coumadin|                    202421|Coumadin:::warfarin sodium 2 MG/ML Injectable S...|
|    1|amlodipine 10 MG|                    308135|amlodipine 10 MG Oral Tablet:::amlodipine 10 MG...|
+-----+----------------+--------------------------+--------------------------------------------------+

New MedicalBertForSequenceClassification Annotator

We developed a new annotator called MedicalBertForSequenceClassification. It can load BERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.

New MedicalDistilBertForSequenceClassification Annotator

We developed a new annotator called MedicalDistilBertForSequenceClassification. It can load DistilBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.

New MedicalDistilBertForSequenceClassification and MedicalBertForSequenceClassification Models

We are releasing a new MedicalDistilBertForSequenceClassification model and three new MedicalBertForSequenceClassification models.

  • bert_sequence_classifier_ade_biobert: a classifier for detecting if a sentence is talking about a possible ADE (TRUE, FALSE)

  • bert_sequence_classifier_gender_biobert: a classifier for detecting the gender of the main subject of the sentence (MALE, FEMALE, UNKNOWN)

  • bert_sequence_classifier_pico_biobert: a classifier for detecting the class of a sentence according to PICO framework (CONCLUSIONS, DESIGN_SETTING,INTERVENTION, PARTICIPANTS, FINDINGS, MEASUREMENTS, AIMS)

Example :

...
sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_pico", "en", "clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("class")
...

sample_text = "To compare the results of recording enamel opacities using the TF and modified DDE indices."

result = sequence_clf_model.transform(spark.createDataFrame([[sample_text]]).toDF("text"))

Results :

+-------------------------------------------------------------------------------------------+-----+
|text                                                                                       |label|
+-------------------------------------------------------------------------------------------+-----+
|To compare the results of recording enamel opacities using the TF and modified DDE indices.|AIMS |
+-------------------------------------------------------------------------------------------+-----+
  • distilbert_sequence_classifier_ade : This model is a DistilBertForSequenceClassification model for classifying clinical texts whether they contain ADE (TRUE, FALSE).

Example :

...
sequenceClassifier = MedicalDistilBertForSequenceClassification\
      .pretrained('distilbert_sequence_classifier_ade', 'en', 'clinical/models') \
      .setInputCols(['token', 'document']) \
      .setOutputCol('class')
...

sample_text = "I felt a bit drowsy and had blurred vision after taking Aspirin."

result = sequence_clf_model.transform(spark.createDataFrame([[sample_text]]).toDF("text"))

Results :

+----------------------------------------------------------------+-----+
|text                                                            |label|
+----------------------------------------------------------------+-----+
|I felt a bit drowsy and had blurred vision after taking Aspirin.| True|
+----------------------------------------------------------------+-----+

Redesign of the ContextualParserApproach Annotator

  • We’ve dropped the annotator’s contextMatch parameter and removed the need for a context field when feeding a JSON configuration file to the annotator. Context information can now be fully defined using the prefix, suffix and contextLength fields in the JSON configuration file.
  • We’ve also fixed issues with the contextException field in the JSON configuration file - it was mismatching values in documents with several sentences and ignoring exceptions situated to the right of a word/token.
  • The ruleScope field in the JSON configuration file can now be set to document instead of sentence. This allows you to match multi-word entities like “New York” or “Salt Lake City”. You can do this by setting "ruleScope" : "document" in the JSON configuration file and feeding a dictionary (csv or tsv) to the annotator with its setDictionary parameter. These changes also mean that we’ve dropped the updateTokenizer parameter since the new capabilities of ruleScope improve the user experience for matching multi-word entities.
  • You can now feed in a dictionary in your chosen format - either vertical or horizontal. You can set that with the following parameter: setDictionary("dictionary.csv", options={"orientation":"vertical"})
  • Lastly, there was an improvement made to the confidence value calculation process to better measure successful hits.

For more explanation and examples, please check this Contextual Parser medium article and Contextual Parser Notebook.

getClasses Method in RelationExtractionModel and RelationExtractionDLModel Annotators

Now you can use getClasses() method for checking the relation labels of RE models (RelationExtractionModel and RelationExtractionDLModel) like MedicalNerModel().

Example :

clinical_re_Model = RelationExtractionModel()\
    .pretrained("re_temporal_events_clinical", "en", 'clinical/models')\
    .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
    .setOutputCol("relations")\

clinical_re_Model.getClasses()

Output :

['OVERLAP', 'BEFORE', 'AFTER']

Label Customization Feature for RelationExtractionModel and RelationExtractionDL Models

We are releasing label customization feature for Relation Extraction and Relation Extraction DL models by using .setCustomLabels() parameter.

Example :

...
reModel = RelationExtractionModel.pretrained("re_ade_clinical", "en", 'clinical/models')\
    .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
    .setOutputCol("relations")\
    .setMaxSyntacticDistance(10)\
    .setRelationPairs(["drug-ade, ade-drug"])\
    .setCustomLabels({"1": "is_related", "0": "not_related"})

redl_model = RelationExtractionDLModel.pretrained('redl_ade_biobert', 'en', "clinical/models") \
    .setPredictionThreshold(0.5)\
    .setInputCols(["re_ner_chunks", "sentences"]) \
    .setOutputCol("relations")\
    .setCustomLabels({"1": "is_related", "0": "not_related"})
...

sample_text = "I experienced fatigue and muscle cramps after taking Lipitor but no more adverse after passing Zocor."
result = model.transform(spark.createDataFrame([[sample_text]]).toDF('text'))

Results :

+-----------+-------+-------------+-------+-------+----------+
|   relation|entity1|       chunk1|entity2| chunk2|confidence|
+-----------+-------+-------------+-------+-------+----------+
| is_related|    ADE|      fatigue|   DRUG|Lipitor| 0.9999825|
|not_related|    ADE|      fatigue|   DRUG|  Zocor| 0.9960077|
| is_related|    ADE|muscle cramps|   DRUG|Lipitor|       1.0|
|not_related|    ADE|muscle cramps|   DRUG|  Zocor|   0.94971|
+-----------+-------+-------------+-------+-------+----------+

useBestModel Parameter in MedicalNerApproach Annotator

Introducing useBestModel param in MedicalNerApproach annotator. This param preserves and restores the model that has achieved the best performance at the end of the training. The priority is metrics from testDataset (micro F1), metrics from validationSplit (micro F1), and if none is set it will keep track of loss during the training.

Example :

med_ner = MedicalNerApproach()\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setLabelColumn("label")\
    .setOutputCol("ner")\
    ...
    ...
    .setUseBestModel(True)\

Early Stopping Feature in MedicalNerApproach Annotator

Introducing earlyStopping feature for MedicalNerApproach(). You can stop training at the point when the perforfmance on test/validation dataset starts to degrage. Two params are added to MedicalNerApproach() in order to use this feature:

  • earlyStoppingCriterion : (float) This is used set the minimal improvement of the test metric to terminate training. The metric monitored is the same as the metrics used in useBestModel (macro F1 when using test/validation set, loss otherwise). Default is 0 which means no early stopping is applied.

  • earlyStoppingPatience: (int), the number of epoch without improvement which will be tolerated. Default is 0, which means that early stopping will occur at the first time when performance in the current epoch is no better than in the previous epoch.

Example :

med_ner = MedicalNerApproach()\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setLabelColumn("label")\
    .setOutputCol("ner")\
    ...
    ...
    .setTestDataset(test_data_parquet_path)\
    .setEarlyStoppingCriterion(0.01)\
    .setEarlyStoppingPatience(3)\

Multi-Language Support for Faker and Regex Lists of Deidentification Annotator

We have a new .setLanguage() parameter in order to use internal Faker and Regex list for multi-language texts. When you are working with German and Spanish texts for a Deidentification, you can set this parameter to de for German and es for Spanish. Default value of this parameter is en.

Example :

deid_obfuscated = DeIdentification()\
      .setInputCols(["sentence", "token", "ner_chunk"]) \
      .setOutputCol("obfuscated") \
      .setMode("obfuscate")\
      .setLanguage('de')\
      .setObfuscateRefSource("faker")\

Spark 3.2.0 Compatibility for the Entire Library

Now we can use the Spark 3.2.0 version for Spark NLP for Healthcare by setting spark32=True in sparknlp_jsl.start() function.

! pip install --ignore-installed -q pyspark==3.2.0
import sparknlp_jsl

spark = sparknlp_jsl.start(SECRET, spark32=True)

Saving Visualization Feature in spark-nlp-display Library

We have a new save_path parameter in spark-nlp-display library for saving any visualization results in Spark NLP.

Example :

from sparknlp_display import NerVisualizer

visualiser = NerVisualizer()

visualiser.display(light_result[0], label_col='ner_chunk', document_col='document', save_path="display_result.html")

Deploying a Custom Spark NLP Image (for opensource, healthcare, and Spark OCR) to an Enterprise Version of Kubernetes: OpenShift

Spark NLP for opensource, healthcare, and SPARK OCR is now available for Openshift - enterprise version of Kubernetes. For deployment, please refer to:

Github Link: https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/platforms/openshift

Youtube: https://www.youtube.com/watch?v=FBes-6ylFrM&ab_channel=JohnSnowLabs

New Speed Benchmarks Table on Databricks

We prepared a speed benchmark table by running a clinical BERT For Token Classification model pipeline on various number of repartitioning and writing the results to parquet or delta formats. You can find the details here : Clinical Bert For Token Classification Benchmark Experiment.

New & Updated Notebooks

To see more, please check : Spark NLP Healthcare Workshop Repo

List of Recently Updated or Added Models

  • bert_sequence_classifier_ade_en
  • bert_sequence_classifier_gender_biobert_en
  • bert_sequence_classifier_pico_biobert_en
  • distilbert_sequence_classifier_ade_en
  • bert_token_classifier_ner_supplement_en
  • deid_pipeline_es
  • ner_deid_generic_es
  • ner_deid_generic_roberta_es
  • ner_deid_subentity_es
  • ner_deid_subentity_roberta_es
  • ner_nature_nero_clinical_en
  • ner_supplement_clinical_en
  • sbiobertresolve_clinical_abbreviation_acronym_en
  • sbiobertresolve_rxnorm_augmented_re

For all Spark NLP for healthcare models, please check : Models Hub Page

Versions

Last updated