Spark NLP for Healthcare Release Notes 4.2.1

 

4.2.1

Highlights

  • Creating new chunks with NerConverterInternal by merging chunks by skipping stopwords in between.
  • Adding relation direction to RelationExtraction models to make the relations direction-aware.
  • Using proper regional date formats in the DeIdentification module.
  • Being able to play with different date formats in DateNormalizer output.
  • New Replacer annotator to replace chunks with their normalized versions (`DateNormalizer’) in documents.
  • New ModelTracer helper class to generate and add model UID and timestamps of the stages in a pipeline
  • Added entity source and labels to the AssertionFilterer metadata
  • New chunk mapper and sentence entity resolver models and a pipeline for CVX
  • Updated clinical NER models with new labels
  • New Certification Training notebooks for the johnsnowlabs library
  • New and updated notebooks
  • 6 new clinical models and pipelines added & updated in total

Creating New Chunks with NerConverterInternal by Merging Chunks by Skipping Stopwords in Between.

NerConverterInternal’s new setIgnoreStopWords parameter allows merging between chunks with the same label, ignoring stopwords and punctuations.

txt = """ The qualified manufacturers for this starting material are:
Alpha Chemicals Pvt LTD
17, R K Industry House, Walbhat Rd, Goregaon – 400063
Mumbai, Maharashtra, India
Beta Chemical Co., Ltd
Huan Cheng Xi Lu 3111hao Hai Guan Da Ting
Shanghai, China """

Example for default:

NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner_deid"])\
    .setOutputCol("chunk_deid")\
    .setGreedyMode(True)\
    .setWhiteList(['LOCATION'])

Results:

| chunks                   | entities | begin | end |
|:-------------------------|:---------|------:|----:|
| R K Industry House       | LOCATION |    90 | 107 |
| Walbhat                  | LOCATION |   110 | 116 |
| Mumbai                   | LOCATION |   141 | 146 |
| Maharashtra              | LOCATION |   149 | 159 |
| India                    | LOCATION |   162 | 166 |
| Huan Cheng Xi Lu 3111hao | LOCATION |   191 | 214 |
| Shanghai                 | LOCATION |   234 | 241 |
| China                    | LOCATION |   244 | 248 |

Example for setting setIgnoreStopWords parameter:

NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner_deid"])\
    .setOutputCol("chunk_deid")\
    .setGreedyMode(True)\
    .setWhiteList(['LOCATION'])\
    .setIgnoreStopWords(['\n', ',', "and", 'or', '.'])

Results:

| chunks                     | entities | begin | end |
|:---------------------------|:---------|------:|----:|
| R K Industry House Walbhat | LOCATION |    90 | 116 |
| Mumbai Maharashtra India   | LOCATION |   141 | 166 |
| Huan Cheng Xi Lu 3111hao   | LOCATION |   191 | 214 |
| Shanghai China             | LOCATION |   234 | 248 |

Adding Relation Direction to RelationExtraction Models to Make the Relations Direction-aware.

We have a new setRelationDirectionCol parameter that is used during training with a new separate column that specified relationship directions. The column should contain one of the following values:

  • rightwards: The first entity in the text is also the first argument of the relation (as well as the second entity in the text is the second argument). In other words, the relation arguments are ordered left to right in the text.
  • leftwards: The first entity in the text is the second argument of the relation (and the second entity in the text is the first argument).
  • both: Order doesn’t matter (relation is symmetric).

In our test cases, it was observed that the accuracy increased significantly when we just add setRelationDirectionCol parameter by keeping the other parameter as they are.

Example:

+--------------------+---------+---------+--------------------+----+----------+
|              chunk1|   label1|   label2|              chunk2| rel|   rel_dir|
+--------------------+---------+---------+--------------------+----+----------+
|expected long ter...|treatment|treatment|         a picc line|   O|      both|
|    light-headedness|  problem|  problem|         diaphoresis| PIP|rightwards|
| po pain medications|treatment|  problem|            his pain|TrAP| leftwards|
|bilateral pleural...|  problem|  problem|increased work of...| PIP|rightwards|
|    her urine output|     test|  problem|           decreased|TeRP|rightwards|
|his psychiatric i...|  problem|  problem|his neurologic in...| PIP|rightwards|
|   white blood cells|     test|     test|     red blood cells|   O|      both|
|            chloride|     test|     test|                 bun|   O|      both|
|     further work-up|     test|  problem|his neurologic co...|TeCP|rightwards|
|         four liters|treatment|     test|      blood pressure|   O|      both|
+--------------------+---------+---------+--------------------+----+----------+
re_approach_with_dir = RelationExtractionApproach()\
    .setInputCols(["embeddings", "pos_tags", "train_ner_chunks", "dependencies"])\
    .setOutputCol("relations")\
    .setLabelColumn("rel")\
    ...
    .setRelationDirectionCol("rel_dir")

Using Proper Regional date Formats in DeIdentification Module

You can specify the format for date entities that will be shifted to the new date or converted to a year.

de_identification = DeIdentification() \
    .setInputCols(["ner_chunk", "token", "sentence"]) \
    .setOutputCol("dei_id") \
    .setRegion('us') # 'eu' for Europe

Being Able to Play With Different Date Formats in DateNormalizer Output

Now we can customize the normalized date formats in the output of DateNormalizer by using the new setOutputDateformat parameter. There are two options to do that; us for MM/DD/YYYY, eu for DD/MM/YYYY formats.

Example:

date_normalizer_us = DateNormalizer()\
    .setInputCols('date_chunk')\
    .setOutputCol('normalized_date_us')\
    .setOutputDateformat('us')

date_normalizer_eu = DateNormalizer()\
    .setInputCols('date_chunk')\
    .setOutputCol('normalized_date_eu')\
    .setOutputDateformat('eu')

sample_text = ['She was last seen in the clinic on Jan 30, 2018, by Dr. Y.',
               'Chris Brown was discharged on 12Mar2021',
               'We reviewed the pathology obtained on 13.04.1999.']

Results:

+----------------------------------------------------------+------------+------------------+------------------+
|text                                                      |date_chunk  |normalized_date_eu|normalized_date_us|
+----------------------------------------------------------+------------+------------------+------------------+
|She was last seen in the clinic on Jan 30, 2018, by Dr. Y.|Jan 30, 2018|30/01/2018        |01/30/2018        |
|Chris Brown was discharged on 12Mar2021                   |12Mar2021   |12/03/2021        |03/20/2021        |
|We reviewed the pathology obtained on 13.04.1999.         |13.04.1999  |13/04/1999        |04/13/1999        |
+----------------------------------------------------------+------------+------------------+------------------+

New Replacer Annotator To Replace Chunks With Their Normalized Versions (DateNormalizer) In Documents

We have a new Replacer annotator that returns the original document by replacing it with the normalized version of the original chunks.

Example:

date_normalizer = DateNormalizer()\
    .setInputCols('date_chunk')\
    .setOutputCol('normalized_date')\

replacer = Replacer()\
    .setInputCols(["normalized_date","document"])\
    .setOutputCol("replaced_document")

sample_text = ['She was last seen in the clinic on Jan 30, 2018, by Dr. Y.',
               'Chris Brown was discharged on 12Mar2021',
               'We reviewed the pathology obtained on 13.04.1999.']

Results:

+----------------------------------------------------------+---------------+--------------------------------------------------------+
|text                                                      |normalized_date|replaced_document                                       |
+----------------------------------------------------------+---------------+--------------------------------------------------------+
|She was last seen in the clinic on Jan 30, 2018, by Dr. Y.|2018/01/30     |She was last seen in the clinic on 2018/01/30, by Dr. Y.|
|Chris Brown was discharged on 12Mar2021                   |2021/03/12     |Chris Brown was discharged on 2021/03/12                |
|We reviewed the pathology obtained on 13.04.1999.         |1999/04/13     |We reviewed the pathology obtained on 1999/04/13.       |
+----------------------------------------------------------+---------------+--------------------------------------------------------+

New ModelTracer Helper Class to Generate and Add Model UID and Timestamps of the Stages in a Pipeline

ModelTracer allows to track the UIDs and timestamps of each stage of a pipeline.

Example:

from sparknlp_jsl.modelTracer import ModelTracer
...
pipeline = Pipeline(
    stages=[
        documentAssembler,
        tokenizer,
        tokenClassifier,
        ])

df = pipeline.fit(data).transform(data)

result = ModelTracer().addUidCols(pipeline = pipeline, df = df)
result.show(truncate=False)

Results:

+----+--------+-----+---+----------------------------------------------------------------------+--------------------------------------------------------------+----------------------------------------------------------------------------------+
|text|document|token|ner|documentassembler_model_uid                                           |tokenizer_model_uid                                           |bert_for_token_classification_model_uid                                           |
+----+--------+-----+---+----------------------------------------------------------------------+--------------------------------------------------------------+----------------------------------------------------------------------------------+
|... |...     |...  |...|{uid -> DocumentAssembler_a666efd1d789, timestamp -> 2022-10-21_11:34}|{uid -> Tokenizer_01fbad79f069, timestamp -> 2022-10-21_11:34}|{uid -> BERT_FOR_TOKEN_CLASSIFICATION_675a6a750b89, timestamp -> 2022-10-21_11:34}|
+----+--------+-----+---+----------------------------------------------------------------------+--------------------------------------------------------------+----------------------------------------------------------------------------------+

Added Entity Source and Labels to the AssertionFilterer Metadata

Now the AssertionFilterer annotator returns the entity source and assertion labels in the metadata.

Example:

assertionFilterer = AssertionFilterer() \
    .setInputCols(["sentence","ner_chunk","assertion"]) \
    .setOutputCol("filtered") \
    .setCriteria("assertion") \
    .setWhiteList(["Absent"])
text = "Patient has a headache for the last 2 weeks, no alopecia noted."

Results:

# before v4.2.1
+-----------------------------------------------------------------------------------------------------+
|filtered                                                                                             |
+-----------------------------------------------------------------------------------------------------+
|[{chunk, 48, 55, alopecia, {entity -> PROBLEM, sentence -> 0, chunk -> 1, confidence -> 0.9988}, []}]|
+-----------------------------------------------------------------------------------------------------+

# v4.2.1
+---------------------------------------------------------------------------------------------------------------------------------------------------+
|filtered                                                                                                                                           |
+---------------------------------------------------------------------------------------------------------------------------------------------------+
|[{chunk, 48, 55, alopecia, {chunk -> 1, confidence -> 0.9987, ner_source -> ner_chunk, assertion -> Absent, entity -> PROBLEM, sentence -> 0}, []}]|
+---------------------------------------------------------------------------------------------------------------------------------------------------+

New Chunk Mapper and Sentence Entity Resolver Models And A Pipeline for CVX

  • We are releasing 2 new chunk mapper models to map entities to their corresponding CVX codes, vaccine names and CPT codes. There are 3 types of vaccine names mapped; short_name, full_name and trade_name
model name description
cvx_name_mapper Mapping vaccine products to their corresponding CVX codes, vaccine names and CPT codes.
cvx_code_mapper Mapping CVX codes to their corresponding vaccine names and CPT codes.

Example:

chunkerMapper = ChunkMapperModel\
    .pretrained("cvx_name_mapper", "en", "clinical/models")\
    .setInputCols(["ner_chunk"])\
    .setOutputCol("mappings")\
    .setRels(["cvx_code", "short_name", "full_name", "trade_name", "cpt_code"])

data = spark.createDataFrame([['DTaP'], ['MYCOBAX'], ['cholera, live attenuated']]).toDF('text')

Results:

+--------------------------+--------+--------------------------+-------------------------------------------------------------+------------+--------+
|chunk                     |cvx_code|short_name                |full_name                                                    |trade_name  |cpt_code|
+--------------------------+--------+--------------------------+-------------------------------------------------------------+------------+--------+
|[DTaP]                    |[20]    |[DTaP]                    |[diphtheria, tetanus toxoids and acellular pertussis vaccine]|[ACEL-IMUNE]|[90700] |
|[MYCOBAX]                 |[19]    |[BCG]                     |[Bacillus Calmette-Guerin vaccine]                           |[MYCOBAX]   |[90585] |
|[cholera, live attenuated]|[174]   |[cholera, live attenuated]|[cholera, live attenuated]                                   |[VAXCHORA]  |[90625] |
+--------------------------+--------+--------------------------+-------------------------------------------------------------+------------+--------+
  • sbiobertresolve_cvx: This sentence entity resolver model maps vaccine entities to CVX codes using sbiobert_base_cased_mli Sentence Bert Embeddings. Additionally, this model returns status of the vaccine (Active/Inactive/Pending/Non-US) in all_k_aux_labels column.

Example:

cvx_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_cvx", "en", "clinical/models")\
    .setInputCols(["sbert_embeddings"])\
    .setOutputCol("cvx_code")\
    .setDistanceFunction("EUCLIDEAN")

result = light_model.fullAnnotate(["Sinovac", "Moderna", "BIOTHRAX"])

Results:

+----------+--------+-------------------------------------------------------+--------+
|ner_chunk |cvx_code|resolved_text                                          |Status  |
+----------+--------+-------------------------------------------------------+--------+
|Sinovac   |511     |COVID-19 IV Non-US Vaccine (CoronaVac, Sinovac)        |Non-US  |
|Moderna   |227     |COVID-19, mRNA, LNP-S, PF, pediatric 50 mcg/0.5 mL dose|Inactive|
|BIOTHRAX  |24      |anthrax                                                |Active  |
+----------+--------+-------------------------------------------------------+--------+
  • cvx_resolver_pipeline: This pretrained pipeline maps entities with their corresponding CVX codes.

Example:

from sparknlp.pretrained import PretrainedPipeline

resolver_pipeline = PretrainedPipeline("cvx_resolver_pipeline", "en", "clinical/models")

text= "The patient has a history of influenza vaccine, tetanus and DTaP"
result = resolver_pipeline.fullAnnotate(text)

Results:

+-----------------+---------+--------+
|chunk            |ner_chunk|cvx_code|
+-----------------+---------+--------+
|influenza vaccine|Vaccine  |160     |
|tetanus          |Vaccine  |35      |
|DTaP             |Vaccine  |20      |
+-----------------+---------+--------+

Updated Clinical NER Models With New Labels

ner_jsl and ner_covid_trials models were updated with the new label called “Vaccine_Name”.

Example:

...
jsl_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \
		.setInputCols(["sentence", "token", "embeddings"]) \
		.setOutputCol("jsl_ner")
...

sample_text= """The patient is a 21-day-old Caucasian male here for 2 days, there is no side effect observed after the influenza vaccine"""

Results:

|chunks            |   begin |   end | entities       |
|------------------|--------:|------:|:---------------|
|21-day-old        |      18 |    27 | Age            |
|Caucasian         |      29 |    37 | Race_Ethnicity |
|male              |      39 |    42 | Gender         |
|for 2 days        |      49 |    58 | Duration       |
|influenza vaccine |     100 |   116 | Vaccine_Name   |

New Certification Training Notebooks for the johnsnowlabs Library

Now we have 46 new Healtcare Certification Training notebooks for the users who want to use the new johnsnowlabs library.

New and Updated Notebooks

6 New Clinical Models and Pipelines Added & Updated in Total

  • cvx_code_mapper
  • cvx_name_mapper
  • sbiobertresolve_cvx
  • cvx_resolver_pipeline
  • ner_jsl
  • ner_covid_trials

Versions

Last updated