4.2.1
Highlights
- Creating new chunks with
NerConverterInternalby merging chunks by skipping stopwords in between. - Adding relation direction to
RelationExtractionmodels to make the relations direction-aware. - Using proper regional date formats in the
DeIdentificationmodule. - Being able to play with different date formats in
DateNormalizeroutput. - New
Replacerannotator to replace chunks with their normalized versions (`DateNormalizer’) in documents. - New
ModelTracerhelper class to generate and add model UID and timestamps of the stages in a pipeline - Added entity source and labels to the
AssertionFilterermetadata - New chunk mapper and sentence entity resolver models and a pipeline for CVX
- Updated clinical NER models with new labels
- New Certification Training notebooks for the
johnsnowlabslibrary - New and updated notebooks
- 6 new clinical models and pipelines added & updated in total
Creating New Chunks with NerConverterInternal by Merging Chunks by Skipping Stopwords in Between.
NerConverterInternal’s new setIgnoreStopWords parameter allows merging between chunks with the same label, ignoring stopwords and punctuations.
txt = """ The qualified manufacturers for this starting material are:
Alpha Chemicals Pvt LTD
17, R K Industry House, Walbhat Rd, Goregaon – 400063
Mumbai, Maharashtra, India
Beta Chemical Co., Ltd
Huan Cheng Xi Lu 3111hao Hai Guan Da Ting
Shanghai, China """
Example for default:
NerConverterInternal()\
.setInputCols(["sentence", "token", "ner_deid"])\
.setOutputCol("chunk_deid")\
.setGreedyMode(True)\
.setWhiteList(['LOCATION'])
Results:
| chunks | entities | begin | end |
|:-------------------------|:---------|------:|----:|
| R K Industry House | LOCATION | 90 | 107 |
| Walbhat | LOCATION | 110 | 116 |
| Mumbai | LOCATION | 141 | 146 |
| Maharashtra | LOCATION | 149 | 159 |
| India | LOCATION | 162 | 166 |
| Huan Cheng Xi Lu 3111hao | LOCATION | 191 | 214 |
| Shanghai | LOCATION | 234 | 241 |
| China | LOCATION | 244 | 248 |
Example for setting setIgnoreStopWords parameter:
NerConverterInternal()\
.setInputCols(["sentence", "token", "ner_deid"])\
.setOutputCol("chunk_deid")\
.setGreedyMode(True)\
.setWhiteList(['LOCATION'])\
.setIgnoreStopWords(['\n', ',', "and", 'or', '.'])
Results:
| chunks | entities | begin | end |
|:---------------------------|:---------|------:|----:|
| R K Industry House Walbhat | LOCATION | 90 | 116 |
| Mumbai Maharashtra India | LOCATION | 141 | 166 |
| Huan Cheng Xi Lu 3111hao | LOCATION | 191 | 214 |
| Shanghai China | LOCATION | 234 | 248 |
Adding Relation Direction to RelationExtraction Models to Make the Relations Direction-aware.
We have a new setRelationDirectionCol parameter that is used during training with a new separate column that specified relationship directions. The column should contain one of the following values:
rightwards: The first entity in the text is also the first argument of the relation (as well as the second entity in the text is the second argument). In other words, the relation arguments are ordered left to right in the text.leftwards: The first entity in the text is the second argument of the relation (and the second entity in the text is the first argument).both: Order doesn’t matter (relation is symmetric).
In our test cases, it was observed that the accuracy increased significantly when we just add setRelationDirectionCol parameter by keeping the other parameter as they are.
Example:
+--------------------+---------+---------+--------------------+----+----------+
| chunk1| label1| label2| chunk2| rel| rel_dir|
+--------------------+---------+---------+--------------------+----+----------+
|expected long ter...|treatment|treatment| a picc line| O| both|
| light-headedness| problem| problem| diaphoresis| PIP|rightwards|
| po pain medications|treatment| problem| his pain|TrAP| leftwards|
|bilateral pleural...| problem| problem|increased work of...| PIP|rightwards|
| her urine output| test| problem| decreased|TeRP|rightwards|
|his psychiatric i...| problem| problem|his neurologic in...| PIP|rightwards|
| white blood cells| test| test| red blood cells| O| both|
| chloride| test| test| bun| O| both|
| further work-up| test| problem|his neurologic co...|TeCP|rightwards|
| four liters|treatment| test| blood pressure| O| both|
+--------------------+---------+---------+--------------------+----+----------+
re_approach_with_dir = RelationExtractionApproach()\
.setInputCols(["embeddings", "pos_tags", "train_ner_chunks", "dependencies"])\
.setOutputCol("relations")\
.setLabelColumn("rel")\
...
.setRelationDirectionCol("rel_dir")
Using Proper Regional date Formats in DeIdentification Module
You can specify the format for date entities that will be shifted to the new date or converted to a year.
de_identification = DeIdentification() \
.setInputCols(["ner_chunk", "token", "sentence"]) \
.setOutputCol("dei_id") \
.setRegion('us') # 'eu' for Europe
Being Able to Play With Different Date Formats in DateNormalizer Output
Now we can customize the normalized date formats in the output of DateNormalizer by using the new setOutputDateformat parameter. There are two options to do that; us for MM/DD/YYYY, eu for DD/MM/YYYY formats.
Example:
date_normalizer_us = DateNormalizer()\
.setInputCols('date_chunk')\
.setOutputCol('normalized_date_us')\
.setOutputDateformat('us')
date_normalizer_eu = DateNormalizer()\
.setInputCols('date_chunk')\
.setOutputCol('normalized_date_eu')\
.setOutputDateformat('eu')
sample_text = ['She was last seen in the clinic on Jan 30, 2018, by Dr. Y.',
'Chris Brown was discharged on 12Mar2021',
'We reviewed the pathology obtained on 13.04.1999.']
Results:
+----------------------------------------------------------+------------+------------------+------------------+
|text |date_chunk |normalized_date_eu|normalized_date_us|
+----------------------------------------------------------+------------+------------------+------------------+
|She was last seen in the clinic on Jan 30, 2018, by Dr. Y.|Jan 30, 2018|30/01/2018 |01/30/2018 |
|Chris Brown was discharged on 12Mar2021 |12Mar2021 |12/03/2021 |03/20/2021 |
|We reviewed the pathology obtained on 13.04.1999. |13.04.1999 |13/04/1999 |04/13/1999 |
+----------------------------------------------------------+------------+------------------+------------------+
New Replacer Annotator To Replace Chunks With Their Normalized Versions (DateNormalizer) In Documents
We have a new Replacer annotator that returns the original document by replacing it with the normalized version of the original chunks.
Example:
date_normalizer = DateNormalizer()\
.setInputCols('date_chunk')\
.setOutputCol('normalized_date')\
replacer = Replacer()\
.setInputCols(["normalized_date","document"])\
.setOutputCol("replaced_document")
sample_text = ['She was last seen in the clinic on Jan 30, 2018, by Dr. Y.',
'Chris Brown was discharged on 12Mar2021',
'We reviewed the pathology obtained on 13.04.1999.']
Results:
+----------------------------------------------------------+---------------+--------------------------------------------------------+
|text |normalized_date|replaced_document |
+----------------------------------------------------------+---------------+--------------------------------------------------------+
|She was last seen in the clinic on Jan 30, 2018, by Dr. Y.|2018/01/30 |She was last seen in the clinic on 2018/01/30, by Dr. Y.|
|Chris Brown was discharged on 12Mar2021 |2021/03/12 |Chris Brown was discharged on 2021/03/12 |
|We reviewed the pathology obtained on 13.04.1999. |1999/04/13 |We reviewed the pathology obtained on 1999/04/13. |
+----------------------------------------------------------+---------------+--------------------------------------------------------+
New ModelTracer Helper Class to Generate and Add Model UID and Timestamps of the Stages in a Pipeline
ModelTracer allows to track the UIDs and timestamps of each stage of a pipeline.
Example:
from sparknlp_jsl.modelTracer import ModelTracer
...
pipeline = Pipeline(
stages=[
documentAssembler,
tokenizer,
tokenClassifier,
])
df = pipeline.fit(data).transform(data)
result = ModelTracer().addUidCols(pipeline = pipeline, df = df)
result.show(truncate=False)
Results:
+----+--------+-----+---+----------------------------------------------------------------------+--------------------------------------------------------------+----------------------------------------------------------------------------------+
|text|document|token|ner|documentassembler_model_uid |tokenizer_model_uid |bert_for_token_classification_model_uid |
+----+--------+-----+---+----------------------------------------------------------------------+--------------------------------------------------------------+----------------------------------------------------------------------------------+
|... |... |... |...|{uid -> DocumentAssembler_a666efd1d789, timestamp -> 2022-10-21_11:34}|{uid -> Tokenizer_01fbad79f069, timestamp -> 2022-10-21_11:34}|{uid -> BERT_FOR_TOKEN_CLASSIFICATION_675a6a750b89, timestamp -> 2022-10-21_11:34}|
+----+--------+-----+---+----------------------------------------------------------------------+--------------------------------------------------------------+----------------------------------------------------------------------------------+
Added Entity Source and Labels to the AssertionFilterer Metadata
Now the AssertionFilterer annotator returns the entity source and assertion labels in the metadata.
Example:
assertionFilterer = AssertionFilterer() \
.setInputCols(["sentence","ner_chunk","assertion"]) \
.setOutputCol("filtered") \
.setCriteria("assertion") \
.setWhiteList(["Absent"])
text = "Patient has a headache for the last 2 weeks, no alopecia noted."
Results:
# before v4.2.1
+-----------------------------------------------------------------------------------------------------+
|filtered |
+-----------------------------------------------------------------------------------------------------+
|[{chunk, 48, 55, alopecia, {entity -> PROBLEM, sentence -> 0, chunk -> 1, confidence -> 0.9988}, []}]|
+-----------------------------------------------------------------------------------------------------+
# v4.2.1
+---------------------------------------------------------------------------------------------------------------------------------------------------+
|filtered |
+---------------------------------------------------------------------------------------------------------------------------------------------------+
|[{chunk, 48, 55, alopecia, {chunk -> 1, confidence -> 0.9987, ner_source -> ner_chunk, assertion -> Absent, entity -> PROBLEM, sentence -> 0}, []}]|
+---------------------------------------------------------------------------------------------------------------------------------------------------+
New Chunk Mapper and Sentence Entity Resolver Models And A Pipeline for CVX
- We are releasing 2 new chunk mapper models to map entities to their corresponding CVX codes, vaccine names and CPT codes. There are 3 types of vaccine names mapped;
short_name,full_nameandtrade_name
| model name | description |
|---|---|
| cvx_name_mapper | Mapping vaccine products to their corresponding CVX codes, vaccine names and CPT codes. |
| cvx_code_mapper | Mapping CVX codes to their corresponding vaccine names and CPT codes. |
Example:
chunkerMapper = ChunkMapperModel\
.pretrained("cvx_name_mapper", "en", "clinical/models")\
.setInputCols(["ner_chunk"])\
.setOutputCol("mappings")\
.setRels(["cvx_code", "short_name", "full_name", "trade_name", "cpt_code"])
data = spark.createDataFrame([['DTaP'], ['MYCOBAX'], ['cholera, live attenuated']]).toDF('text')
Results:
+--------------------------+--------+--------------------------+-------------------------------------------------------------+------------+--------+
|chunk |cvx_code|short_name |full_name |trade_name |cpt_code|
+--------------------------+--------+--------------------------+-------------------------------------------------------------+------------+--------+
|[DTaP] |[20] |[DTaP] |[diphtheria, tetanus toxoids and acellular pertussis vaccine]|[ACEL-IMUNE]|[90700] |
|[MYCOBAX] |[19] |[BCG] |[Bacillus Calmette-Guerin vaccine] |[MYCOBAX] |[90585] |
|[cholera, live attenuated]|[174] |[cholera, live attenuated]|[cholera, live attenuated] |[VAXCHORA] |[90625] |
+--------------------------+--------+--------------------------+-------------------------------------------------------------+------------+--------+
sbiobertresolve_cvx: This sentence entity resolver model maps vaccine entities to CVX codes usingsbiobert_base_cased_mliSentence Bert Embeddings. Additionally, this model returns status of the vaccine (Active/Inactive/Pending/Non-US) inall_k_aux_labelscolumn.
Example:
cvx_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_cvx", "en", "clinical/models")\
.setInputCols(["sbert_embeddings"])\
.setOutputCol("cvx_code")\
.setDistanceFunction("EUCLIDEAN")
result = light_model.fullAnnotate(["Sinovac", "Moderna", "BIOTHRAX"])
Results:
+----------+--------+-------------------------------------------------------+--------+
|ner_chunk |cvx_code|resolved_text |Status |
+----------+--------+-------------------------------------------------------+--------+
|Sinovac |511 |COVID-19 IV Non-US Vaccine (CoronaVac, Sinovac) |Non-US |
|Moderna |227 |COVID-19, mRNA, LNP-S, PF, pediatric 50 mcg/0.5 mL dose|Inactive|
|BIOTHRAX |24 |anthrax |Active |
+----------+--------+-------------------------------------------------------+--------+
cvx_resolver_pipeline: This pretrained pipeline maps entities with their corresponding CVX codes.
Example:
from sparknlp.pretrained import PretrainedPipeline
resolver_pipeline = PretrainedPipeline("cvx_resolver_pipeline", "en", "clinical/models")
text= "The patient has a history of influenza vaccine, tetanus and DTaP"
result = resolver_pipeline.fullAnnotate(text)
Results:
+-----------------+---------+--------+
|chunk |ner_chunk|cvx_code|
+-----------------+---------+--------+
|influenza vaccine|Vaccine |160 |
|tetanus |Vaccine |35 |
|DTaP |Vaccine |20 |
+-----------------+---------+--------+
Updated Clinical NER Models With New Labels
ner_jsl and ner_covid_trials models were updated with the new label called “Vaccine_Name”.
Example:
...
jsl_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("jsl_ner")
...
sample_text= """The patient is a 21-day-old Caucasian male here for 2 days, there is no side effect observed after the influenza vaccine"""
Results:
|chunks | begin | end | entities |
|------------------|--------:|------:|:---------------|
|21-day-old | 18 | 27 | Age |
|Caucasian | 29 | 37 | Race_Ethnicity |
|male | 39 | 42 | Gender |
|for 2 days | 49 | 58 | Duration |
|influenza vaccine | 100 | 116 | Vaccine_Name |
New Certification Training Notebooks for the johnsnowlabs Library
Now we have 46 new Healtcare Certification Training notebooks for the users who want to use the new johnsnowlabs library.
New and Updated Notebooks
-
New Coreference Resolution notebook to find other references of clinical entities in a document.
-
Updated Clinical Name Entity Recognition Model notebook with the new feature
setIgnoreStopWordsparameter andModelTracermodule. -
Updated Clinical Assertion Model notebook with the new changes in
AssertionFiltererimprovement. -
Updated Clinical Deidentification notebook with the new
setRegionparameter inDeIdentification. -
Updated Clinical Relation Extraction notebook with the new
setRelationDirectionColparameter inRelationExtractionApproach. -
Updated Date Normalizer notebook with the new
setOutputDateformatparameter inDateNormalizerandReplacerannotator. -
Updated 25 Certification Training Public notebooks and 47 Certification Training Healthcare notebooks with the latest updates in the libraries.
-
Updated 6 Databricks Public notebooks and 14 Databricks Healthcare notebooks with the latest updates in the libraries and 4 new Databricks notebooks created.
6 New Clinical Models and Pipelines Added & Updated in Total
cvx_code_mappercvx_name_mappersbiobertresolve_cvxcvx_resolver_pipelinener_jslner_covid_trials
Versions
- 6.1.1
- 6.1.0
- 6.0.4
- 6.0.3
- 6.0.2
- 6.0.1
- 6.0.0
- 5.5.3
- 5.5.2
- 5.5.1
- 5.5.0
- 5.4.1
- 5.4.0
- 5.3.3
- 5.3.2
- 5.3.1
- 5.3.0
- 5.2.1
- 5.2.0
- 5.1.4
- 5.1.3
- 5.1.2
- 5.1.1
- 5.1.0
- 5.0.2
- 5.0.1
- 5.0.0
- 4.4.4
- 4.4.3
- 4.4.2
- 4.4.1
- 4.4.0
- 4.3.2
- 4.3.1
- 4.3.0
- 4.2.8
- 4.2.4
- 4.2.3
- 4.2.2
- 4.2.1
- 4.2.0
- 4.1.0
- 4.0.2
- 4.0.0
- 3.5.3
- 3.5.2
- 3.5.1
- 3.5.0
- 3.4.2
- 3.4.1
- 3.4.0
- 3.3.4
- 3.3.2
- 3.3.1
- 3.3.0
- 3.2.3
- 3.2.2
- 3.2.1
- 3.2.0
- 3.1.3
- 3.1.2
- 3.1.1
- 3.1.0
- 3.0.3
- 3.0.2
- 3.0.1
- 3.0.0
- 2.7.6
- 2.7.5
- 2.7.4
- 2.7.3
- 2.7.2
- 2.7.1
- 2.7.0
- 2.6.2
- 2.6.0
- 2.5.5
- 2.5.3
- 2.5.2
- 2.5.0
- 2.4.6
- 2.4.5
- 2.4.2
- 2.4.1
- 2.4.0