3.4.1
We are glad to announce that Spark NLP Healthcare 3.4.1 has been released!
Highlights
- Brand new Spanish deidentification NER models
- Brand new Spanish deidentification pretrained pipeline
- New clinical NER model to detect supplements
- New RxNorm sentence entity resolver model
- New
EntityChunkEmbeddings
annotator - New
MedicalBertForSequenceClassification
annotator - New
MedicalDistilBertForSequenceClassification
annotator - New
MedicalDistilBertForSequenceClassification
andMedicalBertForSequenceClassification
models - Redesign of the
ContextualParserApproach
annotator getClasses
method inRelationExtractionModel
andRelationExtractionDLModel
annotators- Label customization feature for
RelationExtractionModel
andRelationExtractionDL
models useBestModel
parameter inMedicalNerApproach
annotator- Early stopping feature in
MedicalNerApproach
annotator - Multi-Language support for faker and regex lists of
Deidentification
annotator - Spark 3.2.0 compatibility for the entire library
- Saving visualization feature in
spark-nlp-display
library - Deploying a custom Spark NLP image (for opensource, healthcare, and Spark OCR) to an enterprise version of Kubernetes: OpenShift
- New speed benchmarks table on databricks
- New & Updated Notebooks
- List of recently updated or added models
Brand New Spanish Deidentification NER Models
We trained two new NER models to find PHI data (protected health information) that may need to be deidentified in Spanish. ner_deid_generic
and ner_deid_subentity
models are trained with in-house annotations. Both also are available for using Roberta Spanish Clinical Embeddings and sciwiki 300d.
-
ner_deid_generic
: Detects 7 PHI entities in Spanish (DATE
,NAME
,LOCATION
,PROFESSION
,CONTACT
,AGE
,ID
). -
ner_deid_subentity
: Detects 13 PHI sub-entities in Spanish (PATIENT
,HOSPITAL
,DATE
,ORGANIZATION
,E-MAIL
,USERNAME
,LOCATION
,ZIP
,MEDICALRECORD
,PROFESSION
,PHONE
,DOCTOR
,AGE
).
Example :
...
embeddings = WordEmbeddingsModel.pretrained("embeddings_sciwiki_300d","es","clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
deid_ner = MedicalNerModel.pretrained("ner_deid_generic", "es", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
deid_sub_entity_ner = MedicalNerModel.pretrained("ner_deid_subentity", "es", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner_sub_entity")
...
text = """Antonio Pérez Juan, nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14/03/2020
y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos.."""
result = model.transform(spark.createDataFrame([[text]], ["text"]))
Results :
| chunk | ner_deid_generic_chunk | ner_deid_subentity_chunk |
|--------------------|------------------------|--------------------------|
| Antonio Pérez Juan | NAME | PATIENT |
| Cádiz | LOCATION | LOCATION |
| España | LOCATION | LOCATION |
| 14/03/2022 | DATE | DATE |
| Clínica San Carlos | LOCATION | HOSPITAL |
Brand New Spanish Deidentification Pretrained Pipeline
We developed a clinical deidentification pretrained pipeline that can be used to deidentify PHI information from Spanish medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask, fake or obfuscate the following entities: AGE
, DATE
, PROFESSION
, E-MAIL
, USERNAME
, LOCATION
, DOCTOR
, HOSPITAL
, PATIENT
, URL
, IP
, MEDICALRECORD
, IDNUM
, ORGANIZATION
, PHONE
, ZIP
, ACCOUNT
, SSN
, PLATE
, SEX
and IPADDR
.
from sparknlp.pretrained import PretrainedPipeline
deid_pipeline = PretrainedPipeline("clinical_deidentification", "es", "clinical/models")
sample_text = """Datos del paciente. Nombre: Jose . Apellidos: Aranda Martinez. NHC: 2748903. NASS: 26 37482910."""
result = deid_pipe.annotate(text)
print("\n".join(result['masked']))
print("\n".join(result['masked_with_chars']))
print("\n".join(result['masked_fixed_length_chars']))
print("\n".join(result['obfuscated']))
Results:
Masked with entity labels
------------------------------
Datos del paciente. Nombre: <PATIENT> . Apellidos: <PATIENT>. NHC: <SSN>. NASS: <SSN> <SSN>
Masked with chars
------------------------------
Datos del paciente. Nombre: [**] . Apellidos: [*************]. NHC: [*****]. NASS: [**] [******]
Masked with fixed length chars
------------------------------
Datos del paciente. Nombre: **** . Apellidos: ****. NHC: ****. NASS: **** ****
Obfuscated
------------------------------
Datos del paciente. Nombre: Sr. Lerma . Apellidos: Aristides Gonzalez Gelabert. NHC: BBBBBBBBQR648597. NASS: 041010000011 RZRM020101906017 04.
New Clinical NER Model to Detect Supplements
We are releasing ner_supplement_clinical
model that can extract benefits of using drugs for certain conditions. It can label detected entities as CONDITION
and BENEFIT
. Also this model is trained on the dataset that is released by Spacy in their HealthSea product. Here is the benchmark comparison of both versions:
Entity | Spark NLP | Spacy-HealthSea |
---|---|---|
BENEFIT | 0.8729641 | 0.8330684 |
CONDITION | 0.8339274 | 0.8333333 |
Example :
...
clinical_ner = MedicalNerModel.pretrained("ner_supplement_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner_tags")
...
results = ner_model.transform(spark.createDataFrame([["Excellent!. The state of health improves, nervousness disappears, and night sleep improves. It also promotes hair and nail growth."]], ["text"]))
Results :
+------------------------+---------------+
| chunk | ner_label |
+------------------------+---------------+
| nervousness | CONDITION |
| night sleep improves | BENEFIT |
| hair | BENEFIT |
| nail | BENEFIT |
+------------------------+---------------+
New RxNorm Sentence Entity Resolver Model
sbiobertresolve_rxnorm_augmented_re
: This model maps clinical entities and concepts (like drugs/ingredients) to RxNorm codes without specifying the relations between the entities (relations are calculated on the fly inside the annotator) using sbiobert_base_cased_mli Sentence Bert Embeddings (EntityChunkEmbeddings).
Example :
...
rxnorm_resolver = SentenceEntityResolverModel\
.pretrained("sbiobertresolve_rxnorm_augmented_re", "en", "clinical/models")\
.setInputCols(["entity_chunk_embeddings"])\
.setOutputCol("rxnorm_code")\
.setDistanceFunction("EUCLIDEAN")
...
New EntityChunkEmbeddings
Annotator
We have a new EntityChunkEmbeddings
annotator to compute a weighted average vector representing entity-related vectors. The model’s input usually consists of chunks of recognized named entities produced by MedicalNerModel. We can specify relations between the entities by the setTargetEntities()
parameter, and the internal Relation Extraction model finds related entities and creates a chunk. Embedding for the chunk is calculated according to the weights specified in the setEntityWeights()
parameter.
For instance, the chunk warfarin sodium 5 MG Oral Tablet
has DRUG
, STRENGTH
, ROUTE
, and FORM
entity types. Since DRUG label is the most prominent label for resolver models, now we can assign weight to prioritize DRUG label (i.e {"DRUG": 0.8, "STRENGTH": 0.2, "ROUTE": 0.2, "FORM": 0.2}
as shown below). In other words, embeddings of these labels are multipled by the assigned weights such as DRUG
by 0.8
.
For more details and examples, please check Sentence Entity Resolvers with EntityChunkEmbeddings Notebook in the Spark NLP workshop repo.
Example :
...
drug_chunk_embeddings = EntityChunkEmbeddings()\
.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["ner_chunks", "dependencies"])\
.setOutputCol("drug_chunk_embeddings")\
.setMaxSyntacticDistance(3)\
.setTargetEntities({"DRUG": ["STRENGTH", "ROUTE", "FORM"]})\
.setEntityWeights({"DRUG": 0.8, "STRENGTH": 0.2, "ROUTE": 0.2, "FORM": 0.2})
rxnorm_resolver = SentenceEntityResolverModel\
.pretrained("sbiobertresolve_rxnorm_augmented_re", "en", "clinical/models")\
.setInputCols(["drug_chunk_embeddings"])\
.setOutputCol("rxnorm_code")\
.setDistanceFunction("EUCLIDEAN")
rxnorm_weighted_pipeline_re = Pipeline(
stages = [
documenter,
sentence_detector,
tokenizer,
embeddings,
posology_ner_model,
ner_converter,
pos_tager,
dependency_parser,
drug_chunk_embeddings,
rxnorm_resolver])
sampleText = ["The patient was given metformin 500 mg, 2.5 mg of coumadin and then ibuprofen.",
"The patient was given metformin 400 mg, coumadin 5 mg, coumadin, amlodipine 10 MG"]
data_df = spark.createDataFrame(sample_df)
results = rxnorm_weighted_pipeline_re.fit(data_df).transform(data_df)
The internal relation extraction creates the chunks here, and the embedding is computed according to the weights.
Results :
+-----+----------------+--------------------------+--------------------------------------------------+
|index| chunk|rxnorm_code_weighted_08_re| Concept_Name|
+-----+----------------+--------------------------+--------------------------------------------------+
| 0|metformin 500 mg| 860974|metformin hydrochloride 500 MG:::metformin 500 ...|
| 0| 2.5 mg coumadin| 855313|warfarin sodium 2.5 MG [Coumadin]:::warfarin so...|
| 0| ibuprofen| 1747293|ibuprofen Injection:::ibuprofen Pill:::ibuprofe...|
| 1|metformin 400 mg| 332809|metformin 400 MG:::metformin 250 MG Oral Tablet...|
| 1| coumadin 5 mg| 855333|warfarin sodium 5 MG [Coumadin]:::warfarin sodi...|
| 1| coumadin| 202421|Coumadin:::warfarin sodium 2 MG/ML Injectable S...|
| 1|amlodipine 10 MG| 308135|amlodipine 10 MG Oral Tablet:::amlodipine 10 MG...|
+-----+----------------+--------------------------+--------------------------------------------------+
New MedicalBertForSequenceClassification
Annotator
We developed a new annotator called MedicalBertForSequenceClassification
. It can load BERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.
New MedicalDistilBertForSequenceClassification
Annotator
We developed a new annotator called MedicalDistilBertForSequenceClassification
. It can load DistilBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.
New MedicalDistilBertForSequenceClassification
and MedicalBertForSequenceClassification
Models
We are releasing a new MedicalDistilBertForSequenceClassification
model and three new MedicalBertForSequenceClassification
models.
-
bert_sequence_classifier_ade_biobert
: a classifier for detecting if a sentence is talking about a possible ADE (TRUE
,FALSE
) -
bert_sequence_classifier_gender_biobert
: a classifier for detecting the gender of the main subject of the sentence (MALE
,FEMALE
,UNKNOWN
) -
bert_sequence_classifier_pico_biobert
: a classifier for detecting the class of a sentence according to PICO framework (CONCLUSIONS
,DESIGN_SETTING
,INTERVENTION
,PARTICIPANTS
,FINDINGS
,MEASUREMENTS
,AIMS
)
Example :
...
sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_pico", "en", "clinical/models")\
.setInputCols(["document","token"])\
.setOutputCol("class")
...
sample_text = "To compare the results of recording enamel opacities using the TF and modified DDE indices."
result = sequence_clf_model.transform(spark.createDataFrame([[sample_text]]).toDF("text"))
Results :
+-------------------------------------------------------------------------------------------+-----+
|text |label|
+-------------------------------------------------------------------------------------------+-----+
|To compare the results of recording enamel opacities using the TF and modified DDE indices.|AIMS |
+-------------------------------------------------------------------------------------------+-----+
distilbert_sequence_classifier_ade
: This model is a DistilBertForSequenceClassification model for classifying clinical texts whether they contain ADE (TRUE
,FALSE
).
Example :
...
sequenceClassifier = MedicalDistilBertForSequenceClassification\
.pretrained('distilbert_sequence_classifier_ade', 'en', 'clinical/models') \
.setInputCols(['token', 'document']) \
.setOutputCol('class')
...
sample_text = "I felt a bit drowsy and had blurred vision after taking Aspirin."
result = sequence_clf_model.transform(spark.createDataFrame([[sample_text]]).toDF("text"))
Results :
+----------------------------------------------------------------+-----+
|text |label|
+----------------------------------------------------------------+-----+
|I felt a bit drowsy and had blurred vision after taking Aspirin.| True|
+----------------------------------------------------------------+-----+
Redesign of the ContextualParserApproach
Annotator
- We’ve dropped the annotator’s
contextMatch
parameter and removed the need for acontext
field when feeding a JSON configuration file to the annotator. Context information can now be fully defined using theprefix
,suffix
andcontextLength
fields in the JSON configuration file. - We’ve also fixed issues with the
contextException
field in the JSON configuration file - it was mismatching values in documents with several sentences and ignoring exceptions situated to the right of a word/token. - The
ruleScope
field in the JSON configuration file can now be set todocument
instead ofsentence
. This allows you to match multi-word entities like “New York” or “Salt Lake City”. You can do this by setting"ruleScope" : "document"
in the JSON configuration file and feeding a dictionary (csv or tsv) to the annotator with itssetDictionary
parameter. These changes also mean that we’ve dropped theupdateTokenizer
parameter since the new capabilities ofruleScope
improve the user experience for matching multi-word entities. - You can now feed in a dictionary in your chosen format - either vertical or horizontal. You can set that with the following parameter:
setDictionary("dictionary.csv", options={"orientation":"vertical"})
- Lastly, there was an improvement made to the confidence value calculation process to better measure successful hits.
For more explanation and examples, please check this Contextual Parser medium article and Contextual Parser Notebook.
getClasses
Method in RelationExtractionModel
and RelationExtractionDLModel
Annotators
Now you can use getClasses()
method for checking the relation labels of RE models (RelationExtractionModel and RelationExtractionDLModel) like MedicalNerModel().
Example :
clinical_re_Model = RelationExtractionModel()\
.pretrained("re_temporal_events_clinical", "en", 'clinical/models')\
.setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
.setOutputCol("relations")\
clinical_re_Model.getClasses()
Output :
['OVERLAP', 'BEFORE', 'AFTER']
Label Customization Feature for RelationExtractionModel
and RelationExtractionDL
Models
We are releasing label customization feature for Relation Extraction and Relation Extraction DL models by using .setCustomLabels()
parameter.
Example :
...
reModel = RelationExtractionModel.pretrained("re_ade_clinical", "en", 'clinical/models')\
.setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
.setOutputCol("relations")\
.setMaxSyntacticDistance(10)\
.setRelationPairs(["drug-ade, ade-drug"])\
.setCustomLabels({"1": "is_related", "0": "not_related"})
redl_model = RelationExtractionDLModel.pretrained('redl_ade_biobert', 'en', "clinical/models") \
.setPredictionThreshold(0.5)\
.setInputCols(["re_ner_chunks", "sentences"]) \
.setOutputCol("relations")\
.setCustomLabels({"1": "is_related", "0": "not_related"})
...
sample_text = "I experienced fatigue and muscle cramps after taking Lipitor but no more adverse after passing Zocor."
result = model.transform(spark.createDataFrame([[sample_text]]).toDF('text'))
Results :
+-----------+-------+-------------+-------+-------+----------+
| relation|entity1| chunk1|entity2| chunk2|confidence|
+-----------+-------+-------------+-------+-------+----------+
| is_related| ADE| fatigue| DRUG|Lipitor| 0.9999825|
|not_related| ADE| fatigue| DRUG| Zocor| 0.9960077|
| is_related| ADE|muscle cramps| DRUG|Lipitor| 1.0|
|not_related| ADE|muscle cramps| DRUG| Zocor| 0.94971|
+-----------+-------+-------------+-------+-------+----------+
useBestModel
Parameter in MedicalNerApproach
Annotator
Introducing useBestModel
param in MedicalNerApproach annotator. This param preserves and restores the model that has achieved the best performance at the end of the training. The priority is metrics from testDataset (micro F1), metrics from validationSplit (micro F1), and if none is set it will keep track of loss during the training.
Example :
med_ner = MedicalNerApproach()\
.setInputCols(["sentence", "token", "embeddings"])\
.setLabelColumn("label")\
.setOutputCol("ner")\
...
...
.setUseBestModel(True)\
Early Stopping Feature in MedicalNerApproach
Annotator
Introducing earlyStopping
feature for MedicalNerApproach(). You can stop training at the point when the perforfmance on test/validation dataset starts to degrage. Two params are added to MedicalNerApproach() in order to use this feature:
-
earlyStoppingCriterion
: (float) This is used set the minimal improvement of the test metric to terminate training. The metric monitored is the same as the metrics used inuseBestModel
(macro F1 when using test/validation set, loss otherwise). Default is 0 which means no early stopping is applied. -
earlyStoppingPatience
: (int), the number of epoch without improvement which will be tolerated. Default is 0, which means that early stopping will occur at the first time when performance in the current epoch is no better than in the previous epoch.
Example :
med_ner = MedicalNerApproach()\
.setInputCols(["sentence", "token", "embeddings"])\
.setLabelColumn("label")\
.setOutputCol("ner")\
...
...
.setTestDataset(test_data_parquet_path)\
.setEarlyStoppingCriterion(0.01)\
.setEarlyStoppingPatience(3)\
Multi-Language Support for Faker and Regex Lists of Deidentification
Annotator
We have a new .setLanguage()
parameter in order to use internal Faker and Regex list for multi-language texts. When you are working with German and Spanish texts for a Deidentification, you can set this parameter to de
for German and es
for Spanish. Default value of this parameter is en
.
Example :
deid_obfuscated = DeIdentification()\
.setInputCols(["sentence", "token", "ner_chunk"]) \
.setOutputCol("obfuscated") \
.setMode("obfuscate")\
.setLanguage('de')\
.setObfuscateRefSource("faker")\
Spark 3.2.0 Compatibility for the Entire Library
Now we can use the Spark 3.2.0 version for Spark NLP for Healthcare by setting spark32=True
in sparknlp_jsl.start()
function.
! pip install --ignore-installed -q pyspark==3.2.0
import sparknlp_jsl
spark = sparknlp_jsl.start(SECRET, spark32=True)
Saving Visualization Feature in spark-nlp-display
Library
We have a new save_path
parameter in spark-nlp-display
library for saving any visualization results in Spark NLP.
Example :
from sparknlp_display import NerVisualizer
visualiser = NerVisualizer()
visualiser.display(light_result[0], label_col='ner_chunk', document_col='document', save_path="display_result.html")
Deploying a Custom Spark NLP Image (for opensource, healthcare, and Spark OCR) to an Enterprise Version of Kubernetes: OpenShift
Spark NLP for opensource, healthcare, and SPARK OCR is now available for Openshift - enterprise version of Kubernetes. For deployment, please refer to:
Github Link: https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/platforms/openshift
Youtube: https://www.youtube.com/watch?v=FBes-6ylFrM&ab_channel=JohnSnowLabs
New Speed Benchmarks Table on Databricks
We prepared a speed benchmark table by running a clinical BERT For Token Classification model pipeline on various number of repartitioning and writing the results to parquet or delta formats. You can find the details here : Clinical Bert For Token Classification Benchmark Experiment.
New & Updated Notebooks
- We have updated our existing workshop notebooks with v3.4.0 by adding new features and functionalities.
- You can find the workshop notebooks updated with previous versions in the branches named with the relevant version.
- We have updated the ContextualParser Notebook with the new updates in this version.
- We have a new Sentence Entity Resolvers with EntityChunkEmbeddings Notebook for the new
EntityChunkEmbeddings
annotator.
To see more, please check : Spark NLP Healthcare Workshop Repo
List of Recently Updated or Added Models
bert_sequence_classifier_ade_en
bert_sequence_classifier_gender_biobert_en
bert_sequence_classifier_pico_biobert_en
distilbert_sequence_classifier_ade_en
bert_token_classifier_ner_supplement_en
deid_pipeline_es
ner_deid_generic_es
ner_deid_generic_roberta_es
ner_deid_subentity_es
ner_deid_subentity_roberta_es
ner_nature_nero_clinical_en
ner_supplement_clinical_en
sbiobertresolve_clinical_abbreviation_acronym_en
sbiobertresolve_rxnorm_augmented_re
For all Spark NLP for healthcare models, please check : Models Hub Page
Versions
- 5.5.2
- 5.5.1
- 5.5.0
- 5.4.1
- 5.4.0
- 5.3.3
- 5.3.2
- 5.3.1
- 5.3.0
- 5.2.1
- 5.2.0
- 5.1.4
- 5.1.3
- 5.1.2
- 5.1.1
- 5.1.0
- 5.0.2
- 5.0.1
- 5.0.0
- 4.4.4
- 4.4.3
- 4.4.2
- 4.4.1
- 4.4.0
- 4.3.2
- 4.3.1
- 4.3.0
- 4.2.8
- 4.2.4
- 4.2.3
- 4.2.2
- 4.2.1
- 4.2.0
- 4.1.0
- 4.0.2
- 4.0.0
- 3.5.3
- 3.5.2
- 3.5.1
- 3.5.0
- 3.4.2
- 3.4.1
- 3.4.0
- 3.3.4
- 3.3.2
- 3.3.1
- 3.3.0
- 3.2.3
- 3.2.2
- 3.2.1
- 3.2.0
- 3.1.3
- 3.1.2
- 3.1.1
- 3.1.0
- 3.0.3
- 3.0.2
- 3.0.1
- 3.0.0
- 2.7.6
- 2.7.5
- 2.7.4
- 2.7.3
- 2.7.2
- 2.7.1
- 2.7.0
- 2.6.2
- 2.6.0
- 2.5.5
- 2.5.3
- 2.5.2
- 2.5.0
- 2.4.6
- 2.4.5
- 2.4.2
- 2.4.1
- 2.4.0