Healthcare NLP v4.2.8 Release Notes

4.2.8

Highlights

4 new clinical named entity recognition models (3 oncology, 1 others)
5 new Social Determenant of Health text classification models
New DocumentMLClassifierApproach annotator for training text classification models using SVM and Logistic Regression using TfIdf
New Resolution2Chunk annotator to map entity resolver outputs (terminology codes) to other clinical terminologies
New DocMapperModel annotator allows to use any mapper model in DOCUMENT type
Option to return Deidentification output as a single document
Inter-Annotator Agreement (IAA) metrics module that works with NLP Lab seamlessly
Assertion dataset preparation module now supports chunk start and end indices, rather than token indices
Added ner_source in the ChunkConverter metadata
Core improvements and bug fixes
- Added chunk confidence score in the RelationExtractionModel metadata
- Added confidence score in the DocumentLogRegClassifierApproach metadata
- Fixed non-deterministic Relation Extraction DL Models (30+ models updated in the model hub)
- Fixed incompatible PretrainedPipelines with PySpark v3.2.x and v3.3.x
- Fixed ZIP label issue on faker mode with setZipCodeTag parameter in Deidentification
- Fixed obfuscated numbers have the same number of chars as the original ones
- Fixed name obfuscation hashes in Deidentification for romanian language
- Fixed LightPipeline validation parameter for internal annotators
- LightPipeline support for GenericClassifier (FeatureAssembler)
New and updated notebooks
New and updated demos
- SOCIAL DETERMINANT demo
9 new clinical models and pipelines added & updated in total

4 New Clinical Named Entity Recognition Models (3 Oncology, 1 Others)

We are releasing 3 new oncological NER models that were trained by using embeddings_healthcare_100d embeddings model.

model name	description	predicted entities
ner_oncology_anatomy_general_healthcare	Extracts anatomical entities using an unspecific label	`Anatomical_Site` `Direction`
ner_oncology_biomarker_healthcare	Extracts mentions of biomarkers and biomarker results in oncological texts.	`Biomarker_Result` `Biomarker`
ner_oncology_unspecific_posology_healthcare	Extracts mentions of treatments and posology information using unspecific labels (low granularity).	`Posology_Information` `Cancer_Therapy`

Example:

...
word_embeddings = WordEmbeddingsModel()\
    .pretrained("embeddings_healthcare_100d", "en", "clinical/models")\
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")  

ner = MedicalNerModel\
    .pretrained("ner_oncology_anatomy_general_healthcare", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

text = "The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver."

Result:

+------------------+----------------+
|chunk             |ner_label       |
+------------------+----------------+
|left              |Direction       |
|breast            |Anatomical_Site |
|lungs             |Anatomical_Site |
|liver             |Anatomical_Site |
+------------------+----------------+

We are releasing new oncological NER models that used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives.

Example:

...
ner = MedicalNerModel.pretrained('ner_eu_clinical_case', "en", "clinical/models") \
	.setInputCols(["sentence", "token", "embeddings"]) \
	.setOutputCol("ner")

text = """A 3-year-old boy with autistic disorder on hospital of pediatric ward A at university hospital. He has no family history of illness or autistic spectrum disorder."""

Result:

+------------------------------+------------------+
|chunk                         |ner_label         |
+------------------------------+------------------+
|A 3-year-old boy              |patient           |
|autistic disorder             |clinical_condition|
|He                            |patient           |
|illness                       |clinical_event    |
|autistic spectrum disorder    |clinical_condition|
+------------------------------+------------------+

We are releasing 5 new models that can be used in Social Determinant of Health related classification tasks.

model name	description	predicted entities
genericclassifier_sdoh_alcohol_usage_sbiobert_cased_mli	This model is intended for detecting alcohol use in clinical notes and trained by using GenericClassifierApproach annotator.	`Present` `Past` `Never` `None`
genericclassifier_sdoh_alcohol_usage_binary_sbiobert_cased_mli	This model is intended for detecting alcohol use in clinical notes and trained by using GenericClassifierApproach annotator.	`Present` `Never` `None`
genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli	This model is intended for detecting tobacco use in clinical notes and trained by using GenericClassifierApproach annotator	`Present` `Past` `Never` `None`
genericclassifier_sdoh_economics_binary_sbiobert_cased_mli	This model classifies related to social economics status in the clinical documents and trained by using GenericClassifierApproach annotator.	`True` `False`
genericclassifier_sdoh_substance_usage_binary_sbiobert_cased_mli	This model is intended for detecting substance use in clinical notes and trained by using GenericClassifierApproach annotator.	`Present` `None`

Example:

...
features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")

generic_classifier_tobacco = GenericClassifierModel.pretrained("genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli", 'en', 'clinical/models')\
    .setInputCols(["features"])\
    .setOutputCol("class_tobacco")
    
generic_classifier_alcohol = GenericClassifierModel.pretrained("genericclassifier_sdoh_alcohol_usage_sbiobert_cased_mli", 'en', 'clinical/models')\
    .setInputCols(["features"])\
    .setOutputCol("class_alcohol")

text = ["Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. He uses alcohol and cigarettes",
        "The patient quit smoking approximately two years ago with an approximately a 40 pack year history, mostly cigar use.",
        "The patient denies any history of smoking or alcohol abuse. She lives with her one daughter.",
        "She was previously employed as a hairdresser, though says she hasnt worked in 4 years. Not reported by patient, but there is apparently a history of alochol abuse."
      ]

Result:

+----------------------------------------------------------------------------------------------------+---------+---------+
|                                                                                                text|  tobacco|  alcohol|
+----------------------------------------------------------------------------------------------------+---------+---------+
|Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 2...|[Present]|[Present]|
|The patient quit smoking approximately two years ago with an approximately a 40 pack year history...|   [Past]|   [None]|
|        The patient denies any history of smoking or alcohol abuse. She lives with her one daughter.|  [Never]|  [Never]|
|She was previously employed as a hairdresser, though says she hasnt worked in 4 years. Not report...|   [None]|   [Past]|
+----------------------------------------------------------------------------------------------------+---------+---------+

New `DocumentMLClassifierApproach` Annotator For Training Text Classification Models Using SVM And Logistic Regression Using TfIdf

We have a new DocumentMLClassifierApproach that can be used for training text classification models with Logistic Regression and SVM algorithms. Training data requires “text” and their “label” columns only and the trained model will be a DocumentMLClassifierModel().

Input types: TOKEN
Output type: CATEGORY

Parameters	Description
labels	array to output the label in the original form.
labelCol	column with the value result we are trying to predict.
maxIter	maximum number of iterations.
tol	convergence tolerance after each iteration.
fitIntercept	whether to fit an intercept term, default is true.
maxTokenNgram	the max number of tokens for Ngrams
minTokenNgram	the min number of tokens for Ngrams
vectorizationModelPath	specify the vectorization model if it has been already trained.
classificationModelPath	specify the classification model if it has been already trained.
classificationModelClass	specify the SparkML classification class; possible values are `logreg`, `svm`

Example:

...
classifier_svm= DocumentMLClassifierApproach() \
    .setInputCols("token") \
    .setLabelCol("category") \
    .setOutputCol("prediction") \
    .setMaxTokenNgram(1)\
    .setClassificationModelClass("svm") #or "logreg"

model_svm = Pipeline(stages=[document, token, classifier_svm]).fit(trainingData)

text = [
    ["This 1-year-old child had a gastrostomy placed due to feeding difficulties."], 
    ["He is a pleasant young man who has a diagnosis of bulbar cerebral palsy and hypotonia."], 
    ["The patient is a 45-year-old female whose symptoms are pain in the left shoulder and some neck pain."],
    ["The patient is a 61-year-old female with history of recurrent uroseptic stones."]
]

Result:

+----------------------------------------------------------------------------------------------------+----------------+
|text                                                                                                |prediction      |
+----------------------------------------------------------------------------------------------------+----------------+
|He is a pleasant young man who has a diagnosis of bulbar cerebral palsy and hypotonia.              |Neurology       |
|This 1-year-old child had a gastrostomy placed due to feeding difficulties.                         |Gastroenterology|
|The patient is a 61-year-old female with history of recurrent uroseptic stones.                     |Urology         |
|The patient is a 45-year-old female whose symptoms are pain in the left shoulder and some neck pain.|Orthopedic      |
+----------------------------------------------------------------------------------------------------+----------------+

Option To Return `Deidentification` Output As a Single Document

We can return Deidentification() output as a single document by setting new setOutputAsDocument as True. If it is False, the outputs will be list of sentences as it is used to be.

Example:

deid_obfuscated = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk_subentity"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate")\
    .setObfuscateDate(True)\
    .setObfuscateRefFile('obfuscate.txt')\
    .setObfuscateRefSource("file")\
    .setUnnormalizedDateMode("obfuscate")\
    .setOutputAsDocument(True) # or False for sentence level result

text ='''
Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson , Ora MR # 7194334 Date : 01/13/93 . Patient : Oliveira, 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital . 0295 Keats Street
'''

Result of .setOutputAsDocument(True):

'obfuscated': ['Record date : 2093-01-14 , Beer-Karge , M.D . , Name : Hasan Jacobi Jäckel MR # <MEDICALRECORD> Date : 01-31-1991 . Patient : Herr Anselm Trüb, 51 years-old , Record date : 2080-01-08 . Klinik St. Hedwig . <MEDICALRECORD> Keats Street']

Result of .setOutputAsDocument(False):

'obfuscated': ['Record date : 2093-02-19 , Kaul , M.D . , Name : Frauke Oestrovsky MR # <MEDICALRECORD> Date : 05-08-1971 .',
               'Patient : Lars Bloch, 33 years-old , Record date : 2079-11-11 .',
               'University Hospital of Düsseldorf . <MEDICALRECORD> Keats Street']

New `Resolution2Chunk` Annotator To Map Entity Resolver Outputs (terminology codes) To Other Clinical Terminologies

We have a new Resolution2Chunk annotator that maps the entity resolver outputs to other clinical terminologies.

Example:

icd_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc","en", "clinical/models") \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("icd10cm_code")\
    .setDistanceFunction("EUCLIDEAN")
    
resolver2chunk = Resolution2Chunk()\
    .setInputCols(["icd10cm_code"]) \
    .setOutputCol("resolver2chunk")\

chunkerMapper = ChunkMapperModel.pretrained("icd10cm_snomed_mapper", "en", "clinical/models")\
    .setInputCols(["resolver2chunk"])\
    .setOutputCol("mappings")\
    .setRels(["snomed_code"])

sample_text = """Diabetes Mellitus"""

Result:

+-----------------+-----------------+------------+-----------+
|text             |ner_chunk        |icd10cm_code|snomed_code|
+-----------------+-----------------+------------+-----------+
|Diabetes Mellitus|Diabetes Mellitus|E109        |170756003  |
+-----------------+-----------------+------------+-----------+

New `DocMapperModel` Annotator Allows To Use With Any Mapper Model In `DOCUMENT` Type

Any ChunkMapperModel can be used with this new annotator called DocMapperModel and as its name suggests, it is used to map short strings via DocumentAssembler without using any other annotator between to convert strings to Chunk type that ChunkMapperModel expects.

Example:

documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

model = DocMapperModel.pretrained("drug_brandname_ndc_mapper", "en", "clinical/models")\
    .setInputCols("document")\
    .setOutputCol("mappings")

sample_text = "ZYVOX"

Result:

| Brand_Name   | Strenth_NDC              |
|:-------------|:-------------------------|
| ZYVOX        | 600 mg/300mL | 0009-4992 |

Inter-Annotator Agreement (IAA) metrics module that works with NLP Lab seamlessly

We added a new get_IAA_metrics() method to ALAB module. This method allows you to compare and evaluate the annotations in the seed corpus that all annotators annotated the same documents at the begining of an annotation project. It returns all the results in CSV files. Here are the parameters;

spark : SparkSession.
conll_dir (str): path to the folder that conll files in.
annotator_names (list): list of annotator names.
set_ref_annotator (str): reference annotator name. If present, all comparisons made with respect to it, if it is None all annotators will be compared by each other. Default is None.
return_NerDLMetrics (boolean): If True, we get the full_chunk and - partial_chunk_per_token IAA metrics by using NerDLMetrics. If False, we get the chunk based metrics using evaluate method of training_log_parser module and the token based metrics using classification reports, then write the results in “eval_metric_files” folder. Default is False.
save_dir (str): path to save the token based results dataframes, default is “results_token_based”.

For more details and examples, please check ALAB Notebook.

Example:

alab.get_IAA_metrics(spark, conll_dir = path_to_conll_folder, annotator_names = ["annotator_1","annotator_2","annotator_3","annotator_4"], set_ref_annotator = "annotator_1", return_NerDLMetrics = False, save_dir = "./token_based_results")

Assertion dataset preparation module now supports chunk start and end indices, rather than token indices.

Here are the new features in get_assertion_data();

Now it returns the char_begin and char_end indices of the chunks. These columns can be used in AssertionDLApproach() annotator instead of token_begin and token_end columns for training an Assertion Status Detection model.
Added included_task_ids parameter that allows you to prepare the assertion model training dataframe with only the included tasks. Default is None.
Added seed parameter that allows you to get the same training dataframe at each time when you set unannotated_label_strategy. Default is None.

For more details and examples, please check ALAB Notebook.

Added `ner_source` in the `ChunkConverter` Metadata

We added ner_source in the metadata of ChunkConverter output. In this way, the sources of the chunks can be seen if there are multiple components that have the same NER label in the same pipeline.

Example:

...
age_contextual_parser = ContextualParserApproach() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("age_cp") \
    .setJsonPath("age.json") \
    .setCaseSensitive(False) \
    .setPrefixAndSuffixMatch(False)    

chunks_age = ChunkConverter()\
    .setInputCols("age_cp")\
    .setOutputCol("age_chunk")
...

sample_text = """The patient is a 28 years old female with a history of gestational diabetes mellitus was diagnosed in April 2002 in County Baptist Hospital ."""

Result:

[Annotation(chunk, 17, 18, 28, {'tokenIndex': '4', 'entity': 'Age', 'field': 'Age', 'ner_source': 'age_chunk', 'chunk': '0', 'normalized': '', 'sentence': '0', 'confidenceValue': '0.74'})]

Core Improvements and Bug Fixes

Added chunk confidence score in the RelationExtractionModel metadata
Added confidence score in the DocumentLogRegClassifierApproach metadata
Fixed non-deterministic Relation Extraction DL Models (30+ models updated in the model hub)
Fixed incompatible PretrainedPipelines with PySpark v3.2.x and v3.3.x
Fixed ZIP label issue on faker mode with setZipCodeTag parameter in Deidentification
Fixed obfuscated numbers have the same number of chars as the original ones
Fixed name obfuscation hashes in Deidentification for romanian language
Fixed LightPipeline validation parameter for internal annotators
LightPipeline support for GenericClassifier (FeatureAssembler)

New and Updated Notebooks

New Clinical Text Classification with Spark_NLP Notebook show how can use medical text with ClassifierDL, MultiClassifierDL, GenericClassifier, and DocumentLogRegClassifier
New Clinical Text Classification with DocumentMLClassifier Notebook show how can use medical text with DocumentMLClassifier
Updated ALAB Notebook with the changes in get_assertion_data() and the new get_IAA_metrics() method.

New and Updated Demos

SOCIAL DETERMINANT demo

9 New Clinical Models and Pipelines Added & Updated in Total

ner_oncology_anatomy_general_healthcare
ner_oncology_biomarker_healthcare
ner_oncology_unspecific_posology_healthcare
ner_eu_clinical_case
genericclassifier_sdoh_economics_binary_sbiobert_cased_mli
genericclassifier_sdoh_substance_usage_binary_sbiobert_cased_mli
genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli
genericclassifier_sdoh_alcohol_usage_sbiobert_cased_mli
genericclassifier_sdoh_alcohol_usage_binary_sbiobert_cased_mli

For all Spark NLP for Healthcare models, please check: Models Hub Page

Versions

Version
Version
Version

PREVIOUSVersion Compatibility

4.2.8

Highlights

4 New Clinical Named Entity Recognition Models (3 Oncology, 1 Others)

5 New Social Determinant of Health Text Classification Models

New DocumentMLClassifierApproach Annotator For Training Text Classification Models Using SVM And Logistic Regression Using TfIdf

Option To Return Deidentification Output As a Single Document

New Resolution2Chunk Annotator To Map Entity Resolver Outputs (terminology codes) To Other Clinical Terminologies

New DocMapperModel Annotator Allows To Use With Any Mapper Model In DOCUMENT Type

Inter-Annotator Agreement (IAA) metrics module that works with NLP Lab seamlessly

Assertion dataset preparation module now supports chunk start and end indices, rather than token indices.

Added ner_source in the ChunkConverter Metadata

Core Improvements and Bug Fixes

New and Updated Notebooks

New and Updated Demos

9 New Clinical Models and Pipelines Added & Updated in Total

Versions

New `DocumentMLClassifierApproach` Annotator For Training Text Classification Models Using SVM And Logistic Regression Using TfIdf

Option To Return `Deidentification` Output As a Single Document

New `Resolution2Chunk` Annotator To Map Entity Resolver Outputs (terminology codes) To Other Clinical Terminologies

New `DocMapperModel` Annotator Allows To Use With Any Mapper Model In `DOCUMENT` Type

Added `ner_source` in the `ChunkConverter` Metadata