4.2.8
Highlights
- 4 new clinical named entity recognition models (3 oncology, 1 others)
- 5 new Social Determenant of Health text classification models
- New
DocumentMLClassifierApproachannotator for training text classification models using SVM and Logistic Regression using TfIdf - New
Resolution2Chunkannotator to map entity resolver outputs (terminology codes) to other clinical terminologies - New
DocMapperModelannotator allows to use any mapper model inDOCUMENTtype - Option to return
Deidentificationoutput as a single document - Inter-Annotator Agreement (IAA) metrics module that works with NLP Lab seamlessly
- Assertion dataset preparation module now supports chunk start and end indices, rather than token indices
- Added
ner_sourcein theChunkConvertermetadata - Core improvements and bug fixes
- Added chunk confidence score in the
RelationExtractionModelmetadata - Added confidence score in the
DocumentLogRegClassifierApproachmetadata - Fixed non-deterministic Relation Extraction DL Models (30+ models updated in the model hub)
- Fixed incompatible PretrainedPipelines with PySpark v3.2.x and v3.3.x
- Fixed
ZIPlabel issue onfakermode withsetZipCodeTagparameter inDeidentification - Fixed obfuscated numbers have the same number of chars as the original ones
- Fixed name obfuscation hashes in
Deidentificationfor romanian language - Fixed LightPipeline validation parameter for internal annotators
- LightPipeline support for
GenericClassifier(FeatureAssembler)
- Added chunk confidence score in the
- New and updated notebooks
- New and updated demos
- SOCIAL DETERMINANT demo
- 9 new clinical models and pipelines added & updated in total
4 New Clinical Named Entity Recognition Models (3 Oncology, 1 Others)
- We are releasing 3 new oncological NER models that were trained by using
embeddings_healthcare_100dembeddings model.
| model name | description | predicted entities |
|---|---|---|
| ner_oncology_anatomy_general_healthcare | Extracts anatomical entities using an unspecific label | Anatomical_Site Direction |
| ner_oncology_biomarker_healthcare | Extracts mentions of biomarkers and biomarker results in oncological texts. | Biomarker_Result Biomarker |
| ner_oncology_unspecific_posology_healthcare | Extracts mentions of treatments and posology information using unspecific labels (low granularity). | Posology_Information Cancer_Therapy |
Example:
...
word_embeddings = WordEmbeddingsModel()\
.pretrained("embeddings_healthcare_100d", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel\
.pretrained("ner_oncology_anatomy_general_healthcare", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
text = "The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver."
Result:
+------------------+----------------+
|chunk |ner_label |
+------------------+----------------+
|left |Direction |
|breast |Anatomical_Site |
|lungs |Anatomical_Site |
|liver |Anatomical_Site |
+------------------+----------------+
- We are releasing new oncological NER models that used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives.
Example:
...
ner = MedicalNerModel.pretrained('ner_eu_clinical_case', "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
text = """A 3-year-old boy with autistic disorder on hospital of pediatric ward A at university hospital. He has no family history of illness or autistic spectrum disorder."""
Result:
+------------------------------+------------------+
|chunk |ner_label |
+------------------------------+------------------+
|A 3-year-old boy |patient |
|autistic disorder |clinical_condition|
|He |patient |
|illness |clinical_event |
|autistic spectrum disorder |clinical_condition|
+------------------------------+------------------+
5 New Social Determinant of Health Text Classification Models
We are releasing 5 new models that can be used in Social Determinant of Health related classification tasks.
| model name | description | predicted entities |
|---|---|---|
| genericclassifier_sdoh_alcohol_usage_sbiobert_cased_mli | This model is intended for detecting alcohol use in clinical notes and trained by using GenericClassifierApproach annotator. | Present Past Never None |
| genericclassifier_sdoh_alcohol_usage_binary_sbiobert_cased_mli | This model is intended for detecting alcohol use in clinical notes and trained by using GenericClassifierApproach annotator. | Present Never None |
| genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli | This model is intended for detecting tobacco use in clinical notes and trained by using GenericClassifierApproach annotator | Present Past Never None |
| genericclassifier_sdoh_economics_binary_sbiobert_cased_mli | This model classifies related to social economics status in the clinical documents and trained by using GenericClassifierApproach annotator. | True False |
| genericclassifier_sdoh_substance_usage_binary_sbiobert_cased_mli | This model is intended for detecting substance use in clinical notes and trained by using GenericClassifierApproach annotator. | Present None |
Example:
...
features_asm = FeaturesAssembler()\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("features")
generic_classifier_tobacco = GenericClassifierModel.pretrained("genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli", 'en', 'clinical/models')\
.setInputCols(["features"])\
.setOutputCol("class_tobacco")
generic_classifier_alcohol = GenericClassifierModel.pretrained("genericclassifier_sdoh_alcohol_usage_sbiobert_cased_mli", 'en', 'clinical/models')\
.setInputCols(["features"])\
.setOutputCol("class_alcohol")
text = ["Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. He uses alcohol and cigarettes",
"The patient quit smoking approximately two years ago with an approximately a 40 pack year history, mostly cigar use.",
"The patient denies any history of smoking or alcohol abuse. She lives with her one daughter.",
"She was previously employed as a hairdresser, though says she hasnt worked in 4 years. Not reported by patient, but there is apparently a history of alochol abuse."
]
Result:
+----------------------------------------------------------------------------------------------------+---------+---------+
| text| tobacco| alcohol|
+----------------------------------------------------------------------------------------------------+---------+---------+
|Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 2...|[Present]|[Present]|
|The patient quit smoking approximately two years ago with an approximately a 40 pack year history...| [Past]| [None]|
| The patient denies any history of smoking or alcohol abuse. She lives with her one daughter.| [Never]| [Never]|
|She was previously employed as a hairdresser, though says she hasnt worked in 4 years. Not report...| [None]| [Past]|
+----------------------------------------------------------------------------------------------------+---------+---------+
New DocumentMLClassifierApproach Annotator For Training Text Classification Models Using SVM And Logistic Regression Using TfIdf
We have a new DocumentMLClassifierApproach that can be used for training text classification models with Logistic Regression and SVM algorithms. Training data requires “text” and their “label” columns only and the trained model will be a DocumentMLClassifierModel().
Input types: TOKEN
Output type: CATEGORY
| Parameters | Description |
|---|---|
| labels | array to output the label in the original form. |
| labelCol | column with the value result we are trying to predict. |
| maxIter | maximum number of iterations. |
| tol | convergence tolerance after each iteration. |
| fitIntercept | whether to fit an intercept term, default is true. |
| maxTokenNgram | the max number of tokens for Ngrams |
| minTokenNgram | the min number of tokens for Ngrams |
| vectorizationModelPath | specify the vectorization model if it has been already trained. |
| classificationModelPath | specify the classification model if it has been already trained. |
| classificationModelClass | specify the SparkML classification class; possible values are logreg, svm |
Example:
...
classifier_svm= DocumentMLClassifierApproach() \
.setInputCols("token") \
.setLabelCol("category") \
.setOutputCol("prediction") \
.setMaxTokenNgram(1)\
.setClassificationModelClass("svm") #or "logreg"
model_svm = Pipeline(stages=[document, token, classifier_svm]).fit(trainingData)
text = [
["This 1-year-old child had a gastrostomy placed due to feeding difficulties."],
["He is a pleasant young man who has a diagnosis of bulbar cerebral palsy and hypotonia."],
["The patient is a 45-year-old female whose symptoms are pain in the left shoulder and some neck pain."],
["The patient is a 61-year-old female with history of recurrent uroseptic stones."]
]
Result:
+----------------------------------------------------------------------------------------------------+----------------+
|text |prediction |
+----------------------------------------------------------------------------------------------------+----------------+
|He is a pleasant young man who has a diagnosis of bulbar cerebral palsy and hypotonia. |Neurology |
|This 1-year-old child had a gastrostomy placed due to feeding difficulties. |Gastroenterology|
|The patient is a 61-year-old female with history of recurrent uroseptic stones. |Urology |
|The patient is a 45-year-old female whose symptoms are pain in the left shoulder and some neck pain.|Orthopedic |
+----------------------------------------------------------------------------------------------------+----------------+
Option To Return Deidentification Output As a Single Document
We can return Deidentification() output as a single document by setting new setOutputAsDocument as True. If it is False, the outputs will be list of sentences as it is used to be.
Example:
deid_obfuscated = DeIdentification()\
.setInputCols(["sentence", "token", "ner_chunk_subentity"]) \
.setOutputCol("obfuscated") \
.setMode("obfuscate")\
.setObfuscateDate(True)\
.setObfuscateRefFile('obfuscate.txt')\
.setObfuscateRefSource("file")\
.setUnnormalizedDateMode("obfuscate")\
.setOutputAsDocument(True) # or False for sentence level result
text ='''
Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson , Ora MR # 7194334 Date : 01/13/93 . Patient : Oliveira, 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital . 0295 Keats Street
'''
Result of .setOutputAsDocument(True):
'obfuscated': ['Record date : 2093-01-14 , Beer-Karge , M.D . , Name : Hasan Jacobi Jäckel MR # <MEDICALRECORD> Date : 01-31-1991 . Patient : Herr Anselm Trüb, 51 years-old , Record date : 2080-01-08 . Klinik St. Hedwig . <MEDICALRECORD> Keats Street']
Result of .setOutputAsDocument(False):
'obfuscated': ['Record date : 2093-02-19 , Kaul , M.D . , Name : Frauke Oestrovsky MR # <MEDICALRECORD> Date : 05-08-1971 .',
'Patient : Lars Bloch, 33 years-old , Record date : 2079-11-11 .',
'University Hospital of Düsseldorf . <MEDICALRECORD> Keats Street']
New Resolution2Chunk Annotator To Map Entity Resolver Outputs (terminology codes) To Other Clinical Terminologies
We have a new Resolution2Chunk annotator that maps the entity resolver outputs to other clinical terminologies.
Example:
icd_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc","en", "clinical/models") \
.setInputCols(["sentence_embeddings"]) \
.setOutputCol("icd10cm_code")\
.setDistanceFunction("EUCLIDEAN")
resolver2chunk = Resolution2Chunk()\
.setInputCols(["icd10cm_code"]) \
.setOutputCol("resolver2chunk")\
chunkerMapper = ChunkMapperModel.pretrained("icd10cm_snomed_mapper", "en", "clinical/models")\
.setInputCols(["resolver2chunk"])\
.setOutputCol("mappings")\
.setRels(["snomed_code"])
sample_text = """Diabetes Mellitus"""
Result:
+-----------------+-----------------+------------+-----------+
|text |ner_chunk |icd10cm_code|snomed_code|
+-----------------+-----------------+------------+-----------+
|Diabetes Mellitus|Diabetes Mellitus|E109 |170756003 |
+-----------------+-----------------+------------+-----------+
New DocMapperModel Annotator Allows To Use With Any Mapper Model In DOCUMENT Type
Any ChunkMapperModel can be used with this new annotator called DocMapperModel and as its name suggests, it is used to map short strings via DocumentAssembler without using any other annotator between to convert strings to Chunk type that ChunkMapperModel expects.
Example:
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
model = DocMapperModel.pretrained("drug_brandname_ndc_mapper", "en", "clinical/models")\
.setInputCols("document")\
.setOutputCol("mappings")
sample_text = "ZYVOX"
Result:
| Brand_Name | Strenth_NDC |
|:-------------|:-------------------------|
| ZYVOX | 600 mg/300mL | 0009-4992 |
Inter-Annotator Agreement (IAA) metrics module that works with NLP Lab seamlessly
We added a new get_IAA_metrics() method to ALAB module. This method allows you to compare and evaluate the annotations in the seed corpus that all annotators annotated the same documents at the begining of an annotation project. It returns all the results in CSV files. Here are the parameters;
spark: SparkSession.conll_dir(str): path to the folder that conll files in.annotator_names(list): list of annotator names.set_ref_annotator(str): reference annotator name. If present, all comparisons made with respect to it, if it isNoneall annotators will be compared by each other. Default isNone.return_NerDLMetrics(boolean): IfTrue, we get thefull_chunkand -partial_chunk_per_tokenIAA metrics by using NerDLMetrics. IfFalse, we get the chunk based metrics usingevaluatemethod oftraining_log_parsermodule and the token based metrics using classification reports, then write the results in “eval_metric_files” folder. Default isFalse.save_dir(str): path to save the token based results dataframes, default is “results_token_based”.
For more details and examples, please check ALAB Notebook.
Example:
alab.get_IAA_metrics(spark, conll_dir = path_to_conll_folder, annotator_names = ["annotator_1","annotator_2","annotator_3","annotator_4"], set_ref_annotator = "annotator_1", return_NerDLMetrics = False, save_dir = "./token_based_results")
Assertion dataset preparation module now supports chunk start and end indices, rather than token indices.
Here are the new features in get_assertion_data();
- Now it returns the
char_beginandchar_endindices of the chunks. These columns can be used inAssertionDLApproach()annotator instead oftoken_beginandtoken_endcolumns for training an Assertion Status Detection model. - Added
included_task_idsparameter that allows you to prepare the assertion model training dataframe with only the included tasks. Default isNone. - Added
seedparameter that allows you to get the same training dataframe at each time when you setunannotated_label_strategy. Default isNone.
For more details and examples, please check ALAB Notebook.
Added ner_source in the ChunkConverter Metadata
We added ner_source in the metadata of ChunkConverter output. In this way, the sources of the chunks can be seen if there are multiple components that have the same NER label in the same pipeline.
Example:
...
age_contextual_parser = ContextualParserApproach() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("age_cp") \
.setJsonPath("age.json") \
.setCaseSensitive(False) \
.setPrefixAndSuffixMatch(False)
chunks_age = ChunkConverter()\
.setInputCols("age_cp")\
.setOutputCol("age_chunk")
...
sample_text = """The patient is a 28 years old female with a history of gestational diabetes mellitus was diagnosed in April 2002 in County Baptist Hospital ."""
Result:
[Annotation(chunk, 17, 18, 28, {'tokenIndex': '4', 'entity': 'Age', 'field': 'Age', 'ner_source': 'age_chunk', 'chunk': '0', 'normalized': '', 'sentence': '0', 'confidenceValue': '0.74'})]
Core Improvements and Bug Fixes
- Added chunk confidence score in the
RelationExtractionModelmetadata - Added confidence score in the
DocumentLogRegClassifierApproachmetadata - Fixed non-deterministic Relation Extraction DL Models (30+ models updated in the model hub)
- Fixed incompatible PretrainedPipelines with PySpark v3.2.x and v3.3.x
- Fixed
ZIPlabel issue onfakermode withsetZipCodeTagparameter inDeidentification - Fixed obfuscated numbers have the same number of chars as the original ones
- Fixed name obfuscation hashes in
Deidentificationfor romanian language - Fixed LightPipeline validation parameter for internal annotators
- LightPipeline support for
GenericClassifier(FeatureAssembler)
New and Updated Notebooks
- New Clinical Text Classification with Spark_NLP Notebook show how can use medical text with ClassifierDL, MultiClassifierDL, GenericClassifier, and DocumentLogRegClassifier
- New Clinical Text Classification with DocumentMLClassifier Notebook show how can use medical text with DocumentMLClassifier
- Updated ALAB Notebook with the changes in
get_assertion_data()and the newget_IAA_metrics()method.
New and Updated Demos
- SOCIAL DETERMINANT demo
9 New Clinical Models and Pipelines Added & Updated in Total
ner_oncology_anatomy_general_healthcarener_oncology_biomarker_healthcarener_oncology_unspecific_posology_healthcarener_eu_clinical_casegenericclassifier_sdoh_economics_binary_sbiobert_cased_mligenericclassifier_sdoh_substance_usage_binary_sbiobert_cased_mligenericclassifier_sdoh_tobacco_usage_sbiobert_cased_mligenericclassifier_sdoh_alcohol_usage_sbiobert_cased_mligenericclassifier_sdoh_alcohol_usage_binary_sbiobert_cased_mli
For all Spark NLP for Healthcare models, please check: Models Hub Page
Versions
- 6.2.0
- 6.1.1
- 6.1.0
- 6.0.4
- 6.0.3
- 6.0.2
- 6.0.1
- 6.0.0
- 5.5.3
- 5.5.2
- 5.5.1
- 5.5.0
- 5.4.1
- 5.4.0
- 5.3.3
- 5.3.2
- 5.3.1
- 5.3.0
- 5.2.1
- 5.2.0
- 5.1.4
- 5.1.3
- 5.1.2
- 5.1.1
- 5.1.0
- 5.0.2
- 5.0.1
- 5.0.0
- 4.4.4
- 4.4.3
- 4.4.2
- 4.4.1
- 4.4.0
- 4.3.2
- 4.3.1
- 4.3.0
- 4.2.8
- 4.2.4
- 4.2.3
- 4.2.2
- 4.2.1
- 4.2.0
- 4.1.0
- 4.0.2
- 4.0.0
- 3.5.3
- 3.5.2
- 3.5.1
- 3.5.0
- 3.4.2
- 3.4.1
- 3.4.0
- 3.3.4
- 3.3.2
- 3.3.1
- 3.3.0
- 3.2.3
- 3.2.2
- 3.2.1
- 3.2.0
- 3.1.3
- 3.1.2
- 3.1.1
- 3.1.0
- 3.0.3
- 3.0.2
- 3.0.1
- 3.0.0
- 2.7.6
- 2.7.5
- 2.7.4
- 2.7.3
- 2.7.2
- 2.7.1
- 2.7.0
- 2.6.2
- 2.6.0
- 2.5.5
- 2.5.3
- 2.5.2
- 2.5.0
- 2.4.6
- 2.4.5
- 2.4.2
- 2.4.1
- 2.4.0