4.2.8
Highlights
- 4 new clinical named entity recognition models (3 oncology, 1 others)
- 5 new Social Determenant of Health text classification models
- New
DocumentMLClassifierApproach
annotator for training text classification models using SVM and Logistic Regression using TfIdf - New
Resolution2Chunk
annotator to map entity resolver outputs (terminology codes) to other clinical terminologies - New
DocMapperModel
annotator allows to use any mapper model inDOCUMENT
type - Option to return
Deidentification
output as a single document - Inter-Annotator Agreement (IAA) metrics module that works with NLP Lab seamlessly
- Assertion dataset preparation module now supports chunk start and end indices, rather than token indices
- Added
ner_source
in theChunkConverter
metadata - Core improvements and bug fixes
- Added chunk confidence score in the
RelationExtractionModel
metadata - Added confidence score in the
DocumentLogRegClassifierApproach
metadata - Fixed non-deterministic Relation Extraction DL Models (30+ models updated in the model hub)
- Fixed incompatible PretrainedPipelines with PySpark v3.2.x and v3.3.x
- Fixed
ZIP
label issue onfaker
mode withsetZipCodeTag
parameter inDeidentification
- Fixed obfuscated numbers have the same number of chars as the original ones
- Fixed name obfuscation hashes in
Deidentification
for romanian language - Fixed LightPipeline validation parameter for internal annotators
- LightPipeline support for
GenericClassifier
(FeatureAssembler
)
- Added chunk confidence score in the
- New and updated notebooks
- New and updated demos
- SOCIAL DETERMINANT demo
- 9 new clinical models and pipelines added & updated in total
4 New Clinical Named Entity Recognition Models (3 Oncology, 1 Others)
- We are releasing 3 new oncological NER models that were trained by using
embeddings_healthcare_100d
embeddings model.
model name | description | predicted entities |
---|---|---|
ner_oncology_anatomy_general_healthcare | Extracts anatomical entities using an unspecific label | Anatomical_Site Direction |
ner_oncology_biomarker_healthcare | Extracts mentions of biomarkers and biomarker results in oncological texts. | Biomarker_Result Biomarker |
ner_oncology_unspecific_posology_healthcare | Extracts mentions of treatments and posology information using unspecific labels (low granularity). | Posology_Information Cancer_Therapy |
Example:
...
word_embeddings = WordEmbeddingsModel()\
.pretrained("embeddings_healthcare_100d", "en", "clinical/models")\
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner = MedicalNerModel\
.pretrained("ner_oncology_anatomy_general_healthcare", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
text = "The patient presented a mass in her left breast, and a possible metastasis in her lungs and in her liver."
Result:
+------------------+----------------+
|chunk |ner_label |
+------------------+----------------+
|left |Direction |
|breast |Anatomical_Site |
|lungs |Anatomical_Site |
|liver |Anatomical_Site |
+------------------+----------------+
- We are releasing new oncological NER models that used for model training is provided by European Clinical Case Corpus (E3C), a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives.
Example:
...
ner = MedicalNerModel.pretrained('ner_eu_clinical_case', "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
text = """A 3-year-old boy with autistic disorder on hospital of pediatric ward A at university hospital. He has no family history of illness or autistic spectrum disorder."""
Result:
+------------------------------+------------------+
|chunk |ner_label |
+------------------------------+------------------+
|A 3-year-old boy |patient |
|autistic disorder |clinical_condition|
|He |patient |
|illness |clinical_event |
|autistic spectrum disorder |clinical_condition|
+------------------------------+------------------+
5 New Social Determinant of Health Text Classification Models
We are releasing 5 new models that can be used in Social Determinant of Health related classification tasks.
model name | description | predicted entities |
---|---|---|
genericclassifier_sdoh_alcohol_usage_sbiobert_cased_mli | This model is intended for detecting alcohol use in clinical notes and trained by using GenericClassifierApproach annotator. | Present Past Never None |
genericclassifier_sdoh_alcohol_usage_binary_sbiobert_cased_mli | This model is intended for detecting alcohol use in clinical notes and trained by using GenericClassifierApproach annotator. | Present Never None |
genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli | This model is intended for detecting tobacco use in clinical notes and trained by using GenericClassifierApproach annotator | Present Past Never None |
genericclassifier_sdoh_economics_binary_sbiobert_cased_mli | This model classifies related to social economics status in the clinical documents and trained by using GenericClassifierApproach annotator. | True False |
genericclassifier_sdoh_substance_usage_binary_sbiobert_cased_mli | This model is intended for detecting substance use in clinical notes and trained by using GenericClassifierApproach annotator. | Present None |
Example:
...
features_asm = FeaturesAssembler()\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("features")
generic_classifier_tobacco = GenericClassifierModel.pretrained("genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli", 'en', 'clinical/models')\
.setInputCols(["features"])\
.setOutputCol("class_tobacco")
generic_classifier_alcohol = GenericClassifierModel.pretrained("genericclassifier_sdoh_alcohol_usage_sbiobert_cased_mli", 'en', 'clinical/models')\
.setInputCols(["features"])\
.setOutputCol("class_alcohol")
text = ["Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. He uses alcohol and cigarettes",
"The patient quit smoking approximately two years ago with an approximately a 40 pack year history, mostly cigar use.",
"The patient denies any history of smoking or alcohol abuse. She lives with her one daughter.",
"She was previously employed as a hairdresser, though says she hasnt worked in 4 years. Not reported by patient, but there is apparently a history of alochol abuse."
]
Result:
+----------------------------------------------------------------------------------------------------+---------+---------+
| text| tobacco| alcohol|
+----------------------------------------------------------------------------------------------------+---------+---------+
|Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 2...|[Present]|[Present]|
|The patient quit smoking approximately two years ago with an approximately a 40 pack year history...| [Past]| [None]|
| The patient denies any history of smoking or alcohol abuse. She lives with her one daughter.| [Never]| [Never]|
|She was previously employed as a hairdresser, though says she hasnt worked in 4 years. Not report...| [None]| [Past]|
+----------------------------------------------------------------------------------------------------+---------+---------+
New DocumentMLClassifierApproach
Annotator For Training Text Classification Models Using SVM And Logistic Regression Using TfIdf
We have a new DocumentMLClassifierApproach
that can be used for training text classification models with Logistic Regression and SVM algorithms. Training data requires “text” and their “label” columns only and the trained model will be a DocumentMLClassifierModel()
.
Input types: TOKEN
Output type: CATEGORY
Parameters | Description |
---|---|
labels | array to output the label in the original form. |
labelCol | column with the value result we are trying to predict. |
maxIter | maximum number of iterations. |
tol | convergence tolerance after each iteration. |
fitIntercept | whether to fit an intercept term, default is true. |
maxTokenNgram | the max number of tokens for Ngrams |
minTokenNgram | the min number of tokens for Ngrams |
vectorizationModelPath | specify the vectorization model if it has been already trained. |
classificationModelPath | specify the classification model if it has been already trained. |
classificationModelClass | specify the SparkML classification class; possible values are logreg , svm |
Example:
...
classifier_svm= DocumentMLClassifierApproach() \
.setInputCols("token") \
.setLabelCol("category") \
.setOutputCol("prediction") \
.setMaxTokenNgram(1)\
.setClassificationModelClass("svm") #or "logreg"
model_svm = Pipeline(stages=[document, token, classifier_svm]).fit(trainingData)
text = [
["This 1-year-old child had a gastrostomy placed due to feeding difficulties."],
["He is a pleasant young man who has a diagnosis of bulbar cerebral palsy and hypotonia."],
["The patient is a 45-year-old female whose symptoms are pain in the left shoulder and some neck pain."],
["The patient is a 61-year-old female with history of recurrent uroseptic stones."]
]
Result:
+----------------------------------------------------------------------------------------------------+----------------+
|text |prediction |
+----------------------------------------------------------------------------------------------------+----------------+
|He is a pleasant young man who has a diagnosis of bulbar cerebral palsy and hypotonia. |Neurology |
|This 1-year-old child had a gastrostomy placed due to feeding difficulties. |Gastroenterology|
|The patient is a 61-year-old female with history of recurrent uroseptic stones. |Urology |
|The patient is a 45-year-old female whose symptoms are pain in the left shoulder and some neck pain.|Orthopedic |
+----------------------------------------------------------------------------------------------------+----------------+
Option To Return Deidentification
Output As a Single Document
We can return Deidentification()
output as a single document by setting new setOutputAsDocument
as True
. If it is False
, the outputs will be list of sentences as it is used to be.
Example:
deid_obfuscated = DeIdentification()\
.setInputCols(["sentence", "token", "ner_chunk_subentity"]) \
.setOutputCol("obfuscated") \
.setMode("obfuscate")\
.setObfuscateDate(True)\
.setObfuscateRefFile('obfuscate.txt')\
.setObfuscateRefSource("file")\
.setUnnormalizedDateMode("obfuscate")\
.setOutputAsDocument(True) # or False for sentence level result
text ='''
Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson , Ora MR # 7194334 Date : 01/13/93 . Patient : Oliveira, 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital . 0295 Keats Street
'''
Result of .setOutputAsDocument(True):
'obfuscated': ['Record date : 2093-01-14 , Beer-Karge , M.D . , Name : Hasan Jacobi Jäckel MR # <MEDICALRECORD> Date : 01-31-1991 . Patient : Herr Anselm Trüb, 51 years-old , Record date : 2080-01-08 . Klinik St. Hedwig . <MEDICALRECORD> Keats Street']
Result of .setOutputAsDocument(False):
'obfuscated': ['Record date : 2093-02-19 , Kaul , M.D . , Name : Frauke Oestrovsky MR # <MEDICALRECORD> Date : 05-08-1971 .',
'Patient : Lars Bloch, 33 years-old , Record date : 2079-11-11 .',
'University Hospital of Düsseldorf . <MEDICALRECORD> Keats Street']
New Resolution2Chunk
Annotator To Map Entity Resolver Outputs (terminology codes) To Other Clinical Terminologies
We have a new Resolution2Chunk
annotator that maps the entity resolver outputs to other clinical terminologies.
Example:
icd_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc","en", "clinical/models") \
.setInputCols(["sentence_embeddings"]) \
.setOutputCol("icd10cm_code")\
.setDistanceFunction("EUCLIDEAN")
resolver2chunk = Resolution2Chunk()\
.setInputCols(["icd10cm_code"]) \
.setOutputCol("resolver2chunk")\
chunkerMapper = ChunkMapperModel.pretrained("icd10cm_snomed_mapper", "en", "clinical/models")\
.setInputCols(["resolver2chunk"])\
.setOutputCol("mappings")\
.setRels(["snomed_code"])
sample_text = """Diabetes Mellitus"""
Result:
+-----------------+-----------------+------------+-----------+
|text |ner_chunk |icd10cm_code|snomed_code|
+-----------------+-----------------+------------+-----------+
|Diabetes Mellitus|Diabetes Mellitus|E109 |170756003 |
+-----------------+-----------------+------------+-----------+
New DocMapperModel
Annotator Allows To Use With Any Mapper Model In DOCUMENT
Type
Any ChunkMapperModel
can be used with this new annotator called DocMapperModel
and as its name suggests, it is used to map short strings via DocumentAssembler
without using any other annotator between to convert strings to Chunk
type that ChunkMapperModel
expects.
Example:
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
model = DocMapperModel.pretrained("drug_brandname_ndc_mapper", "en", "clinical/models")\
.setInputCols("document")\
.setOutputCol("mappings")
sample_text = "ZYVOX"
Result:
| Brand_Name | Strenth_NDC |
|:-------------|:-------------------------|
| ZYVOX | 600 mg/300mL | 0009-4992 |
Inter-Annotator Agreement (IAA) metrics module that works with NLP Lab seamlessly
We added a new get_IAA_metrics()
method to ALAB module. This method allows you to compare and evaluate the annotations in the seed corpus that all annotators annotated the same documents at the begining of an annotation project. It returns all the results in CSV files. Here are the parameters;
spark
: SparkSession.conll_dir
(str): path to the folder that conll files in.annotator_names
(list): list of annotator names.set_ref_annotator
(str): reference annotator name. If present, all comparisons made with respect to it, if it isNone
all annotators will be compared by each other. Default isNone
.return_NerDLMetrics
(boolean): IfTrue
, we get thefull_chunk
and -partial_chunk_per_token
IAA metrics by using NerDLMetrics. IfFalse
, we get the chunk based metrics usingevaluate
method oftraining_log_parser
module and the token based metrics using classification reports, then write the results in “eval_metric_files” folder. Default isFalse
.save_dir
(str): path to save the token based results dataframes, default is “results_token_based”.
For more details and examples, please check ALAB Notebook.
Example:
alab.get_IAA_metrics(spark, conll_dir = path_to_conll_folder, annotator_names = ["annotator_1","annotator_2","annotator_3","annotator_4"], set_ref_annotator = "annotator_1", return_NerDLMetrics = False, save_dir = "./token_based_results")
Assertion dataset preparation module now supports chunk start and end indices, rather than token indices.
Here are the new features in get_assertion_data()
;
- Now it returns the
char_begin
andchar_end
indices of the chunks. These columns can be used inAssertionDLApproach()
annotator instead oftoken_begin
andtoken_end
columns for training an Assertion Status Detection model. - Added
included_task_ids
parameter that allows you to prepare the assertion model training dataframe with only the included tasks. Default isNone
. - Added
seed
parameter that allows you to get the same training dataframe at each time when you setunannotated_label_strategy
. Default isNone
.
For more details and examples, please check ALAB Notebook.
Added ner_source
in the ChunkConverter
Metadata
We added ner_source
in the metadata of ChunkConverter
output. In this way, the sources of the chunks can be seen if there are multiple components that have the same NER label in the same pipeline.
Example:
...
age_contextual_parser = ContextualParserApproach() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("age_cp") \
.setJsonPath("age.json") \
.setCaseSensitive(False) \
.setPrefixAndSuffixMatch(False)
chunks_age = ChunkConverter()\
.setInputCols("age_cp")\
.setOutputCol("age_chunk")
...
sample_text = """The patient is a 28 years old female with a history of gestational diabetes mellitus was diagnosed in April 2002 in County Baptist Hospital ."""
Result:
[Annotation(chunk, 17, 18, 28, {'tokenIndex': '4', 'entity': 'Age', 'field': 'Age', 'ner_source': 'age_chunk', 'chunk': '0', 'normalized': '', 'sentence': '0', 'confidenceValue': '0.74'})]
Core Improvements and Bug Fixes
- Added chunk confidence score in the
RelationExtractionModel
metadata - Added confidence score in the
DocumentLogRegClassifierApproach
metadata - Fixed non-deterministic Relation Extraction DL Models (30+ models updated in the model hub)
- Fixed incompatible PretrainedPipelines with PySpark v3.2.x and v3.3.x
- Fixed
ZIP
label issue onfaker
mode withsetZipCodeTag
parameter inDeidentification
- Fixed obfuscated numbers have the same number of chars as the original ones
- Fixed name obfuscation hashes in
Deidentification
for romanian language - Fixed LightPipeline validation parameter for internal annotators
- LightPipeline support for
GenericClassifier
(FeatureAssembler
)
New and Updated Notebooks
- New Clinical Text Classification with Spark_NLP Notebook show how can use medical text with ClassifierDL, MultiClassifierDL, GenericClassifier, and DocumentLogRegClassifier
- New Clinical Text Classification with DocumentMLClassifier Notebook show how can use medical text with DocumentMLClassifier
- Updated ALAB Notebook with the changes in
get_assertion_data()
and the newget_IAA_metrics()
method.
New and Updated Demos
- SOCIAL DETERMINANT demo
9 New Clinical Models and Pipelines Added & Updated in Total
ner_oncology_anatomy_general_healthcare
ner_oncology_biomarker_healthcare
ner_oncology_unspecific_posology_healthcare
ner_eu_clinical_case
genericclassifier_sdoh_economics_binary_sbiobert_cased_mli
genericclassifier_sdoh_substance_usage_binary_sbiobert_cased_mli
genericclassifier_sdoh_tobacco_usage_sbiobert_cased_mli
genericclassifier_sdoh_alcohol_usage_sbiobert_cased_mli
genericclassifier_sdoh_alcohol_usage_binary_sbiobert_cased_mli
For all Spark NLP for Healthcare models, please check: Models Hub Page
Versions
- 5.5.0
- 5.4.1
- 5.4.0
- 5.3.3
- 5.3.2
- 5.3.1
- 5.3.0
- 5.2.1
- 5.2.0
- 5.1.4
- 5.1.3
- 5.1.2
- 5.1.1
- 5.1.0
- 5.0.2
- 5.0.1
- 5.0.0
- 4.4.4
- 4.4.3
- 4.4.2
- 4.4.1
- 4.4.0
- 4.3.2
- 4.3.1
- 4.3.0
- 4.2.8
- 4.2.4
- 4.2.3
- 4.2.2
- 4.2.1
- 4.2.0
- 4.1.0
- 4.0.2
- 4.0.0
- 3.5.3
- 3.5.2
- 3.5.1
- 3.5.0
- 3.4.2
- 3.4.1
- 3.4.0
- 3.3.4
- 3.3.2
- 3.3.1
- 3.3.0
- 3.2.3
- 3.2.2
- 3.2.1
- 3.2.0
- 3.1.3
- 3.1.2
- 3.1.1
- 3.1.0
- 3.0.3
- 3.0.2
- 3.0.1
- 3.0.0
- 2.7.6
- 2.7.5
- 2.7.4
- 2.7.3
- 2.7.2
- 2.7.1
- 2.7.0
- 2.6.2
- 2.6.0
- 2.5.5
- 2.5.3
- 2.5.2
- 2.5.0
- 2.4.6
- 2.4.5
- 2.4.2
- 2.4.1
- 2.4.0