4.2.3
Highlights
- 3 new chunk mapper models to mapping Drugs and Diseases from the KEGG Database as well as mapping abbreviations to their categories
- New utility & helper Relation Extraction modules to handle preprocess
- New utility & helper OCR modules to handle annotate
- New utility & helper NER log parser
- Adding flexibility chunk merger prioritization
- Core improvements and bug fixes
- New and updated notebooks
- 3 new clinical models and pipelines added & updated in total
3 New Hhunk Mapper Models to Mapping Drugs and Diseases from the KEGG Database as well as Mapping Abbreviations to Their Categories
kegg_disease_mapper
: This pretrained model maps diseases with their correspondingcategory
,description
,icd10_code
,icd11_code
,mesh_code
, and hierarchicalbrite_code
. This model was trained with the data from the KEGG database.
Example:
chunkerMapper = ChunkMapperModel.pretrained("kegg_disease_mapper", "en", "clinical/models")\
.setInputCols(["ner_chunk"])\
.setOutputCol("mappings")\
.setRels(["description", "category", "icd10_code", "icd11_code", "mesh_code", "brite_code"])
text= "A 55-year-old female with a history of myopia, kniest dysplasia and prostate cancer. She was on glipizide , and dapagliflozin for congenital nephrogenic diabetes insipidus."
Result:
+-----------------------------------------+--------------------------------------------------+-----------------------+----------+----------+---------+-----------------------+
| ner_chunk| description| category|icd10_code|icd11_code|mesh_code| brite_code|
+-----------------------------------------+--------------------------------------------------+-----------------------+----------+----------+---------+-----------------------+
| myopia|Myopia is the most common ocular disorder world...| Nervous system disease| H52.1| 9D00.0| D009216| 08402,08403|
| kniest dysplasia|Kniest dysplasia is an autosomal dominant chond...|Congenital malformation| Q77.7| LD24.3| C537207| 08402,08403|
| prostate cancer|Prostate cancer constitutes a major health prob...| Cancer| C61| 2C82| NONE|08402,08403,08442,08441|
|congenital nephrogenic diabetes insipidus|Nephrogenic diabetes insipidus (NDI) is charact...| Urinary system disease| N25.1| GB90.4A| D018500| 08402,08403|
+-----------------------------------------+--------------------------------------------------+-----------------------+----------+----------+---------+-----------------------+
kegg_drug_mapper
: This pretrained model maps drugs with their correspondingefficacy
,molecular_weight
as well asCAS
,PubChem
,ChEBI
,LigandBox
,NIKKAJI
,PDB-CCD
codes. This model was trained with the data from the KEGG database.
Example:
chunkerMapper = ChunkMapperModel.pretrained("kegg_drug_mapper", "en", "clinical/models")\
.setInputCols(["ner_chunk"])\
.setOutputCol("mappings")\
.setRels(["efficacy", "molecular_weight", "CAS", "PubChem", "ChEBI", "LigandBox", "NIKKAJI", "PDB-CCD"])
text= "She is given OxyContin, folic acid, levothyroxine, Norvasc, aspirin, Neurontin"
Result:
+-------------+--------------------------------------------------+----------------+----------+-----------+-------+---------+---------+-------+
| ner_chunk| efficacy|molecular_weight| CAS| PubChem| ChEBI|LigandBox| NIKKAJI|PDB-CCD|
+-------------+--------------------------------------------------+----------------+----------+-----------+-------+---------+---------+-------+
| OxyContin| Analgesic (narcotic), Opioid receptor agonist| 351.8246| 124-90-3| 7847912.0| 7859.0| D00847|J281.239H| NONE|
| folic acid|Anti-anemic, Hematopoietic, Supplement (folic a...| 441.3975| 59-30-3| 7847138.0|27470.0| D00070| J1.392G| FOL|
|levothyroxine| Replenisher (thyroid hormone)| 776.87| 51-48-9|9.6024815E7|18332.0| D08125| J4.118A| T44|
| Norvasc|Antihypertensive, Vasodilator, Calcium channel ...| 408.8759|88150-42-9|5.1091781E7| 2668.0| D07450| J33.383B| NONE|
| aspirin|Analgesic, Anti-inflammatory, Antipyretic, Anti...| 180.1574| 50-78-2| 7847177.0|15365.0| D00109| J2.300K| AIN|
| Neurontin| Anticonvulsant, Antiepileptic| 171.2368|60142-96-3| 7847398.0|42797.0| D00332| J39.388F| GBN|
+-------------+--------------------------------------------------+----------------+----------+-----------+-------+---------+---------+-------+
abbreviation_category_mapper
: This pretrained model maps abbreviations and acronyms of medical regulatory activities with their definitions and categories. Predicted categories:general
,problem
,test
,treatment
,medical_condition
,clinical_dept
,drug
,nursing
,internal_organ_or_component
,hospital_unit
,drug_frequency
,employment
,procedure
.
Example:
chunkerMapper = ChunkMapperModel.pretrained("abbreviation_category_mapper", "en", "clinical/models")\
.setInputCols(["abbr_ner_chunk"])\
.setOutputCol("mappings")\
.setRels(["definition", "category"])\
text = ["""Gravid with estimated fetal weight of 6-6/12 pounds.
LABORATORY DATA: Laboratory tests include a CBC which is normal.
VDRL: Nonreactive
HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet."""]
Result:
| chunk | category | definition |
|:--------|:------------------|:---------------------------------------|
| CBC | general | complete blood count |
| VDRL | clinical_dept | Venereal Disease Research Laboratories |
| HIV | medical_condition | Human immunodeficiency virus |
New Utility & Helper Relation Extraction Modules to Handle Preprocess
This process is standard and training column should be same in all RE trainings. We can simplify this process with helper class. With proposed changes it can be done as follows:
Example:
from sparknlp_jsl.training import REDatasetHelper
# map entity columns to dataset columns
column_map = {
"begin1": "firstCharEnt1",
"end1": "lastCharEnt1",
"begin2": "firstCharEnt2",
"end2": "lastCharEnt2",
"chunk1": "chunk1",
"chunk2": "chunk2",
"label1": "label1",
"label2": "label2"
}
# apply preprocess function to dataframe
data = REDatasetHelper(data).create_annotation_column(
column_map,
ner_column_name="train_ner_chunks" # optional, default train_ner_chunks
)
New Utility & Helper OCR Modules to Handle Annotations
This modeule can generates an annotated PDF file using input PDF files. style
: PDF file proccess style that has 3 options;
black_band
: Black bands over the chunks detected by NER pipeline.bounding_box
: Colorful bounding boxes around the chunks detected by NER pipeline. Each color represents a different NER label.highlight
: Colorful highlights over the chunks detected by NER pipeline. Each color represents a different NER label.- You can check Spark OCR Utility Module notebook for more examples.
Example:
from sparknlp_jsl.utils.ocr_nlp_processor import ocr_entity_processor
path='/*.pdf'
box = "bounding_box"
ocr_entity_processor(spark=spark,file_path=path,ner_pipeline = nlp_model,chunk_col = "merged_chunk", black_list = ["AGE", "DATA", "PATIENT"],
style = box, save_dir = "colored_box",label= True, label_color = "red",color_chart_path = "label_colors.png", display_result=True)
box = "highlight"
ocr_entity_processor(spark=spark,file_path=path, ner_pipeline = nlp_model, chunk_col = "merged_chunk", black_list = ["AGE", "DATE", "PATIENT"],
style = box, save_dir = "colored_box", label= True, label_color = "red", color_chart_path = "label_colors.png", display_result=True)
box = "black_band"
ocr_entity_processor(spark=spark,file_path=path, ner_pipeline = nlp_modelchunk_col = "merged_chunk",
style = box, save_dir = "black_band",label= True, label_color = "red", display_result = True)
Results:
- Bounding box with labels and black list
- Highlight with labels and black_list
- black_band with labels
New Utility & Helper NER Log Parser
ner_utils
: This new module is used after NER training to calculate mertic chunkbase and plot training logs.
Example:
nerTagger = NerDLApproach()\
.setInputCols(["sentence", "token", "embeddings"])\
.setLabelColumn("label")\
.setOutputCol("ner")\
...
.setOutputLogsPath('ner_logs')
ner_pipeline = Pipeline(stages=[glove_embeddings,
graph_builder,
nerTagger])
ner_model = ner_pipeline.fit(training_data)
evaluate
: if verbose, returns overall performance, as well as performance per chunk type; otherwise, simply returns overall precision, recall, f1 scores
Example:
from sparknlp_jsl.utils.ner_utils import evaluate
metrics = evaluate(preds_df['ground_truth'].values, preds_df['prediction'].values)
Result:
processed 14133 tokens with 1758 phrases; found: 1779 phrases; correct: 1475.
accuracy: 83.45%; (non-O)
accuracy: 96.67%; precision: 82.91%; recall: 83.90%; FB1: 83.40
LOC: precision: 91.41%; recall: 85.69%; FB1: 88.46 524
MISC: precision: 78.15%; recall: 62.11%; FB1: 69.21 151
ORG: precision: 61.86%; recall: 74.93%; FB1: 67.77 430
PER: precision: 90.80%; recall: 93.58%; FB1: 92.17 674
loss_plot
: Plots the figure of loss vs epochs
Example:
from sparknlp_jsl.utils.ner_utils import loss_plot
loss_plot('./ner_logs/'+log_files[0])
Results:
get_charts
: Plots the figures of metrics ( precision, recall, f1) vs epochs
Example:
from sparknlp_jsl.utils.ner_utils import get_charts
get_charts('./ner_logs/'+log_files[0])
Results:
Adding Flexibility Chunk Merger Prioritization
orderingFeatures
: Array of strings specifying the ordering features to use for overlapping entities. Possible values are ChunkBegin, ChunkLength, ChunkPrecedence, ChunkConfidence
selectionStrategy
: Whether to select annotations sequentially based on annotation order Sequential
or using any other available strategy, currently only DiverseLonger
are available.
defaultConfidence
: When ChunkConfidence ordering feature is included and a given annotation does not have any confidence the value of this param will be used.
chunkPrecedence
: When ChunkPrecedence ordering feature is used this param contains the comma separated fields in metadata that drive prioritization of overlapping annotations. When used by itself (empty chunkPrecedenceValuePrioritization) annotations will be prioritized based on number of metadata fields present. When used together with chunkPrecedenceValuePrioritization param it will prioritize based on the order of its values.
chunkPrecedenceValuePrioritization
: When ChunkPrecedence ordering feature is used this param contains an Array of comma separated values representing the desired order of prioritization for the VALUES in the metadata fields included from chunkPrecedence.
Example:
text = """A 63 years old man presents to the hospital with a history of recurrent infections
that include cellulitis, pneumonias, and upper respiratory tract infections..."""
+-------------------------------------------------------------------------------------+
|ner_deid_chunk |
+-------------------------------------------------------------------------------------+
|[{chunk, 2, 3, 63, {entity -> AGE, sentence -> 0, chunk -> 0, confidence -> 0.9997}}]|
+-------------------------------------------------------------------------------------+
+----------------------------------------------------------------------------------------------------+
|jsl_ner_chunk |
+----------------------------------------------------------------------------------------------------+
|[{chunk, 2, 13, 63 years old, {entity -> Age, sentence -> 0, chunk -> 0, confidence -> 0.85873336}}]|
+----------------------------------------------------------------------------------------------------+
- Merging overlapped chunks by considering their lenght
If we setsetOrderingFeatures(["ChunkLength"])
andsetSelectionStrategy("DiverseLonger")
parameters, the longest chunk will be prioritized in case of overlapping.
Example:
chunk_merger = ChunkMergeApproach()\
.setInputCols('ner_deid_chunk', "jsl_ner_chunk")\
.setOutputCol('merged_ner_chunk')\
.setOrderingFeatures(["ChunkLength"])\
.setSelectionStrategy("DiverseLonger")
Results:
|begin|end| chunk| entity|
+-----+---+-------------+---------------+
| 2| 13| 63 years old| Age|
| 15| 17| man| Gender|
| 35| 42| hospital| Clinical_Dept|
- Merging overlapped chunks by considering custom values that we set
setChunkPrecedence()
parameter contains an Array of comma separated values representing the desired order of prioritization for the VALUES in the metadata fields included fromsetOrderingFeatures(["chunkPrecedence"])
.
Example:
chunk_merger = ChunkMergeApproach()\
.setInputCols('ner_deid_chunk', "jsl_ner_chunk")\
.setOutputCol('merged_ner_chunk')\
.setMergeOverlapping(True) \
.setOrderingFeatures(["ChunkPrecedence"]) \
.setChunkPrecedence('ner_deid_chunk,AGE') \
# .setChunkPrecedenceValuePrioritization(["ner_deid_chunk,AGE", "jsl_ner_chunk,Age"])
Results:
|begin|end| chunk| entity|
+-----+---+-------------+---------------+
| 2| 3| 63| AGE|
| 15| 17| man| Gender|
| 35| 42| hospital| Clinical_Dept|
You can check NER Chunk Merger notebook for more examples.
Core improvements and bug fixes
- AssertionDL IncludeConfidence() parameters default value set by True
- Fixed NaN outputs in RelationExtraction
- Fixed loadSavedModel method that we use for importing transformers into Spark NLP
- Fixed replacer with setUseReplacement(True) parameter
- Added overall confidence score to MedicalNerModel when setIncludeAllConfidenceScore is True
- Fixed in InternalResourceDownloader showAvailableAnnotators
New and Updated Notebooks
-
New Spark OCR Utility Module notebook to help handle OCR process.
-
Updated Clinical Entity Resolvers notebook with
Assertion Filterer
example. -
Updated NER Chunk Merger notebook with flexibility chunk merger prioritization example.
-
Updated Clinical Relation Extraction notebook with new
REDatasetHelper
module. -
Updated ALab Module SparkNLP JSL notebook with new updates.
3 New Clinical Models and Pipelines Added & Updated in Total
kegg_disease_mapper
kegg_drug_mapper
abbreviation_category_mapper
For all Spark NLP for healthcare models, please check: Models Hub Page
Versions
- 5.5.1
- 5.5.0
- 5.4.1
- 5.4.0
- 5.3.3
- 5.3.2
- 5.3.1
- 5.3.0
- 5.2.1
- 5.2.0
- 5.1.4
- 5.1.3
- 5.1.2
- 5.1.1
- 5.1.0
- 5.0.2
- 5.0.1
- 5.0.0
- 4.4.4
- 4.4.3
- 4.4.2
- 4.4.1
- 4.4.0
- 4.3.2
- 4.3.1
- 4.3.0
- 4.2.8
- 4.2.4
- 4.2.3
- 4.2.2
- 4.2.1
- 4.2.0
- 4.1.0
- 4.0.2
- 4.0.0
- 3.5.3
- 3.5.2
- 3.5.1
- 3.5.0
- 3.4.2
- 3.4.1
- 3.4.0
- 3.3.4
- 3.3.2
- 3.3.1
- 3.3.0
- 3.2.3
- 3.2.2
- 3.2.1
- 3.2.0
- 3.1.3
- 3.1.2
- 3.1.1
- 3.1.0
- 3.0.3
- 3.0.2
- 3.0.1
- 3.0.0
- 2.7.6
- 2.7.5
- 2.7.4
- 2.7.3
- 2.7.2
- 2.7.1
- 2.7.0
- 2.6.2
- 2.6.0
- 2.5.5
- 2.5.3
- 2.5.2
- 2.5.0
- 2.4.6
- 2.4.5
- 2.4.2
- 2.4.1
- 2.4.0