3.5.2
Highlights
TFGraphBuilder
annotator to create graphs for training NER, Assertion, Relation Extraction, and Generic Classifier models- Default TF graphs added for
AssertionDLApproach
to let users train models without custom graphs - New functionalities in
ContextualParserApproach
- Printing the list of clinical pretrained models and pipelines with one-liner
- New clinical models
- Clinical NER model (
ner_biomedical_bc2gm
) - Clinical
ChunkMapper
models (abbreviation_mapper
,rxnorm_ndc_mapper
,drug_brandname_ndc_mapper
,rxnorm_action_treatment_mapper
)
- Clinical NER model (
- Bug fixes
- New and updated notebooks
- List of recently updated or added models
TFGraphBuilder
annotator to create graphs for Training NER, Assertion, Relation Extraction, and Generic Classifier Models
We have a new annotator used to create graphs in the model training pipeline. TFGraphBuilder
inspects the data and creates the proper graph if a suitable version of TensorFlow (<= 2.7 ) is available. The graph is stored in the defined folder and loaded by the approach.
You can use this builder with MedicalNerApproach
, RelationExtractionApproach
, AssertionDLApproach
, and GenericClassifierApproach
Example:
graph_folder_path = "./medical_graphs"
med_ner_graph_builder = TFGraphBuilder()\
.setModelName("ner_dl")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setLabelColumn("label")\
.setGraphFile("auto")\
.setHiddenUnitsNumber(20)\
.setGraphFolder(graph_folder_path)
med_ner = MedicalNerApproach() \
...
.setGraphFolder(graph_folder)
medner_pipeline = Pipeline()([
...,
med_ner_graph_builder,
med_ner
])
For more examples, please check TFGraph Builder Notebook.
Default TF graphs added for AssertionDLApproach
to let users train models without custom graphs
We added default TF graphs for the AssertionDLApproach
to let users train assertion models without specifying any custom TF graph.
Default Graph Features:
- Feature Sizes: 100, 200, 768
- Number of Classes: 2, 4, 8
New Functionalities in ContextualParserApproach
- Added
.setOptionalContextRules
parameter that allows to output regex matches regardless of context match (prefix, suffix configuration). - Allows sending a JSON string of the configuration file to
setJsonPath
parameter.
Confidence Value Scenarios:
- When there is regex match only, the confidence value will be 0.5.
- When there are regex and prefix matches together, the confidence value will be > 0.5 depending on the distance between target token and the prefix.
- When there are regex and suffix matches together, the confidence value will be > 0.5 depending on the distance between target token and the suffix.
- When there are regex, prefix, and suffix matches all together, the confidence value will be > than the other scenarios.
Example:
jsonString = {
"entity": "CarId",
"ruleScope": "sentence",
"completeMatchRegex": "false",
"regex": "\\d+",
"prefix": ["red"],
"contextLength": 100
}
with open("jsonString.json", "w") as f:
json.dump(jsonString, f)
contextual_parser = ContextualParserApproach()\
.setInputCols(["sentence", "token"])\
.setOutputCol("entity")\
.setJsonPath("jsonString.json")\
.setCaseSensitive(True)\
.setOptionalContextRules(True)
Printing the List of Clinical Pretrained Models and Pipelines with One-Liner
Now we can check what the clinical model names are of a specific annotator and the names of clinical pretrained pipelines in a language.
- Listing Clinical Model Names:
Example:
from sparknlp_jsl.pretrained import InternalResourceDownloader
InternalResourceDownloader.showPrivateModels("AssertionDLModel")
Results:
+-----------------------------------+------+---------+
| Model | lang | version |
+-----------------------------------+------+---------+
| assertion_ml | en | 2.0.2 |
| assertion_dl | en | 2.0.2 |
| assertion_dl_healthcare | en | 2.7.2 |
| assertion_dl_biobert | en | 2.7.2 |
| assertion_dl | en | 2.7.2 |
| assertion_dl_radiology | en | 2.7.4 |
| assertion_jsl_large | en | 3.1.2 |
| assertion_jsl | en | 3.1.2 |
| assertion_dl_scope_L10R10 | en | 3.4.2 |
| assertion_dl_biobert_scope_L10R10 | en | 3.4.2 |
+-----------------------------------+------+---------+
- Listing Clinical Pretrained Pipelines:
from sparknlp_jsl.pretrained import InternalResourceDownloader
InternalResourceDownloader.showPrivatePipelines("en")
+--------------------------------------------------------+------+---------+
| Pipeline | lang | version |
+--------------------------------------------------------+------+---------+
| clinical_analysis | en | 2.4.0 |
| clinical_ner_assertion | en | 2.4.0 |
| clinical_deidentification | en | 2.4.0 |
| clinical_analysis | en | 2.4.0 |
| explain_clinical_doc_ade | en | 2.7.3 |
| icd10cm_snomed_mapping | en | 2.7.5 |
| recognize_entities_posology | en | 3.0.0 |
| explain_clinical_doc_carp | en | 3.0.0 |
| recognize_entities_posology | en | 3.0.0 |
| explain_clinical_doc_ade | en | 3.0.0 |
| explain_clinical_doc_era | en | 3.0.0 |
| icd10cm_snomed_mapping | en | 3.0.2 |
| snomed_icd10cm_mapping | en | 3.0.2 |
| icd10cm_umls_mapping | en | 3.0.2 |
| snomed_umls_mapping | en | 3.0.2 |
| ... | ... | ... |
+--------------------------------------------------------+------+---------+
New ner_biomedical_bc2gm
NER Model
This model has been trained to extract genes/proteins from a medical text.
See Model Card for more details.
Example :
...
ner = MedicalNerModel.pretrained("ner_biomedical_bc2gm", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
...
text = spark.createDataFrame([["Immunohistochemical staining was positive for S-100 in all 9 cases stained, positive for HMB-45 in 9 (90%) of 10, and negative for cytokeratin in all 9 cases in which myxoid melanoma remained in the block after previous sections."]]).toDF("text")
result = model.transform(text)
Results :
+-----------+------------+
|chunk |ner_label |
+-----------+------------+
|S-100 |GENE_PROTEIN|
|HMB-45 |GENE_PROTEIN|
|cytokeratin|GENE_PROTEIN|
+-----------+------------+
New Clinical ChunkMapper
Models
We have 4 new ChunkMapper
models and a new Chunk Mapping Notebook for showing their examples.
drug_brandname_ndc_mapper
: This model maps drug brand names to corresponding National Drug Codes (NDC). Product NDCs for each strength are returned in result and metadata.
See Model Card for more details.
Example :
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("chunk")
chunkerMapper = ChunkMapperModel.pretrained("drug_brandname_ndc_mapper", "en", "clinical/models")\
.setInputCols(["chunk"])\
.setOutputCol("ndc")\
.setRel("Strength_NDC")
model = PipelineModel(stages=[document_assembler,
chunkerMapper])
light_model = LightPipeline(model)
res = light_model.fullAnnotate(["zytiga", "ZYVOX", "ZYTIGA"])
Results :
+-------------+--------------------------+-----------------------------------------------------------+
| Brandname | Strenth_NDC | Other_NDSs |
+-------------+--------------------------+-----------------------------------------------------------+
| zytiga | 500 mg/1 | 57894-195 | ['250 mg/1 | 57894-150'] |
| ZYVOX | 600 mg/300mL | 0009-4992 | ['600 mg/300mL | 66298-7807', '600 mg/300mL | 0009-7807'] |
| ZYTIGA | 500 mg/1 | 57894-195 | ['250 mg/1 | 57894-150'] |
+-------------+--------------------------+-----------------------------------------------------------+
abbreviation_mapper
: This model maps abbreviations and acronyms of medical regulatory activities with their definitions.
See Model Card for details.
Example:
input = ["""Gravid with estimated fetal weight of 6-6/12 pounds.
LABORATORY DATA: Laboratory tests include a CBC which is normal.
HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet."""]
>> output:
+------------+----------------------------+
|Abbreviation|Definition |
+------------+----------------------------+
|CBC |complete blood count |
|HIV |human immunodeficiency virus|
+------------+----------------------------+
rxnorm_action_treatment_mapper
: RxNorm and RxNorm Extension codes with their corresponding action and treatment. Action refers to the function of the drug in various body systems; treatment refers to which disease the drug is used to treat.
See Model Card for more details.
Example:
input = ['Sinequan 150 MG', 'Zonalon 50 mg']
>> output:
+---------------+------------+---------------+
|chunk |rxnorm_code |Action |
+---------------+------------+---------------+
|Sinequan 150 MG|1000067 |Antidepressant |
|Zonalon 50 mg |103971 |Analgesic |
+---------------+------------+---------------+
rxnorm_ndc_mapper
: This pretrained model maps RxNorm and RxNorm Extension codes with corresponding National Drug Codes (NDC).
See Model Card for more details.
Example:
input = ['doxepin hydrochloride 50 MG/ML', 'macadamia nut 100 MG/ML']
>> output:
+------------------------------+------------+------------+
|chunk |rxnorm_code |Product NDC |
+------------------------------+------------+------------+
|doxepin hydrochloride 50 MG/ML|1000091 |00378-8117 |
|macadamia nut 100 MG/ML |212433 |00064-2120 |
+------------------------------+------------+------------+
Bug Fixes
We fixed some issues in DrugNormalizer
, DateNormalizer
and ContextualParserApproach
annotators.
DateNormalizer
: We fixed some relative date issues and alsoDateNormalizer
takes account the Leap years now.DrugNormalizer
: Fixed some formats.ContextualParserApproach
:- Computing the right distance for prefix.
- Extracting the right content for suffix.
- Handling special characters in prefix and suffix.
New and Updated Notebooks
- We prepared Spark NLP for Healthcare 3hr Notebook to cover mostly used components of Spark NLP in ODSC East 2022-3 hours hands-on workshop on ‘Modular Approach to Solve Problems at Scale in Healthcare NLP’. You can also find its Databricks version here.
- New Chunk Mapping Notebook for showing the examples of Chunk Mapper models.
- Updated healthcare tutorial notebooks for Databricks with
sparknlp_jsl
v3.5.1 - We have a new Databricks healthcare tutorials folder in which you can find all Spark NLP for Healthcare Databricks tutorial notebooks.
- Updated Graph Builder Notebook by adding the examples of new
TFGraphBuilder
annotator.
List of Recently Updated or Added Models
sbiobertresolve_rxnorm_action_treatment
ner_biomedical_bc2gm
abbreviation_mapper
rxnorm_ndc_mapper
drug_brandname_ndc_mapper
sbiobertresolve_cpt_procedures_measurements_augmented
sbiobertresolve_icd10cm_slim_billable_hcc
sbiobertresolve_icd10cm_slim_normalized
For all Spark NLP for healthcare models, please check : Models Hub Page
Versions
- 5.5.1
- 5.5.0
- 5.4.1
- 5.4.0
- 5.3.3
- 5.3.2
- 5.3.1
- 5.3.0
- 5.2.1
- 5.2.0
- 5.1.4
- 5.1.3
- 5.1.2
- 5.1.1
- 5.1.0
- 5.0.2
- 5.0.1
- 5.0.0
- 4.4.4
- 4.4.3
- 4.4.2
- 4.4.1
- 4.4.0
- 4.3.2
- 4.3.1
- 4.3.0
- 4.2.8
- 4.2.4
- 4.2.3
- 4.2.2
- 4.2.1
- 4.2.0
- 4.1.0
- 4.0.2
- 4.0.0
- 3.5.3
- 3.5.2
- 3.5.1
- 3.5.0
- 3.4.2
- 3.4.1
- 3.4.0
- 3.3.4
- 3.3.2
- 3.3.1
- 3.3.0
- 3.2.3
- 3.2.2
- 3.2.1
- 3.2.0
- 3.1.3
- 3.1.2
- 3.1.1
- 3.1.0
- 3.0.3
- 3.0.2
- 3.0.1
- 3.0.0
- 2.7.6
- 2.7.5
- 2.7.4
- 2.7.3
- 2.7.2
- 2.7.1
- 2.7.0
- 2.6.2
- 2.6.0
- 2.5.5
- 2.5.3
- 2.5.2
- 2.5.0
- 2.4.6
- 2.4.5
- 2.4.2
- 2.4.1
- 2.4.0