Healthcare NLP v4.0.0 Release Notes

4.0.0

Highlights

8 new chunk mapper models and 9 new pretrained chunk mapper pipelines to convert one medical terminology to another (Snomed to ICD10, RxNorm to UMLS etc.)
2 new medical NER models (ner_clinical_trials_abstracts and ner_pathogen) and pretrained NER pipelines
20 new biomedical NER models based on the LivingNER corpus in 8 languages (English, Spanish, French, Italian, Portuguese, Romanian, Catalan and Galician)
2 new medical NER models for Romanian language (ner_clinical, ner_clinical_bert)
Deidentification support for Romanian language (ner_deid_subentity, ner_deid_subentity_bert and a pretrained deidentification pipeline)
The first public health model: Emotional stress classifier (bert_sequence_classifier_stress)
ResolverMerger annotator to merge the results of ChunkMapperModel and SentenceEntityResolverModel annotators
New Shortest Context Match and Token Index Features in ContextualParserApproach
Prettified relational categories in ZeroShotRelationExtractionModel annotator
Create graphs for open source NerDLApproach with the TFGraphBuilder
Spark NLP for Healthcare library installation with Poetry (dependency management and packaging tool)
Bug fixes
Updated notebooks
List of recently updated or added models (50+ new medical models and pipelines)

8 New Chunk Mapper Models and 9 New Pretrained Chunk Mapper Pipelines to Convert One Medical Terminology to Another (Snomed to ICD10, RxNorm to UMLS etc.)

We are releasing 8 new ChunkMapperModel models and 9 new pretrained pipelines for mapping clinical codes with their corresponding.

Mapper Models:

Mapper Name	Source	Target
snomed_icd10cm_mapper	SNOMED CT	ICD-10-CM
icd10cm_snomed_mapper	ICD-10-CM	SNOMED CT
snomed_icdo_mapper	SNOMED CT	ICD-O
icdo_snomed_mapper	ICD-O	SNOMED CT
rxnorm_umls_mapper	RxNorm	UMLS
icd10cm_umls_mapper	ICD-10-CM	UMLS
mesh_umls_mapper	MeSH	UMLS
snomed_umls_mapper	SNOMED CT	UMLS

Example:

...
snomed_resolver = SentenceEntityResolverModel.pretrained("sbertresolve_snomed_conditions", "en", "clinical/models") \
    .setInputCols(["sbert_embeddings"]) \
    .setOutputCol("snomed_code")\
    .setDistanceFunction("EUCLIDEAN")

chunkerMapper = ChunkMapperModel.pretrained("snomed_icd10cm_mapper", "en", "clinical/models")\
    .setInputCols(["snomed_code"])\
    .setOutputCol("icd10cm_mappings")\
    .setRels(["icd10cm_code"])

pipeline = PipelineModel(
    stages = [
        documentAssembler,
        sbert_embedder,
        snomed_resolver,
        chunkerMapper
        ])

light_pipeline= LightPipeline(pipeline)

result = light_pipeline.fullAnnotate("Radiating chest pain")

Results :

|    | ner_chunk            |   snomed_code | icd10cm_mappings   |
|---:|:---------------------|--------------:|:-------------------|
|  0 | Radiating chest pain |      10000006 | R07.9              |

Pretrained Pipelines:

Pipeline Name	Source	Target
icd10cm_snomed_mapping	ICD-10-CM	SNOMED CT
snomed_icd10cm_mapping	SNOMED CT	ICD-10-CM
icdo_snomed_mapping	ICD-O	SNOMED CT
snomed_icdo_mapping	SNOMED CT	ICD-O
rxnorm_ndc_mapping	RxNorm	NDC
icd10cm_umls_mapping	ICD-10-CM	UMLS
mesh_umls_mapping	MeSH	UMLS
rxnorm_umls_mapping	RxNorm	UMLS
snomed_umls_mapping	SOMED CT	UMLS

Example:

from sparknlp.pretrained import PretrainedPipeline

pipeline= PretrainedPipeline("rxnorm_umls_mapping", "en", "clinical/models")
result= pipeline.annotate("1161611 315677")

Results :

{'document': ['1161611 315677'],
 'rxnorm_code': ['1161611', '315677'],
 'umls_code': ['C3215948', 'C0984912']}

2 New Medical NER Models (`ner_clinical_trials_abstracts` and `ner_pathogene`) and Pretrained NER Pipelines

ner_clinical_trials_abstracts: This model can extract concepts related to clinical trial design, diseases, drugs, population, statistics and publication. It can detect Age, AllocationRatio, Author, BioAndMedicalUnit, CTAnalysisApproach, CTDesign, Confidence, Country, DisorderOrSyndrome, DoseValue, Drug, DrugTime, Duration, Journal, NumberPatients, PMID, PValue, PercentagePatients, PublicationYear, TimePoint, Value entities.

See Model Hub Page for details.

Example :

...
clinical_ner = MedicalNerModel.pretrained("ner_clinical_trials_abstracts", "en", "clinical/models") \
      .setInputCols(["sentence", "token", "embeddings"]) \
      .setOutputCol("ner_tags")
...

sample_text = "A one-year, randomised, multicentre trial comparing insulin glargine with NPH insulin in combination with oral agents in patients with type 2 diabetes."

bert_token_classifier_ner_clinical_trials_abstracts: This model is the BERT-based version of ner_clinical_trials_abstracts model and it can detect Age, AllocationRatio, Author, BioAndMedicalUnit, CTAnalysisApproach, CTDesign, Confidence, Country, DisorderOrSyndrome, DoseValue, Drug, DrugTime, Duration, Journal, NumberPatients, PMID, PValue, PercentagePatients, PublicationYear, TimePoint, Value entities.

See Model Hub Page for details.

Example :

...
tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_clinical_trials_abstracts", "en", "clinical/models")\
       .setInputCols("token", "sentence")\
       .setOutputCol("ner")\
       .setCaseSensitive(True)
...

sample_text = "A one-year, randomised, multicentre trial comparing insulin glargine with NPH insulin in combination with oral agents in patients with type 2 diabetes."

ner_clinical_trials_abstracts_pipeline: This pretrained pipeline is build upon the ner_clinical_trials_abstracts model and it can extract Age, AllocationRatio, Author, BioAndMedicalUnit, CTAnalysisApproach, CTDesign, Confidence, Country, DisorderOrSyndrome, DoseValue, Drug, DrugTime, Duration, Journal, NumberPatients, PMID, PValue, PercentagePatients, PublicationYear, TimePoint, Value entities.

See Model Hub Page for details.

Example :

pipeline = PretrainedPipeline("ner_clinical_trials_abstracts_pipeline", "en", "clinical/models")

result = pipeline.fullAnnotate("A one-year, randomised, multicentre trial comparing insulin glargine with NPH insulin in combination with oral agents in patients with type 2 diabetes.")

Results :

+----------------+------------------+
|           chunk|         ner_label|
+----------------+------------------+
|      randomised|          CTDesign|
|     multicentre|          CTDesign|
|insulin glargine|              Drug|
|     NPH insulin|              Drug|
| type 2 diabetes|DisorderOrSyndrome|
+----------------+------------------+

ner_pathogen: This model is trained for detecting medical conditions (influenza, headache, malaria, etc), medicine (aspirin, penicillin, methotrexate) and pathogenes (Corona Virus, Zika Virus, E. Coli, etc) in clinical texts. It can extract Pathogen, MedicalCondition, Medicine entities.

See Model Hub Page for details.

Example :

...
clinical_ner = MedicalNerModel.pretrained("ner_pathogen", "en", "clinical/models") \
      .setInputCols(["sentence", "token", "embeddings"]) \
      .setOutputCol("ner")
...

sample_text = "Racecadotril is an antisecretory medication and it has better tolerability than loperamide. Diarrhea is the condition of having loose, liquid or watery bowel movements each day. Signs of dehydration often begin with loss of the normal stretchiness of the skin. While it has been speculated that rabies virus, Lyssavirus and Ephemerovirus could be transmitted through aerosols, studies have concluded that this is only feasible in limited conditions."

ner_pathogen_pipeline: This pretrained pipeline is build upon the ner_pathogen model and it can extract Pathogen, MedicalCondition, Medicine entities.

See Model Hub Page for details.

Example :

pipeline = PretrainedPipeline("ner_pathogen_pipeline", "en", "clinical/models")

result = pipeline.fullAnnotate("Racecadotril is an antisecretory medication and it has better tolerability than loperamide. Diarrhea is the condition of having loose, liquid or watery bowel movements each day. Signs of dehydration often begin with loss of the normal stretchiness of the skin. While it has been speculated that rabies virus, Lyssavirus and Ephemerovirus could be transmitted through aerosols, studies have concluded that this is only feasible in limited conditions.")

Results :

+---------------+----------------+
|chunk          |ner_label       |
+---------------+----------------+
|Racecadotril   |Medicine        |
|loperamide     |Medicine        |
|Diarrhea       |MedicalCondition|
|dehydration    |MedicalCondition|
|rabies virus   |Pathogen        |
|Lyssavirus     |Pathogen        |
|Ephemerovirus  |Pathogen        |
+---------------+----------------+

ner_biomedical_bc2gm_pipeline : This pretrained pipeline can extract genes/proteins from medical texts by labelling them as GENE_PROTEIN.

See Model Hub Page for details.

Example :

pipeline = PretrainedPipeline("ner_biomedical_bc2gm_pipeline", "en", "clinical/models")

result = pipeline.fullAnnotate("""Immunohistochemical staining was positive for S-100 in all 9 cases stained, positive for HMB-45 in 9 (90%) of 10, and negative for cytokeratin in all 9 cases in which myxoid melanoma remained in the block after previous sections.""")

Results :

+-----------+------------+
|chunk      |ner_label   |
+-----------+------------+
|S-100      |GENE_PROTEIN|
|HMB-45     |GENE_PROTEIN|
|cytokeratin|GENE_PROTEIN|
+-----------+------------+

20 New Biomedical NER Models Based on the [LivingNER corpus] in 8 Languages

We are releasing 20 new NER and MedicalBertForTokenClassifier models for *English, French, Italian, Portuguese, Romanian, Catalan and Galician languages that are trained on the LivingNER multilingual corpus and for Spanish that is trained on LivingNER corpus is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others. These models can detect living species as HUMAN and SPECIES entities in clinical texts.

Here is the list of model names and their embeddings used while training:

Language	Annotator	Embeddings	Model Name
es	MedicalBertForTokenClassification		bert_token_classifier_ner_living_species
es	MedicalNerModel	bert_base_cased_es	ner_living_species_bert
es	MedicalNerModel	roberta_base_biomedical_es	ner_living_species_roberta
es	MedicalNerModel	embeddings_scielo_300d_es	ner_living_species_300
es	MedicalNerModel	w2v_cc_300d_es	ner_living_species
en	MedicalBertForTokenClassification		bert_token_classifier_ner_living_species
en	MedicalNerModel	embeddings_clinical_en	ner_living_species
en	MedicalNerModel	biobert_pubmed_base_cased_en	ner_living_species_biobert
fr	MedicalNerModel	w2v_cc_300d_fr	ner_living_species
fr	MedicalNerModel	bert_embeddings_bert_base_fr_cased	ner_living_species_bert
pt	MedicalBertForTokenClassification		bert_token_classifier_ner_living_species
pt	MedicalNerModel	w2v_cc_300d_pt	ner_living_species
pt	MedicalNerModel	roberta_embeddings_BR_BERTo_pt	ner_living_species_roberta
pt	MedicalNerModel	biobert_embeddings_biomedical_pt	ner_living_species_bert
it	MedicalBertForTokenClassification		bert_token_classifier_ner_living_species
it	MedicalNerModel	bert_base_italian_xxl_cased_it	ner_living_species_bert
it	MedicalNerModel	w2v_cc_300d_it	ner_living_species
ro	MedicalNerModel	bert_base_cased_ro	ner_living_species_bert
cat	MedicalNerModel	w2v_cc_300d_cat	ner_living_species
gal	MedicalNerModel	w2v_cc_300d_gal	ner_living_species

Example :

...
clinical_ner = MedicalNerModel.pretrained("ner_living_species", "en", "clinical/models") \
      .setInputCols(["sentence", "token", "embeddings"]) \
      .setOutputCol("ner_tags")
...

results = ner_model.transform(spark.createDataFrame([["""Patient aged 61 years; no known drug allergies, smoker of 63 packs/year, significant active alcoholism, recently diagnosed hypertension. He came to the emergency department approximately 4 days ago with a frontal headache coinciding with a diagnosis of hypertension, for which he was started on antihypertensive treatment. The family reported that they found him "slower" accompanied by behavioural alterations; with no other accompanying symptoms.Physical examination: Glasgow Glasgow 15; neurological examination without focality except for bradypsychia and disorientation in time, person and space. Afebrile. BP: 159/92; heart rate 70 and O2 Sat: 93%; abdominal examination revealed hepatomegaly of two finger widths with no other noteworthy findings. CBC: Legionella antigen and pneumococcus in urine negative."""]], ["text"]))

Results :

+------------+-------+
|ner_chunk   |label  |
+------------+-------+
|Patient     |HUMAN  |
|family      |HUMAN  |
|person      |HUMAN  |
|Legionella  |SPECIES|
|pneumococcus|SPECIES|
+------------+-------+

2 New Medical NER Models for Romanian Language

We trained ner_clinical and ner_clinical_bert models that can detect Measurements, Form, Symptom, Route, Procedure, Disease_Syndrome_Disorder, Score, Drug_Ingredient, Pulse, Frequency, Date, Body_Part, Drug_Brand_Name, Time, Direction, Dosage, Medical_Device, Imaging_Technique, Test, Imaging_Findings, Imaging_Test, Test_Result, Weight, Clinical_Dept and Units entities in Romanian clinical texts.

ner_clinical: This model is trained with w2v_cc_300d embeddings model.

Example :

...
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ro")\
        .setInputCols(["sentence","token"])\
        .setOutputCol("word_embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_clinical", "ro", "clinical/models")\
        .setInputCols(["sentence","token","word_embeddings"])\
        .setOutputCol("ner")
...

sample_text = "Aorta ascendenta inlocuita cu proteza de Dacron de la nivelul anulusului pana pe segmentul ascendent distal pe o lungime aproximativa de 75 mm."

ner_clinical_bert: This model is trained with bert_base_cased embeddings model.

Example :

 ...
 embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro")\
        .setInputCols(["sentence","token"])\
        .setOutputCol("word_embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_clinical_bert", "ro", "clinical/models")\
        .setInputCols(["sentence","token","word_embeddings"])\
        .setOutputCol("ner")
...

sample_text = "Aorta ascendenta inlocuita cu proteza de Dacron de la nivelul anulusului pana pe segmentul ascendent distal pe o lungime aproximativa de 75 mm."

Results :

+-------------------+--------------+
|             chunks|      entities|
+-------------------+--------------+
|   Aorta ascendenta|     Body_Part|
|  proteza de Dacron|Medical_Device|
|         anulusului|     Body_Part|
|segmentul ascendent|     Body_Part|
|             distal|     Direction|
|                 75|  Measurements|
|                 mm|         Units|
+-------------------+--------------+

Deidentification Support for Romanian Language (`ner_deid_subentity`, `ner_deid_subentity_bert` and a Pretrained Deidentification Pipeline)

We trained two new NER models to find PHI data (protected health information) that may need to be deidentified in Romanian. ner_deid_subentity and ner_deid_subentity_bert models are trained with in-house annotations and can detect 17 different entities (AGE, CITY, COUNTRY, DATE, DOCTOR, EMAIL, FAX, HOSPITAL, IDNUM, LOCATION-OTHER, MEDICALRECORD, ORGANIZATION, PATIENT, PHONE, PROFESSION, STREET, ZIP).

ner_deid_subentity: This model is trained with w2v_cc_300d embeddings model.

See Model Hub Page for details.

Example :

...
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","ro")\
        .setInputCols(["sentence","token"])\
        .setOutputCol("word_embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity", "ro", "clinical/models")\
        .setInputCols(["sentence","token","word_embeddings"])\
        .setOutputCol("ner")
...

sample_text = """
Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România
Tel: +40(235)413773
Data setului de analize: 25 May 2022 15:36:00
Nume si Prenume : BUREAN MARIA, Varsta: 77
Medic : Agota Evelyn Tımar
C.N.P : 2450502264401"""

ner_deid_subentity_bert: This model is trained with bert_base_cased embeddings model.

See Model Hub Page for details.

Example :

 ...
 embeddings = BertEmbeddings.pretrained("bert_base_cased", "ro")\
        .setInputCols(["sentence","token"])\
        .setOutputCol("word_embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity_bert", "ro", "clinical/models")\
        .setInputCols(["sentence","token","word_embeddings"])\
        .setOutputCol("ner")
...

text = """
Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România
Tel: +40(235)413773
Data setului de analize: 25 May 2022 15:36:00
Nume si Prenume : BUREAN MARIA, Varsta: 77
Medic : Agota Evelyn Tımar
C.N.P : 2450502264401"""

Results :

+----------------------------+---------+
|chunk                       |ner_label|
+----------------------------+---------+
|Spitalul Pentru Ochi de Deal|HOSPITAL |
|Drumul Oprea Nr             |STREET   |
|Vaslui                      |CITY     |
|737405                      |ZIP      |
|+40(235)413773              |PHONE    |
|25 May 2022                 |DATE     |
|BUREAN MARIA                |PATIENT  |
|77                          |AGE      |
|Agota Evelyn Tımar          |DOCTOR   |
|2450502264401               |IDNUM    |
+----------------------------+---------+

clinical_deidentification: This pretrained pipeline that can be used to deidentify PHI information from Romanian medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate ACCOUNT, PLATE, LICENSE, AGE, CITY, COUNTRY, DATE, DOCTOR, EMAIL, FAX, HOSPITAL, IDNUM, LOCATION-OTHER, MEDICALRECORD, ORGANIZATION, PATIENT, PHONE, PROFESSION, STREET, ZIP entities.

See Model Hub Page for details.

Example :

from sparknlp.pretrained import PretrainedPipeline
deid_pipeline = PretrainedPipeline("clinical_deidentification", "ro", "clinical/models")

text = "Varsta : 77, Nume si Prenume : BUREAN MARIA, Data setului de analize: 25 May 2022, Licență : B004256985M, Înmatriculare : CD205113, Cont : FXHZ7170951927104999"

result = deid_pipeline.annotate(text)

print("\nMasked with entity labels")
print("-"*30)
print("\n".join(result['masked']))
print("\nMasked with chars")
print("-"*30)
print("\n".join(result['masked_with_chars']))
print("\nMasked with fixed length chars")
print("-"*30)
print("\n".join(result['masked_fixed_length_chars']))
print("\nObfuscated")
print("-"*30)
print("\n".join(result['obfuscated']))

Results :

Masked with entity labels
------------------------------
Varsta : <AGE>, Nume si Prenume : <PATIENT>, Data setului de analize: <DATE>, Licență : <LICENSE>, Înmatriculare : <PLATE>, Cont : <ACCOUNT>

Masked with chars
------------------------------
Varsta : **, Nume si Prenume : [**********], Data setului de analize: [*********], Licență : [*********], Înmatriculare : [******], Cont : [******************]

Masked with fixed length chars
------------------------------
Varsta : ****, Nume si Prenume : ****, Data setului de analize: ****, Licență : ****, Înmatriculare : ****, Cont : ****

Obfuscated
------------------------------
Varsta : 91, Nume si Prenume : Dragomir Emilia, Data setului de analize: 01-04-2001, Licență : T003485962M, Înmatriculare : AR-65-UPQ, Cont : KHHO5029180812813651

The First Public Health Model: Emotional Stress Classifier

We are releasing a new bert_sequence_classifier_stress model that can classify whether the content of a text expresses emotional stress. It is a PHS-BERT-based model and trained with the Dreaddit dataset.

Example :

...
sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_stress", "en", "clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("class")

sample_text = "No place in my city has shelter space for us, and I won't put my baby on the literal street. What cities have good shelter programs for homeless mothers and children?"

Results :

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
|text                                                                                                                                                                  |   class|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
|No place in my city has shelter space for us, and I won't put my baby on the literal street. What cities have good shelter programs for homeless mothers and children?|[stress]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+

`ResolverMerger` Annotator to Merge the Results of `ChunkMapperModel` and `SentenceEntityResolverModel` Annotators

ResolverMerger annotator allows to merge the results of ChunkMapperModel and SentenceEntityResolverModel annotators. You can detect your results that fail by ChunkMapperModel with ChunkMapperFilterer and then merge your resolver and mapper results with ResolverMerger.

Example :

...
chunkerMapper = ChunkMapperModel.pretrained("rxnorm_mapper", "en", "clinical/models")\
      .setInputCols(["chunk"])\
      .setOutputCol("RxNorm_Mapper")\
      .setRel("rxnorm_code")

cfModel = ChunkMapperFilterer() \
    .setInputCols(["chunk", "RxNorm_Mapper"]) \
    .setOutputCol("chunks_fail") \
    .setReturnCriteria("fail")
...
resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented", "en", "clinical/models") \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("resolver_code") \
    .setDistanceFunction("EUCLIDEAN")

resolverMerger = ResolverMerger()\
    .setInputCols(["resolver_code","RxNorm_Mapper"])\
    .setOutputCol("RxNorm")
...

Results :

+--------------------------------+-----------------------+---------------+-------------+-------------------------+
|chunk                           |RxNorm_Mapper          |chunks_fail    |resolver_code|RxNorm                   |
+--------------------------------+-----------------------+---------------+-------------+-------------------------+
|[Adapin 10 MG, coumadin 5 mg]   |[1000049, NONE]        |[coumadin 5 mg]|[855333]     |[1000049, 855333]        |
|[Avandia 4 mg, Tegretol, zytiga]|[NONE, 203029, 1100076]|[Avandia 4 mg] |[261242]     |[261242, 203029, 1100076]|
+--------------------------------+-----------------------+---------------+-------------+-------------------------+

New Shortest Context Match and Token Index Features in `ContextualParserApproach`

We have new functionalities in ContextualParserApproach to make it more performant.

setShortestContextMatch() parameter will allow stop looking for matches in the text when a token defined as a suffix is found. Also it will keep tracking of the last mathced prefix and subsequent mathches with suffix.
Now the index of the matched token can be found in metadata.

Example :

...
contextual_parser = ContextualParserApproach() \
    .setInputCols(["sentence", "token"])\
    .setOutputCol("entity")\
    .setJsonPath("cities.json")\
    .setCaseSensitive(True)\
    .setDictionary('cities.tsv', options={"orientation":"vertical"})\
    .setShortestContextMatch(True)
...

sample_text = "Peter Parker is a nice guy and lives in Chicago."

Results :

+-------+---------+----------+
|chunk  |ner_label|tokenIndex|
+-------+---------+----------+
|Chicago|City     |9         |
+-------+---------+----------+

Prettified relational categories in `ZeroShotRelationExtractionModel` annotator

Now you can setRelationalCategories() between the entity labels by using a single {} instead of two.

Example :

re_model = ZeroShotRelationExtractionModel.pretrained("re_zeroshot_biobert", "en", "clinical/models")\
    .setInputCols(["re_ner_chunks", "sentences"]) \
    .setOutputCol("relations")\
    .setRelationalCategories({"ADE": ["{DRUG} causes {PROBLEM}."]})

Create Graphs for Open Source `NerDLApproach` with the `TFGraphBuilder`

Now you can create graphs for model training with NerDLApproach by using the new setIsMedical() parameter of TFGraphBuilder annotator. If setIsMedical(True), the model can be trained with MedicalNerApproach, but if it is setIsMedical(False) it can be used with NerDLApproach for training non-medical models.

graph_folder_path = "./graphs"

ner_graph_builder = TFGraphBuilder()\
    .setModelName("ner_dl")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("label")\
    .setGraphFile("auto")\
    .setHiddenUnitsNumber(20)\
    .setGraphFolder(graph_folder_path)\
    .setIsMedical(False)

ner = NerDLApproach() \
    ...
    .setGraphFolder(graph_folder_path)

ner_pipeline = Pipeline()([
    ...,
    ner_graph_builder,
    ner
    ])

Spark NLP for Healthcare Library Installation with Poetry Documentation (dependency management and packaging tool).

We have a new documentation page for showing Spark NLP for Healthcare installation with Poetry. You can find it here.

Bug fixes

ContextualParserApproach: Fixed the bug using a dictionary and document rule scope in JSON config file.
RENerChunksFilter: Preparing a pretrained pipeline with RENerChunksFilter annotator issue is fixed.

Updated Notebooks

ZeroShot Clinical Relation Extraction Notebook: Added new features, visualization and new examples.
Clinical_Entity_Resolvers Notebook: Added an example of ResolverMerger.
Chunk Mapping Notebook: Added new models into the model list and an example of mapper pretrained pipelines.
Healthcare Code Mapping Notebook: Added all mapper pretrained pipeline examples.

List of Recently Updated and Added Models

ner_pathogene
ner_pathogen_pipeline
ner_clinical_trials_abstracts
bert_token_classifier_ner_clinical_trials_abstracts
ner_clinical_trials_abstracts_pipeline
ner_biomedical_bc2gm_pipeline
bert_sequence_classifier_stress
icd10cm_snomed_mapper
snomed_icd10cm_mapper
snomed_icdo_mapper
icdo_snomed_mapper
rxnorm_umls_mapper
icd10cm_umls_mapper
mesh_umls_mapper
snomed_umls_mapper
icd10cm_snomed_mapping
snomed_icd10cm_mapping
icdo_snomed_mapping
snomed_icdo_mapping
rxnorm_ndc_mapping
icd10cm_umls_mapping
mesh_umls_mapping
rxnorm_umls_mapping
snomed_umls_mapping
drug_action_tretment_mapper
normalized_section_header_mapper
drug_brandname_ndc_mapper
abbreviation_mapper
rxnorm_ndc_mapper
rxnorm_action_treatment_mapper
rxnorm_mapper
ner_deid_subentity -> ro
ner_deid_subentity_bert -> ro
clinical_deidentification -> ro
ner_clinical -> ro
ner_clinical_bert -> ro
bert_token_classifier_ner_living_species -> es
ner_living_species_bert -> es
ner_living_species_roberta -> es
ner_living_species_300 -> es
ner_living_species -> es
bert_token_classifier_ner_living_species -> en
ner_living_species -> en
ner_living_species_biobert -> en
ner_living_species -> fr
ner_living_species_bert -> fr
bert_token_classifier_ner_living_species -> pt
ner_living_species -> pt
ner_living_species_roberta -> pt
ner_living_species_bert -> pt
bert_token_classifier_ner_living_species -> it
ner_living_species_bert -> it
ner_living_species -> pt
ner_living_species_bert -> ro
ner_living_species -> ro
ner_living_species -> gal

For all Spark NLP for healthcare models, please check: Models Hub Page

Versions

Version
Version
Version

PREVIOUSVersion Compatibility

4.0.0

Highlights

8 New Chunk Mapper Models and 9 New Pretrained Chunk Mapper Pipelines to Convert One Medical Terminology to Another (Snomed to ICD10, RxNorm to UMLS etc.)

2 New Medical NER Models (ner_clinical_trials_abstracts and ner_pathogene) and Pretrained NER Pipelines

20 New Biomedical NER Models Based on the [LivingNER corpus] in 8 Languages

2 New Medical NER Models for Romanian Language

Deidentification Support for Romanian Language (ner_deid_subentity, ner_deid_subentity_bert and a Pretrained Deidentification Pipeline)

The First Public Health Model: Emotional Stress Classifier

ResolverMerger Annotator to Merge the Results of ChunkMapperModel and SentenceEntityResolverModel Annotators

New Shortest Context Match and Token Index Features in ContextualParserApproach

Prettified relational categories in ZeroShotRelationExtractionModel annotator

Create Graphs for Open Source NerDLApproach with the TFGraphBuilder

Spark NLP for Healthcare Library Installation with Poetry Documentation (dependency management and packaging tool).

Bug fixes

Updated Notebooks

List of Recently Updated and Added Models

Versions

2 New Medical NER Models (`ner_clinical_trials_abstracts` and `ner_pathogene`) and Pretrained NER Pipelines

Deidentification Support for Romanian Language (`ner_deid_subentity`, `ner_deid_subentity_bert` and a Pretrained Deidentification Pipeline)

`ResolverMerger` Annotator to Merge the Results of `ChunkMapperModel` and `SentenceEntityResolverModel` Annotators

New Shortest Context Match and Token Index Features in `ContextualParserApproach`

Prettified relational categories in `ZeroShotRelationExtractionModel` annotator

Create Graphs for Open Source `NerDLApproach` with the `TFGraphBuilder`