NLP Lab (Annotation Lab) Interface Module
Spark NLP for Healthcare provides functionality to interact with the NLP Lab using easy-to-use functions. NLP Lab is a tool for multi-modal data annotation. It allows annotation teams to efficiently collaborate to generate training data for ML models and/or to validate automatic annotations generated by those.
NLP Lab Intreacting Module provides programmatic interactions with the NLP Lab. A detailed usage examples can be found at Complete NLP Lab Module SparkNLP JSL, and Python’s documentation in the Python API. Following are the functionalities supported by the module:
- Generating a CoNLL formatted file from the annotation JSON for training an NER model.
- Generating a csv/excel formatted file from the annotation JSON for training classification, assertion, and relation extraction models.
- Build preannotation JSON file using Spark NLP pipelines, saving it as a JSON and uploading preannotations to a project.
- Interacting with the NLP Lab instance, and setting up projects for NLP Lab.
- Getting the list of all projects in the NLP Lab instance.
- Creating New Projects.
- Deleting Projects.
- Setting & editing configuration of projects.
- Accessing/getting configuration of any existing project.
- Upload tasks to a project.
- Deleting tasks of a project.
Start Module
# import the module
from sparknlp_jsl.alab import AnnotationLab
alab = AnnotationLab()
Generate Data for Traing a Classification Model
alab.get_classification_data(
# required: path to NLP Lab JSON export
input_json_path='alab_demo.json',
# optional: set to True to select ground truth completions, False to select latest completions,
# defaults to False
# ground_truth=False)
Converting The Json Export into a Conll Format Suitable for Training an Ner Model
alab.get_conll_data(
# required: Spark session with spark-nlp-jsl jar
spark=spark,
# required: path to NLP Lab JSON export
input_json_path="alab_demo.json",
# required: name of the CoNLL file to save
output_name="conll_demo",
# optional: path for CoNLL file saving directory, defaults to 'exported_conll'
# save_dir="exported_conll",
# optional: set to True to select ground truth completions, False to select latest completions,
# defaults to False
# ground_truth=False,
# optional: labels to exclude from CoNLL; these are all assertion labels and irrelevant NER labels,
# defaults to empty list
# excluded_labels=['ABSENT'],
# optional: set a pattern to use regex tokenizer, defaults to regular tokenizer if pattern not defined
# regex_pattern="\\s+|(?=[-.:;*+,$&%\\[\\]])|(?<=[-.:;*+,$&%\\[\\]])"
# optional: list of NLP Lab task titles to exclude from CoNLL, defaults to empty list
# excluded_task_ids = [2, 3]
# optional: list of NLP Lab task titles to exclude from CoNLL, defaults to None
# excluded_task_titles = ['Note 1'])
Converting The JSON Export into a Dataframe Suitable for Training an Assertion Model
alab.get_assertion_data(
# required: SparkSession with spark-nlp-jsl jar
spark=spark,
# required: path to NLP Lab JSON export
input_json_path = 'alab_demo.json',
# required: annotated assertion labels to train on
assertion_labels = ['ABSENT'],
# required: relevant NER labels that are assigned assertion labels
relevant_ner_labels = ['PROBLEM', 'TREATMENT'],
# optional: set to True to select ground truth completions, False to select latest completions,
# defaults to False
# ground_truth = False,
# optional: assertion label to assign to entities that have no assertion labels, defaults to None
# unannotated_label = 'PRESENT',
# optional: set a pattern to use regex tokenizer, defaults to regular tokenizer if pattern not defined
# regex_pattern = "\\s+|(?=[-.:;*+,$&%\\[\\]])|(?<=[-.:;*+,$&%\\[\\]])",
# optional: set the strategy to control the number of occurrences of the unannotated assertion label
# in the output dataframe, options are 'weighted' or 'counts', 'weighted' allows to sample using a
# fraction, 'counts' allows to sample using absolute counts, defaults to None
# unannotated_label_strategy = None,
# optional: dictionary in the format {'ENTITY_LABEL': sample_weight_or_counts} to control the number of
# occurrences of the unannotated assertion label in the output dataframe, where 'ENTITY_LABEL' are the
# NER labels that are assigned the unannotated assertion label, and sample_weight_or_counts should be
# between 0 and 1 if `unannotated_label_strategy` is 'weighted' or between 0 and the max number of
# occurrences of that NER label if `unannotated_label_strategy` is 'counts'
# unannotated_label_strategy_dict = {'PROBLEM': 0.5, 'TREATMENT': 0.5},
# optional: list of NLP Lab task IDs to exclude from output dataframe, defaults to None
# excluded_task_ids = [2, 3]
# optional: list of NLP Lab task titles to exclude from output dataframe, defaults to None
# excluded_task_titles = ['Note 1'])
Converting The JSON Export into a Dataframe Suitable for Training a Relation Extraction Model
alab.get_relation_extraction_data(
# required: Spark session with spark-nlp-jsl jar
spark=spark,
# required: path to NLP Lab JSON export
input_json_path='alab_demo.json',
# optional: set to True to select ground truth completions, False to select latest completions,
# defaults to False
ground_truth=True,
# optional: set to True to assign a relation label between entities where no relation was annotated,
# defaults to False
negative_relations=True,
# optional: all assertion labels that were annotated in the NLP Lab, defaults to None
assertion_labels=['ABSENT'],
# optional: plausible pairs of entities for relations, separated by a '-', use the same casing as the
# annotations, include only one relation direction, defaults to all possible pairs of annotated entities
relation_pairs=['DATE-PROBLEM','TREATMENT-PROBLEM','TEST-PROBLEM'],
# optional: set the strategy to control the number of occurrences of the negative relation label
# in the output dataframe, options are 'weighted' or 'counts', 'weighted' allows to sample using a
# fraction, 'counts' allows to sample using absolute counts, defaults to None
negative_relation_strategy='weighted',
# optional: dictionary in the format {'ENTITY1-ENTITY2': sample_weight_or_counts} to control the number of
# occurrences of negative relations in the output dataframe for each entity pair, where 'ENTITY1-ENTITY2'
# represent the pairs of entities for relations separated by a `-` (include only one relation direction),
# and sample_weight_or_counts should be between 0 and 1 if `negative_relation_strategy` is 'weighted' or
# between 0 and the max number of occurrences of negative relations if `negative_relation_strategy` is
# 'counts', defaults to None
negative_relation_strategy_dict = {'DATE-PROBLEM': 0.1, 'TREATMENT-PROBLEM': 0.5, 'TEST-PROBLEM': 0.2},
# optional: list of NLP Lab task IDs to exclude from output dataframe, defaults to None
# excluded_task_ids = [2, 3]
# optional: list of NLP Lab task titles to exclude from output dataframe, defaults to None
# excluded_task_titles = ['Note 1'])
Generate JSON Containing Pre-annotations Using a Spark NLP Pipeline
pre_annotations, summary = alab.generate_preannotations(
# required: list of results.
all_results = results,
# requied: output column name of 'DocumentAssembler' stage - to get original document string.
document_column = 'document',
# required: column name(s) of ner model(s). Note: multiple NER models can be used, but make sure their results don't overrlap.
# Or use 'ChunkMergeApproach' to combine results from multiple NER models.
ner_columns = ['ner_chunk'],
# optional: column name(s) of assertion model(s). Note: multiple assertion models can be used, but make sure their results don't overrlap.
# assertion_columns = ['assertion_res'],
# optional: column name(s) of relation extraction model(s). Note: multiple relation extraction models can be used, but make sure their results don't overrlap.
# relations_columns = ['relations_clinical', 'relations_pos'],
# optional: This can be defined to identify which pipeline/user/model was used to get predictions.
# Default: 'model'
# user_name = 'model',
# optional: Option to assign custom titles to tasks. By default, tasks will be titled as 'task_#'
# titles_list = [],
# optional: If there are already tasks in project, then this id offset can be used to make sure default titles 'task_#' do not overlap.
# While upload a batch after the first one, this can be set to number of tasks currently present in the project
# This number would be added to each tasks's ID and title.
# id_offset=0)
Interacting with NLP Lab
alab = AnnotationLab()
username=''
password=''
client_secret=''
annotationlab_url=''
alab.set_credentials(
# required: username
username=username,
# required: password
password=password,
# required: secret for you alab instance (every alab installation has a different secret)
client_secret=client_secret,
# required: http(s) url for you NLP lab
annotationlab_url=annotationlab_url)
Get All Visible Projects
alab.get_all_projects()
Create a New Project
alab.create_project(
# required: unique name of project
project_name = 'alab_demo',
# optional: other details about project. Default: Empty string
project_description='',
# optional: Sampling option of tasks. Default: random
project_sampling='',
# optional: Annotation Guidelines of project
project_instruction='')
Delete a Project
alab.delete_project(
# required: unique name of project
project_name = 'alab_demo',
# optional: confirmation for deletion. Default: False - will ask for confirmation. If set to true, will delete directly.
confirm=False)
Upload Tasks to a Project
alab.upload_tasks(
# required: name of project to upload tasks to
project_name='alab_demo',
# required: list of examples / tasks as string (One string is one task).
task_list=task_list,
# optional: Option to assign custom titles to tasks. By default, tasks will be titled as 'task_#'
title_list = [],
# optional: If there are already tasks in project, then this id offset can be used to make sure default titles 'task_#' do not overlap.
# While upload a batch after the first one, this can be set to number of tasks currently present in the project
# This number would be added to each tasks's ID and title.
id_offset=0)
Delete Tasks from a Project
alab.delete_tasks(
# required: name of project to upload tasks to
project_name='alab_demo',
# required: list of ids of tasks.
# note: you can get task ids from the above step. Look for 'task_ids' key.
task_ids=[1, 2],
# optional: confirmation for deletion. Default: False - will ask for confirmation. If set to true, will delete directly.
confirm=False)
Upload Pre-annotations to NLP Lab
alab.upload_preannotations(
# required: name of project to upload annotations to
project_name = 'alab_demo',
# required: preannotation JSON
preannotations = pre_annotations)
Deidentification Module
Spark NLP for Healthcare provides functionality to apply Deidentification using easy-to-use module named Deid
.
The Deid
module is a tool for deidentifying Personal Health Information from data in a file path. It can be used with custom SparkNLP NER pipelines or without any pipeline specified.
It returns the deidentification results as a pyspark dataframe as well as a csv
or json file
.
The module also includes functionality for applying Structured Deidentification task to data from a file path.
The function, deidentify()
, can be used with a custom pipeline or without defining any custom pipeline.
structured_deidentifier()
function can be used for the Structured Deidentification task.
Apply Deidentification With a Custom Pipeline
from sparknlp_jsl import Deid
deid_implementor= Deid(
# required: Spark session with spark-nlp-jsl jar
spark,
# required: The path of the input file. Default is None. File type must be 'csv' or 'json'.
input_file_path="data.csv",
#optional: The path of the output file. Default is 'deidentified.csv'. File type must be 'csv' or 'json'.
output_file_path="deidentified.csv",
#optional: The separator of the input csv file. Default is "\t".
separator=",",
#optional: A custom pipeline model to be used for deidentification. If not specified, the default is None.
custom_pipeline=nlpModel,
#optional: Fields to be deidentified and their deidentification modes, by default {"text": "mask"}
fields={"text": "mask", "text_1": "obfuscate"},
#optional: The masking policy. Default is "entity_labels".
masking_policy="fixed_length_chars",
#optional: The fixed mask length. Default is 4.
fixed_mask_length=4,
#optional: The final chunk column name of the custom pipeline that will be deidentified, if specified. Default is "ner_chunk".
ner_chunk="ner_chunk",
#optional: The corresponding document column name of the custom pipeline, if specified. Default is "document"
document="document",
#optional: The corresponding sentence column name of the custom pipeline, if specified. Default is "sentence"
sentence="sentence",
#optional: The corresponding token column name of the custom pipeline, if specified. Default is "token"
token="token",
#optional: The source of the reference file for obfuscation. Default is "faker".
#obfuscate_ref_source="both",
#optional: The path of the reference file for obfuscation. Default is None.
#obfuscate_ref_file_path="obfuscation.txt",
#optional: Obfuscate date. Default is True.
#obfuscate_date=True,
#optional: The document hash coder column name. Default is "documentHash".
#documentHashCoder_col_name= "documentHash"
#optional: ID column name. Default is "id".
#id_column_name= "ID"
#optional: Date shift column name. Default is "date_shift".
#date_shift_column_name= "date_shift"
#optional: Json file path for multi-mode Deid. Default is NONE.
#multi_mode_file_path= "multi_mode_file_path.json"
#optional: The date tag. Default is "DATE".
#date_tag="DATE"
#optional: Language. Default is "en"
#language="en"
#optional: Region. Default is "us"
#region="us"
#optional: Age group obfuscation. Default is False.
#age_group_obfuscation=True
#optional: Age ranges for obfuscation. Default is [1, 4, 12, 20, 40, 60, 80].
#age_ranges=[1, 4, 12, 20, 40, 60, 80]
#optional: Shift days. Default is False.
#shift_days=False
#optional: The number of days to shift. Default is None.
#number_of_days=5
#optional: Use unnormalized date. Default is False.
#unnormalized_date=True
#optional: The unnormalized mode. Default is "mask".
#unnormalized_mode="obfuscate"
)
res= deid_implementor.deidentify()
+---+----------------------------------------------------------------------+----------------------------------------------+----------------------------------------------------------------------+----------------------------------------------------------------------+
| ID| text| text_deidentified| text_1| text_1_deidentified|
+---+----------------------------------------------------------------------+----------------------------------------------+----------------------------------------------------------------------+----------------------------------------------------------------------+
| 0|Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson ...|Record date : ** , ** , M.D . , Name : ** MR .|Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-...|Date : 10-16-1991 PCP : Alveda Castles , 26 years-old , Record date...|
+---+----------------------------------------------------------------------+----------------------------------------------+----------------------------------------------------------------------+----------------------------------------------------------------------+
Apply Deidentification With No Custom Pipeline
from sparknlp_jsl import Deid
deid_implementor= Deid(
# required: Spark session with spark-nlp-jsl jar
spark,
# required: The path of the input file. Default is None. File type must be 'csv' or 'json'.
input_file_path="data.csv",
#optional: The path of the output file. Default is 'deidentified.csv'. File type must be 'csv' or 'json'.
output_file_path="deidentified.csv",
#optional: The separator of the input csv file. Default is "\t".
separator=",",
#optional: Fields to be deidentified and their deidentification modes, by default {"text": "mask"}
fields={"text": "mask"},
#optional: The masking policy. Default is "entity_labels".
masking_policy="entity_labels",
#optional: Json file path for multi-mode Deid. Default is NONE.
#multi_mode_file_path= "multi_mode_file_path.json",
#optional: Age group obfuscation. Default is False.
#age_group_obfuscation=True
#optional: Age ranges for obfuscation. Default is [1, 4, 12, 20, 40, 60, 80].
#age_ranges=[1, 4, 12, 20, 40, 60, 80]
#optional: Shift days. Default is False.
#shift_days=False
#optional: The number of days to shift. Default is None.
#number_of_days=5
#optional: Use unnormalized date. Default is False.
#unnormalized_date=True
#optional: The unnormalized mode. Default is "mask".
#unnormalized_mode="obfuscate"
)
res= deid_implementor.deidentify()
+---+----------------------------------------------------------------------+----------------------------------------------------------------------+
| ID| text_original| text_deid|
+---+----------------------------------------------------------------------+----------------------------------------------------------------------+
| 0| "| "|
| 1|Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson ...|Record date : <DATE> , <DOCTOR> , M.D . , Name : <PATIENT> , MR # <...|
| 2| "| "|
+---+----------------------------------------------------------------------+----------------------------------------------------------------------+
Apply Structured Deidentification
from sparknlp_jsl.utils.deidentification_utils import structured_deidentifier
res= structured_deidentifier(
# required: Spark session with spark-nlp-jsl jar
spark
#required: The path of the input file. Default is None. File type must be 'csv' or 'json'.
input_file_path="data.csv",
#optional: The path of the output file. Default is 'deidentified.csv'. File type must be 'csv' or 'json'.
output_file_path="deidentified.csv",
#optional: The separator of the input csv file. Default is "\t".
separator=",",
#optional: A dictionary that contains the column names and the tags that should be used for deidentification. Default is {"NAME":"PATIENT","AGE":"AGE"}
columns_dict= {"NAME": "ID", "DOB": "DATE"},
#optional: The seed value for the random number generator. Default is {"NAME": 23, "AGE": 23}
columns_seed= {"NAME": 23, "DOB": 23},
#optional: The source of the reference file. Default is faker.
ref_source="faker",
#optional: The number of days to be shifted. Default is None
shift_days=5,
#optional: The path of the reference file for obfuscation. Default is None.
#obfuscateRefFile: "obfuscator_unique_ref_test.txt",
#optional: A list of date formats. Default is ["dd/MM/yyyy", "dd-MM-yyyy", "d/M/yyyy", "dd-MM-yyyy", "d-M-yyyy"]
#date_formats=["dd/MM/yyyy", "dd-MM-yyyy"]
)
+----------+------------+--------------------+---+----------------+
| NAME| DOB| ADDRESS|SBP| TEL|
+----------+------------+--------------------+---+----------------+
|[N2649912]|[18/02/1977]| 711 Nulla St.|140| 673 431234|
| [W466004]|[28/02/1977]| 1 Green Avenue.|140|+23 (673) 431234|
| [M403810]|[16/04/1900]|Calle del Liberta...|100| 912 345623|
+----------+------------+--------------------+---+----------------+
Compatibility
This module helps to find appropriate model versions depending your distribution of John Snow Labs products.
By searching our vast repository of models available at NLP Model Hub, we can return a JSON-like file with the models’s information (using method .findVersion()
) or print the models that match a given query (using method .showVersion()
).
To use it, simply run the following:
from johnsnowlabs import medical
# Or: from sparknlp_jsl.compatibility import Compatibility
compatibility = medical.Compatibility()
# Returns a list of dict objects
found_models = compatibility.findVersion('ner_clinical')
To tabulate and visualize all retrieved models, you can:
import pandas as pd
models_df = pd.
| | name | sparkVersion | version | language | date | readyToUse |
|---:|:----------------------------------------------------|---------------:|:----------|:-----------|:------------------------|:-------------|
| 0 | ner_clinical_noncontrib | 2.4 | 2.3.0 | en | 2019-11-14T17:07:35.434 | true |
| 1 | ner_clinical_large | 2.4 | 2.5.0 | en | 2020-05-21T00:35:02.624 | true |
| 2 | ner_clinical | 3 | 3.0.0 | en | 2021-01-27T12:52:59.087 | true |
| 3 | ner_clinical_large_en | 3 | 3.0.0 | en | 2021-03-31T12:32:55.357 | true |
| 4 | ner_clinical | 3 | 3.0.0 | en | 2021-03-31T16:33:39.368 | true |
| 5 | ner_clinical_large | 3 | 3.0.0 | en | 2021-03-31T15:55:14.650 | true |
| 6 | ner_clinical_biobert | 3 | 3.0.0 | en | 2021-04-01T07:06:52.919 | true |
| 7 | ner_clinical | 2.3 | 3.0.0 | en | 2021-03-31T16:33:39.368 | true |
| 8 | ner_clinical_biobert | 2.3 | 3.0.0 | en | 2021-04-01T07:06:52.919 | true |
| 9 | ner_clinical | 2.3 | 3.0.0 | en | 2021-01-27T12:52:59.087 | true |
| 10 | ner_clinical | 2.3 | 3.0.0 | en | 2021-03-31T16:33:39.368 | true |
| 11 | ner_clinical_large | 2.3 | 3.0.0 | en | 2021-03-31T15:55:14.650 | true |
| 12 | bert_token_classifier_ner_clinical | 2.4 | 3.2.0 | en | 2021-08-28T15:51:44.492 | true |
| 13 | bert_token_classifier_ner_clinical | 2.4 | 3.3.4 | en | 2022-01-06T12:42:21.908 | true |
| 14 | bert_token_classifier_ner_clinical_pipeline | 3 | 3.4.1 | en | 2022-03-15T12:08:50.209 | true |
| 15 | bert_token_classifier_ner_clinical_pipeline | 2.4 | 3.4.1 | en | 2022-03-15T12:56:42.874 | true |
| 16 | ner_clinical_biobert_pipeline | 3 | 3.4.1 | en | 2022-03-21T15:06:54.361 | true |
| 17 | ner_clinical_large_pipeline | 3 | 3.4.1 | en | 2022-03-21T14:29:11.545 | true |
| 18 | ner_clinical_pipeline | 3 | 3.4.1 | en | 2022-03-21T14:32:59.531 | true |
| 19 | bert_token_classifier_ner_clinical_pipeline | 3 | 3.4.1 | en | 2022-03-21T18:51:36.583 | true |
| 20 | ner_clinical_trials_abstracts | 3 | 3.5.3 | en | 2022-06-22T15:26:56.789 | true |
| 21 | ner_clinical_trials_abstracts_pipeline | 3 | 3.5.3 | en | 2022-06-27T07:07:17.828 | true |
| 22 | bert_token_classifier_ner_clinical_trials_abstracts | 3 | 3.5.3 | en | 2022-06-29T04:10:29.985 | true |
| 23 | ner_clinical_bert | 3 | 4.0.0 | ro | 2022-06-30T21:36:31.573 | true |
| 24 | ner_clinical | 3 | 4.0.0 | ro | 2022-07-01T14:55:02.322 | true |
| 25 | ner_clinical_bert | 3 | 4.0.2 | ro | 2022-08-12T09:12:00.992 | true |
| 26 | bert_token_classifier_ner_clinical_trials_abstracts | 3 | 4.0.2 | es | 2022-08-11T14:45:17.151 | true |
| 27 | ner_clinical_trials_abstracts | 3 | 4.0.2 | es | 2022-08-12T21:19:27.613 | true |
| 28 | ner_clinical_bert | 3 | 4.2.2 | ro | 2022-11-22T13:33:53.852 | true |
Or simply run the showVersion()
method instead:
compatibility.showVersion('ner_clinical')
+-----------------------------------------------------+------+---------+
| Pipeline/Model | lang | version |
+-----------------------------------------------------+------+---------+
| ner_clinical_noncontrib | en | 2.3.0 |
| ner_clinical_large | en | 2.5.0 |
| ner_clinical | en | 3.0.0 |
| ner_clinical_large_en | en | 3.0.0 |
| ner_clinical | en | 3.0.0 |
| ner_clinical_large | en | 3.0.0 |
| ner_clinical_biobert | en | 3.0.0 |
| ner_clinical | en | 3.0.0 |
| ner_clinical_biobert | en | 3.0.0 |
| ner_clinical | en | 3.0.0 |
| ner_clinical | en | 3.0.0 |
| ner_clinical_large | en | 3.0.0 |
| bert_token_classifier_ner_clinical | en | 3.2.0 |
| bert_token_classifier_ner_clinical | en | 3.3.4 |
| bert_token_classifier_ner_clinical_pipeline | en | 3.4.1 |
| bert_token_classifier_ner_clinical_pipeline | en | 3.4.1 |
| ner_clinical_biobert_pipeline | en | 3.4.1 |
| ner_clinical_large_pipeline | en | 3.4.1 |
| ner_clinical_pipeline | en | 3.4.1 |
| bert_token_classifier_ner_clinical_pipeline | en | 3.4.1 |
| ner_clinical_trials_abstracts | en | 3.5.3 |
| ner_clinical_trials_abstracts_pipeline | en | 3.5.3 |
| bert_token_classifier_ner_clinical_trials_abstracts | en | 3.5.3 |
| ner_clinical_bert | ro | 4.0.0 |
| ner_clinical | ro | 4.0.0 |
| ner_clinical_bert | ro | 4.0.2 |
| bert_token_classifier_ner_clinical_trials_abstracts | es | 4.0.2 |
| ner_clinical_trials_abstracts | es | 4.0.2 |
| ner_clinical_bert | ro | 4.2.2 |
+-----------------------------------------------------+------+---------+
InternalResourceDownloader
This module has extended functinalities to list and download models from John Snow Labs repositories. It is an auxiliary module for finding and downloading different models for studies and analysis.
As with the Compatibility
module, InternalResourceDownloader
is also capable of displaying the available models. The difference is that this module can filter the results based on the Python’s class name of the annotator, while Compatibility
searches for models’ name.
Displaying available models
To display the pipelines or models, you can use the .showPrivateModels()
, .showPrivatePipelines()
, .returnPrivateModels()
, or .returnPrivatePipelines()
methods, which return the results in a list or print the results directly.
For example, to list all models with class MedicalNerModel
, just run (some results were ommited for brevity):
medical_ner_models = medical.InternalResourceDownloader.returnPrivateModels("MedicalNerModel")
medical_ner_models[0]
['nerdl_tumour_demo', 'en', '1.7.3']
medical.InternalResourceDownloader.showPrivateModels("MedicalNerModel")
+----------------------------------------+------+---------+
| Model | lang | version |
+----------------------------------------+------+---------+
| ner_deid_subentity_bert | ro | 4.0.0 |
| ner_deid_subentity | ro | 4.0.0 |
| ner_pathogen | en | 4.0.0 |
| ner_clinical_bert | ro | 4.0.0 |
| ner_clinical | ro | 4.0.0 |
| ner_ade_binary | en | 4.0.0 |
| ner_living_species_300 | es | 4.0.0 |
| ner_clinical_bert | ro | 4.0.2 |
| ner_clinical_trials_abstracts | es | 4.0.2 |
| ner_pharmacology | es | 4.0.2 |
| ner_negation_uncertainty | es | 4.0.2 |
| disease_mentions_tweet | es | 4.0.2 |
| ner_deid_generic_bert | ro | 4.0.2 |
| ner_oncology_unspecific_posology_wip | en | 4.0.0 |
| ner_oncology_wip | en | 4.0.0 |
| ner_oncology_therapy_wip | en | 4.0.0 |
| ner_oncology_posology_wip | en | 4.0.0 |
| ner_oncology_anatomy_general_wip | en | 4.0.0 |
| ner_oncology_tnm_wip | en | 4.0.0 |
| ner_oncology_demographics_wip | en | 4.0.0 |
| ner_oncology_biomarker_wip | en | 4.0.0 |
| ner_oncology_anatomy_granular_wip | en | 4.0.0 |
| ner_oncology_test_wip | en | 4.0.0 |
| ner_oncology_diagnosis_wip | en | 4.0.0 |
| ner_oncology_response_to_treatment_wip | en | 4.0.0 |
| ner_jsl | en | 4.2.0 |
| ner_covid_trials | en | 4.2.0 |
| ner_oncology_unspecific_posology | en | 4.0.0 |
| ner_oncology | en | 4.0.0 |
| ner_oncology_tnm | en | 4.0.0 |
| ner_oncology_anatomy_general | en | 4.0.0 |
| ner_oncology_therapy | en | 4.0.0 |
| ner_oncology_test | en | 4.0.0 |
| ner_oncology_diagnosis | en | 4.0.0 |
| ner_oncology_demographics | en | 4.0.0 |
| ner_oncology_anatomy_granular | en | 4.0.0 |
| ner_oncology_response_to_treatment | en | 4.0.0 |
| ner_oncology_posology | en | 4.0.0 |
| ner_oncology_biomarker | en | 4.0.0 |
| ner_sdoh_slim_wip | en | 4.2.1 |
| ner_clinical_bert | ro | 4.2.2 |
| ner_living_species_300 | es | 4.2.2 |
| ner_deid_generic_bert | ro | 4.2.2 |
| ner_oncology_biomarker | en | 4.2.2 |
| ner_oncology_response_to_treatment | en | 4.2.2 |
| ner_oncology_demographics | en | 4.2.2 |
| ner_oncology_therapy | en | 4.2.2 |
| ner_oncology | en | 4.2.2 |
| ner_oncology_anatomy_granular | en | 4.2.2 |
| ner_oncology_anatomy_general | en | 4.2.2 |
| ner_oncology_diagnosis | en | 4.2.2 |
| ner_oncology_tnm | en | 4.2.2 |
| ner_oncology_posology | en | 4.2.2 |
| ner_oncology_unspecific_posology | en | 4.2.2 |
| ner_oncology_test | en | 4.2.2 |
+----------------------------------------+------+---------+
ModelTracer
This module adds information on the data to help track uids and timestamps of each stage of the pipeline.
Given the following pipeline for Medical NER:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
# Tokenizer splits words in a relevant format for NLP
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
# Clinical word embeddings trained on PubMED dataset
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
.setInputCols(["sentence","token"])\
.setOutputCol("embeddings")
# NER model trained on i2b2 (sampled from MIMIC) dataset
clinical_ner = MedicalNerModel.pretrained("ner_clinical_large","en","clinical/models")\
.setInputCols(["sentence","token","embeddings"])\
.setOutputCol("ner")\
.setLabelCasing("upper") #decide if we want to return the tags in upper or lower case
ner_converter = NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(
stages=[
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter
])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
To add the UID and timestamp of each pipeline step, simply use
from sparknlp_jsl.modelTracer import ModelTracer
df = model.transform(empty_data)
tracer_result = ModelTracer().addUidCols(pipeline = nlpPipeline, df = df)
tracer_result.show(truncate=False)
+----+------------------------------------------+--------+-----+----------+---+---------+----------------------------------------------------------------------+----------------------------------------------------------------------------+--------------------------------------------------------------+--------------------------------------------------------------------------+--------------------------------------------------------------------+-----------------------------------------------------------------+
|text|document |sentence|token|embeddings|ner|ner_chunk|documentassembler_model_uid |sentencedetectordlmodel_model_uid |tokenizer_model_uid |word_embeddings_model_model_uid |medicalnermodel_model_uid |nerconverter_model_uid |
+----+------------------------------------------+--------+-----+----------+---+---------+----------------------------------------------------------------------+----------------------------------------------------------------------------+--------------------------------------------------------------+--------------------------------------------------------------------------+--------------------------------------------------------------------+-----------------------------------------------------------------+
| |[{document, 0, -1, , {sentence -> 0}, []}]|[] |[] |[] |[] |[] |{uid -> DocumentAssembler_3e110f5ce3dc, timestamp -> 2022-10-21_22:58}|{uid -> SentenceDetectorDLModel_6bafc4746ea5, timestamp -> 2022-10-21_22:58}|{uid -> Tokenizer_bd74fe5f5860, timestamp -> 2022-10-21_22:58}|{uid -> WORD_EMBEDDINGS_MODEL_9004b1d00302, timestamp -> 2022-10-21_22:58}|{uid -> MedicalNerModel_1a8637089929, timestamp -> 2022-10-21_22:58}|{uid -> NerConverter_643c903e9161, timestamp -> 2022-10-21_22:58}|
+----+------------------------------------------+--------+-----+----------+---+---------+----------------------------------------------------------------------+----------------------------------------------------------------------------+--------------------------------------------------------------+--------------------------------------------------------------------------+--------------------------------------------------------------------+-----------------------------------------------------------------+