sparknlp_jsl.alab
#
Functions to manipulate an NLP Lab JSON export into an appropriate layout for training assertion, relation extraction and NER models
Module Contents#
Classes#
Interface to John Snow Labs's NLP Lab. |
- class AnnotationLab#
Interface to John Snow Labs’s NLP Lab.
NLP Lab is a tool for multi-modal data annotation. It allows annotation teams to efficiently collaborate to generate training data for ML models and/or to validate automatic annotations generated by those.
This module provides programmatically access to NLP Lab through the Healthcare NLP library. It allows users to: - Generating a CoNLL formatted file from the annotation JSON for training an NER model. - Generating a csv/excel formatted file from the annotation JSON for training classification, assertion, and relation extraction models. - Build preannotation JSON file using Spark NLP pipelines, saving it as a JSON and uploading preannotations to a project. - Interacting with the ALAB instance, and setting up projects of ALAB. - Getting the list of all projects in the ALAB instance. - Creating New Projects. - Deleting Projects. - Setting & editing configuration of projects. - Accessing/getting configuration of any existing project. - Upload tasks to a project. - Deleting tasks of a project.
For usage details and examples, please check the [NLP Lab tutorial](JohnSnowLabs/spark-nlp-workshop).
- base_url = None#
- client_id = None#
- client_secret = None#
- cookies = None#
- http_success_codes = [200, 201, 203]#
- password = None#
- username = None#
- create_project(project_name, project_description='', project_sampling='', project_instruction='')#
Creates a new project.
- Parameters:
project_name (str) – Project Name
project_description (str) – Project Description
project_sampling (str) – Random or Sequential sampling while showing tasks to annotators
project_instruction (str) – Annotation Guidelines
- Returns:
project
- Return type:
dict
- delete_project(project_name, confirm=False)#
Deletes an existing project.
- Parameters:
project_name (str) – Project Name
confirm (boolean) – If set to True, no input required for confirmation. Default: False.
- Return type:
dict
- delete_tasks(project_name, task_ids, confirm=False)#
Deletes tasks of a project in NLP Lab.
- Parameters:
project_name (str) – Project Name
task_id – List of tasks ids
confirm (boolean) – If set to True, no input required for confirmation. Default: False.
- Returns:
content
- Return type:
dict
- generate_preannotations(all_results: List[dict], document_column: str, ner_columns: List[str], assertion_columns: List[str] = [], relations_columns: List[str] = [], user_name: str = 'model', titles_list: List[str] = [], id_offset: int = 0) List[dict] #
Generates a JSON that can be imported directly into NLP Lab as pre-annotations.
- Parameters:
all_results (list[objects]) – list of annotations on documents using light_pipeline.fullAnnotate or full_pipeline.transform(df).collect()
document_column (str) – output column name of DocumentAssembler stage
ner_columns (list[str]) – list of column names of ner chunks
assertion_columns (list[str]) – list of column names of ner chunks
assertion_columns – list of column names of assertion models
relations_columns (list[str]) – list of column names of relation models
user_name (str) – name of model(s). default: model
titles_list (list[str]) – custom list of titles of tasks in NLP Lab. Default: task_ID
id_offset (int) – When older tasks are already in NLP Lab, define the ID offeset to avoid overriting existing tasks. Default: 0
- get_IAA_metrics(spark: pyspark.sql.SparkSession, conll_dir: str, annotator_names: List[str], set_ref_annotator: str = None, return_NerDLMetrics: bool = False, save_dir: str = 'results_token_based')#
Gets IAA metrics for the annotator annotations.
- Parameters:
spark – Spark session
:param conll_dir:path to the conll files :param annotator_names: list of annotator names :param set_ref_annotator: reference annotator name, Default is None. If present, all comparisons made with respect to it. :param return_NerDLMetrics: By default return_NerDLMetrics = False. If True, we get the full chunk and partial chunk per token IAA dataframes by using NerDLMetrics. If False, we get the evaluate method for chunk based and classification reports for token based comparisons. :param save_dir: path to save token based results, default = “results_token_based”
- get_all_projects()#
Gets a JSON object containing details of all visible projects in NLP Lab. :rtype dict
- get_annotations(project_name, output_name, save_dir='.')#
Gets / Exports annotations of a project in NLP Lab.
- Parameters:
project_name (str) – Project Name
output_name (str) – file name where to write the result as json.
save_dir (str) – directory location where to save output json.
- Returns:
content
- Return type:
dict
- get_assertion_data(spark: pyspark.sql.SparkSession, input_json_path: str, assertion_labels: List[str], relevant_ner_labels: List[str], ground_truth: bool = False, unannotated_label: str | None = None, regex_pattern: str | None = None, unannotated_label_strategy: str | None = None, unannotated_label_strategy_dict: dict | None = None, included_task_ids: List[int] | None = None, excluded_task_ids: List[int] | None = None, excluded_task_titles: List[str] | None = None, seed: int = None) pandas.DataFrame #
Generates a dataframe to train assertion models in Spark NLP from an NLP Lab JSON export.
- Parameters:
spark (SparkSession) – Spark session with spark-nlp-jsl jar
input_json_path (str) – path to NLP Lab JSON export
assertion_labels (list[str]) – annotated assertion labels to train on
relevant_ner_labels (list[str]) – relevant NER labels that are assigned assertion labels
ground_truth (bool) – set to True to select ground truth completions, False to select latest completions, defaults to False
unannotated_label (str) – assertion label to assign to entities that have no assertion, defaults to None
regex_pattern (str) – set a pattern to use regex tokenizer, defaults to regular tokenizer if pattern not defined
unannotated_label_strategy (str) – set the strategy to control the number of occurrences of the unannotated assertion label in the output dataframe, options are ‘weighted’ or ‘counts’, ‘weighted’ allows to sample using a fraction, ‘counts’ allows to sample using absolute counts, defaults to None
unannotated_label_strategy_dict (dict) – dictionary in the format {‘ENTITY_LABEL’: sample_weight_or_counts} to control the number of occurrences of the unannotated assertion label in the output dataframe, where ‘ENTITY_LABEL’ are the NER labels that are assigned the unannotated assertion label, and sample_weight_or_counts should be between 0 and 1 if unannotated_label_strategy is ‘weighted’ or between 0 and the max number of occurrences of that NER label if unannotated_label_strategy is ‘counts’
excluded_task_ids (list) – list of NLP Lab task IDs to exclude from output dataframe, defaults to None
excluded_task_titles (list) – list of NLP Lab task titles to exclude from output dataframe, defaults to None
seed (int) – Makes sure we get the same data every time we execute the code. Defaults to None
- Returns:
dataframe in appropriate layout for training assertion models
- Return type:
pd.DataFrame
- get_classification_data(input_json_path, ground_truth=False)#
Generates a dataframe to train classification models.
- Parameters:
input_json_path (str) – path of NLP Lab exported JSON
ground_truth (boolean) – set to True to select ground truth completions, False to select latest completions, defaults to False
- get_conll_data(spark: pyspark.sql.SparkSession, input_json_path: str, output_name: str, save_dir: str = 'exported_conll', ground_truth: bool = False, excluded_labels: List[str] = None, excluded_task_ids: List[int] | None = None, excluded_task_titles: List[str] | None = None, regex_pattern: str = None) IO #
Generates a CoNLL file from an NLP Lab JSON export.
- Parameters:
spark (SparkSession) – Spark session with spark-nlp-jsl jar
input_json_path (str) – path to NLP Lab JSON export
output_name (str) – name of the CoNLL file to save
save_dir (str) – path for CoNLL file saving directory, defaults to ‘exported_conll’
ground_truth (bool) – set to True to select ground truth completions, False to select latest completions, defaults to False
excluded_labels (list) – labels to exclude from CoNLL; these are all assertion labels and irrelevant NER labels, defaults to None
excluded_task_ids (list) – list of NLP Lab task IDs to exclude from CoNLL, defaults to None
excluded_task_titles (list) – list of NLP Lab task titles to exclude from CoNLL, defaults to None
regex_pattern (str) – set a pattern to use regex tokenizer, defaults to regular tokenizer if pattern not defined
- Returns:
CoNLL file
- Return type:
IO
- get_conll_data_from_visualner(input_json_path: str, output_name: str = 'visualner', save_dir: str = 'exported_conll') IO #
Generates a CoNLL file from an NLP Lab JSON export.
- Parameters:
input_json_path (str) – path to NLP Lab JSON export
output_name (str) – name of the CoNLL file to save
save_dir (str) – path for CoNLL file saving directory, defaults to ‘exported_conll’
- Returns:
CoNLL file
- Return type:
IO
- get_project_config(project_name)#
Gets configuration details of a project as JSON.
- Parameters:
project_name (str) – Project Name
- Returns:
Project Configuration
- Return type:
dict
- get_relation_extraction_data(spark: pyspark.sql.SparkSession, input_json_path: str, ground_truth: bool = False, negative_relations: bool = False, assertion_labels: List[str] | None = None, relations: List[str] | None = None, relation_pairs: List[str] | None = None, negative_relation_strategy: str | None = None, negative_relation_strategy_dict: dict | None = None, excluded_task_ids: int | None = None, excluded_task_titles: List[str] | None = None, doc_wise_annot: bool = False) pandas.DataFrame #
Generates a dataframe to train relation extraction models in Spark NLP from an NLP Lab JSON export.
- Parameters:
spark (SparkSession) – Spark session with spark-nlp-jsl jar
input_json_path (str) – path to NLP Lab JSON export
ground_truth (bool) – set to True to select ground truth completions, False to select latest completions, defaults to False
negative_relations (bool) – set to True to assign a relation label between entities where no relation was annotated, defaults to False
assertion_labels (list) – all assertion labels that were annotated, defaults to None
relations (list) – Name of the relations you want to include. It will discard the rest
relation_pairs (list) – plausible pairs of entities for relations, separated by a -, use the same casing as the annotations, include only one relation direction, defaults to all possible pairs of annotated entities
negative_relation_strategy (str) – set the strategy to control the number of occurrences of the negative relation label in the output dataframe, options are ‘weighted’ or ‘counts’, ‘weighted’ allows to sample using a fraction, ‘counts’ allows to sample using absolute counts, defaults to None
negative_relation_strategy_dict (dict) – dictionary in the format {‘ENTITY1-ENTITY2’: sample_weight_or_counts} to control the number of occurrences of negative relations in the output dataframe for each entity pair, where ‘ENTITY1-ENTITY2’ represent the pairs of entities for relations separated by a - (include only one relation direction), and sample_weight_or_counts should be between 0 and 1 if negative_relation_strategy is ‘weighted’ or between 0 and the max number of occurrences of negative relations if negative_relation_strategy is ‘counts’, defaults to None
excluded_task_ids (list) – list of NLP Lab task IDs to exclude from output dataframe, defaults to None
excluded_task_titles (list) – list of NLP Lab task titles to exclude from output dataframe, defaults to None
doc_wise_annot (bool) – returns the dataframe with document wise annotations that were done cross sentences, defaults to False
- Returns:
dataframe in appropriate layout for training relation extraction models
- Return type:
pd.DataFrame
- review_annotations(spark: pyspark.sql.SparkSession, conll_path: str, ner_model_path: str)#
This function is used to get the review files from a conll file. It predicts the ner tags for the conll file and compares the predicted tags with the original conll tags and generates a review files.
- Parameters:
spark (SparkSession) – spark session
conll_path (str) – path to conll file
ner_model_path (str) – path to ner model
- Returns:
tuple containing chunk_based_comparison_df (DataFrame), token_based_comparision_df (DataFrame), chunk_frequency_df (DataFrame), metrics_result (Dict)
- Return type:
tuple
- set_credentials(username, password, client_secret, annotationlab_url='https://annotationlab.johnsnowlabs.com')#
Sets credentials to connect to your NLP Lab instance.
- Parameters:
username (str) – Username
password (str) – Password
client_secret (str) – Client Secret for your installation (Every installation of NLP Lab has a secret code).
annotationlab_url (str) – URL of NLP Lab. Default: JSL’s NLP Lab (https://annotationlab.johnsnowlabs.com).
- set_project_config(project_name, classification_labels=[], ner_labels=[], assertion_labels=[], relations_labels=[])#
Sets configuration of a project.
- Parameters:
project_name (str) – Project Name
classification_labels (List) – List of document classification classes. By default, it’d be multiclass classification.
ner_labels (List) – List of NER classes
assertion_labels (List) – List of assertion classes
relations_labels (List) – List of different types of relations.
- Returns:
content
- Return type:
dict
- upload_preannotations(project_name, preannotations)#
Uploads preannotations to a project in NLP Lab.
- Parameters:
project_name (str) – Project Name
preannotations (List) – preannotation JSON generated by ‘generate_preannotations’ function
- Return type:
dict
- upload_tasks(project_name, task_list, title_list=[], id_offset=0)#
Uploads tasks to a project in NLP Lab.
- Parameters:
project_name (str) – Project Name
task_list (List) – List of documents (text).
title_list (List) – Option for providing custom titles for each task. If defined, it’s length should be equal to number of tasks
id_offset (int) – Increment offset for document ID. Useful when uploading in batches. Default: 0.
- Returns:
content
- Return type:
dict