sparknlp_jsl.alab#

Functions to manipulate an NLP Lab JSON export into an appropriate layout for training assertion, relation extraction and NER models

Module Contents#

Classes#

AnnotationLab

Interface to John Snow Labs's NLP Lab.

class AnnotationLab#

Interface to John Snow Labs’s NLP Lab.

NLP Lab is a tool for multi-modal data annotation. It allows annotation teams to efficiently collaborate to generate training data for ML models and/or to validate automatic annotations generated by those.

This module provides programmatically access to NLP Lab through the Healthcare NLP library. It allows users to: - Generating a CoNLL formatted file from the annotation JSON for training an NER model. - Generating a csv/excel formatted file from the annotation JSON for training classification, assertion, and relation extraction models. - Build preannotation JSON file using Spark NLP pipelines, saving it as a JSON and uploading preannotations to a project. - Interacting with the ALAB instance, and setting up projects of ALAB. - Getting the list of all projects in the ALAB instance. - Creating New Projects. - Deleting Projects. - Setting & editing configuration of projects. - Accessing/getting configuration of any existing project. - Upload tasks to a project. - Deleting tasks of a project.

For usage details and examples, please check the [NLP Lab tutorial](JohnSnowLabs/spark-nlp-workshop).

create_project(project_name, project_description='', project_sampling='', project_instruction='')#

Creates a new project.

Parameters:
  • project_name (str) – Project Name

  • project_description (str) – Project Description

  • project_sampling (str) – Random or Sequential sampling while showing tasks to annotators

  • project_instruction (str) – Annotation Guidelines

Returns:

project

Return type:

dict

delete_project(project_name, confirm=False)#

Deletes an existing project.

Parameters:
  • project_name (str) – Project Name

  • confirm (boolean) – If set to True, no input required for confirmation. Default: False.

Return type:

dict

delete_tasks(project_name, task_ids, confirm=False)#

Deletes tasks of a project in NLP Lab.

Parameters:
  • project_name (str) – Project Name

  • task_id – List of tasks ids

  • confirm (boolean) – If set to True, no input required for confirmation. Default: False.

Returns:

content

Return type:

dict

generate_preannotations(all_results: List[dict], document_column: str, ner_columns: List[str], assertion_columns: List[str] = [], relations_columns: List[str] = [], user_name: str = 'model', titles_list: List[str] = [], id_offset: int = 0) List[dict]#

Generates a JSON that can be imported directly into NLP Lab as pre-annotations.

Parameters:
  • all_results (list[objects]) – list of annotations on documents using light_pipeline.fullAnnotate or full_pipeline.transform(df).collect()

  • document_column (str) – output column name of DocumentAssembler stage

  • ner_columns (list[str]) – list of column names of ner chunks

  • assertion_columns (list[str]) – list of column names of ner chunks

  • assertion_columns – list of column names of assertion models

  • relations_columns (list[str]) – list of column names of relation models

  • user_name (str) – name of model(s). default: model

  • titles_list (list[str]) – custom list of titles of tasks in NLP Lab. Default: task_ID

  • id_offset (int) – When older tasks are already in NLP Lab, define the ID offeset to avoid overriting existing tasks. Default: 0

get_IAA_metrics(spark: pyspark.sql.SparkSession, conll_dir: str, annotator_names: List[str], set_ref_annotator: str = None, return_NerDLMetrics: bool = False, save_dir: str = 'results_token_based')#

Gets IAA metrics for the annotator annotations.

Parameters:

spark – Spark session

:param conll_dir:path to the conll files :param annotator_names: list of annotator names :param set_ref_annotator: reference annotator name, Default is None. If present, all comparisons made with respect to it. :param return_NerDLMetrics: By default return_NerDLMetrics = False. If True, we get the full chunk and partial chunk per token IAA dataframes by using NerDLMetrics. If False, we get the evaluate method for chunk based and classification reports for token based comparisons. :param save_dir: path to save token based results, default = “results_token_based”

get_all_projects()#

Gets a JSON object containing details of all visible projects in NLP Lab. :rtype dict

get_annotations(project_name, output_name, save_dir='.')#

Gets / Exports annotations of a project in NLP Lab.

Parameters:
  • project_name (str) – Project Name

  • output_name (str) – file name where to write the result as json.

  • save_dir (str) – directory location where to save output json.

Returns:

content

Return type:

dict

get_assertion_data(spark: pyspark.sql.SparkSession, input_json_path: str, assertion_labels: List[str], relevant_ner_labels: List[str], ground_truth: bool = False, unannotated_label: str | None = None, regex_pattern: str | None = None, unannotated_label_strategy: str | None = None, unannotated_label_strategy_dict: dict | None = None, included_task_ids: List[int] | None = None, excluded_task_ids: List[int] | None = None, excluded_task_titles: List[str] | None = None, seed: int = None) pandas.DataFrame#

Generates a dataframe to train assertion models in Spark NLP from an NLP Lab JSON export.

Parameters:
  • spark (SparkSession) – Spark session with spark-nlp-jsl jar

  • input_json_path (str) – path to NLP Lab JSON export

  • assertion_labels (list[str]) – annotated assertion labels to train on

  • relevant_ner_labels (list[str]) – relevant NER labels that are assigned assertion labels

  • ground_truth (bool) – set to True to select ground truth completions, False to select latest completions, defaults to False

  • unannotated_label (str) – assertion label to assign to entities that have no assertion, defaults to None

  • regex_pattern (str) – set a pattern to use regex tokenizer, defaults to regular tokenizer if pattern not defined

  • unannotated_label_strategy (str) – set the strategy to control the number of occurrences of the unannotated assertion label in the output dataframe, options are ‘weighted’ or ‘counts’, ‘weighted’ allows to sample using a fraction, ‘counts’ allows to sample using absolute counts, defaults to None

  • unannotated_label_strategy_dict (dict) – dictionary in the format {‘ENTITY_LABEL’: sample_weight_or_counts} to control the number of occurrences of the unannotated assertion label in the output dataframe, where ‘ENTITY_LABEL’ are the NER labels that are assigned the unannotated assertion label, and sample_weight_or_counts should be between 0 and 1 if unannotated_label_strategy is ‘weighted’ or between 0 and the max number of occurrences of that NER label if unannotated_label_strategy is ‘counts’

  • excluded_task_ids (list) – list of NLP Lab task IDs to exclude from output dataframe, defaults to None

  • excluded_task_titles (list) – list of NLP Lab task titles to exclude from output dataframe, defaults to None

  • seed (int) – Makes sure we get the same data every time we execute the code. Defaults to None

Returns:

dataframe in appropriate layout for training assertion models

Return type:

pd.DataFrame

get_classification_data(input_json_path, ground_truth=False)#

Generates a dataframe to train classification models.

Parameters:
  • input_json_path (str) – path of NLP Lab exported JSON

  • ground_truth (boolean) – set to True to select ground truth completions, False to select latest completions, defaults to False

get_conll_data(spark: pyspark.sql.SparkSession, input_json_path: str, output_name: str, save_dir: str = 'exported_conll', ground_truth: bool = False, excluded_labels: List[str] = None, excluded_task_ids: List[int] | None = None, excluded_task_titles: List[str] | None = None, regex_pattern: str = None) IO#

Generates a CoNLL file from an NLP Lab JSON export.

Parameters:
  • spark (SparkSession) – Spark session with spark-nlp-jsl jar

  • input_json_path (str) – path to NLP Lab JSON export

  • output_name (str) – name of the CoNLL file to save

  • save_dir (str) – path for CoNLL file saving directory, defaults to ‘exported_conll’

  • ground_truth (bool) – set to True to select ground truth completions, False to select latest completions, defaults to False

  • excluded_labels (list) – labels to exclude from CoNLL; these are all assertion labels and irrelevant NER labels, defaults to None

  • excluded_task_ids (list) – list of NLP Lab task IDs to exclude from CoNLL, defaults to None

  • excluded_task_titles (list) – list of NLP Lab task titles to exclude from CoNLL, defaults to None

  • regex_pattern (str) – set a pattern to use regex tokenizer, defaults to regular tokenizer if pattern not defined

Returns:

CoNLL file

Return type:

IO

get_conll_data_from_visualner(input_json_path: str, output_name: str = 'visualner', save_dir: str = 'exported_conll') IO#

Generates a CoNLL file from an NLP Lab JSON export.

Parameters:
  • input_json_path (str) – path to NLP Lab JSON export

  • output_name (str) – name of the CoNLL file to save

  • save_dir (str) – path for CoNLL file saving directory, defaults to ‘exported_conll’

Returns:

CoNLL file

Return type:

IO

get_project_config(project_name)#

Gets configuration details of a project as JSON.

Parameters:

project_name (str) – Project Name

Returns:

Project Configuration

Return type:

dict

get_relation_extraction_data(spark: pyspark.sql.SparkSession, input_json_path: str, ground_truth: bool = False, negative_relations: bool = False, assertion_labels: List[str] | None = None, relations: List[str] | None = None, relation_pairs: List[str] | None = None, negative_relation_strategy: str | None = None, negative_relation_strategy_dict: dict | None = None, excluded_task_ids: int | None = None, excluded_task_titles: List[str] | None = None, doc_wise_annot: bool = False) pandas.DataFrame#

Generates a dataframe to train relation extraction models in Spark NLP from an NLP Lab JSON export.

Parameters:
  • spark (SparkSession) – Spark session with spark-nlp-jsl jar

  • input_json_path (str) – path to NLP Lab JSON export

  • ground_truth (bool) – set to True to select ground truth completions, False to select latest completions, defaults to False

  • negative_relations (bool) – set to True to assign a relation label between entities where no relation was annotated, defaults to False

  • assertion_labels (list) – all assertion labels that were annotated, defaults to None

  • relations (list) – Name of the relations you want to include. It will discard the rest

  • relation_pairs (list) – plausible pairs of entities for relations, separated by a -, use the same casing as the annotations, include only one relation direction, defaults to all possible pairs of annotated entities

  • negative_relation_strategy (str) – set the strategy to control the number of occurrences of the negative relation label in the output dataframe, options are ‘weighted’ or ‘counts’, ‘weighted’ allows to sample using a fraction, ‘counts’ allows to sample using absolute counts, defaults to None

  • negative_relation_strategy_dict (dict) – dictionary in the format {‘ENTITY1-ENTITY2’: sample_weight_or_counts} to control the number of occurrences of negative relations in the output dataframe for each entity pair, where ‘ENTITY1-ENTITY2’ represent the pairs of entities for relations separated by a - (include only one relation direction), and sample_weight_or_counts should be between 0 and 1 if negative_relation_strategy is ‘weighted’ or between 0 and the max number of occurrences of negative relations if negative_relation_strategy is ‘counts’, defaults to None

  • excluded_task_ids (list) – list of NLP Lab task IDs to exclude from output dataframe, defaults to None

  • excluded_task_titles (list) – list of NLP Lab task titles to exclude from output dataframe, defaults to None

  • doc_wise_annot (bool) – returns the dataframe with document wise annotations that were done cross sentences, defaults to False

Returns:

dataframe in appropriate layout for training relation extraction models

Return type:

pd.DataFrame

review_annotations(spark: pyspark.sql.SparkSession, conll_path: str, ner_model_path: str)#

This function is used to get the review files from a conll file. It predicts the ner tags for the conll file and compares the predicted tags with the original conll tags and generates a review files.

Parameters:
  • spark (SparkSession) – spark session

  • conll_path (str) – path to conll file

  • ner_model_path (str) – path to ner model

Returns:

tuple containing chunk_based_comparison_df (DataFrame), token_based_comparision_df (DataFrame), chunk_frequency_df (DataFrame), metrics_result (Dict)

Return type:

tuple

set_credentials(username, password, client_secret, annotationlab_url='https://annotationlab.johnsnowlabs.com')#

Sets credentials to connect to your NLP Lab instance.

Parameters:
  • username (str) – Username

  • password (str) – Password

  • client_secret (str) – Client Secret for your installation (Every installation of NLP Lab has a secret code).

  • annotationlab_url (str) – URL of NLP Lab. Default: JSL’s NLP Lab (https://annotationlab.johnsnowlabs.com).

set_project_config(project_name, classification_labels=[], ner_labels=[], assertion_labels=[], relations_labels=[])#

Sets configuration of a project.

Parameters:
  • project_name (str) – Project Name

  • classification_labels (List) – List of document classification classes. By default, it’d be multiclass classification.

  • ner_labels (List) – List of NER classes

  • assertion_labels (List) – List of assertion classes

  • relations_labels (List) – List of different types of relations.

Returns:

content

Return type:

dict

upload_preannotations(project_name, preannotations)#

Uploads preannotations to a project in NLP Lab.

Parameters:
  • project_name (str) – Project Name

  • preannotations (List) – preannotation JSON generated by ‘generate_preannotations’ function

Return type:

dict

upload_tasks(project_name, task_list, title_list=[], id_offset=0)#

Uploads tasks to a project in NLP Lab.

Parameters:
  • project_name (str) – Project Name

  • task_list (List) – List of documents (text).

  • title_list (List) – Option for providing custom titles for each task. If defined, it’s length should be equal to number of tasks

  • id_offset (int) – Increment offset for document ID. Useful when uploading in batches. Default: 0.

Returns:

content

Return type:

dict