sparknlp_jsl.utils.alab_utils#

Util functions for alab module

Module Contents#

Functions#

get_doc_assembler(→ sparknlp.base.DocumentAssembler)

get_empty_df(→ pyspark.sql.DataFrame)

get_ner_df(input_json_path[, ground_truth, ...])

Generates dataframe for all annotated entities from an Annotation Lab JSON export

get_ner_sentence_borders(spark, input_json_path[, ...])

Get sentence borders for each task

get_nlp_pos_pipeline(spark, regex_pattern)

Generates a LightPipeline with a document assembler, sentence detector, regular tokenizer and POS-tagger

get_nlp_token_pipeline(spark, regex_pattern)

Generates a LightPipeline with a document assembler, sentence detector and regular tokenizer

get_pos(→ sparknlp.annotator.PerceptronModel)

get_regex_tokenizer(→ sparknlp.annotator.RegexTokenizer)

get_regular_tokenizer(→ sparknlp.annotator.Tokenizer)

get_rel_df(input_json_path[, ground_truth])

Generate dataframe for all annotated relations from an Annotation Lab JSON export

get_sent_detector(...)

get_sentence_pipeline(spark[, doc_wise_annot])

Generates a LightPipeline with a document assembler and sentence detector

get_single_task_conll(output, pos_pipeline, token_pipeline)

Generates CoNLL-style outputs from a single task in an Annotation Lab JSON export

get_token_df(text)

param text:

input text

read_conll(spark, filepath)

param spark:

Spark session with spark-nlp-jsl jar

strip_json_entities(exported_json)

iterate through entities and strip out the text

Attributes#

document_assembler

empty_df

nlp_pos_pipeline

nlp_token_pipeline

pos

pos_pipeline_initialized

regex_tokenizer

regular_tokenizer

sentence_detector

token_pipeline_initialized

get_doc_assembler(doc_wise_annot=False) sparknlp.base.DocumentAssembler#
get_empty_df(spark) pyspark.sql.DataFrame#
get_ner_df(input_json_path, ground_truth=False, assertion_labels=None)#

Generates dataframe for all annotated entities from an Annotation Lab JSON export :param input_json_path: path to Annotation Lab JSON export :type input_json_path: str :param ground_truth: set to True to select ground truth completions, False to select latest completions, defaults to False :type ground_truth: bool :param assertion_labels: all assertion labels that were annotated, defaults to None :type assertion_labels: list :return: dataframe with all annotated entities :rtype: pd.DataFrame

get_ner_sentence_borders(spark, input_json_path, doc_wise_annot=False, ground_truth=False, assertion_labels=None)#

Get sentence borders for each task :param spark: Spark session with spark-nlp-jsl jar :type spark: SparkSession :param input_json_path: path to Annotation Lab JSON export :type input_json_path: str :param doc_wise_annot: returns the dataframe with document wise annotations :type doc_wise_annot: bool :param ground_truth: set to True to select ground truth completions, False to select latest completions, defaults to False :type ground_truth: bool :param assertion_labels: all assertion labels that were annotated, defaults to None :type assertion_labels: list :return: dataframe with tasks split into sentences including start and end indexes :rtype: pd.DataFrame

get_nlp_pos_pipeline(spark, regex_pattern)#

Generates a LightPipeline with a document assembler, sentence detector, regular tokenizer and POS-tagger :param spark: Spark session with spark-nlp-jsl jar :type spark: SparkSession :param regex_pattern: set a pattern to use regex tokenizer, defaults to regular tokenizer if pattern not defined :type regex_pattern: str :return: LightPipeline with a document assembler, sentence detector, regular or regex tokenizer depending on regex_pattern and POS-tagger :rtype: LightPipeline

get_nlp_token_pipeline(spark, regex_pattern)#

Generates a LightPipeline with a document assembler, sentence detector and regular tokenizer :param spark: Spark session with spark-nlp-jsl jar :type spark: SparkSession :param regex_pattern: set a pattern to use regex tokenizer, defaults to regular tokenizer if pattern not defined :type regex_pattern: str :return: LightPipeline with a document assembler, sentence detector and a regular or regex tokenizer depending on regex_pattern :rtype: LightPipeline

get_pos() sparknlp.annotator.PerceptronModel#
get_regex_tokenizer(regex_pattern) sparknlp.annotator.RegexTokenizer#
get_regular_tokenizer() sparknlp.annotator.Tokenizer#
get_rel_df(input_json_path, ground_truth=False)#

Generate dataframe for all annotated relations from an Annotation Lab JSON export :param input_json_path: path to Annotation Lab JSON export :type input_json_path: str :param ground_truth: set to True to select ground truth completions, False to select latest completions, defaults to False :type ground_truth: bool :return: dataframe with all annotated relations :rtype: pd.DataFrame

get_sent_detector() sparknlp.annotator.SentenceDetectorDLModel#
get_sentence_pipeline(spark, doc_wise_annot=False)#

Generates a LightPipeline with a document assembler and sentence detector :param spark: Spark session with spark-nlp-jsl jar :type spark: SparkSession :param doc_wise_annot: drops sentence detector if True :type doc_wise_annot: bool :return: LightPipeline with a document assembler and sentence detector :rtype: LightPipeline

get_single_task_conll(output, pos_pipeline, token_pipeline, ground_truth=False, excluded_labels=None, excluded_task_ids=None, excluded_task_titles=None)#

Generates CoNLL-style outputs from a single task in an Annotation Lab JSON export :param output: enumerated items from an Annotation Lab JSON export :type output: dict :param pos_pipeline: LightPipeline with a document assembler, sentence detector, tokenizer and POS-tagger :type pos_pipeline: LightPipeline :param token_pipeline: LightPipeline with a document assembler, sentence detector and tokenizer :type token_pipeline: LightPipeline :param ground_truth: set to True to select ground truth completions, False to select latest completions, defaults to False :type ground_truth: bool :param excluded_labels: labels to exclude from CoNLL; these are all assertion labels and irrelevant NER labels, defaults to None :type excluded_labels: list :param excluded_task_ids: list of Annotation Lab task IDs to exclude from CoNLL, defaults to None :type excluded_task_ids: list :param excluded_task_titles: Annotation Lab task titles to exclude from CoNLL, defaults to None :type excluded_task_titles: list :return: CoNLL lines :rtype: list

get_token_df(text)#
Parameters:

text (str) – input text

Returns:

Pandas DataFrame with columns (‘token’, ‘begin’, ‘end’)

Return type:

_type_

read_conll(spark, filepath)#
Parameters:
  • spark – Spark session with spark-nlp-jsl jar

  • filepath – path to the conll file

Returns:

Spark DataFrame

Return type:

_type_

strip_json_entities(exported_json)#

iterate through entities and strip out the text and update start and end positions in the json file from annotation lab

Parameters:

exported_json (dict) – exported json from annotation lab

Returns:

updated json

Return type:

exported_json (dict)

document_assembler :sparknlp.base.DocumentAssembler = None#
empty_df :pyspark.sql.DataFrame = None#
nlp_pos_pipeline :sparknlp.base.LightPipeline = None#
nlp_token_pipeline :sparknlp.base.LightPipeline = None#
pos :sparknlp.annotator.PerceptronModel = None#
pos_pipeline_initialized = False#
regex_tokenizer :sparknlp.annotator.RegexTokenizer = None#
regular_tokenizer :sparknlp.annotator.Tokenizer = None#
sentence_detector :sparknlp.annotator.SentenceDetectorDLModel = None#
token_pipeline_initialized = False#