`sparknlp_jsl.utils.alab_utils`#

Util functions for alab module

Module Contents#

`get_doc_assembler`(→ sparknlp.base.DocumentAssembler)
`get_empty_df`(→ pyspark.sql.DataFrame)
`get_ner_df`(input_json_path[, ground_truth, ...])	Generates dataframe for all annotated entities from an Annotation Lab JSON export
`get_ner_sentence_borders`(spark, input_json_path[, ...])	Get sentence borders for each task
`get_nlp_pos_pipeline`(spark, regex_pattern)	Generates a LightPipeline with a document assembler, sentence detector, regular tokenizer and POS-tagger
`get_nlp_token_pipeline`(spark, regex_pattern)	Generates a LightPipeline with a document assembler, sentence detector and regular tokenizer
`get_pos`(→ sparknlp.annotator.PerceptronModel)
`get_regex_tokenizer`(→ sparknlp.annotator.RegexTokenizer)
`get_regular_tokenizer`(→ sparknlp.annotator.Tokenizer)
`get_rel_df`(input_json_path[, ground_truth])	Generate dataframe for all annotated relations from an Annotation Lab JSON export
`get_sent_detector`(...)
`get_sentence_pipeline`(spark[, doc_wise_annot])	Generates a LightPipeline with a document assembler and sentence detector
`get_single_task_conll`(output, pos_pipeline, token_pipeline)	Generates CoNLL-style outputs from a single task in an Annotation Lab JSON export
`get_token_df`(text)	param text: input text
`read_conll`(spark, filepath)	param spark: Spark session with spark-nlp-jsl jar
`strip_json_entities`(exported_json)	iterate through entities and strip out the text

get_doc_assembler(doc_wise_annot=False) → sparknlp.base.DocumentAssembler#

get_ner_df(input_json_path, ground_truth=False, assertion_labels=None)#: Generates dataframe for all annotated entities from an Annotation Lab JSON export :param input_json_path: path to Annotation Lab JSON export :type input_json_path: str :param ground_truth: set to True to select ground truth completions, False to select latest completions, defaults to False :type ground_truth: bool :param assertion_labels: all assertion labels that were annotated, defaults to None :type assertion_labels: list :return: dataframe with all annotated entities :rtype: pd.DataFrame

get_ner_sentence_borders(spark, input_json_path, doc_wise_annot=False, ground_truth=False, assertion_labels=None)#: Get sentence borders for each task :param spark: Spark session with spark-nlp-jsl jar :type spark: SparkSession :param input_json_path: path to Annotation Lab JSON export :type input_json_path: str :param doc_wise_annot: returns the dataframe with document wise annotations :type doc_wise_annot: bool :param ground_truth: set to True to select ground truth completions, False to select latest completions, defaults to False :type ground_truth: bool :param assertion_labels: all assertion labels that were annotated, defaults to None :type assertion_labels: list :return: dataframe with tasks split into sentences including start and end indexes :rtype: pd.DataFrame

get_nlp_pos_pipeline(spark, regex_pattern)#: Generates a LightPipeline with a document assembler, sentence detector, regular tokenizer and POS-tagger :param spark: Spark session with spark-nlp-jsl jar :type spark: SparkSession :param regex_pattern: set a pattern to use regex tokenizer, defaults to regular tokenizer if pattern not defined :type regex_pattern: str :return: LightPipeline with a document assembler, sentence detector, regular or regex tokenizer depending on regex_pattern and POS-tagger :rtype: LightPipeline

get_nlp_token_pipeline(spark, regex_pattern)#: Generates a LightPipeline with a document assembler, sentence detector and regular tokenizer :param spark: Spark session with spark-nlp-jsl jar :type spark: SparkSession :param regex_pattern: set a pattern to use regex tokenizer, defaults to regular tokenizer if pattern not defined :type regex_pattern: str :return: LightPipeline with a document assembler, sentence detector and a regular or regex tokenizer depending on regex_pattern :rtype: LightPipeline

get_rel_df(input_json_path, ground_truth=False)#: Generate dataframe for all annotated relations from an Annotation Lab JSON export :param input_json_path: path to Annotation Lab JSON export :type input_json_path: str :param ground_truth: set to True to select ground truth completions, False to select latest completions, defaults to False :type ground_truth: bool :return: dataframe with all annotated relations :rtype: pd.DataFrame

get_sentence_pipeline(spark, doc_wise_annot=False)#: Generates a LightPipeline with a document assembler and sentence detector :param spark: Spark session with spark-nlp-jsl jar :type spark: SparkSession :param doc_wise_annot: drops sentence detector if True :type doc_wise_annot: bool :return: LightPipeline with a document assembler and sentence detector :rtype: LightPipeline

get_single_task_conll(output, pos_pipeline, token_pipeline, ground_truth=False, excluded_labels=None, excluded_task_ids=None, excluded_task_titles=None)#: Generates CoNLL-style outputs from a single task in an Annotation Lab JSON export :param output: enumerated items from an Annotation Lab JSON export :type output: dict :param pos_pipeline: LightPipeline with a document assembler, sentence detector, tokenizer and POS-tagger :type pos_pipeline: LightPipeline :param token_pipeline: LightPipeline with a document assembler, sentence detector and tokenizer :type token_pipeline: LightPipeline :param ground_truth: set to True to select ground truth completions, False to select latest completions, defaults to False :type ground_truth: bool :param excluded_labels: labels to exclude from CoNLL; these are all assertion labels and irrelevant NER labels, defaults to None :type excluded_labels: list :param excluded_task_ids: list of Annotation Lab task IDs to exclude from CoNLL, defaults to None :type excluded_task_ids: list :param excluded_task_titles: Annotation Lab task titles to exclude from CoNLL, defaults to None :type excluded_task_titles: list :return: CoNLL lines :rtype: list

get_token_df(text)#

read_conll(spark, filepath)#

Parameters:

Returns:

Spark DataFrame

Return type:

_type_

strip_json_entities(exported_json)#

iterate through entities and strip out the text and update start and end positions in the json file from annotation lab