sparknlp_jsl.utils.alab_utils
#
Util functions for alab module
Module Contents#
Functions#
|
|
|
|
|
Generates dataframe for all annotated entities from an Annotation Lab JSON export |
|
Get sentence borders for each task |
|
Generates a LightPipeline with a document assembler, sentence detector, regular tokenizer and POS-tagger |
|
Generates a LightPipeline with a document assembler, sentence detector and regular tokenizer |
|
|
|
|
|
|
|
Generate dataframe for all annotated relations from an Annotation Lab JSON export |
|
|
|
Generates a LightPipeline with a document assembler and sentence detector |
|
Generates CoNLL-style outputs from a single task in an Annotation Lab JSON export |
|
|
|
|
|
iterate through entities and strip out the text |
Attributes#
- get_doc_assembler(doc_wise_annot=False) sparknlp.base.DocumentAssembler #
- get_empty_df(spark) pyspark.sql.DataFrame #
- get_ner_df(input_json_path, ground_truth=False, assertion_labels=None)#
Generates dataframe for all annotated entities from an Annotation Lab JSON export :param input_json_path: path to Annotation Lab JSON export :type input_json_path: str :param ground_truth: set to True to select ground truth completions, False to select latest completions, defaults to False :type ground_truth: bool :param assertion_labels: all assertion labels that were annotated, defaults to None :type assertion_labels: list :return: dataframe with all annotated entities :rtype: pd.DataFrame
- get_ner_sentence_borders(spark, input_json_path, doc_wise_annot=False, ground_truth=False, assertion_labels=None)#
Get sentence borders for each task :param spark: Spark session with spark-nlp-jsl jar :type spark: SparkSession :param input_json_path: path to Annotation Lab JSON export :type input_json_path: str :param doc_wise_annot: returns the dataframe with document wise annotations :type doc_wise_annot: bool :param ground_truth: set to True to select ground truth completions, False to select latest completions, defaults to False :type ground_truth: bool :param assertion_labels: all assertion labels that were annotated, defaults to None :type assertion_labels: list :return: dataframe with tasks split into sentences including start and end indexes :rtype: pd.DataFrame
- get_nlp_pos_pipeline(spark, regex_pattern)#
Generates a LightPipeline with a document assembler, sentence detector, regular tokenizer and POS-tagger :param spark: Spark session with spark-nlp-jsl jar :type spark: SparkSession :param regex_pattern: set a pattern to use regex tokenizer, defaults to regular tokenizer if pattern not defined :type regex_pattern: str :return: LightPipeline with a document assembler, sentence detector, regular or regex tokenizer depending on regex_pattern and POS-tagger :rtype: LightPipeline
- get_nlp_token_pipeline(spark, regex_pattern)#
Generates a LightPipeline with a document assembler, sentence detector and regular tokenizer :param spark: Spark session with spark-nlp-jsl jar :type spark: SparkSession :param regex_pattern: set a pattern to use regex tokenizer, defaults to regular tokenizer if pattern not defined :type regex_pattern: str :return: LightPipeline with a document assembler, sentence detector and a regular or regex tokenizer depending on regex_pattern :rtype: LightPipeline
- get_pos() sparknlp.annotator.PerceptronModel #
- get_regex_tokenizer(regex_pattern) sparknlp.annotator.RegexTokenizer #
- get_regular_tokenizer() sparknlp.annotator.Tokenizer #
- get_rel_df(input_json_path, ground_truth=False)#
Generate dataframe for all annotated relations from an Annotation Lab JSON export :param input_json_path: path to Annotation Lab JSON export :type input_json_path: str :param ground_truth: set to True to select ground truth completions, False to select latest completions, defaults to False :type ground_truth: bool :return: dataframe with all annotated relations :rtype: pd.DataFrame
- get_sent_detector() sparknlp.annotator.SentenceDetectorDLModel #
- get_sentence_pipeline(spark, doc_wise_annot=False)#
Generates a LightPipeline with a document assembler and sentence detector :param spark: Spark session with spark-nlp-jsl jar :type spark: SparkSession :param doc_wise_annot: drops sentence detector if True :type doc_wise_annot: bool :return: LightPipeline with a document assembler and sentence detector :rtype: LightPipeline
- get_single_task_conll(output, pos_pipeline, token_pipeline, ground_truth=False, excluded_labels=None, excluded_task_ids=None, excluded_task_titles=None)#
Generates CoNLL-style outputs from a single task in an Annotation Lab JSON export :param output: enumerated items from an Annotation Lab JSON export :type output: dict :param pos_pipeline: LightPipeline with a document assembler, sentence detector, tokenizer and POS-tagger :type pos_pipeline: LightPipeline :param token_pipeline: LightPipeline with a document assembler, sentence detector and tokenizer :type token_pipeline: LightPipeline :param ground_truth: set to True to select ground truth completions, False to select latest completions, defaults to False :type ground_truth: bool :param excluded_labels: labels to exclude from CoNLL; these are all assertion labels and irrelevant NER labels, defaults to None :type excluded_labels: list :param excluded_task_ids: list of Annotation Lab task IDs to exclude from CoNLL, defaults to None :type excluded_task_ids: list :param excluded_task_titles: Annotation Lab task titles to exclude from CoNLL, defaults to None :type excluded_task_titles: list :return: CoNLL lines :rtype: list
- get_token_df(text)#
- Parameters:
text (str) – input text
- Returns:
Pandas DataFrame with columns (‘token’, ‘begin’, ‘end’)
- Return type:
_type_
- read_conll(spark, filepath)#
- Parameters:
spark – Spark session with spark-nlp-jsl jar
filepath – path to the conll file
- Returns:
Spark DataFrame
- Return type:
_type_
- strip_json_entities(exported_json)#
iterate through entities and strip out the text and update start and end positions in the json file from annotation lab
- Parameters:
exported_json (dict) – exported json from annotation lab
- Returns:
updated json
- Return type:
exported_json (dict)
- document_assembler :sparknlp.base.DocumentAssembler = None#
- empty_df :pyspark.sql.DataFrame = None#
- nlp_pos_pipeline :sparknlp.base.LightPipeline = None#
- nlp_token_pipeline :sparknlp.base.LightPipeline = None#
- pos :sparknlp.annotator.PerceptronModel = None#
- pos_pipeline_initialized = False#
- regex_tokenizer :sparknlp.annotator.RegexTokenizer = None#
- regular_tokenizer :sparknlp.annotator.Tokenizer = None#
- sentence_detector :sparknlp.annotator.SentenceDetectorDLModel = None#
- token_pipeline_initialized = False#