sparknlp_jsl.training
#
Module Contents#
Classes#
Converts annotation results from json or csv files to DataFrame suitable for NER training. |
|
Reads and process the exported json file from NLP Lab. |
|
Parser for the Cantemist dataset. |
|
Parser for the CodiEsp dataset. |
|
Class to preprocess RE dataset loaded into Spark Dataframe. |
|
Augments datasets with synonyms from UMLS. |
Attributes#
- class Annotation2Training(spark: pyspark.sql.SparkSession)#
Bases:
sparknlp.internal.ExtendedJavaWrapper
Converts annotation results from json or csv files to DataFrame suitable for NER training. Input files must have a structure similar to the one produced by John Snow Labs’ Generative AI annotation tool.
Example
>>> from sparknlp_jsl.training import Annotation2Training >>> from sparknlp_jsl.annotator import * >>> from sparknlp.annotator import * >>> from sparknlp.base import * >>> from pyspark.ml import Pipeline Create a base pipeline to extract sentences and tokens >>> document_assembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \ ... .setInputCols(["document"]) \ ... .setOutputCol("sentence") >>> tokenizer = Tokenizer() \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("token") >>> pipeline = Pipeline(stages=[ ... document_assembler, ... sentence_detector, ... tokenizer, ... ]) >>> pipeline_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text")) Load the annotation2training utility class >>> annotation2training = Annotation2Training(spark) Convert a json file to a training dataframe >>> training_df = annotation2training.convertJson2NerDF( ... json_path= "path/to/annotations.json", # Path to GenAI json file ... pipeline_model=pipeline_model, ... repartition=32, # Number of partitions to use when creating the DataFrame ... token_output_col="token", ... ner_label_col="label") >>> training_df.show(5) Returns a Spark DataFrame to train NER models. Ready to go for MedicalNerApproach
- java_obj#
- sc = None#
- apply()#
- convertCsv2NerDF(csv_path: str, pipeline_model: pyspark.ml.PipelineModel, repartition: int = 32, token_output_col: str = 'token', ner_label_col: str = 'label')#
Converts a CSV file with annotation results to a DataFrame suitable for NER training.
- Parameters:
csv_path (str) – Path to the input CSV file. The file will be read with the spark.read.csv method with header, multiLine, quote and escape options set.
pipeline_model (PipelineModel) – A pre-trained Spark NLP PipelineModel that includes at least a DocumentAssembler, and Tokenizer. PipelineModel can also include SentenceDetector, DocumentSplitter, WordEmbeddings, etc.
repartition (int, optional) – Number of partitions to use when creating the DataFrame (default is 32).
token_output_col (str, optional) – The name of the column containing token annotations (default is “token”).
ner_label_col (str, optional) – The name of the output column for NER labels (default is “label”).
- Returns:
A Spark DataFrame to train NER models.
- Return type:
DataFrame
- convertJson2NerDF(json_path: str, pipeline_model: pyspark.ml.PipelineModel, repartition: int = 32, token_output_col: str = 'token', ner_label_col: str = 'label')#
Converts a JSON file with annotation results to a DataFrame suitable for NER training.
- Parameters:
json_path (str) – Path to the input JSON file. The file will be read with the spark.read.json method with multiLine option set to true.
pipeline_model (PipelineModel) – A pre-trained Spark NLP PipelineModel that includes at least a DocumentAssembler, and Tokenizer. PipelineModel can also include SentenceDetector, DocumentSplitter, WordEmbeddings, etc.
repartition (int, optional) – Number of partitions to use when creating the DataFrame (default is 32).
token_output_col (str, optional) – The name of the column containing token annotations (default is “token”).
ner_label_col (str, optional) – The name of the output column for NER labels (default is “label”).
- Returns:
A Spark DataFrame to train NER models.
- Return type:
DataFrame
- getDataFrame(spark, jdf)#
- new_java_array(pylist, java_class)#
ToDo: Inspired from spark 2.0. Review if spark changes
- new_java_array_integer(pylist)#
- new_java_array_string(pylist)#
- new_java_obj(java_class, *args)#
- spark_version()#
- class AnnotationToolJsonReader(pipeline_model: str = None, assertion_labels: list = None, excluded_labels: list = None, cleanup_mode: str = 'disabled', split_chars: list = None, context_chars: list = None, scheme: str = 'IOB', min_chars_tol: int = 2, align_chars_tol: int = 1, merge_overlapping: bool = True, SDDLPath: str = '')#
Bases:
sparknlp.internal.ExtendedJavaWrapper
Reads and process the exported json file from NLP Lab.
Reader class that parses relevant information exported from NLP Lab into different formats. The reader can be used to create a training dataset for training assertion status (using the generateAssertionTrainSet method) or NER models (in the CoNLL format using the generateConll method).
To generate the assertion data, the following attributes need to be specified when instantiating the class:
assertion_labels: The assertion labels to use.
- excluded_labels: The assertions labels that are excluded for the training dataset
creation (can be an empty list).
- Examples:
>>> from sparknlp_jsl.training import AnnotationToolJsonReader >>> assertion_labels = ["AsPresent","Absent"] >>> excluded_labels = ["Treatment"] >>> rdr = AnnotationToolJsonReader( ... assertion_labels=assertion_labels, ... excluded_labels=excluded_labels, ... ) >>> path = "annotations.json" >>> df = rdr.readDataset(spark, path) >>> df.show(5) +-----+-------+--------------------+-------------+--------------------+---------------+------------+--------------------+---------------+---------+--------------------+--------------------+--------------------+--------------------+ |title|task_id| text|completion_id| completion_date|completion_user|ground_truth| tool_chunk|assertion_label|relations| document| sentence| token| ner_label| +-----+-------+--------------------+-------------+--------------------+---------------+------------+--------------------+---------------+---------+--------------------+--------------------+--------------------+--------------------+ | null| 18|
229937784 FIH...| 18001|2022-03-04T13:19:…| aleksei| true|[{chunk, 136, 142…| []| []|[{document, 0, 37…|[{document, 1, 79…|[{token, 1, 9, 22…|[{named_entity, 1…|
null| 65|305038693 FIH...| 65001|2022-03-03T08:31:…| aleksei| true|[{chunk, 1, 9, 30…| []| []|[{document, 0, 17…|[{document, 1, 81…|[{token, 1, 9, 30…|[{named_entity, 1…|
null| 21|844970557 RWH …| 21001|2022-03-04T13:00:…| aleksei| true|[{chunk, 1, 13, 8…| []| []|[{document, 0, 28…|[{document, 1, 25…|[{token, 1, 9, 84…|[{named_entity, 1…|
null| 97|914783811 FIH...| 97001|2022-03-02T12:07:…| aleksei| true|[{chunk, 1, 9, 91…| []| []|[{document, 0, 90…|[{document, 1, 11…|[{token, 1, 9, 91…|[{named_entity, 1…|
null| 84|733882247 FIH...| 84001|2022-03-02T15:01:…| aleksei| true|[{chunk, 1, 9, 73…| []| []|[{document, 0, 77…|[{document, 1, 10…|[{token, 1, 9, 73…|[{named_entity, 1…|
only showing top 5 rows >>> assertion_df = rdr.generateAssertionTrainSet(df) >>> rdr.generateConll(df, “annotations.conll”)
- java_obj#
- sc = None#
- apply()#
- generateAssertionTrainSet(df, sentenceCol='sentence', assertionCol='assertion_label')#
Generates assertion training data at token level.
Using information from the sentence and assertion labels, this method generates a training data set with the following columns: - text: sentence text - target: the token text - label: the assertion label - start: start position of the token - end: end position of the token
The tokens are identified internally with the constraints from the min_chars_tol and align_chars_tol parameters.
- Parameters:
df (DataFrame) – Data Frame containing the sentences and assertion labels.
sentenceCol (str, optional) – Column name containing the sentence annotation.
"sentence". (Defaults to) –
assertionCol (str, optional) – Column name containing the assertion annotation.
"assertion_label". (Defaults to) –
- Returns:
data frame with the training data.
- Return type:
DataFrame
- generateConll(df, path: str, taskColumn: str = 'task_id', tokenCol: str = 'token', nerLabel: str = 'ner_label')#
Saves a CoNLL file from the exported annotations.
- Parameters:
df (DataFrame) – DataFrame containing the annotations read using readDataset.
path (str) – Path to where the CoNLL file will be saved.
taskColumn (str, optional) – Column containing the task id information.
"task_id". (Defaults to) –
tokenCol (str, optional) – Column containing the token information.
"token". (Defaults to) –
nerLabel (str, optional) – Column containing the NER label.
"ner_label". (Defaults to) –
- generatePlainAssertionTrainSet(df, taskColumn: str = 'task_id', tokenCol: str = 'token', nerLabel: str = 'ner_label', assertion_label: str = 'assertion_label')#
Generates assertion training data at chunk level.
Using information from the sentence, task id (from NLP Lab), ner label, and assertion labels, this method generates a training data set with the following columns: - sentence: sentence text - begin: start position of the token - end: end position of the token - ner: the NER chunk - assertion: the assertion label
Internally uses the NerConverterInternal to identify the NER chunks.
- Parameters:
df (DataFrame) – Data Frame containing the sentences and assertion labels.
taskColumn (str, optional) – Column name containing the task id annotation.
"task_id". (Defaults to) –
tokenCol (str, optional) – Column name containing the token annotation.
"token". (Defaults to) –
nerLabel (str, optional) – Column name containing the NER label annotation.
"ner_label". (Defaults to) –
assertion_label (str, optional) – Column name containing the assertion label annotation.
"assertion_label". (Defaults to) –
- Returns:
data frame with the training data.
- Return type:
DataFrame
- getDataFrame(spark, jdf)#
- new_java_array(pylist, java_class)#
ToDo: Inspired from spark 2.0. Review if spark changes
- new_java_array_integer(pylist)#
- new_java_array_string(pylist)#
- new_java_obj(java_class, *args)#
- readDataset(spark, path: str)#
Reads the exported JSON file into a spark data frame.
- Parameters:
spark – Spark session.
path (str) – Path to the exported file.
- Returns:
Spark DataFrame containing the exported data.
- Return type:
DataFrame
- spark_version()#
- class CantemistReader(scheme='IOB')#
Bases:
sparknlp.internal.ExtendedJavaWrapper
Parser for the Cantemist dataset.
- java_obj#
- sc = None#
- apply()#
- getDataFrame(spark, jdf)#
- new_java_array(pylist, java_class)#
ToDo: Inspired from spark 2.0. Review if spark changes
- new_java_array_integer(pylist)#
- new_java_array_string(pylist)#
- new_java_obj(java_class, *args)#
- readDatasetTaskNer(spark, textFolder)#
Reads the Cantemist dataset into a spark data frame.
- spark_version()#
- class CodiEspReader(scheme='IOB')#
Bases:
sparknlp.internal.ExtendedJavaWrapper
Parser for the CodiEsp dataset.
- java_obj#
- sc = None#
- apply()#
- getDataFrame(spark, jdf)#
- new_java_array(pylist, java_class)#
ToDo: Inspired from spark 2.0. Review if spark changes
- new_java_array_integer(pylist)#
- new_java_array_string(pylist)#
- new_java_obj(java_class, *args)#
- readDatasetTaskX(spark, path, textFolder, sep='\t')#
Reads the CodiEsp dataset into a spark data frame.
- spark_version()#
- class REDatasetHelper(spark_df: pyspark.sql.DataFrame)#
Class to preprocess RE dataset loaded into Spark Dataframe.
Examples
>>> from sparknlp_jsl.training import REDatasetHelper >>> PATH = "/content/i2b2_clinical_rel_dataset.csv" >>> data = spark.read.option("header","true").format("csv").load(PATH) >>> >>> column_map = { "begin1": "firstCharEnt1", "end1": "lastCharEnt1", "begin2": "firstCharEnt2", "end2": "lastCharEnt2", "chunk1": "chunk1", "chunk2": "chunk2", "label1": "label1", "label2": "label2" } >>> >>> # apply preprocess function to dataframe >>> data = REDatasetHelper(data).create_annotation_column(column_map) >>> data.show(5) +-------+-------------+--------------------+--------------------+-------+--------------------+------+----+-----+--------------------+------+----+---------+---------+------------+-------------+------------+-------------+-------------+-------------+-------------+ |dataset| source| txt_file| sentence|sent_id| chunk1|begin1|end1| rel| chunk2|begin2|end2| label1| label2|lastCharEnt1|firstCharEnt1|lastCharEnt2|firstCharEnt2|words_in_ent1|words_in_ent2|words_between| +-------+-------------+--------------------+--------------------+-------+--------------------+------+----+-----+--------------------+------+----+---------+---------+------------+-------------+------------+-------------+-------------+-------------+-------------+ | test|beth+partners|i2b2 2010 VA/test...|VITAL SIGNS - Tem...| 44| respiratory rate| 12| 13| O| saturation| 17| 17| test| test| 64| 49| 84| 75| 2| 1| 3| | test|beth+partners|i2b2 2010 VA/test...|No lotions , crea...| 146| lotions| 1| 1|TrNAP| incisions| 7| 7|treatment| problem| 9| 3| 42| 34| 1| 1| 5| | train| partners|i2b2 2010 VA/conc...|Because of expect...| 43|expected long ter...| 2| 6| O| a picc line| 8| 10|treatment|treatment| 54| 11| 68| 58| 5| 3| 1| | train| partners|i2b2 2010 VA/conc...|She states this l...| 21| light-headedness| 3| 3| PIP| diaphoresis| 12| 12| problem| problem| 31| 16| 92| 82| 1| 1| 8| | test|beth+partners|i2b2 2010 VA/test...|Initial electroca...| 61|an inferior and r...| 38| 43| PIP|1-mm st depressio...| 28| 34| problem| problem| 239| 196| 176| 145| 6| 7| 3| +-------+-------------+--------------------+--------------------+-------+--------------------+------+----+-----+--------------------+------+----+---------+---------+------------+-------------+------------+-------------+-------------+-------------+-------------+ only showing top 5 rows >>> # if data contains different splits, you can first preprocess then filter by dataset column. >>> train_data = data.where("dataset='train'") >>> test_data = data.where("dataset='test'")
- annotation_schema#
- data#
- create_annotation_column(column_map, ner_column_name='train_ner_chunks')#
Creates label column for RelationExtractionApproach.
- Args
- column_map: Required mapping between entity columns and dataset columns.
Required columns are: begin1, end1, chunk1, label1, begin2, end2, chunk2, label2.
ner_column_name: a label column name for RelationExtractionApproach.
Examples
# for dataset with following schema: >>> data.printSchema() root |– sentence: string (nullable = true) |– chunk1: string (nullable = true) |– begin1: string (nullable = true) |– end1: string (nullable = true) |– rel: string (nullable = true) |– chunk2: string (nullable = true) |– begin2: string (nullable = true) |– end2: string (nullable = true) |– label1: string (nullable = true) |– label2: string (nullable = true) |– lastCharEnt1: string (nullable = true) |– firstCharEnt1: string (nullable = true) |– lastCharEnt2: string (nullable = true) |– firstCharEnt2: string (nullable = true)
# map should be as follows: >>> column_map = {
“begin1”: “firstCharEnt1”, “end1”: “lastCharEnt1”, “begin2”: “firstCharEnt2”, “end2”: “lastCharEnt2”, “chunk1”: “chunk1”, “chunk2”: “chunk2”, “label1”: “label1”, “label2”: “label2”
}
- Returns:
A new dataframe extended with ner chunk column.
- Return type:
DataFrame
- class SynonymAugmentationUMLS(spark, umls_path='', code_col='code', description_col='description', case_sensitive=False)#
Bases:
sparknlp.internal.ExtendedJavaWrapper
Augments datasets with synonyms from UMLS.
- java_obj#
- sc = None#
- apply()#
- augmentCsv(corpus_csv_path, ner_pipeline, language='ENG', do_product=False, augmentation_mode='plain_text', synonym_source='umls', regex_parsers=None, euclidean_distance_threshold=10.0, cosine_distance_threshold=0.25, synonym_limit=5, casing_functions=None)#
Augment based on CSV file.
- augmentDataFrame(corpus_df, ner_pipeline, language='ENG', do_product=False, augmentation_mode='plain_text', synonym_source='umls', regex_parsers=None, euclidean_distance_threshold=10.0, cosine_distance_threshold=0.25, synonym_limit=5, casing_functions=None)#
Augment based on Spark Dataframe.
- getDataFrame(spark, jdf)#
- new_java_array(pylist, java_class)#
ToDo: Inspired from spark 2.0. Review if spark changes
- new_java_array_integer(pylist)#
- new_java_array_string(pylist)#
- new_java_obj(java_class, *args)#
- spark_version()#
- tf_graph#
- tf_graph_1x#