`sparknlp_jsl.training`#

Module Contents#

Classes#

`Annotation2Training`	Converts annotation results from json or csv files to DataFrame suitable for NER training.
`AnnotationToolJsonReader`	Reads and process the exported json file from NLP Lab.
`CantemistReader`	Parser for the Cantemist dataset.
`CodiEspReader`	Parser for the CodiEsp dataset.
`REDatasetHelper`	Class to preprocess RE dataset loaded into Spark Dataframe.
`SynonymAugmentationUMLS`	Augments datasets with synonyms from UMLS.

Attributes#

`tf_graph`
`tf_graph_1x`

class Annotation2Training(spark: pyspark.sql.SparkSession)#

Bases: sparknlp.internal.ExtendedJavaWrapper

Converts annotation results from json or csv files to DataFrame suitable for NER training. Input files must have a structure similar to the one produced by John Snow Labs’ Generative AI annotation tool.

Example

>>> from sparknlp_jsl.training import Annotation2Training
>>> from sparknlp_jsl.annotator import *
>>> from sparknlp.annotator import *
>>> from sparknlp.base import *
>>> from pyspark.ml import Pipeline
Create a base pipeline to extract sentences and tokens
>>> document_assembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentence")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("token")
>>> pipeline = Pipeline(stages=[
...     document_assembler,
...     sentence_detector,
...     tokenizer,
... ])
>>> pipeline_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
Load the annotation2training utility class
>>> annotation2training = Annotation2Training(spark)
Convert a json file to a training dataframe
>>> training_df = annotation2training.convertJson2NerDF(
...     json_path= "path/to/annotations.json",  # Path to GenAI json file
...     pipeline_model=pipeline_model,
...     repartition=32,                # Number of partitions to use when creating the DataFrame
...     token_output_col="token",
...     ner_label_col="label")
>>> training_df.show(5)
Returns a Spark DataFrame to train NER models.
Ready to go for MedicalNerApproach

java_obj#

sc = None#

apply()#

convertCsv2NerDF(csv_path: str, pipeline_model: pyspark.ml.PipelineModel, repartition: int = 32, token_output_col: str = 'token', ner_label_col: str = 'label', white_list: list = [], black_list: list = [], replace_labels: dict = {})#

Converts a CSV file with annotation results to a DataFrame suitable for NER training.

Parameters:

csv_path (str) – Path to the input CSV file. The file will be read with the spark.read.csv method with header, multiLine, quote and escape options set.
pipeline_model (PipelineModel) – A pre-trained Spark NLP PipelineModel that includes at least a DocumentAssembler, and Tokenizer. PipelineModel can also include SentenceDetector, DocumentSplitter, WordEmbeddings, etc.
repartition (int, optional) – Number of partitions to use when creating the DataFrame (default is 32).
token_output_col (str, optional) – The name of the column containing token annotations (default is “token”).
ner_label_col (str, optional) – The name of the output column for NER labels (default is “label”).
white_list (list, optional) – A list of entity labels to include. If empty, all labels are included. Do not include IOB prefix on labels.
black_list (list, optional) – A list of entity labels to exclude. If empty, no labels are excluded. Do not include IOB prefix on labels.
replace_labels (dict, optional) – A dictionary mapping original labels to new labels. replace_labels runs after white_list and black_list filtering.

Returns:

A Spark DataFrame to train NER models.

Return type:

DataFrame

convertJson2NerDF(json_path: str, pipeline_model: pyspark.ml.PipelineModel, repartition: int = 32, token_output_col: str = 'token', ner_label_col: str = 'label', white_list: list = [], black_list: list = [], replace_labels: dict = {})#

Converts a JSON file with annotation results to a DataFrame suitable for NER training.

Parameters:

json_path (str) – Path to the input JSON file. The file will be read with the spark.read.json method with multiLine option set to true.
pipeline_model (PipelineModel) – A pre-trained Spark NLP PipelineModel that includes at least a DocumentAssembler, and Tokenizer. PipelineModel can also include SentenceDetector, DocumentSplitter, WordEmbeddings, etc.
repartition (int, optional) – Number of partitions to use when creating the DataFrame (default is 32).
token_output_col (str, optional) – The name of the column containing token annotations (default is “token”).
ner_label_col (str, optional) – The name of the output column for NER labels (default is “label”).
white_list (list, optional) – A list of entity labels to include. If empty, all labels are included. Do not include IOB prefix on labels.
black_list (list, optional) – A list of entity labels to exclude. If empty, no labels are excluded. Do not include IOB prefix on labels.
replace_labels (dict, optional) – A dictionary mapping original labels to new labels. replace_labels runs after white_list and black_list filtering.

Returns:

A Spark DataFrame to train NER models.

Return type:

DataFrame

generateConll(df: pyspark.sql.DataFrame, output_path: str, label_col: str = 'label', doc_id_col: str = 'id')#

Generates a CoNLL format file from a DataFrame containing Spark NLP compatible NER (Named Entity Recognition) annotations.

This method processes a DataFrame with NER labels and writes the output in CoNLL-2003 format, which is commonly used for NER tasks. Each document is separated by a -DOCSTART- header, sentences are separated by blank lines, and each token is written with its corresponding label.

Parameters:

df (DataFrame) – The input DataFrame containing NER annotations. Must include: - A document ID column (specified by doc_id_col) - A label column (specified by label_col)
output_path (str) – The file system path where the CoNLL file will be written. Supports various file systems: - Local: file:///path/to/output.conll - DBFS: dbfs:/mnt/path/to/output.conll If the file exists, it will be overwritten.
label_col (str, optional) – The name of the column containing NER label annotations. Default is “label”. The column must exist in the DataFrame.
doc_id_col (str, optional) – The name of the column containing document identifiers. Default is “id”. The column must exist in the DataFrame. IMPORTANT: All document IDs must be unique. Duplicate IDs will cause an IllegalArgumentException to be thrown.

Examples

>>> from sparknlp_jsl.training import Annotation2Training
>>> annotation2training = Annotation2Training(spark)
>>> df = annotation2training.convertJson2NerDF(
...     json_path= "path/to/annotations.json",
...     pipeline_model=pipeline_model
... )
>>> annotation2training.generateConll(df, "file:///tmp/output.conll")

Note

The output follows CoNLL-2003 format: TOKEN -X- -X- LABEL
Documents are separated by -DOCSTART- -X- -DOCID- O headers
Sentences within documents are separated by blank lines

getDataFrame(spark, jdf)#

new_java_array(pylist, java_class)#: ToDo: Inspired from spark 2.0. Review if spark changes

new_java_array_integer(pylist)#

new_java_array_string(pylist)#

new_java_obj(java_class, *args)#

showLabelDistributions(df: pyspark.sql.DataFrame, ner_label_col: str = 'label')#

Shows the distribution of entity labels in the training DataFrame.

Parameters:

df (DataFrame) – The training DataFrame.
ner_label_col (str, optional) – The name of the column containing NER labels (default is “label”).

spark_version()#

class AnnotationToolJsonReader(pipeline_model: str = None, assertion_labels: list = None, excluded_labels: list = None, cleanup_mode: str = 'disabled', split_chars: list = None, context_chars: list = None, scheme: str = 'IOB', min_chars_tol: int = 2, align_chars_tol: int = 1, merge_overlapping: bool = True, SDDLPath: str = '')#

Bases: sparknlp.internal.ExtendedJavaWrapper

Reads and process the exported json file from NLP Lab.

Reader class that parses relevant information exported from NLP Lab into different formats. The reader can be used to create a training dataset for training assertion status (using the generateAssertionTrainSet method) or NER models (in the CoNLL format using the generateConll method).

To generate the assertion data, the following attributes need to be specified when instantiating the class:

assertion_labels: The assertion labels to use.

excluded_labels: The assertions labels that are excluded for the training dataset
creation (can be an empty list).
Examples:
>>> from sparknlp_jsl.training import AnnotationToolJsonReader
>>> assertion_labels = ["AsPresent","Absent"]
>>> excluded_labels = ["Treatment"]
>>> rdr = AnnotationToolJsonReader(
...     assertion_labels=assertion_labels,
...     excluded_labels=excluded_labels,
...     )
>>> path = "annotations.json"
>>> df = rdr.readDataset(spark, path)
>>> df.show(5)
+-----+-------+--------------------+-------------+--------------------+---------------+------------+--------------------+---------------+---------+--------------------+--------------------+--------------------+--------------------+
|title|task_id|                text|completion_id|     completion_date|completion_user|ground_truth|          tool_chunk|assertion_label|relations|            document|            sentence|               token|           ner_label|
+-----+-------+--------------------+-------------+--------------------+---------------+------------+--------------------+---------------+---------+--------------------+--------------------+--------------------+--------------------+
| null|     18|

229937784 FIH...| 18001|2022-03-04T13:19:…| aleksei| true|[{chunk, 136, 142…| []| []|[{document, 0, 37…|[{document, 1, 79…|[{token, 1, 9, 22…|[{named_entity, 1…|

null| 65|

305038693 FIH...| 65001|2022-03-03T08:31:…| aleksei| true|[{chunk, 1, 9, 30…| []| []|[{document, 0, 17…|[{document, 1, 81…|[{token, 1, 9, 30…|[{named_entity, 1…|

null| 21|

844970557 RWH …| 21001|2022-03-04T13:00:…| aleksei| true|[{chunk, 1, 13, 8…| []| []|[{document, 0, 28…|[{document, 1, 25…|[{token, 1, 9, 84…|[{named_entity, 1…|

null| 97|

914783811 FIH...| 97001|2022-03-02T12:07:…| aleksei| true|[{chunk, 1, 9, 91…| []| []|[{document, 0, 90…|[{document, 1, 11…|[{token, 1, 9, 91…|[{named_entity, 1…|

null| 84|

733882247 FIH...| 84001|2022-03-02T15:01:…| aleksei| true|[{chunk, 1, 9, 73…| []| []|[{document, 0, 77…|[{document, 1, 10…|[{token, 1, 9, 73…|[{named_entity, 1…|

only showing top 5 rows >>> assertion_df = rdr.generateAssertionTrainSet(df) >>> rdr.generateConll(df, “annotations.conll”)

java_obj#

sc = None#

apply()#

generateAssertionTrainSet(df, sentenceCol='sentence', assertionCol='assertion_label')#

Generates assertion training data at token level.

Using information from the sentence and assertion labels, this method generates a training data set with the following columns: - text: sentence text - target: the token text - label: the assertion label - start: start position of the token - end: end position of the token

The tokens are identified internally with the constraints from the min_chars_tol and align_chars_tol parameters.

Parameters:

df (DataFrame) – Data Frame containing the sentences and assertion labels.
sentenceCol (str, optional) – Column name containing the sentence annotation.
"sentence". (Defaults to) –
assertionCol (str, optional) – Column name containing the assertion annotation.
"assertion_label". (Defaults to) –

Returns:

data frame with the training data.

Return type:

DataFrame

generateConll(df, path: str, taskColumn: str = 'task_id', tokenCol: str = 'token', nerLabel: str = 'ner_label')#

Saves a CoNLL file from the exported annotations.

Parameters:

df (DataFrame) – DataFrame containing the annotations read using readDataset.
path (str) – Path to where the CoNLL file will be saved.
taskColumn (str, optional) – Column containing the task id information.
"task_id". (Defaults to) –
tokenCol (str, optional) – Column containing the token information.
"token". (Defaults to) –
nerLabel (str, optional) – Column containing the NER label.
"ner_label". (Defaults to) –

generatePlainAssertionTrainSet(df, taskColumn: str = 'task_id', tokenCol: str = 'token', nerLabel: str = 'ner_label', assertion_label: str = 'assertion_label')#

Generates assertion training data at chunk level.

Using information from the sentence, task id (from NLP Lab), ner label, and assertion labels, this method generates a training data set with the following columns: - sentence: sentence text - begin: start position of the token - end: end position of the token - ner: the NER chunk - assertion: the assertion label

Internally uses the NerConverterInternal to identify the NER chunks.

Parameters:

df (DataFrame) – Data Frame containing the sentences and assertion labels.
taskColumn (str, optional) – Column name containing the task id annotation.
"task_id". (Defaults to) –
tokenCol (str, optional) – Column name containing the token annotation.
"token". (Defaults to) –
nerLabel (str, optional) – Column name containing the NER label annotation.
"ner_label". (Defaults to) –
assertion_label (str, optional) – Column name containing the assertion label annotation.
"assertion_label". (Defaults to) –

Returns:

data frame with the training data.

Return type:

DataFrame

getDataFrame(spark, jdf)#

new_java_array(pylist, java_class)#: ToDo: Inspired from spark 2.0. Review if spark changes

new_java_array_integer(pylist)#

new_java_array_string(pylist)#

new_java_obj(java_class, *args)#

readDataset(spark, path: str)#

Reads the exported JSON file into a spark data frame.

Parameters:

spark – Spark session.
path (str) – Path to the exported file.

Returns:

Spark DataFrame containing the exported data.

Return type:

DataFrame

spark_version()#

class CantemistReader(scheme='IOB')#

Bases: sparknlp.internal.ExtendedJavaWrapper

Parser for the Cantemist dataset.

java_obj#

sc = None#

apply()#

getDataFrame(spark, jdf)#

new_java_array(pylist, java_class)#: ToDo: Inspired from spark 2.0. Review if spark changes

new_java_array_integer(pylist)#

new_java_array_string(pylist)#

new_java_obj(java_class, *args)#

readDatasetTaskNer(spark, textFolder)#: Reads the Cantemist dataset into a spark data frame.

spark_version()#

class CodiEspReader(scheme='IOB')#

Bases: sparknlp.internal.ExtendedJavaWrapper

Parser for the CodiEsp dataset.

java_obj#

sc = None#

apply()#

getDataFrame(spark, jdf)#

new_java_array(pylist, java_class)#: ToDo: Inspired from spark 2.0. Review if spark changes

new_java_array_integer(pylist)#

new_java_array_string(pylist)#

new_java_obj(java_class, *args)#

readDatasetTaskX(spark, path, textFolder, sep='\t')#: Reads the CodiEsp dataset into a spark data frame.

spark_version()#

class REDatasetHelper(spark_df: pyspark.sql.DataFrame)#

Class to preprocess RE dataset loaded into Spark Dataframe.

Examples

>>> from sparknlp_jsl.training import REDatasetHelper
>>> PATH = "/content/i2b2_clinical_rel_dataset.csv"
>>> data = spark.read.option("header","true").format("csv").load(PATH)
>>>
>>> column_map = {
        "begin1": "firstCharEnt1",
        "end1": "lastCharEnt1",
        "begin2": "firstCharEnt2",
        "end2": "lastCharEnt2",
        "chunk1": "chunk1",
        "chunk2": "chunk2",
        "label1": "label1",
        "label2": "label2"
    }
>>>
>>> # apply preprocess function to dataframe
>>> data = REDatasetHelper(data).create_annotation_column(column_map)
>>> data.show(5)
+-------+-------------+--------------------+--------------------+-------+--------------------+------+----+-----+--------------------+------+----+---------+---------+------------+-------------+------------+-------------+-------------+-------------+-------------+
|dataset|       source|            txt_file|            sentence|sent_id|              chunk1|begin1|end1|  rel|              chunk2|begin2|end2|   label1|   label2|lastCharEnt1|firstCharEnt1|lastCharEnt2|firstCharEnt2|words_in_ent1|words_in_ent2|words_between|
+-------+-------------+--------------------+--------------------+-------+--------------------+------+----+-----+--------------------+------+----+---------+---------+------------+-------------+------------+-------------+-------------+-------------+-------------+
|   test|beth+partners|i2b2 2010 VA/test...|VITAL SIGNS - Tem...|     44|    respiratory rate|    12|  13|    O|          saturation|    17|  17|     test|     test|          64|           49|          84|           75|            2|            1|            3|
|   test|beth+partners|i2b2 2010 VA/test...|No lotions , crea...|    146|             lotions|     1|   1|TrNAP|           incisions|     7|   7|treatment|  problem|           9|            3|          42|           34|            1|            1|            5|
|  train|     partners|i2b2 2010 VA/conc...|Because of expect...|     43|expected long ter...|     2|   6|    O|         a picc line|     8|  10|treatment|treatment|          54|           11|          68|           58|            5|            3|            1|
|  train|     partners|i2b2 2010 VA/conc...|She states this l...|     21|    light-headedness|     3|   3|  PIP|         diaphoresis|    12|  12|  problem|  problem|          31|           16|          92|           82|            1|            1|            8|
|   test|beth+partners|i2b2 2010 VA/test...|Initial electroca...|     61|an inferior and r...|    38|  43|  PIP|1-mm st depressio...|    28|  34|  problem|  problem|         239|          196|         176|          145|            6|            7|            3|
+-------+-------------+--------------------+--------------------+-------+--------------------+------+----+-----+--------------------+------+----+---------+---------+------------+-------------+------------+-------------+-------------+-------------+-------------+
only showing top 5 rows
>>> #   if data contains different splits, you can first preprocess then filter by dataset column.
>>> train_data = data.where("dataset='train'")
>>> test_data = data.where("dataset='test'")

annotation_schema#

data#

create_annotation_column(column_map, ner_column_name='train_ner_chunks')#

Creates label column for RelationExtractionApproach.

Args

column_map: Required mapping between entity columns and dataset columns.: Required columns are: begin1, end1, chunk1, label1, begin2, end2, chunk2, label2.

ner_column_name: a label column name for RelationExtractionApproach.

Examples

# map should be as follows: >>> column_map = {

“begin1”: “firstCharEnt1”, “end1”: “lastCharEnt1”, “begin2”: “firstCharEnt2”, “end2”: “lastCharEnt2”, “chunk1”: “chunk1”, “chunk2”: “chunk2”, “label1”: “label1”, “label2”: “label2”

}

Returns:: A new dataframe extended with ner chunk column.
Return type:: DataFrame

class SynonymAugmentationUMLS(spark, umls_path='', code_col='code', description_col='description', case_sensitive=False)#

Bases: sparknlp.internal.ExtendedJavaWrapper

Augments datasets with synonyms from UMLS.

java_obj#

sc = None#

apply()#

augmentCsv(corpus_csv_path, ner_pipeline, language='ENG', do_product=False, augmentation_mode='plain_text', synonym_source='umls', regex_parsers=None, euclidean_distance_threshold=10.0, cosine_distance_threshold=0.25, synonym_limit=5, casing_functions=None)#: Augment based on CSV file.

augmentDataFrame(corpus_df, ner_pipeline, language='ENG', do_product=False, augmentation_mode='plain_text', synonym_source='umls', regex_parsers=None, euclidean_distance_threshold=10.0, cosine_distance_threshold=0.25, synonym_limit=5, casing_functions=None)#: Augment based on Spark Dataframe.

getDataFrame(spark, jdf)#

new_java_array(pylist, java_class)#: ToDo: Inspired from spark 2.0. Review if spark changes

new_java_array_integer(pylist)#

new_java_array_string(pylist)#

new_java_obj(java_class, *args)#

spark_version()#

tf_graph#

tf_graph_1x#

sparknlp_jsl.training#

Module Contents#

Classes#

Attributes#

`sparknlp_jsl.training`#