sparknlp_jsl.training#

Module Contents#

Classes#

AnnotationToolJsonReader

Reads and process the exported json file from NLP Lab.

CantemistReader

Parser for the Cantemist dataset.

CodiEspReader

Parser for the CodiEsp dataset.

REDatasetHelper

Class to preprocess RE dataset loaded into Spark Dataframe.

SynonymAugmentationUMLS

Augments datasets with synonyms from UMLS.

Attributes#

tf_graph

tf_graph_1x

class AnnotationToolJsonReader(pipeline_model: str = None, assertion_labels: list = None, excluded_labels: list = None, cleanup_mode: str = 'disabled', split_chars: list = None, context_chars: list = None, scheme: str = 'IOB', min_chars_tol: int = 2, align_chars_tol: int = 1, merge_overlapping: bool = True, SDDLPath: str = '')#

Bases: sparknlp.internal.ExtendedJavaWrapper

Reads and process the exported json file from NLP Lab.

Reader class that parses relevant information exported from NLP Lab into different formats. The reader can be used to create a training dataset for training assertion status (using the generateAssertionTrainSet method) or NER models (in the CoNLL format using the generateConll method).

To generate the assertion data, the following attributes need to be specified when instantiating the class:

  • assertion_labels: The assertion labels to use.

  • excluded_labels: The assertions labels that are excluded for the training dataset

    creation (can be an empty list).

Examples:
>>> from sparknlp_jsl.training import AnnotationToolJsonReader
>>> assertion_labels = ["AsPresent","Absent"]
>>> excluded_labels = ["Treatment"]
>>> rdr = AnnotationToolJsonReader(
...     assertion_labels=assertion_labels,
...     excluded_labels=excluded_labels,
...     )
>>> path = "annotations.json"
>>> df = rdr.readDataset(spark, path)
>>> df.show(5)
+-----+-------+--------------------+-------------+--------------------+---------------+------------+--------------------+---------------+---------+--------------------+--------------------+--------------------+--------------------+
|title|task_id|                text|completion_id|     completion_date|completion_user|ground_truth|          tool_chunk|assertion_label|relations|            document|            sentence|               token|           ner_label|
+-----+-------+--------------------+-------------+--------------------+---------------+------------+--------------------+---------------+---------+--------------------+--------------------+--------------------+--------------------+
| null|     18|

229937784 FIH...| 18001|2022-03-04T13:19:…| aleksei| true|[{chunk, 136, 142…| []| []|[{document, 0, 37…|[{document, 1, 79…|[{token, 1, 9, 22…|[{named_entity, 1…|

null| 65|

305038693 FIH...| 65001|2022-03-03T08:31:…| aleksei| true|[{chunk, 1, 9, 30…| []| []|[{document, 0, 17…|[{document, 1, 81…|[{token, 1, 9, 30…|[{named_entity, 1…|

null| 21|

844970557 RWH …| 21001|2022-03-04T13:00:…| aleksei| true|[{chunk, 1, 13, 8…| []| []|[{document, 0, 28…|[{document, 1, 25…|[{token, 1, 9, 84…|[{named_entity, 1…|

null| 97|

914783811 FIH...| 97001|2022-03-02T12:07:…| aleksei| true|[{chunk, 1, 9, 91…| []| []|[{document, 0, 90…|[{document, 1, 11…|[{token, 1, 9, 91…|[{named_entity, 1…|

null| 84|

733882247 FIH...| 84001|2022-03-02T15:01:…| aleksei| true|[{chunk, 1, 9, 73…| []| []|[{document, 0, 77…|[{document, 1, 10…|[{token, 1, 9, 73…|[{named_entity, 1…|

only showing top 5 rows >>> assertion_df = rdr.generateAssertionTrainSet(df) >>> rdr.generateConll(df, “annotations.conll”)

apply()#
generateAssertionTrainSet(df, sentenceCol='sentence', assertionCol='assertion_label')#

Generates assertion training data at token level.

Using information from the sentence and assertion labels, this method generates a training data set with the following columns: - text: sentence text - target: the token text - label: the assertion label - start: start position of the token - end: end position of the token

The tokens are identified internally with the constraints from the min_chars_tol and align_chars_tol parameters.

Parameters:
  • df (DataFrame) – Data Frame containing the sentences and assertion labels.

  • sentenceCol (str, optional) – Column name containing the sentence annotation.

  • "sentence". (Defaults to) –

  • assertionCol (str, optional) – Column name containing the assertion annotation.

  • "assertion_label". (Defaults to) –

Returns:

data frame with the training data.

Return type:

DataFrame

generateConll(df, path: str, taskColumn: str = 'task_id', tokenCol: str = 'token', nerLabel: str = 'ner_label')#

Saves a CoNLL file from the exported annotations.

Parameters:
  • df (DataFrame) – DataFrame containing the annotations read using readDataset.

  • path (str) – Path to where the CoNLL file will be saved.

  • taskColumn (str, optional) – Column containing the task id information.

  • "task_id". (Defaults to) –

  • tokenCol (str, optional) – Column containing the token information.

  • "token". (Defaults to) –

  • nerLabel (str, optional) – Column containing the NER label.

  • "ner_label". (Defaults to) –

generatePlainAssertionTrainSet(df, taskColumn: str = 'task_id', tokenCol: str = 'token', nerLabel: str = 'ner_label', assertion_label: str = 'assertion_label')#

Generates assertion training data at chunk level.

Using information from the sentence, task id (from NLP Lab), ner label, and assertion labels, this method generates a training data set with the following columns: - sentence: sentence text - begin: start position of the token - end: end position of the token - ner: the NER chunk - assertion: the assertion label

Internally uses the NerConverterInternal to identify the NER chunks.

Parameters:
  • df (DataFrame) – Data Frame containing the sentences and assertion labels.

  • taskColumn (str, optional) – Column name containing the task id annotation.

  • "task_id". (Defaults to) –

  • tokenCol (str, optional) – Column name containing the token annotation.

  • "token". (Defaults to) –

  • nerLabel (str, optional) – Column name containing the NER label annotation.

  • "ner_label". (Defaults to) –

  • assertion_label (str, optional) – Column name containing the assertion label annotation.

  • "assertion_label". (Defaults to) –

Returns:

data frame with the training data.

Return type:

DataFrame

getDataFrame(spark, jdf)#
new_java_array(pylist, java_class)#

ToDo: Inspired from spark 2.0. Review if spark changes

new_java_array_integer(pylist)#
new_java_array_string(pylist)#
new_java_obj(java_class, *args)#
readDataset(spark, path: str)#

Reads the exported JSON file into a spark data frame.

Parameters:
  • spark – Spark session.

  • path (str) – Path to the exported file.

Returns:

Spark DataFrame containing the exported data.

Return type:

DataFrame

spark_version()#
class CantemistReader(scheme='IOB')#

Bases: sparknlp.internal.ExtendedJavaWrapper

Parser for the Cantemist dataset.

apply()#
getDataFrame(spark, jdf)#
new_java_array(pylist, java_class)#

ToDo: Inspired from spark 2.0. Review if spark changes

new_java_array_integer(pylist)#
new_java_array_string(pylist)#
new_java_obj(java_class, *args)#
readDatasetTaskNer(spark, textFolder)#

Reads the Cantemist dataset into a spark data frame.

spark_version()#
class CodiEspReader(scheme='IOB')#

Bases: sparknlp.internal.ExtendedJavaWrapper

Parser for the CodiEsp dataset.

apply()#
getDataFrame(spark, jdf)#
new_java_array(pylist, java_class)#

ToDo: Inspired from spark 2.0. Review if spark changes

new_java_array_integer(pylist)#
new_java_array_string(pylist)#
new_java_obj(java_class, *args)#
readDatasetTaskX(spark, path, textFolder, sep='\t')#

Reads the CodiEsp dataset into a spark data frame.

spark_version()#
class REDatasetHelper(spark_df: pyspark.sql.DataFrame)#

Class to preprocess RE dataset loaded into Spark Dataframe.

Examples

>>> from sparknlp_jsl.training import REDatasetHelper
>>> PATH = "/content/i2b2_clinical_rel_dataset.csv"
>>> data = spark.read.option("header","true").format("csv").load(PATH)
>>>
>>> column_map = {
        "begin1": "firstCharEnt1",
        "end1": "lastCharEnt1",
        "begin2": "firstCharEnt2",
        "end2": "lastCharEnt2",
        "chunk1": "chunk1",
        "chunk2": "chunk2",
        "label1": "label1",
        "label2": "label2"
    }
>>>
>>> # apply preprocess function to dataframe
>>> data = REDatasetHelper(data).create_annotation_column(column_map)
>>> data.show(5)
+-------+-------------+--------------------+--------------------+-------+--------------------+------+----+-----+--------------------+------+----+---------+---------+------------+-------------+------------+-------------+-------------+-------------+-------------+
|dataset|       source|            txt_file|            sentence|sent_id|              chunk1|begin1|end1|  rel|              chunk2|begin2|end2|   label1|   label2|lastCharEnt1|firstCharEnt1|lastCharEnt2|firstCharEnt2|words_in_ent1|words_in_ent2|words_between|
+-------+-------------+--------------------+--------------------+-------+--------------------+------+----+-----+--------------------+------+----+---------+---------+------------+-------------+------------+-------------+-------------+-------------+-------------+
|   test|beth+partners|i2b2 2010 VA/test...|VITAL SIGNS - Tem...|     44|    respiratory rate|    12|  13|    O|          saturation|    17|  17|     test|     test|          64|           49|          84|           75|            2|            1|            3|
|   test|beth+partners|i2b2 2010 VA/test...|No lotions , crea...|    146|             lotions|     1|   1|TrNAP|           incisions|     7|   7|treatment|  problem|           9|            3|          42|           34|            1|            1|            5|
|  train|     partners|i2b2 2010 VA/conc...|Because of expect...|     43|expected long ter...|     2|   6|    O|         a picc line|     8|  10|treatment|treatment|          54|           11|          68|           58|            5|            3|            1|
|  train|     partners|i2b2 2010 VA/conc...|She states this l...|     21|    light-headedness|     3|   3|  PIP|         diaphoresis|    12|  12|  problem|  problem|          31|           16|          92|           82|            1|            1|            8|
|   test|beth+partners|i2b2 2010 VA/test...|Initial electroca...|     61|an inferior and r...|    38|  43|  PIP|1-mm st depressio...|    28|  34|  problem|  problem|         239|          196|         176|          145|            6|            7|            3|
+-------+-------------+--------------------+--------------------+-------+--------------------+------+----+-----+--------------------+------+----+---------+---------+------------+-------------+------------+-------------+-------------+-------------+-------------+
only showing top 5 rows
>>> #   if data contains different splits, you can first preprocess then filter by dataset column.
>>> train_data = data.where("dataset='train'")
>>> test_data = data.where("dataset='test'")
create_annotation_column(column_map, ner_column_name='train_ner_chunks')#

Creates label column for RelationExtractionApproach.

Args
column_map: Required mapping between entity columns and dataset columns.

Required columns are: begin1, end1, chunk1, label1, begin2, end2, chunk2, label2.

ner_column_name: a label column name for RelationExtractionApproach.

Examples

# for dataset with following schema: >>> data.printSchema() root |– sentence: string (nullable = true) |– chunk1: string (nullable = true) |– begin1: string (nullable = true) |– end1: string (nullable = true) |– rel: string (nullable = true) |– chunk2: string (nullable = true) |– begin2: string (nullable = true) |– end2: string (nullable = true) |– label1: string (nullable = true) |– label2: string (nullable = true) |– lastCharEnt1: string (nullable = true) |– firstCharEnt1: string (nullable = true) |– lastCharEnt2: string (nullable = true) |– firstCharEnt2: string (nullable = true)

# map should be as follows: >>> column_map = {

“begin1”: “firstCharEnt1”, “end1”: “lastCharEnt1”, “begin2”: “firstCharEnt2”, “end2”: “lastCharEnt2”, “chunk1”: “chunk1”, “chunk2”: “chunk2”, “label1”: “label1”, “label2”: “label2”

}

Returns:

A new dataframe extended with ner chunk column.

Return type:

DataFrame

class SynonymAugmentationUMLS(spark, umls_path='', code_col='code', description_col='description', case_sensitive=False)#

Bases: sparknlp.internal.ExtendedJavaWrapper

Augments datasets with synonyms from UMLS.

apply()#
augmentCsv(corpus_csv_path, ner_pipeline, language='ENG', do_product=False, augmentation_mode='plain_text', synonym_source='umls', regex_parsers=None, euclidean_distance_threshold=10.0, cosine_distance_threshold=0.25, synonym_limit=5, casing_functions=None)#

Augment based on CSV file.

augmentDataFrame(corpus_df, ner_pipeline, language='ENG', do_product=False, augmentation_mode='plain_text', synonym_source='umls', regex_parsers=None, euclidean_distance_threshold=10.0, cosine_distance_threshold=0.25, synonym_limit=5, casing_functions=None)#

Augment based on Spark Dataframe.

getDataFrame(spark, jdf)#
new_java_array(pylist, java_class)#

ToDo: Inspired from spark 2.0. Review if spark changes

new_java_array_integer(pylist)#
new_java_array_string(pylist)#
new_java_obj(java_class, *args)#
spark_version()#
tf_graph#
tf_graph_1x#