c

com.johnsnowlabs.nlp.training

AnnotationToolJsonReader

class AnnotationToolJsonReader extends CheckLicense

Reads and process the exported json file from NLP Lab.

Reader class that parses relevant information exported from NLP Lab into different formats. The reader can be used to create a training dataset for training assertion status (using the generateAssertionTrainSet method) or NER models (in the CoNLL format using the generateConll method).

To generate the assertion data, the following attributes need to be specified when instantiating the class: - assertion_labels: The assertion labels to use. - excluded_labels: The assertions labels that are excluded for the training dataset creation (can be an empty list).

Linear Supertypes
CheckLicense, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. AnnotationToolJsonReader
  2. CheckLicense
  3. AnyRef
  4. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new AnnotationToolJsonReader(assertionLabels: Array[String])
  2. new AnnotationToolJsonReader(pipelineModel: PipelineModel, assertionLabels: List[String])
  3. new AnnotationToolJsonReader(assertionLabels: Array[String], excludedLabels: Array[String])
  4. new AnnotationToolJsonReader(pipelineModel: PipelineModel, assertionLabels: Array[String], excludedLabels: Array[String], scheme: String, minCharsTol: Int, alignCharsTol: Int, mergeOverlapping: Boolean)
  5. new AnnotationToolJsonReader(pipelineModel: PipelineModel, assertionLabels: List[String], excludedLabels: List[String], scheme: String, minCharsTol: Int, alignCharsTol: Int, mergeOverlapping: Boolean)

    pipelineModel

    The pipeline model that is used to create the documents the sentences and the tokens. And example of that pipeline could be the following: {{ val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") pipline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer)) * }

    assertionLabels

    The labels that will be interpreted as Assertion labels.

    excludedLabels

    The labels that will be excluded from the NER / Assertion Annotations.

    scheme

    The schema that will use to create the IOB_tagger (IOB or BIOES).

    minCharsTol

    The min length of the token to be consider a label.

    alignCharsTol

    The maximum number of char difference to consider the start / end tokens alignment with the Assertion Annotations.

    mergeOverlapping

    'true' to merge the overlapping annotations prioritizing longest and more diverse annotations, 'false' to keep both chunks in the same position.

  6. new AnnotationToolJsonReader(assertionLabels: List[String] = List.empty[String].asJava, excludedLabels: List[String] = List.empty[String].asJava, cleanupMode: String = "disabled", splitChars: List[String] = List().asJava, contextChars: List[String] = List().asJava, scheme: String = "IOB", minCharsTol: Int = 2, alignCharsTol: Int = 1, mergeOverlapping: Boolean = true, SDDLPath: String = "")

    The annotation tool json reader is a reader that generate a assertion train set from the json from annotations labs exports.

    The annotation tool json reader is a reader that generate a assertion train set from the json from annotations labs exports.

    assertionLabels

    The assertions labels are used for the training dataset creation.

    excludedLabels

    The NER or Assertion labels to be excluded from the training dataset creation.

    cleanupMode

    The clean mode that is used in the DocumentAssembler transformer.

    splitChars

    The split chars that are used in the default tokenizer. com.johnsnowlabs.nlp.annotators.Tokenizer.setSplitChars().

    contextChars

    The context chars that are used in the default tokenizer. com.johnsnowlabs.nlp.annotators.Tokenizer.setContextChars().

    scheme

    The schema that will use to create the IOB_tagger (IOB or BIOES).

    minCharsTol

    The min length of the token to apply the alignment tolerance feature.

    alignCharsTol

    The maximum number of char difference to consider the start / end tokens alignment with the assertion annotations.

    mergeOverlapping

    'true' to merge the overlapping annotations, 'false' to keep both chunks in the same position.

    SDDLPath

    The context chars that are used in the default tokenizer com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector.load().

    Annotations
    @deprecated
    Deprecated
  7. new AnnotationToolJsonReader(assertionLabels: Array[String], excludedLabels: Array[String], cleanupMode: String, splitChars: Array[String], contextChars: Array[String], scheme: String, minCharsTol: Int, alignCharsTol: Int, mergeOverlapping: Boolean, SDDLPath: String)
    Annotations
    @deprecated
    Deprecated

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. val alignCharsTol: Int
  5. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  6. val assertionLabels: List[String]
  7. def checkValidEnvironment(spark: Option[SparkSession], scopes: Seq[String]): Unit
    Definition Classes
    CheckLicense
  8. def checkValidScope(scope: String): Unit
    Definition Classes
    CheckLicense
  9. def checkValidScopeAndEnvironment(scope: String, spark: Option[SparkSession], checkLp: Boolean): Unit
    Definition Classes
    CheckLicense
  10. def checkValidScopesAndEnvironment(scopes: Seq[String], spark: Option[SparkSession], checkLp: Boolean): Unit
    Definition Classes
    CheckLicense
  11. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  12. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  13. def equals(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  14. val excludedLabels: List[String]
  15. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  16. val fs: FileSystem
  17. def generateAssertionTrainSet(df: DataFrame, sentenceCol: String = "sentence", assertionCol: String = "assertion_label"): DataFrame

    Generates assertion training data at token level.

    Generates assertion training data at token level.

    Using information from the sentence and assertion labels, this method generates a training data set with the following columns:

    • text: sentence text
    • target: the token text
    • label: the assertion label
    • start: start position of the token
    • end: end position of the token The tokens are identified internally with the constraints from the min_chars_tol and align_chars_tol parameters.
    df

    DataFrame with sentence and assertion annotations like the one returned by readDataset method

    sentenceCol

    Name of the column having Sentence Annotations

    assertionCol

    Name of the column having Assertion Annotations

    returns

    A dataframe to train an AssertionDL Model

  18. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  19. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  20. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  21. val mergeOverlapping: Boolean
  22. val minCharsTol: Int
  23. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  24. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  25. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  26. val pipelineModel: PipelineModel
  27. def readDataset(spark: SparkSession, path: String): DataFrame

    The readDataset method uses the spark session to read a access a path to a JSON from the Annnotation Tool and provide a DataFrame with proper annotations to train SparkNLP and SparkNLP Enterprise Annotators

    The readDataset method uses the spark session to read a access a path to a JSON from the Annnotation Tool and provide a DataFrame with proper annotations to train SparkNLP and SparkNLP Enterprise Annotators

    spark

    SparkSession

    path

    Path to a JSON exported from the Annotation Tool

    returns

    A dataframe with NER, Assertion and RE annotations

  28. val scheme: String
  29. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  30. def toString(): String
    Definition Classes
    AnyRef → Any
  31. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  32. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  33. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()

Deprecated Value Members

  1. def generateConll(df: DataFrame, path: String, taskColumn: String = "task_id", tokenCol: String = "token", nerLabel: String = "ner_label"): Unit

    Saves a CoNLL file from the exported annotations.

    Saves a CoNLL file from the exported annotations.

    Annotations
    @deprecated
    Deprecated
  2. def generatePlainAssertionTrainSet(df: DataFrame, taskColumn: String = "task_id", tokenCol: String = "token", nerLabel: String = "ner_label", assertion_label: String = "assertion_label"): DataFrame

    Generates assertion training data at chunk level.

    Generates assertion training data at chunk level.

    Using information from the sentence, task id (from NLP Lab), ner label, and assertion labels, this method generates a training data set with the following columns: - sentence: sentence text - begin: start position of the token - end: end position of the token - ner: the NER chunk - assertion: the assertion label

    Internally uses the NerConverterInternal to identify the NER chunks.

    Annotations
    @deprecated
    Deprecated

Inherited from CheckLicense

Inherited from AnyRef

Inherited from Any

Ungrouped