class Annotation2Training extends CheckLicense

Converts annotation results from json or csv files to DataFrame suitable for NER training. Input files must have a structure similar to the one produced by John Snow Labs' Generative AI annotation tool.

Linear Supertypes
CheckLicense, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. Annotation2Training
  2. CheckLicense
  3. AnyRef
  4. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new Annotation2Training(spark: SparkSession)

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  5. def checkValidEnvironment(spark: Option[SparkSession], scopes: Seq[String]): Unit
    Definition Classes
    CheckLicense
  6. def checkValidScope(scope: String): Unit
    Definition Classes
    CheckLicense
  7. def checkValidScopeAndEnvironment(scope: String, spark: Option[SparkSession], checkLp: Boolean): Unit
    Definition Classes
    CheckLicense
  8. def checkValidScopesAndEnvironment(scopes: Seq[String], spark: Option[SparkSession], checkLp: Boolean): Unit
    Definition Classes
    CheckLicense
  9. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  10. def convertCsv2NerDF(csvPath: String, pipelineModel: PipelineModel, repartition: Int = 32, tokenOutputCol: String = "token", nerLabelCol: String = "label", whiteList: List[String] = new util.ArrayList[String](), blackList: List[String] = new util.ArrayList[String](), replaceLabels: Map[String, String] = new util.HashMap[String, String]()): DataFrame

    Converts a CSV file with annotation results to a DataFrame suitable for NER training.

    Converts a CSV file with annotation results to a DataFrame suitable for NER training.

    csvPath

    Path to the input CSV file. The file will be read with the spark.read.csv method with header, multiLine, quote and escape options set.

    pipelineModel

    A pre-trained Spark NLP PipelineModel that includes at least a DocumentAssembler, and Tokenizer. PipelineModel can also include SentenceDetector, DocumentSplitter, WordEmbeddings, etc.

    repartition

    Number of partitions to use when reading the CSV file (default is 32).

    tokenOutputCol

    The name of the column containing token annotations (default is "token").

    nerLabelCol

    The name of the output column for NER labels (default is "label").

    returns

    A DataFrame to train NER models.

  11. def convertJson2NerDF(inputPath: String, pipelineModel: PipelineModel, repartition: Int = 32, tokenOutputCol: String = "token", nerLabelCol: String = "label", whiteList: List[String] = new util.ArrayList[String](), blackList: List[String] = new util.ArrayList[String](), replaceLabels: Map[String, String] = new util.HashMap[String, String]()): DataFrame

    Converts a JSON file with annotation results to a DataFrame suitable for NER training.

    Converts a JSON file with annotation results to a DataFrame suitable for NER training.

    inputPath

    Path to the input JSON file. The file will be read with the spark.read.json method with multiLine option set to true.

    pipelineModel

    A pre-trained Spark NLP PipelineModel that includes at least a DocumentAssembler, and Tokenizer. PipelineModel can also include SentenceDetector, DocumentSplitter, WordEmbeddings, etc.

    repartition

    Number of partitions to use when reading the input file (default is 32).

    tokenOutputCol

    The name of the column containing token annotations (default is "token").

    nerLabelCol

    The name of the output column for NER labels (default is "label").

    returns

    A DataFrame to train NER models.

  12. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  13. def equals(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  14. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  15. def generateConll(df: DataFrame, outputPath: String, labelCol: String = "label", docIdCol: String = "id"): Unit

    Generates a CoNLL format file from a DataFrame containing Spark NLP compatible NER (Named Entity Recognition) annotations.

    Generates a CoNLL format file from a DataFrame containing Spark NLP compatible NER (Named Entity Recognition) annotations.

    This method processes a DataFrame with NER labels and writes the output in CoNLL-2003 format, which is commonly used for NER tasks. Each document is separated by a -DOCSTART- header, sentences are separated by blank lines, and each token is written with its corresponding label.

    df

    The input DataFrame containing NER annotations. Must include:

    • A document ID column (specified by docIdCol)
    • A label column (specified by labelCol)
    outputPath

    The file system path where the CoNLL file will be written. Supports various file systems:

    • Local: file:///path/to/output.conll
    • DBFS: dbfs:/mnt/path/to/output.conll If the file exists, it will be overwritten.
    labelCol

    The name of the column containing NER label annotations. Default is "label". The column must exist in the DataFrame.

    docIdCol

    The name of the column containing document identifiers. Default is "id". The column must exist in the DataFrame. **IMPORTANT**: All document IDs must be unique. Duplicate IDs will cause an IllegalArgumentException to be thrown.

    Example:
    1. // Example usage with default parameters
      val df = spark.read.parquet("/path/to/ner_annotations.parquet")
      generateConll(df, "file:///tmp/output.conll")
    Exceptions thrown

    if:

    • The labelCol column does not exist in the DataFrame
    • The docIdCol column does not exist in the DataFrame
    • Duplicate document IDs are found in the docIdCol column

    java.io.IOException if there are file system errors during writing

    Note

    • The output follows CoNLL-2003 format: TOKEN -X- -X- LABEL
    • Documents are separated by -DOCSTART- -X- -DOCID- O headers
    • Sentences within documents are separated by blank lines
  16. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  17. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  18. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  19. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  20. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  21. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  22. def showLabelDistributions(df: DataFrame, nerLabelCol: String = "label"): Unit

    Shows the distribution of entity labels in the training DataFrame.

    Shows the distribution of entity labels in the training DataFrame.

    df

    The training DataFrame

    nerLabelCol

    The name of the column containing NER labels (default is "label")

  23. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  24. def toString(): String
    Definition Classes
    AnyRef → Any
  25. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  26. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  27. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()

Inherited from CheckLicense

Inherited from AnyRef

Inherited from Any

Ungrouped