Annotation2Training

Companion object Annotation2Training

class Annotation2Training extends CheckLicense

Converts annotation results from json or csv files to DataFrame suitable for NER training. Input files must have a structure similar to the one produced by John Snow Labs' Generative AI annotation tool.

Linear Supertypes

CheckLicense, AnyRef, Any

Ordering

Alphabetic
By Inheritance

Inherited

Annotation2Training
CheckLicense
AnyRef
Any

Hide All
Show All

Visibility

Public
All

Instance Constructors

new Annotation2Training(spark: SparkSession)

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0

Definition Classes
Any
def checkValidEnvironment(spark: Option[SparkSession], scopes: Seq[String]): Unit

Definition Classes
CheckLicense
def checkValidScope(scope: String): Unit

Definition Classes
CheckLicense
def checkValidScopeAndEnvironment(scope: String, spark: Option[SparkSession], checkLp: Boolean): Unit

Definition Classes
CheckLicense
def checkValidScopesAndEnvironment(scopes: Seq[String], spark: Option[SparkSession], checkLp: Boolean): Unit

Definition Classes
CheckLicense
def clone(): AnyRef

Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws( ... ) @native()
def convertCsv2NerDF(csvPath: String, pipelineModel: PipelineModel, repartition: Int = 32, tokenOutputCol: String = "token", nerLabelCol: String = "label", whiteList: List[String] = new util.ArrayList[String](), blackList: List[String] = new util.ArrayList[String](), replaceLabels: Map[String, String] = new util.HashMap[String, String]()): DataFrame
Converts a CSV file with annotation results to a DataFrame suitable for NER training.
Converts a CSV file with annotation results to a DataFrame suitable for NER training.
csvPath
Path to the input CSV file. The file will be read with the spark.read.csv method with header, multiLine, quote and escape options set.
pipelineModel
A pre-trained Spark NLP PipelineModel that includes at least a DocumentAssembler, and Tokenizer. PipelineModel can also include SentenceDetector, DocumentSplitter, WordEmbeddings, etc.
repartition
Number of partitions to use when reading the CSV file (default is 32).
tokenOutputCol
The name of the column containing token annotations (default is "token").
nerLabelCol
The name of the output column for NER labels (default is "label").
returns
A DataFrame to train NER models.
def convertJson2NerDF(inputPath: String, pipelineModel: PipelineModel, repartition: Int = 32, tokenOutputCol: String = "token", nerLabelCol: String = "label", whiteList: List[String] = new util.ArrayList[String](), blackList: List[String] = new util.ArrayList[String](), replaceLabels: Map[String, String] = new util.HashMap[String, String]()): DataFrame
Converts a JSON file with annotation results to a DataFrame suitable for NER training.
Converts a JSON file with annotation results to a DataFrame suitable for NER training.
inputPath
Path to the input JSON file. The file will be read with the spark.read.json method with multiLine option set to true.
pipelineModel
A pre-trained Spark NLP PipelineModel that includes at least a DocumentAssembler, and Tokenizer. PipelineModel can also include SentenceDetector, DocumentSplitter, WordEmbeddings, etc.
repartition
Number of partitions to use when reading the input file (default is 32).
tokenOutputCol
The name of the column containing token annotations (default is "token").
nerLabelCol
The name of the output column for NER labels (default is "label").
returns
A DataFrame to train NER models.
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def finalize(): Unit

Attributes
protected[lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
def generateConll(df: DataFrame, outputPath: String, labelCol: String = "label", docIdCol: String = "id"): Unit
Generates a CoNLL format file from a DataFrame containing Spark NLP compatible NER (Named Entity Recognition) annotations.
Generates a CoNLL format file from a DataFrame containing Spark NLP compatible NER (Named Entity Recognition) annotations.
This method processes a DataFrame with NER labels and writes the output in CoNLL-2003 format, which is commonly used for NER tasks. Each document is separated by a -DOCSTART- header, sentences are separated by blank lines, and each token is written with its corresponding label.
df
The input DataFrame containing NER annotations. Must include:
- A document ID column (specified by docIdCol)
- A label column (specified by labelCol)
outputPath
The file system path where the CoNLL file will be written. Supports various file systems:
- Local: file:///path/to/output.conll
- DBFS: dbfs:/mnt/path/to/output.conll If the file exists, it will be overwritten.
labelCol
The name of the column containing NER label annotations. Default is "label". The column must exist in the DataFrame.
docIdCol
The name of the column containing document identifiers. Default is "id". The column must exist in the DataFrame. **IMPORTANT**: All document IDs must be unique. Duplicate IDs will cause an IllegalArgumentException to be thrown.
Example:
1. // Example usage with default parameters val df = spark.read.parquet("/path/to/ner_annotations.parquet") generateConll(df, "file:///tmp/output.conll")
Exceptions thrown
if:
The labelCol column does not exist in the DataFrame
The docIdCol column does not exist in the DataFrame
Duplicate document IDs are found in the docIdCol column
java.io.IOException if there are file system errors during writing
Note
The output follows CoNLL-2003 format: TOKEN -X- -X- LABEL
Documents are separated by -DOCSTART- -X- -DOCID- O headers
Sentences within documents are separated by blank lines
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
Annotations
@native()
def hashCode(): Int

Definition Classes
AnyRef → Any
Annotations
@native()
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
Annotations
@native()
final def notifyAll(): Unit

Definition Classes
AnyRef
Annotations
@native()
def showLabelDistributions(df: DataFrame, nerLabelCol: String = "label"): Unit
Shows the distribution of entity labels in the training DataFrame.
Shows the distribution of entity labels in the training DataFrame.
df
The training DataFrame
nerLabelCol
The name of the column containing NER labels (default is "label")
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toString(): String

Definition Classes
AnyRef → Any
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... ) @native()

Packages

Annotation2Training

Companion object Annotation2Training

class Annotation2Training extends CheckLicense

Instance Constructors

Value Members

Inherited from CheckLicense

Inherited from AnyRef

Inherited from Any

Ungrouped

Packages

Annotation2Training 

Companion object Annotation2Training

class Annotation2Training extends CheckLicense

Instance Constructors

Value Members

Inherited from CheckLicense

Inherited from AnyRef

Inherited from Any

Ungrouped

Annotation2Training