class Annotation2Training extends CheckLicense
Converts annotation results from json or csv files to DataFrame suitable for NER training. Input files must have a structure similar to the one produced by John Snow Labs' Generative AI annotation tool.
- Alphabetic
- By Inheritance
- Annotation2Training
- CheckLicense
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Instance Constructors
- new Annotation2Training(spark: SparkSession)
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
def
checkValidEnvironment(spark: Option[SparkSession], scopes: Seq[String]): Unit
- Definition Classes
- CheckLicense
-
def
checkValidScope(scope: String): Unit
- Definition Classes
- CheckLicense
-
def
checkValidScopeAndEnvironment(scope: String, spark: Option[SparkSession], checkLp: Boolean): Unit
- Definition Classes
- CheckLicense
-
def
checkValidScopesAndEnvironment(scopes: Seq[String], spark: Option[SparkSession], checkLp: Boolean): Unit
- Definition Classes
- CheckLicense
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
def
convertCsv2NerDF(csvPath: String, pipelineModel: PipelineModel, repartition: Int = 32, tokenOutputCol: String = "token", nerLabelCol: String = "label", whiteList: List[String] = new util.ArrayList[String](), blackList: List[String] = new util.ArrayList[String](), replaceLabels: Map[String, String] = new util.HashMap[String, String]()): DataFrame
Converts a CSV file with annotation results to a DataFrame suitable for NER training.
Converts a CSV file with annotation results to a DataFrame suitable for NER training.
- csvPath
Path to the input CSV file. The file will be read with the spark.read.csv method with header, multiLine, quote and escape options set.
- pipelineModel
A pre-trained Spark NLP PipelineModel that includes at least a DocumentAssembler, and Tokenizer. PipelineModel can also include SentenceDetector, DocumentSplitter, WordEmbeddings, etc.
- repartition
Number of partitions to use when reading the CSV file (default is 32).
- tokenOutputCol
The name of the column containing token annotations (default is "token").
- nerLabelCol
The name of the output column for NER labels (default is "label").
- returns
A DataFrame to train NER models.
-
def
convertJson2NerDF(inputPath: String, pipelineModel: PipelineModel, repartition: Int = 32, tokenOutputCol: String = "token", nerLabelCol: String = "label", whiteList: List[String] = new util.ArrayList[String](), blackList: List[String] = new util.ArrayList[String](), replaceLabels: Map[String, String] = new util.HashMap[String, String]()): DataFrame
Converts a JSON file with annotation results to a DataFrame suitable for NER training.
Converts a JSON file with annotation results to a DataFrame suitable for NER training.
- inputPath
Path to the input JSON file. The file will be read with the spark.read.json method with multiLine option set to true.
- pipelineModel
A pre-trained Spark NLP PipelineModel that includes at least a DocumentAssembler, and Tokenizer. PipelineModel can also include SentenceDetector, DocumentSplitter, WordEmbeddings, etc.
- repartition
Number of partitions to use when reading the input file (default is 32).
- tokenOutputCol
The name of the column containing token annotations (default is "token").
- nerLabelCol
The name of the output column for NER labels (default is "label").
- returns
A DataFrame to train NER models.
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
equals(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] )
-
def
generateConll(df: DataFrame, outputPath: String, labelCol: String = "label", docIdCol: String = "id"): Unit
Generates a CoNLL format file from a DataFrame containing Spark NLP compatible NER (Named Entity Recognition) annotations.
Generates a CoNLL format file from a DataFrame containing Spark NLP compatible NER (Named Entity Recognition) annotations.
This method processes a DataFrame with NER labels and writes the output in CoNLL-2003 format, which is commonly used for NER tasks. Each document is separated by a
-DOCSTART-header, sentences are separated by blank lines, and each token is written with its corresponding label.- df
The input DataFrame containing NER annotations. Must include:
- A document ID column (specified by
docIdCol) - A label column (specified by
labelCol)
- A document ID column (specified by
- outputPath
The file system path where the CoNLL file will be written. Supports various file systems:
- Local:
file:///path/to/output.conll - DBFS:
dbfs:/mnt/path/to/output.conllIf the file exists, it will be overwritten.
- Local:
- labelCol
The name of the column containing NER label annotations. Default is "label". The column must exist in the DataFrame.
- docIdCol
The name of the column containing document identifiers. Default is "id". The column must exist in the DataFrame. **IMPORTANT**: All document IDs must be unique. Duplicate IDs will cause an IllegalArgumentException to be thrown.
// Example usage with default parameters val df = spark.read.parquet("/path/to/ner_annotations.parquet") generateConll(df, "file:///tmp/output.conll")
- Exceptions thrown
if:
- The
labelColcolumn does not exist in the DataFrame - The
docIdColcolumn does not exist in the DataFrame - Duplicate document IDs are found in the
docIdColcolumn
java.io.IOExceptionif there are file system errors during writing- The
- Note
- The output follows CoNLL-2003 format:
TOKEN -X- -X- LABEL - Documents are separated by
-DOCSTART- -X- -DOCID- Oheaders - Sentences within documents are separated by blank lines
- The output follows CoNLL-2003 format:
Example: -
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
def
showLabelDistributions(df: DataFrame, nerLabelCol: String = "label"): Unit
Shows the distribution of entity labels in the training DataFrame.
Shows the distribution of entity labels in the training DataFrame.
- df
The training DataFrame
- nerLabelCol
The name of the column containing NER labels (default is "label")
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
def
toString(): String
- Definition Classes
- AnyRef → Any
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()