class AnnotationToolJsonReader extends CheckLicense
Reads and process the exported json file from NLP Lab.
Reader class that parses relevant information exported from NLP Lab into different
formats. The reader can be used to create a training dataset for training assertion
status (using the generateAssertionTrainSet
method)
or NER models (in the CoNLL format using the generateConll
method).
To generate the assertion data, the following attributes need to be specified when
instantiating the class:
- assertion_labels
: The assertion labels to use.
- excluded_labels
: The assertions labels that are excluded for the training dataset creation (can be an empty list).
- Alphabetic
- By Inheritance
- AnnotationToolJsonReader
- CheckLicense
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Instance Constructors
- new AnnotationToolJsonReader(assertionLabels: Array[String])
- new AnnotationToolJsonReader(pipelineModel: PipelineModel, assertionLabels: List[String])
- new AnnotationToolJsonReader(assertionLabels: Array[String], excludedLabels: Array[String])
- new AnnotationToolJsonReader(pipelineModel: PipelineModel, assertionLabels: Array[String], excludedLabels: Array[String], scheme: String, minCharsTol: Int, alignCharsTol: Int, mergeOverlapping: Boolean)
-
new
AnnotationToolJsonReader(pipelineModel: PipelineModel, assertionLabels: List[String], excludedLabels: List[String], scheme: String, minCharsTol: Int, alignCharsTol: Int, mergeOverlapping: Boolean)
- pipelineModel
The pipeline model that is used to create the documents the sentences and the tokens. And example of that pipeline could be the following: {{ val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") pipline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer)) * }
- assertionLabels
The labels that will be interpreted as Assertion labels.
- excludedLabels
The labels that will be excluded from the NER / Assertion Annotations.
- scheme
The schema that will use to create the IOB_tagger (IOB or BIOES).
- minCharsTol
The min length of the token to be consider a label.
- alignCharsTol
The maximum number of char difference to consider the start / end tokens alignment with the Assertion Annotations.
- mergeOverlapping
'true' to merge the overlapping annotations prioritizing longest and more diverse annotations, 'false' to keep both chunks in the same position.
-
new
AnnotationToolJsonReader(assertionLabels: List[String] = List.empty[String].asJava, excludedLabels: List[String] = List.empty[String].asJava, cleanupMode: String = "disabled", splitChars: List[String] = List().asJava, contextChars: List[String] = List().asJava, scheme: String = "IOB", minCharsTol: Int = 2, alignCharsTol: Int = 1, mergeOverlapping: Boolean = true, SDDLPath: String = "")
The annotation tool json reader is a reader that generate a assertion train set from the json from annotations labs exports.
The annotation tool json reader is a reader that generate a assertion train set from the json from annotations labs exports.
- assertionLabels
The assertions labels are used for the training dataset creation.
- excludedLabels
The NER or Assertion labels to be excluded from the training dataset creation.
- cleanupMode
The clean mode that is used in the DocumentAssembler transformer.
- splitChars
The split chars that are used in the default tokenizer. com.johnsnowlabs.nlp.annotators.Tokenizer.setSplitChars().
- contextChars
The context chars that are used in the default tokenizer. com.johnsnowlabs.nlp.annotators.Tokenizer.setContextChars().
- scheme
The schema that will use to create the IOB_tagger (IOB or BIOES).
- minCharsTol
The min length of the token to apply the alignment tolerance feature.
- alignCharsTol
The maximum number of char difference to consider the start / end tokens alignment with the assertion annotations.
- mergeOverlapping
'true' to merge the overlapping annotations, 'false' to keep both chunks in the same position.
- SDDLPath
The context chars that are used in the default tokenizer com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector.load().
- Annotations
- @deprecated
- Deprecated
-
new
AnnotationToolJsonReader(assertionLabels: Array[String], excludedLabels: Array[String], cleanupMode: String, splitChars: Array[String], contextChars: Array[String], scheme: String, minCharsTol: Int, alignCharsTol: Int, mergeOverlapping: Boolean, SDDLPath: String)
- Annotations
- @deprecated
- Deprecated
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- val alignCharsTol: Int
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
- val assertionLabels: List[String]
-
def
checkValidEnvironment(spark: Option[SparkSession], scopes: Seq[String]): Unit
- Definition Classes
- CheckLicense
-
def
checkValidScope(scope: String): Unit
- Definition Classes
- CheckLicense
-
def
checkValidScopeAndEnvironment(scope: String, spark: Option[SparkSession], checkLp: Boolean): Unit
- Definition Classes
- CheckLicense
-
def
checkValidScopesAndEnvironment(scopes: Seq[String], spark: Option[SparkSession], checkLp: Boolean): Unit
- Definition Classes
- CheckLicense
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
equals(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- val excludedLabels: List[String]
-
def
finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] )
- val fs: FileSystem
-
def
generateAssertionTrainSet(df: DataFrame, sentenceCol: String = "sentence", assertionCol: String = "assertion_label"): DataFrame
Generates assertion training data at token level.
Generates assertion training data at token level.
Using information from the sentence and assertion labels, this method generates a training data set with the following columns:
- text: sentence text
- target: the token text
- label: the assertion label
- start: start position of the token
- end: end position of the token
The tokens are identified internally with the constraints from the
min_chars_tol
andalign_chars_tol
parameters.
- df
DataFrame with sentence and assertion annotations like the one returned by readDataset method
- sentenceCol
Name of the column having Sentence Annotations
- assertionCol
Name of the column having Assertion Annotations
- returns
A dataframe to train an AssertionDL Model
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
- val mergeOverlapping: Boolean
- val minCharsTol: Int
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
- val pipelineModel: PipelineModel
-
def
readDataset(spark: SparkSession, path: String): DataFrame
The readDataset method uses the spark session to read a access a path to a JSON from the Annnotation Tool and provide a DataFrame with proper annotations to train SparkNLP and SparkNLP Enterprise Annotators
The readDataset method uses the spark session to read a access a path to a JSON from the Annnotation Tool and provide a DataFrame with proper annotations to train SparkNLP and SparkNLP Enterprise Annotators
- spark
SparkSession
- path
Path to a JSON exported from the Annotation Tool
- returns
A dataframe with NER, Assertion and RE annotations
- val scheme: String
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
def
toString(): String
- Definition Classes
- AnyRef → Any
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
Deprecated Value Members
-
def
generateConll(df: DataFrame, path: String, taskColumn: String = "task_id", tokenCol: String = "token", nerLabel: String = "ner_label"): Unit
Saves a CoNLL file from the exported annotations.
Saves a CoNLL file from the exported annotations.
- Annotations
- @deprecated
- Deprecated
-
def
generatePlainAssertionTrainSet(df: DataFrame, taskColumn: String = "task_id", tokenCol: String = "token", nerLabel: String = "ner_label", assertion_label: String = "assertion_label"): DataFrame
Generates assertion training data at chunk level.
Generates assertion training data at chunk level.
Using information from the sentence, task id (from NLP Lab), ner label, and assertion labels, this method generates a training data set with the following columns: - sentence: sentence text - begin: start position of the token - end: end position of the token - ner: the NER chunk - assertion: the assertion label
Internally uses the NerConverterInternal to identify the NER chunks.
- Annotations
- @deprecated
- Deprecated