package ner
- Alphabetic
- Public
- All
Type Members
-
case class
ClassificationConfig(task: String, labels: Seq[String], multiLabel: Boolean = false, clsThreshold: Double = 0.5, trueLabel: Seq[String] = Seq("N/A")) extends Product with Serializable
Configuration for classification task in schema output.
Configuration for classification task in schema output.
- task
Task name
- labels
Sequence of label names
- multiLabel
Whether this is a multi-label classification
- clsThreshold
Confidence threshold
- trueLabel
Labels that represent "true" in binary classification
-
case class
ClassificationTaskSpec(name: String, multiLabel: Boolean = false) extends Product with Serializable
Specification for classification task from DSL string.
Specification for classification task from DSL string.
Format: "task_name" or "task_name::multi" for multi-label
Examples:
- "sentiment" → ClassificationTaskSpec("sentiment", false)
- "intent::multi" → ClassificationTaskSpec("intent", true)
- name
Task name
- multiLabel
Whether this is a multi-label classification task
-
case class
EntityMetadata(dtype: String, threshold: Option[Double]) extends ValidateFields with Product with Serializable
Metadata for entity extraction configuration.
Metadata for entity extraction configuration.
- dtype
Data type: "str" (single instance) or "list" (multiple instances)
- threshold
Optional confidence threshold for extraction
-
case class
EntitySpec(name: String, dtype: String = "list", description: Option[String] = None) extends Product with Serializable
Specification for entity extraction from DSL string.
Specification for entity extraction from DSL string.
Format: "entity_name" or "entity_name::dtype::description"
Examples:
- "person" → EntitySpec("person", "list", None)
- "person::Names of people" → EntitySpec("person", "list", Some("Names of people"))
- "company::str::Single company name" → EntitySpec("company", "str", Some("Single company name"))
- name
Entity name
- dtype
Data type: "str" (single instance) or "list" (multiple instances)
- description
Optional description
-
case class
FieldMetadata(dtype: String, threshold: Option[Double], choices: Option[Seq[String]], validators: Seq[RegexValidator]) extends ValidateFields with Product with Serializable
Metadata for structure field configuration.
Metadata for structure field configuration.
- dtype
Data type: "str" (single value) or "list" (multiple values)
- threshold
Optional confidence threshold for extraction
- choices
Optional sequence of valid choices (constrains output)
- validators
Sequence of regex validators for post-processing
-
case class
FieldSpec(name: String, dtype: String = "list", choices: Option[Seq[String]] = None, description: Option[String] = None) extends Product with Serializable
Specification for structure field from DSL string.
Specification for structure field from DSL string.
Format: "field_name::type::description" or "field_name::[choice1|choice2]::type::description"
Examples:
- "name" → FieldSpec("name", "list", None, None)
- "price::str" → FieldSpec("price", "str", None, None)
- "category::[electronics|software]" → FieldSpec("category", "str", Some(Seq("electronics", "software")), None)
- "features::list::Product features" → FieldSpec("features", "list", None, Some("Product features"))
- name
Field name
- dtype
Data type: "str" (single value) or "list" (multiple values)
- choices
Optional sequence of valid choices (forces dtype to "str")
- description
Optional field description
-
sealed
trait
FieldValue extends AnyRef
Field value in a JSON structure - either a simple field or one with choices.
- case class FieldWithChoices(value: String, choices: Seq[String]) extends FieldValue with Product with Serializable
-
case class
Gliner2Classification(taskName: String, label: String, score: Float) extends Product with Serializable
Classification result for one task.
Classification result for one task.
- taskName
Classification task name
- label
Predicted label
- score
Confidence score (0.0 to 1.0)
-
case class
Gliner2ClassifierInput(schemaEmbeddings: Array[Array[Float]], labelNames: Array[String], taskName: String, isMultiLabel: Boolean) extends Product with Serializable
Input for classification ONNX.
Input for classification ONNX.
Used to classify text into one or more labels.
- schemaEmbeddings
Schema embeddings (num_classes, 768)
- labelNames
Label names for each class
- taskName
Classification task name
- isMultiLabel
Whether this is multi-label classification
-
case class
Gliner2Config(maxWidth: Int = 8, tokenPooling: String = "first") extends Serializable with Product
Configuration for span generation and representation.
Configuration for span generation and representation.
- maxWidth
Maximum span width (number of whole tokens in a span)
- tokenPooling
Pooling method for subword embeddings: "first", "mean", or "max"
- class Gliner2DataProcessor extends Serializable
-
class
Gliner2EmbeddingExtractor extends Serializable
Extracts token and schema embeddings from encoder output.
Extracts token and schema embeddings from encoder output. Mirrors Python's extract_embeddings_from_batch().
This component:
- Uses mapped_indices to separate text tokens from schema tokens
- Aggregates subword embeddings into word-level embeddings
- Extracts special token embeddings for schema tasks
-
case class
Gliner2EncoderOutput(lastHiddenState: Array[Array[Array[Float]]], batch: Gliner2PreprocessedBatch) extends Product with Serializable
Output from encoder ONNX model.
Output from encoder ONNX model. Contains raw embeddings plus metadata needed for downstream processing.
- lastHiddenState
Encoder embeddings (batch, seq_len, 768)
- batch
Original preprocessed batch for metadata
-
class
Gliner2ExtractionResult extends AnyRef
Complete extraction result for one sample.
Complete extraction result for one sample.
Contains all extracted information across all task types.
-
case class
Gliner2PreprocessedBatch(inputIds: Array[Array[Long]], attentionMask: Array[Array[Long]], mappedIndices: Array[Array[(String, Int, Int)]], schemaCounts: Array[Int], originalLengths: Array[Int], taskTypes: Array[Array[String]], wordTokens: Array[Array[String]], schemaTokensList: Array[Array[Array[String]]], startMappings: Array[Array[Int]], endMappings: Array[Array[Int]], originalTexts: Array[String], originalSchemas: Array[Gliner2Schema], structureLabels: Array[Any] = Array.empty) extends Product with Serializable
Batch of preprocessed inputs ready for ONNX encoder.
Batch of preprocessed inputs ready for ONNX encoder. Maps 1:1 to Python's PreprocessedBatch.
This is the output of Gliner2DataProcessor.prepareInputs() and the input to the ONNX encoder model.
- inputIds
Token IDs for encoder input (batch, max_seq_len)
- attentionMask
Attention mask for encoder (batch, max_seq_len)
- mappedIndices
Token mappings: (seg_type, orig_idx, schema_idx)
- seg_type: "schema" or "text"
- orig_idx: Original token index in text or schema
- schema_idx: Which schema this token belongs to (for schema tokens)
- schemaCounts
Number of schemas per sample
- originalLengths
Original sequence lengths per sample
- taskTypes
Task types per schema per sample
- wordTokens
Original text tokens per sample
- schemaTokensList
Schema tokens per sample
- startMappings
Token char start positions per sample
- endMappings
Token char end positions per sample
- originalTexts
Original text strings
- originalSchemas
Original schema dictionaries
- structureLabels
Ground truth labels (training only, can be empty for inference)
-
case class
Gliner2ProcessedEmbeddings(tokenEmbeddings: Array[Array[Float]], schemaEmbeddings: Array[Array[Array[Float]]], textTokens: Array[String], schemaTokensList: Array[Array[String]], taskTypes: Array[String], startMapping: Array[Int], endMapping: Array[Int], originalText: String, originalSchema: Gliner2Schema, sampleIndex: Int) extends Product with Serializable
Per-sample embeddings extracted from encoder output.
Per-sample embeddings extracted from encoder output. Splits aggregated token and schema embeddings and returns them separately.
Python Reference: _extract_embeddings_from_batch().
- tokenEmbeddings
Word-level text embeddings (text_len, 768)
- schemaEmbeddings
Schema embeddings per task (num_schemas, num_tokens, 768)
- textTokens
Original text tokens
- schemaTokensList
Schema tokens per task
- taskTypes
Task types (e.g., "entities", "classifications")
- startMapping
Character start positions for tokens
- endMapping
Character end positions for tokens
- originalText
Original text string
- originalSchema
Original schema dictionary
- sampleIndex
Index of this sample in the batch
- case class Gliner2RelationResult(label: String, head: Gliner2SpanResult, tail: Gliner2SpanResult) extends Product with Serializable
-
case class
Gliner2Schema(structures: List[StructureConfig], classifications: List[ClassificationConfig], entities: ListMap[String, String], relations: List[RelationConfig], structureDescriptions: Map[String, ListMap[String, String]], entityDescriptions: ListMap[String, String], entityMetadata: Map[String, EntityMetadata] = Map.empty, entityOrder: Seq[String] = Seq.empty, relationMetadata: Map[String, RelationMetadata] = Map.empty, fieldOrders: Map[String, Seq[String]] = Map.empty, fieldMetadata: Map[String, FieldMetadata] = Map.empty) extends Product with Serializable
Completed schema from the builder.
Completed schema from the builder.
- structures
List of JSON structures for structured data extraction
- classifications
List of classification tasks
- entities
Map of entity names to descriptions
- relations
List of relation configurations
- structureDescriptions
Map of structure names to field descriptions
- entityDescriptions
Map of entity names to descriptions
- entityMetadata
Per-entity dtype and threshold overrides
- entityOrder
Ordered sequence of entity names for extraction
- relationMetadata
Per-relation threshold overrides
- fieldOrders
Per-structure/relation ordered field sequences
- fieldMetadata
Per-field dtype, threshold, and validator overrides (keyed as "parent.field")
-
class
Gliner2SchemaBuilder extends AnyRef
Schema builder for extraction tasks.
Schema builder for extraction tasks. Formates the schema for inference of the ONNX model.
Provides a fluent API for building extraction schemas that include:
- Entity extraction configurations
- Classification tasks
- Relation extraction
- Structured data extraction with fields
- Validation rules
Example:
val schema = new Schema() .entities(Map( "person" -> "Names of people", "company" -> "Organization names" )) .classification("sentiment", List("positive", "negative", "neutral")) .relations(List("works_for", "founded")) .structure("product_info") .field("name", dtype = "str") .field("price", dtype = "str") .field("availability", choices = Some(List("in_stock", "out_of_stock"))) .build()
-
class
Gliner2SpanGenerator extends Serializable
Generates span indices and masks for span-based tasks.
Generates span indices and masks for span-based tasks.
Creates all possible consecutive token sequences up to maxWidth. Invalid spans (extending beyond text length) are masked.
-
case class
Gliner2SpanInfo(spanIdx: Array[Array[Array[Long]]], spanMask: Array[Array[Boolean]], spanRep: Option[Array[Array[Array[Float]]]], numWords: Int, maxWidth: Int) extends Product with Serializable
Span indices and representations for a single sample.
Span indices and representations for a single sample. Used for entity, relation, and structure extraction tasks.
Spans are generated for all possible consecutive token sequences up to maxWidth. Invalid spans (extending beyond text length) are masked.
- spanIdx
Span indices (num_words, max_width, 2) - start and end positions
- spanMask
Validity mask (num_words, max_width) - true for valid spans
- spanRep
Span representations from ONNX (num_words, max_width, 768) - optional
- numWords
Number of words in the text
- maxWidth
Maximum span width
-
case class
Gliner2SpanResult(label: String, startIdx: Int, endIdx: Int, startChar: Int, endChar: Int, score: Float, text: String, tokens: Array[String]) extends Product with Serializable
Extracted span result (entity, relation, or structure field).
Extracted span result (entity, relation, or structure field).
Represents a single extracted span with position and score information.
- label
Entity/relation/field type
- startIdx
Start token index
- endIdx
End token index (exclusive)
- startChar
Start character position
- endChar
End character position (exclusive)
- score
Confidence score (0.0 to 1.0)
- text
Extracted text
- tokens
Tokens in span
- case class GlinerConfig(maxWidth: Int = 12, entToken: String = "<<ENT>>", entTokenId: Long = 128002L, sepToken: String = "<<SEP>>", sepTokenId: Long = 128003L, maxLen: Int = 384) extends Serializable with Product
- case class GlinerData(tokens: Array[String], tokenStarts: Array[Int], tokenEnds: Array[Int], tokenIds: Array[Long], tokenTypeIds: Array[Long], attentionMask: Array[Long], wordsMask: Array[Long], spanIdx: Array[Array[Long]], spanMask: Array[Boolean], textLength: Array[Long], idToClasses: Map[Long, String], classesToId: Map[String, Long]) extends Serializable with Product
- class GlinerDataProcessor extends Serializable
- trait GlinerModel extends Serializable
- case class GlinerResult(entity: String, start: Int, end: Int, score: Float, tokensInSpan: List[String]) extends Product with Serializable
-
case class
GraphInfo(path: String, fileTags: Int, fileEmbeddingsNDims: Int, fileNChars: Int) extends Product with Serializable
- Attributes
- protected
-
class
IOBTagger extends AnnotatorModel[IOBTagger] with CheckLicense with HasSimpleAnnotate[IOBTagger]
Merges token tags and NER labels from chunks in the specified format.
Merges token tags and NER labels from chunks in the specified format. For example output columns as inputs from NerConverter and Tokenizer can be used to merge.
Example
Pipeline stages are defined where NER is done. NER is converted to chunks.
val data = Seq(("A 63-year-old man presents to the hospital ...")).toDF("text") val docAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document") val sentenceDetector = new SentenceDetector().setInputCols("document").setOutputCol("sentence") val tokenizer = new Tokenizer().setInputCols("sentence").setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models").setOutputCol("embs") val nerModel = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models").setInputCols("sentence", "token", "embs").setOutputCol("ner") val nerConverter = new NerConverter().setInputCols("sentence", "token", "ner").setOutputCol("ner_chunk")
Define the IOB tagger, which needs tokens and chunks as input. Show results.
val iobTagger = new IOBTagger().setInputCols("token", "ner_chunk").setOutputCol("ner_label") val pipeline = new Pipeline().setStages(Array(docAssembler, sentenceDetector, tokenizer, embeddings, nerModel, nerConverter, iobTagger)) result.selectExpr("explode(ner_label) as a") .selectExpr("a.begin","a.end","a.result as chunk","a.metadata.word as word") .where("chunk!='O'").show(5, false) +-----+---+-----------+-----------+ |begin|end|chunk |word | +-----+---+-----------+-----------+ |5 |15 |B-Age |63-year-old| |17 |19 |B-Gender |man | |64 |72 |B-Modifier |recurrent | |98 |107|B-Diagnosis|cellulitis | |110 |119|B-Diagnosis|pneumonias | +-----+---+-----------+-----------+
-
case class
LabelSpec(name: String, description: Option[String] = None) extends Product with Serializable
Specification for classification label from DSL string.
Specification for classification label from DSL string.
Format: "label" or "label::Description"
Examples:
- "positive" → LabelSpec("positive", None)
- "positive::Positive sentiment" → LabelSpec("positive", Some("Positive sentiment"))
- name
Label name
- description
Optional description
-
class
MedicalNerApproach extends AnnotatorApproach[MedicalNerModel] with MedicalNerParams with NerApproach[MedicalNerApproach] with Logging with ParamsAndFeaturesWritable with EvaluationDLParams with CheckLicense
Trains generic NER models based on Neural Networks.
Trains generic NER models based on Neural Networks.
The architecture of the neural network is a Char CNNs - BiLSTM - CRF that achieves state-of-the-art in most datasets. For instantiated/pretrained models, see MedicalNerModel
The training data should be a labeled Spark Dataset, in the CoNLL 2003 IOB format with
Annotationtype columns. The data should have columns of typeand an additional label column of annotator typeDOCUMENT, TOKEN, WORD_EMBEDDINGS.NAMED_ENTITYExcluding the label, this can be done with, for example, the annotators SentenceDetector, Tokenizer, and WordEmbeddingsModel (any embeddings can be chosen, e.g. BertEmbeddings for BERT based embeddings).
For extended examples of usage, see the Spark NLP Workshop.
Notes
Both DocumentAssembler and SentenceDetector annotators are annotators that output the
annotation type. Thus, any of them can be used as the first annotators in a pipeline.DOCUMENTExample
First extract the prerequisites for the MedicalNerApproach
val document = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence") .setOutputCol("token") val embeddings = BertEmbeddings.pretrained() .setInputCols("sentence", "token") .setOutputCol("embeddings")
Then define the NER annotator
val nerTagger = new MedicalNerApproach() .setInputCols("sentence", "token", "embeddings") .setLabelColumn("label") .setOutputCol("ner") .setMaxEpochs(10) .setLr(0.005f) .setPo(0.005f) .setBatchSize(32) .setValidationSplit(0.1f)
Then the training can start
val pipeline = new Pipeline().setStages(Array( document, sentenceDetector, tokenizer, embeddings, nerTagger )) trainingData = conll.readDataset(spark, "path/to/train_data.conll") pipelineModel = pipeline.fit(trainingData)
-
class
MedicalNerDLGraphChecker extends NerDLGraphChecker
Checks whether a suitable MedicalNerApproach graph is available for the given training dataset, before any computations/training is done.
Checks whether a suitable MedicalNerApproach graph is available for the given training dataset, before any computations/training is done. This annotator is useful for custom training cases, where specialized graphs might not be available and we want to check before embeddings are evaluated.
Important: This annotator should be used or positioned before any embedding or MedicalNerApproach annotators in the pipeline and will process the whole dataset to extract the required graph parameters.
This annotator requires a dataset with at least two columns: one with tokens and one with the labels. In addition, it requires the used embedding annotator in the pipeline to extract the suitable embedding dimension.
Example
import com.johnsnowlabs.nlp.annotator._ import com.johnsnowlabs.nlp.jsl.annotator._ import com.johnsnowlabs.nlp.training.CoNLL import org.apache.spark.ml.Pipeline // This CoNLL dataset already includes a sentence, token and label column with their respective annotator types. val conll = CoNLL() val trainingData = conll.readDataset(spark, "PATH/TO/YOUR/TRAINING/DATA") val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols("sentence", "token") .setOutputCol("embeddings") // Requires the data for MedicalNerApproach graphs: text, tokens, labels and the embedding model val nerDLGraphChecker = new MedicalNerDLGraphChecker() .setInputCols("sentence", "token") .setLabelColumn("label") .setEmbeddingsModel(embeddings) val nerTagger = new MedicalNerApproach() .setInputCols("sentence", "token", "embeddings") .setLabelColumn("label") .setOutputCol("ner") .setMaxEpochs(1) .setRandomSeed(42) .setVerbose(0) .setEarlyStoppingCriterion(0.50f) .setEnableOutputLogs(false) .setUseBestModel(true) val pipeline = new Pipeline().setStages( Array(nerDLGraphChecker, embeddings, nerTagger)) // Will throw an exception if no suitable graph is found val pipelineModel = pipeline.fit(trainingData)
- class MedicalNerModel extends AnnotatorModel[MedicalNerModel] with MedicalNerParams with HasBatchedAnnotate[MedicalNerModel] with WriteTensorflowModel with HasStorageRef with ParamsAndFeaturesWritable with CheckLicense
- trait MedicalNerParams extends Params with HasFeatures
- case class NamedEntityConfidence(start: Int, end: Int, entity: String, text: String, sentenceId: String, confidence: Option[Float]) extends Product with Serializable
-
class
NerChunker extends AnnotatorModel[NerChunker] with HasSimpleAnnotate[NerChunker]
Extracts phrases that fits into a known pattern using the NER tags.
Extracts phrases that fits into a known pattern using the NER tags. Useful for entity groups with neighboring tokens when there is no pretrained NER model to address certain issues. A Regex needs to be provided to extract the tokens between entities.
Example
Defining pipeline stages for NER
val data= Seq("She has cystic cyst on her kidney.").toDF("text") val documentAssembler=new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector=new SentenceDetector() .setInputCols("document") .setOutputCol("sentence") .setUseAbbreviations(false) val tokenizer=new Tokenizer() .setInputCols(Array("sentence")) .setOutputCol("token") val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") .setInputCols("sentence","token") .setOutputCol("embeddings") .setCaseSensitive(false) val ner = MedicalNerModel.pretrained("ner_radiology", "en", "clinical/models") .setInputCols("sentence","token","embeddings") .setOutputCol("ner") .setIncludeConfidence(true)
Define the NerChunker to combine to chunks
val chunker = new NerChunker() .setInputCols(Array("sentence","ner")) .setOutputCol("ner_chunk") .setRegexParsers(Array("<ImagingFindings>.*<BodyPart>")) val pipeline=new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, embeddings, ner, chunker )) val result = pipeline.fit(data).transform(data)
Show results:
result.selectExpr("explode(arrays_zip(ner.metadata , ner.result))") .selectExpr("col['0'].word as word" , "col['1'] as ner").show(truncate=false) +------+-----------------+ |word |ner | +------+-----------------+ |She |O | |has |O | |cystic|B-ImagingFindings| |cyst |I-ImagingFindings| |on |O | |her |O | |kidney|B-BodyPart | |. |O | +------+-----------------+ result.select("ner_chunk.result").show(truncate=false) +---------------------------+ |result | +---------------------------+ |[cystic cyst on her kidney]| +---------------------------+
-
class
NerConverterInternal extends AnnotatorApproach[NerConverterInternalModel] with SourceTrackingMetadataParams with HasFeatures with FilteringParams with CheckLicense
Converts IOB or IOB2 representations of entities to a user-friendly one.
Converts IOB or IOB2 representations of entities to a user-friendly one.
This is the AnnotatorApproach version of the NerConverterInternalModel annotator.
Chunks with no associated entity (tagged "O") are filtered.
This licensed annotator adds extra functionality to the open-source version by adding the following parameters:
blackList,greedyMode,threshold, andignoreStopWordsthat are not available in the open-source annotator.See also Inside–outside–beginning (tagging) for more information.
Example
The output of a MedicalNerModel follows the Annotator schema and looks like this after the transformation.
result.selectExpr("explode(ner_result)").show(5, false) +--------------------------------------------------------------------------+ |col | +--------------------------------------------------------------------------+ |{named_entity, 3, 3, O, {word -> A, confidence -> 0.994}, []} | |{named_entity, 5, 15, B-Age, {word -> 63-year-old, confidence -> 1.0}, []}| |{named_entity, 17, 19, B-Gender, {word -> man, confidence -> 0.9858}, []} | |{named_entity, 21, 28, O, {word -> presents, confidence -> 0.9952}, []} | |{named_entity, 30, 31, O, {word -> to, confidence -> 0.7063}, []} | +--------------------------------------------------------------------------+
After the converter is used:
result.selectExpr("explode(ner_converter_result)").show(5, false) +-----------------------------------------------------------------------------------+ |col | +-----------------------------------------------------------------------------------+ |{chunk, 5, 15, 63-year-old, {entity -> Age, sentence -> 0, chunk -> 0}, []} | |{chunk, 17, 19, man, {entity -> Gender, sentence -> 0, chunk -> 1}, []} | |{chunk, 64, 72, recurrent, {entity -> Modifier, sentence -> 0, chunk -> 2}, []} | |{chunk, 98, 107, cellulitis, {entity -> Diagnosis, sentence -> 0, chunk -> 3}, []} | |{chunk, 110, 119, pneumonias, {entity -> Diagnosis, sentence -> 0, chunk -> 4}, []}| +-----------------------------------------------------------------------------------+
- See also
-
class
NerConverterInternalModel extends AnnotatorModel[NerConverterInternalModel] with HasSimpleAnnotate[NerConverterInternalModel] with SourceTrackingMetadataParams with FilteringParams with CheckLicense
Converts a IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their label.
Converts a IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their label. Chunks with no associated entity (tagged "O") are filtered. See also Inside–outside–beginning (tagging) for more information.
Example
The output of a MedicalNerModel follows the Annotator schema and looks like this after the transformation.
result.selectExpr("explode(ner_result)").show(5, false) +--------------------------------------------------------------------------+ |col | +--------------------------------------------------------------------------+ |{named_entity, 3, 3, O, {word -> A, confidence -> 0.994}, []} | |{named_entity, 5, 15, B-Age, {word -> 63-year-old, confidence -> 1.0}, []}| |{named_entity, 17, 19, B-Gender, {word -> man, confidence -> 0.9858}, []} | |{named_entity, 21, 28, O, {word -> presents, confidence -> 0.9952}, []} | |{named_entity, 30, 31, O, {word -> to, confidence -> 0.7063}, []} | +--------------------------------------------------------------------------+
After the converter is used:
result.selectExpr("explode(ner_converter_result)").show(5, false) +-----------------------------------------------------------------------------------+ |col | +-----------------------------------------------------------------------------------+ |{chunk, 5, 15, 63-year-old, {entity -> Age, sentence -> 0, chunk -> 0}, []} | |{chunk, 17, 19, man, {entity -> Gender, sentence -> 0, chunk -> 1}, []} | |{chunk, 64, 72, recurrent, {entity -> Modifier, sentence -> 0, chunk -> 2}, []} | |{chunk, 98, 107, cellulitis, {entity -> Diagnosis, sentence -> 0, chunk -> 3}, []} | |{chunk, 110, 119, pneumonias, {entity -> Diagnosis, sentence -> 0, chunk -> 4}, []}| +-----------------------------------------------------------------------------------+
- See also
-
class
NerTemplateRenderModel extends AnnotatorModel[NerTemplateRenderModel] with HasSimpleAnnotate[NerTemplateRenderModel] with CheckLicense
Renders a list of SparkNLP for Healthcare templates provided as a
StringArrayParamRenders a list of SparkNLP for Healthcare templates provided as a
StringArrayParamThe output of a NerTemplateRenderModel follows outputs Documents based on the provideed templates.
- See also
-
class
PretrainedZeroShotMultiTask extends AnnotatorModel[PretrainedZeroShotMultiTask] with ParamsAndFeaturesWritable with HasBatchedAnnotate[PretrainedZeroShotMultiTask] with InternalWriteOnnxModel with WriteSentencePieceModel with HasEngine with CheckLicense
Zero-shot multi-task information extraction.
Zero-shot multi-task information extraction.
Performs four extraction tasks simultaneously from a single document in a single forward pass:
- **Named entity extraction** — spans of text matching a given type
- **Relation extraction** — (head, tail) span pairs for a given relation type
- **Classification** — document-level or sentence-level label assignment
- **Structured extraction** — structured records with typed fields extracted from text
All tasks are defined via a compact
::DSL, described below, and can be combined freely. Tasks are zero-shot: no fine-tuning is needed.DSL syntax
Specifications use
::as a separator. Order of optional parts is flexible.Entities
Each entry is a string:
"name","name::dtype","name::description", or"name::dtype::description"wheredtypeis"list"(default, multiple spans) or"str"(single best span)..setEntities(Array( "person", // list of persons "company::str", // single company "product::Names of products or services", // list with description "price::str::Monetary value including currency" // single with dtype and description ))
Relations
Each entry is a string:
"relation_name"or"relation_name::description". The model extracts (head, tail) span pairs for each relation type..setRelations(Array( "works_for", "located_in::The organization is physically located in the place" ))
Classifications
Each entry is a
(taskSpec, Array[labelSpec])tuple.- Task spec:
"task_name"(single-label) or"task_name::multi"(multi-label) - Label spec:
"label"or"label::description"
.setClassifications(Array( ("sentiment", Array("positive", "negative", "neutral")), ("topics::multi", Array("finance::Financial content", "technology", "politics")) ))
Structures
Each entry is a
(structureName, Array[fieldSpec])tuple. Fields use:"field_name","field_name::dtype","field_name::description","field_name::dtype::description", or"field_name::[choice1|choice2]"(forcesdtype=str)..setStructures(Array( ("product_info", Array( "name::str", "price::str::Price including currency symbol", "features::list", "availability::[in_stock|pre_order|sold_out]" )) ))
Output
All task results are returned in a single output column as
Array[Annotation]:- Entities →
annotatorType = "chunk",result= span text,metadatacontainsentity,confidence,sentence - Classifications →
annotatorType = "category",result= label,metadatacontainsconfidence,task,sentence - Relations →
annotatorType = "category",result= relation name,metadatacontainschunk1,chunk2,entity1,entity2,entity1_begin,entity1_end,entity2_begin,entity2_end,chunk1_confidence,chunk2_confidence,sentence(compatible with com.johnsnowlabs.nlp.annotators.re.RelationExtractionDLModel output) - Structures →
annotatorType = "struct",result= structure name,metadatacontains one key per field (value is JSON-encoded: object forstrfields, array forlistfields) plusinstance_idxandsentence
Example
val zeroShot = PretrainedZeroShotMultiTask.pretrained() .setInputCols("document") .setOutputCol("extractions") .setEntities(Array("person", "company::str", "product::List of products")) .setClassifications(Array(("sentiment", Array("positive", "negative", "neutral")))) .setRelations(Array("works_for", "founded")) .setStructures(Array( ("invoice", Array("vendor::str", "amount::str", "items::list")))) .setEntityThreshold(0.5f) .setRelationThreshold(0.6f)
- class PretrainedZeroShotNER extends AnnotatorModel[PretrainedZeroShotNER] with ParamsAndFeaturesWritable with HasBatchedAnnotate[PretrainedZeroShotNER] with InternalWriteOnnxModel with WriteSentencePieceModel with WriteOpenvinoModel with HasEngine with CheckLicense
-
class
PretrainedZeroShotNERChunker extends PretrainedZeroShotNER
A fine-tuned zero-shot named-entity recognition (NER) model.
A fine-tuned zero-shot named-entity recognition (NER) model. Performs NER on arbitrary text without task-specific labeled training.
In contrast to PretrainedZeroShotNER this annotator directly outputs NER chunks instead of aligning them to provided tokens.
Example
val text = """ |Cristiano Ronaldo dos Santos Aveiro (Portuguese pronunciation: [kɾiʃˈtjɐnu ʁɔˈnaldu]; born 5 February |1985) is a Portuguese professional footballer who plays as a forward for and captains both Saudi Pro |League club Al Nassr and the Portugal national team. Widely regarded as one of the greatest players of |all time, Ronaldo has won five Ballon d'Or awards,[note 3] a record three UEFA Men's Player of the Year |Awards, and four European Golden Shoes, the most by a European player. """.stripMargin val testData = Seq(text).toDF("text") val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentence") val ner = PretrainedZeroShotNERChunker .pretrained() .setInputCols(Array("sentence")) .setOutputCol("ner_chunk") .setLabels(Array("person", "award", "date", "competitions", "teams")) val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, ner)) val results = pipeline.fit(testData).transform(testData) results.selectExpr("explode(entity)").show(1000, truncate = false)
Results:
+--------------------------------------------------------------------------------------------------------------------------------------------------+ |col | +--------------------------------------------------------------------------------------------------------------------------------------------------+ |{chunk, 2, 37, Cristiano Ronaldo dos Santos Aveiro, {sentence -> 0, entity -> person, confidence -> 0.9144007, ner_source -> ner_chunk}, []} | |{chunk, 93, 109, 5 February\r\n1985, {sentence -> 1, entity -> date, confidence -> 0.99999976, ner_source -> ner_chunk}, []} | |{chunk, 196, 213, Saudi Pro\r\nLeague, {sentence -> 1, entity -> competitions, confidence -> 0.9926515, ner_source -> ner_chunk}, []} | |{chunk, 219, 227, Al Nassr, {sentence -> 1, entity -> teams, confidence -> 0.99384415, ner_source -> ner_chunk}, []} | |{chunk, 321, 328, Ronaldo, {sentence -> 2, entity -> person, confidence -> 0.999997, ner_source -> ner_chunk}, []} | |{chunk, 342, 353, Ballon d'Or, {sentence -> 2, entity -> award, confidence -> 0.95896983, ner_source -> ner_chunk}, []} | |{chunk, 385, 422, UEFA Men's Player of the Year\r\nAwards, {sentence -> 2, entity -> award, confidence -> 0.9687164, ner_source -> ner_chunk}, []}| |{chunk, 433, 454, European Golden Shoes, {sentence -> 2, entity -> award, confidence -> 0.999326, ner_source -> ner_chunk}, []} | +--------------------------------------------------------------------------------------------------------------------------------------------------+- See also
- trait ReadPretrainedZeroShotMultiTask extends ReadSentencePieceModel with InternalReadOnnxModel
- trait ReadPretrainedZeroShotNER extends ReadSentencePieceModel with InternalReadOnnxModel with ReadOpenvinoModel
- trait ReadPretrainedZeroShotNERChunker extends ReadSentencePieceModel with InternalReadOnnxModel with ReadOpenvinoModel
- trait ReadZeroShotNerTensorflowModel extends ReadTensorflowModel with InternalReadOnnxModel with ReadOpenvinoModel
- trait ReadablePretrainedMedicalNer extends ParamsAndFeaturesReadable[MedicalNerModel] with HasPretrained[MedicalNerModel]
- trait ReadablePretrainedPretrainedZeroShotMultiTask extends ParamsAndFeaturesReadable[PretrainedZeroShotMultiTask] with HasPretrained[PretrainedZeroShotMultiTask]
- trait ReadablePretrainedPretrainedZeroShotNER extends ParamsAndFeaturesReadable[PretrainedZeroShotNER] with HasPretrained[PretrainedZeroShotNER]
- trait ReadablePretrainedPretrainedZeroShotNERChunker extends ParamsAndFeaturesReadable[PretrainedZeroShotNERChunker] with HasPretrained[PretrainedZeroShotNERChunker]
- trait ReadablePretrainedZeroShotNer extends ParamsAndFeaturesReadable[ZeroShotNerModel] with HasPretrained[ZeroShotNerModel]
- trait ReadsMedicalNerGraph extends ParamsAndFeaturesReadable[MedicalNerModel] with ReadTensorflowModel
-
case class
RegexValidator(pattern: String, mode: String = "full", exclude: Boolean = false, caseInsensitive: Boolean = true) extends Product with Serializable
Regex-based span filter for post-processing entity extraction.
Regex-based span filter for post-processing entity extraction.
- pattern
The regex pattern as a string or compiled Regex
- mode
Match mode: "full" for fullmatch, "partial" for search
- exclude
If true, inverts the match result
- caseInsensitive
If true, performs case-insensitive matching
-
case class
RelationConfig(name: String, head: String = "", tail: String = "") extends Product with Serializable
Relation configuration in schema output.
Relation configuration in schema output.
- name
Relation name
- head
Head entity placeholder (empty string)
- tail
Tail entity placeholder (empty string)
-
case class
RelationMetadata(threshold: Option[Double]) extends ValidateFields with Product with Serializable
Metadata for relation extraction configuration.
Metadata for relation extraction configuration.
- threshold
Optional confidence threshold for extraction
-
case class
RelationSpec(name: String, description: Option[String] = None) extends Product with Serializable
Specification for relation extraction from DSL string.
Specification for relation extraction from DSL string.
Format: "relation_name" or "relation_name::description"
Examples:
- "works_for" → RelationSpec("works_for", None)
- "works_for::Employment relationship" → RelationSpec("works_for", Some("Employment relationship"))
- name
Relation name
- description
Optional description
-
class
StructureBuilder extends AnyRef
Builder for structured data schemas (JSON structures).
-
case class
StructureConfig(name: String, fields: ListMap[String, FieldValue]) extends Product with Serializable
JSON structure for structured data extraction.
JSON structure for structured data extraction.
- name
Structure name
- fields
Map of field names to field values
-
trait
ValidateFields extends AnyRef
Trait for types that contain an optional confidence threshold.
Trait for types that contain an optional confidence threshold.
Automatically validates threshold on construction.
- trait WithMedicalNerGraphResolver extends AnyRef
-
class
ZeroShotNerModel extends RoBertaForQuestionAnswering with CheckLicense
ZeroShotNerModel implements zero shot named entity recognition by utilizing RoBERTa transformer models fine tuned on a question answering task.
ZeroShotNerModel implements zero shot named entity recognition by utilizing RoBERTa transformer models fine tuned on a question answering task. Its input is a list of document annotations and it automatically generates questions which are used to recognize entities. The definitions of entities is given by a dictionary structures, specifying a set of questions for each entity. The model is based on RoBertaForQuestionAnswering from the open source SparkNLP project.
Pretrained models can be loaded with
pretrainedof the companion object:val zeroShotRE = ZeroShotNerModel.pretrained() .setInputCols("document") .setOutputCol("zer_shot_ner")
For available pretrained models please see the Models Hub.
Example
val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols(Array("document")) .setOutputCol("sentences") val zeroShotNer = ZeroShotNerModel .pretrained() .setEntityDefinitions( Map( "NAME" -> Array("What is his name?", "What is her name?"), "CITY" -> Array("Which city?"))) .setPredictionThreshold(0.01f) .setInputCols("sentences") .setOutputCol("zero_shot_ner") val pipeline = new Pipeline() .setStages(Array( documentAssembler, sentenceDetector, zeroShotNer)) val model = pipeline.fit(Seq("").toDS.toDF("text")) val results = model.transform( Seq("Clara often travels between New York and Paris.").toDS.toDF("text")) results .selectExpr("document", "explode(zero_shot_ner) AS entity") .select( col("entity.result"), col("entity.metadata.word"), col("entity.metadata.sentence"), col("entity.begin"), col("entity.end"), col("entity.metadata.confidence"), col("entity.metadata.question")) .show(truncate=false) +------+-----+--------+-----+---+----------+------------------+ |result|word |sentence|begin|end|confidence|question | +------+-----+--------+-----+---+----------+------------------+ |B-CITY|Paris|0 |41 |45 |0.78655756|Which is the city?| |B-CITY|New |0 |28 |30 |0.29346612|Which city? | |I-CITY|York |0 |32 |35 |0.29346612|Which city? | +------+-----+--------+-----+---+----------+------------------+
- See also
https://arxiv.org/abs/1907.11692 for details about the RoBERTa transformer
RoBertaForQuestionAnswering for the SparkNLP implementation of RoBERTa question answering
Value Members
- object EmptyField extends FieldValue with Product with Serializable
-
object
Gliner2DslParser
Parser for GLiNER2 DSL (Domain-Specific Language) specifications.
Parser for GLiNER2 DSL (Domain-Specific Language) specifications.
Provides methods to parse string specifications into structured objects for entities, classifications, relations, and structure fields.
All parsing methods follow the
::separator pattern for consistency.Example usage:
val entitySpec = Gliner2DslParser.parseEntitySpec("person::Names of people") val fieldSpec = Gliner2DslParser.parseFieldSpec("category::[electronics|software]::str")
- object Gliner2SpecialTokens
-
object
GlinerOptions
- Attributes
- protected[johnsnowlabs]
- object IOBTagger extends ParamsAndFeaturesReadable[IOBTagger] with Serializable
- object MedicalNerApproach extends DefaultParamsReadable[MedicalNerApproach] with WithMedicalNerGraphResolver with Serializable
- object MedicalNerDLGraphChecker extends ParamsAndFeaturesReadable[MedicalNerDLGraphChecker] with Serializable
- object MedicalNerModel extends ReadablePretrainedMedicalNer with ReadsMedicalNerGraph with Serializable
- object NerChunker extends DefaultParamsReadable[Chunker] with Serializable
- object NerConverterInternalModel extends ParamsAndFeaturesReadable[NerConverterInternalModel] with Serializable
- object NerTaggedInternal
-
object
NerTagsEncodingInternal
Works with different NER representations as tags Supports: IOB and IOB2 https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)
- object NerTemplateRenderModel extends ParamsAndFeaturesReadable[NerTemplateRenderModel] with Serializable
- object PretrainedZeroShotMultiTask extends ReadablePretrainedPretrainedZeroShotMultiTask with ReadPretrainedZeroShotMultiTask with Serializable
- object PretrainedZeroShotNER extends ReadablePretrainedPretrainedZeroShotNER with ReadPretrainedZeroShotNER with Serializable
- object PretrainedZeroShotNERChunker extends ReadablePretrainedPretrainedZeroShotNERChunker with ReadPretrainedZeroShotNERChunker with Serializable
- object ZeroShotNerModel extends ReadablePretrainedZeroShotNer with ReadZeroShotNerTensorflowModel with Serializable