Packages

package ner

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. case class ClassificationConfig(task: String, labels: Seq[String], multiLabel: Boolean = false, clsThreshold: Double = 0.5, trueLabel: Seq[String] = Seq("N/A")) extends Product with Serializable

    Configuration for classification task in schema output.

    Configuration for classification task in schema output.

    task

    Task name

    labels

    Sequence of label names

    multiLabel

    Whether this is a multi-label classification

    clsThreshold

    Confidence threshold

    trueLabel

    Labels that represent "true" in binary classification

  2. case class ClassificationTaskSpec(name: String, multiLabel: Boolean = false) extends Product with Serializable

    Specification for classification task from DSL string.

    Specification for classification task from DSL string.

    Format: "task_name" or "task_name::multi" for multi-label

    Examples:

    • "sentiment" → ClassificationTaskSpec("sentiment", false)
    • "intent::multi" → ClassificationTaskSpec("intent", true)
    name

    Task name

    multiLabel

    Whether this is a multi-label classification task

  3. case class EntityMetadata(dtype: String, threshold: Option[Double]) extends ValidateFields with Product with Serializable

    Metadata for entity extraction configuration.

    Metadata for entity extraction configuration.

    dtype

    Data type: "str" (single instance) or "list" (multiple instances)

    threshold

    Optional confidence threshold for extraction

  4. case class EntitySpec(name: String, dtype: String = "list", description: Option[String] = None) extends Product with Serializable

    Specification for entity extraction from DSL string.

    Specification for entity extraction from DSL string.

    Format: "entity_name" or "entity_name::dtype::description"

    Examples:

    • "person" → EntitySpec("person", "list", None)
    • "person::Names of people" → EntitySpec("person", "list", Some("Names of people"))
    • "company::str::Single company name" → EntitySpec("company", "str", Some("Single company name"))
    name

    Entity name

    dtype

    Data type: "str" (single instance) or "list" (multiple instances)

    description

    Optional description

  5. case class FieldMetadata(dtype: String, threshold: Option[Double], choices: Option[Seq[String]], validators: Seq[RegexValidator]) extends ValidateFields with Product with Serializable

    Metadata for structure field configuration.

    Metadata for structure field configuration.

    dtype

    Data type: "str" (single value) or "list" (multiple values)

    threshold

    Optional confidence threshold for extraction

    choices

    Optional sequence of valid choices (constrains output)

    validators

    Sequence of regex validators for post-processing

  6. case class FieldSpec(name: String, dtype: String = "list", choices: Option[Seq[String]] = None, description: Option[String] = None) extends Product with Serializable

    Specification for structure field from DSL string.

    Specification for structure field from DSL string.

    Format: "field_name::type::description" or "field_name::[choice1|choice2]::type::description"

    Examples:

    • "name" → FieldSpec("name", "list", None, None)
    • "price::str" → FieldSpec("price", "str", None, None)
    • "category::[electronics|software]" → FieldSpec("category", "str", Some(Seq("electronics", "software")), None)
    • "features::list::Product features" → FieldSpec("features", "list", None, Some("Product features"))
    name

    Field name

    dtype

    Data type: "str" (single value) or "list" (multiple values)

    choices

    Optional sequence of valid choices (forces dtype to "str")

    description

    Optional field description

  7. sealed trait FieldValue extends AnyRef

    Field value in a JSON structure - either a simple field or one with choices.

  8. case class FieldWithChoices(value: String, choices: Seq[String]) extends FieldValue with Product with Serializable
  9. case class Gliner2Classification(taskName: String, label: String, score: Float) extends Product with Serializable

    Classification result for one task.

    Classification result for one task.

    taskName

    Classification task name

    label

    Predicted label

    score

    Confidence score (0.0 to 1.0)

  10. case class Gliner2ClassifierInput(schemaEmbeddings: Array[Array[Float]], labelNames: Array[String], taskName: String, isMultiLabel: Boolean) extends Product with Serializable

    Input for classification ONNX.

    Input for classification ONNX.

    Used to classify text into one or more labels.

    schemaEmbeddings

    Schema embeddings (num_classes, 768)

    labelNames

    Label names for each class

    taskName

    Classification task name

    isMultiLabel

    Whether this is multi-label classification

  11. case class Gliner2Config(maxWidth: Int = 8, tokenPooling: String = "first") extends Serializable with Product

    Configuration for span generation and representation.

    Configuration for span generation and representation.

    maxWidth

    Maximum span width (number of whole tokens in a span)

    tokenPooling

    Pooling method for subword embeddings: "first", "mean", or "max"

  12. class Gliner2DataProcessor extends Serializable
  13. class Gliner2EmbeddingExtractor extends Serializable

    Extracts token and schema embeddings from encoder output.

    Extracts token and schema embeddings from encoder output. Mirrors Python's extract_embeddings_from_batch().

    This component:

    1. Uses mapped_indices to separate text tokens from schema tokens
    2. Aggregates subword embeddings into word-level embeddings
    3. Extracts special token embeddings for schema tasks
  14. case class Gliner2EncoderOutput(lastHiddenState: Array[Array[Array[Float]]], batch: Gliner2PreprocessedBatch) extends Product with Serializable

    Output from encoder ONNX model.

    Output from encoder ONNX model. Contains raw embeddings plus metadata needed for downstream processing.

    lastHiddenState

    Encoder embeddings (batch, seq_len, 768)

    batch

    Original preprocessed batch for metadata

  15. class Gliner2ExtractionResult extends AnyRef

    Complete extraction result for one sample.

    Complete extraction result for one sample.

    Contains all extracted information across all task types.

  16. case class Gliner2PreprocessedBatch(inputIds: Array[Array[Long]], attentionMask: Array[Array[Long]], mappedIndices: Array[Array[(String, Int, Int)]], schemaCounts: Array[Int], originalLengths: Array[Int], taskTypes: Array[Array[String]], wordTokens: Array[Array[String]], schemaTokensList: Array[Array[Array[String]]], startMappings: Array[Array[Int]], endMappings: Array[Array[Int]], originalTexts: Array[String], originalSchemas: Array[Gliner2Schema], structureLabels: Array[Any] = Array.empty) extends Product with Serializable

    Batch of preprocessed inputs ready for ONNX encoder.

    Batch of preprocessed inputs ready for ONNX encoder. Maps 1:1 to Python's PreprocessedBatch.

    This is the output of Gliner2DataProcessor.prepareInputs() and the input to the ONNX encoder model.

    inputIds

    Token IDs for encoder input (batch, max_seq_len)

    attentionMask

    Attention mask for encoder (batch, max_seq_len)

    mappedIndices

    Token mappings: (seg_type, orig_idx, schema_idx)

    • seg_type: "schema" or "text"
    • orig_idx: Original token index in text or schema
    • schema_idx: Which schema this token belongs to (for schema tokens)
    schemaCounts

    Number of schemas per sample

    originalLengths

    Original sequence lengths per sample

    taskTypes

    Task types per schema per sample

    wordTokens

    Original text tokens per sample

    schemaTokensList

    Schema tokens per sample

    startMappings

    Token char start positions per sample

    endMappings

    Token char end positions per sample

    originalTexts

    Original text strings

    originalSchemas

    Original schema dictionaries

    structureLabels

    Ground truth labels (training only, can be empty for inference)

  17. case class Gliner2ProcessedEmbeddings(tokenEmbeddings: Array[Array[Float]], schemaEmbeddings: Array[Array[Array[Float]]], textTokens: Array[String], schemaTokensList: Array[Array[String]], taskTypes: Array[String], startMapping: Array[Int], endMapping: Array[Int], originalText: String, originalSchema: Gliner2Schema, sampleIndex: Int) extends Product with Serializable

    Per-sample embeddings extracted from encoder output.

    Per-sample embeddings extracted from encoder output. Splits aggregated token and schema embeddings and returns them separately.

    Python Reference: _extract_embeddings_from_batch().

    tokenEmbeddings

    Word-level text embeddings (text_len, 768)

    schemaEmbeddings

    Schema embeddings per task (num_schemas, num_tokens, 768)

    textTokens

    Original text tokens

    schemaTokensList

    Schema tokens per task

    taskTypes

    Task types (e.g., "entities", "classifications")

    startMapping

    Character start positions for tokens

    endMapping

    Character end positions for tokens

    originalText

    Original text string

    originalSchema

    Original schema dictionary

    sampleIndex

    Index of this sample in the batch

  18. case class Gliner2RelationResult(label: String, head: Gliner2SpanResult, tail: Gliner2SpanResult) extends Product with Serializable
  19. case class Gliner2Schema(structures: List[StructureConfig], classifications: List[ClassificationConfig], entities: ListMap[String, String], relations: List[RelationConfig], structureDescriptions: Map[String, ListMap[String, String]], entityDescriptions: ListMap[String, String], entityMetadata: Map[String, EntityMetadata] = Map.empty, entityOrder: Seq[String] = Seq.empty, relationMetadata: Map[String, RelationMetadata] = Map.empty, fieldOrders: Map[String, Seq[String]] = Map.empty, fieldMetadata: Map[String, FieldMetadata] = Map.empty) extends Product with Serializable

    Completed schema from the builder.

    Completed schema from the builder.

    structures

    List of JSON structures for structured data extraction

    classifications

    List of classification tasks

    entities

    Map of entity names to descriptions

    relations

    List of relation configurations

    structureDescriptions

    Map of structure names to field descriptions

    entityDescriptions

    Map of entity names to descriptions

    entityMetadata

    Per-entity dtype and threshold overrides

    entityOrder

    Ordered sequence of entity names for extraction

    relationMetadata

    Per-relation threshold overrides

    fieldOrders

    Per-structure/relation ordered field sequences

    fieldMetadata

    Per-field dtype, threshold, and validator overrides (keyed as "parent.field")

  20. class Gliner2SchemaBuilder extends AnyRef

    Schema builder for extraction tasks.

    Schema builder for extraction tasks. Formates the schema for inference of the ONNX model.

    Provides a fluent API for building extraction schemas that include:

    • Entity extraction configurations
    • Classification tasks
    • Relation extraction
    • Structured data extraction with fields
    • Validation rules

    Example:

    val schema = new Schema()
      .entities(Map(
        "person" -> "Names of people",
        "company" -> "Organization names"
      ))
      .classification("sentiment", List("positive", "negative", "neutral"))
      .relations(List("works_for", "founded"))
      .structure("product_info")
        .field("name", dtype = "str")
        .field("price", dtype = "str")
        .field("availability", choices = Some(List("in_stock", "out_of_stock")))
      .build()
  21. class Gliner2SpanGenerator extends Serializable

    Generates span indices and masks for span-based tasks.

    Generates span indices and masks for span-based tasks.

    Creates all possible consecutive token sequences up to maxWidth. Invalid spans (extending beyond text length) are masked.

  22. case class Gliner2SpanInfo(spanIdx: Array[Array[Array[Long]]], spanMask: Array[Array[Boolean]], spanRep: Option[Array[Array[Array[Float]]]], numWords: Int, maxWidth: Int) extends Product with Serializable

    Span indices and representations for a single sample.

    Span indices and representations for a single sample. Used for entity, relation, and structure extraction tasks.

    Spans are generated for all possible consecutive token sequences up to maxWidth. Invalid spans (extending beyond text length) are masked.

    spanIdx

    Span indices (num_words, max_width, 2) - start and end positions

    spanMask

    Validity mask (num_words, max_width) - true for valid spans

    spanRep

    Span representations from ONNX (num_words, max_width, 768) - optional

    numWords

    Number of words in the text

    maxWidth

    Maximum span width

  23. case class Gliner2SpanResult(label: String, startIdx: Int, endIdx: Int, startChar: Int, endChar: Int, score: Float, text: String, tokens: Array[String]) extends Product with Serializable

    Extracted span result (entity, relation, or structure field).

    Extracted span result (entity, relation, or structure field).

    Represents a single extracted span with position and score information.

    label

    Entity/relation/field type

    startIdx

    Start token index

    endIdx

    End token index (exclusive)

    startChar

    Start character position

    endChar

    End character position (exclusive)

    score

    Confidence score (0.0 to 1.0)

    text

    Extracted text

    tokens

    Tokens in span

  24. case class GlinerConfig(maxWidth: Int = 12, entToken: String = "<<ENT>>", entTokenId: Long = 128002L, sepToken: String = "<<SEP>>", sepTokenId: Long = 128003L, maxLen: Int = 384) extends Serializable with Product
  25. case class GlinerData(tokens: Array[String], tokenStarts: Array[Int], tokenEnds: Array[Int], tokenIds: Array[Long], tokenTypeIds: Array[Long], attentionMask: Array[Long], wordsMask: Array[Long], spanIdx: Array[Array[Long]], spanMask: Array[Boolean], textLength: Array[Long], idToClasses: Map[Long, String], classesToId: Map[String, Long]) extends Serializable with Product
  26. class GlinerDataProcessor extends Serializable
  27. trait GlinerModel extends Serializable
  28. case class GlinerResult(entity: String, start: Int, end: Int, score: Float, tokensInSpan: List[String]) extends Product with Serializable
  29. case class GraphInfo(path: String, fileTags: Int, fileEmbeddingsNDims: Int, fileNChars: Int) extends Product with Serializable
    Attributes
    protected
  30. class IOBTagger extends AnnotatorModel[IOBTagger] with CheckLicense with HasSimpleAnnotate[IOBTagger]

    Merges token tags and NER labels from chunks in the specified format.

    Merges token tags and NER labels from chunks in the specified format. For example output columns as inputs from NerConverter and Tokenizer can be used to merge.

    Example

    Pipeline stages are defined where NER is done. NER is converted to chunks.

    val data = Seq(("A 63-year-old man presents to the hospital ...")).toDF("text")
    val docAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
    val sentenceDetector = new SentenceDetector().setInputCols("document").setOutputCol("sentence")
    val tokenizer = new Tokenizer().setInputCols("sentence").setOutputCol("token")
    val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models").setOutputCol("embs")
    val nerModel = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models").setInputCols("sentence", "token", "embs").setOutputCol("ner")
    val nerConverter = new NerConverter().setInputCols("sentence", "token", "ner").setOutputCol("ner_chunk")

    Define the IOB tagger, which needs tokens and chunks as input. Show results.

    val iobTagger = new IOBTagger().setInputCols("token", "ner_chunk").setOutputCol("ner_label")
    val pipeline = new Pipeline().setStages(Array(docAssembler, sentenceDetector, tokenizer, embeddings, nerModel, nerConverter, iobTagger))
    
    result.selectExpr("explode(ner_label) as a")
      .selectExpr("a.begin","a.end","a.result as chunk","a.metadata.word as word")
      .where("chunk!='O'").show(5, false)
    
    +-----+---+-----------+-----------+
    |begin|end|chunk      |word       |
    +-----+---+-----------+-----------+
    |5    |15 |B-Age      |63-year-old|
    |17   |19 |B-Gender   |man        |
    |64   |72 |B-Modifier |recurrent  |
    |98   |107|B-Diagnosis|cellulitis |
    |110  |119|B-Diagnosis|pneumonias |
    +-----+---+-----------+-----------+
    See also

    Tokenizer

    MedicalNerModel

    NerConverter

  31. case class LabelSpec(name: String, description: Option[String] = None) extends Product with Serializable

    Specification for classification label from DSL string.

    Specification for classification label from DSL string.

    Format: "label" or "label::Description"

    Examples:

    • "positive" → LabelSpec("positive", None)
    • "positive::Positive sentiment" → LabelSpec("positive", Some("Positive sentiment"))
    name

    Label name

    description

    Optional description

  32. class MedicalNerApproach extends AnnotatorApproach[MedicalNerModel] with MedicalNerParams with NerApproach[MedicalNerApproach] with Logging with ParamsAndFeaturesWritable with EvaluationDLParams with CheckLicense

    Trains generic NER models based on Neural Networks.

    Trains generic NER models based on Neural Networks.

    The architecture of the neural network is a Char CNNs - BiLSTM - CRF that achieves state-of-the-art in most datasets. For instantiated/pretrained models, see MedicalNerModel

    The training data should be a labeled Spark Dataset, in the CoNLL 2003 IOB format with Annotation type columns. The data should have columns of type DOCUMENT, TOKEN, WORD_EMBEDDINGS and an additional label column of annotator type NAMED_ENTITY.

    Excluding the label, this can be done with, for example, the annotators SentenceDetector, Tokenizer, and WordEmbeddingsModel (any embeddings can be chosen, e.g. BertEmbeddings for BERT based embeddings).

    For extended examples of usage, see the Spark NLP Workshop.

    Notes

    Both DocumentAssembler and SentenceDetector annotators are annotators that output the DOCUMENT annotation type. Thus, any of them can be used as the first annotators in a pipeline.

    Example

    First extract the prerequisites for the MedicalNerApproach

    val document = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    val sentenceDetector = new SentenceDetector()
      .setInputCols("document")
      .setOutputCol("sentence")
    val tokenizer = new Tokenizer()
      .setInputCols("sentence")
      .setOutputCol("token")
    val embeddings = BertEmbeddings.pretrained()
      .setInputCols("sentence", "token")
      .setOutputCol("embeddings")

    Then define the NER annotator

    val nerTagger = new MedicalNerApproach()
      .setInputCols("sentence", "token", "embeddings")
      .setLabelColumn("label")
      .setOutputCol("ner")
      .setMaxEpochs(10)
      .setLr(0.005f)
      .setPo(0.005f)
      .setBatchSize(32)
      .setValidationSplit(0.1f)

    Then the training can start

    val pipeline = new Pipeline().setStages(Array(
      document,
      sentenceDetector,
      tokenizer,
      embeddings,
      nerTagger
    ))
    
    trainingData = conll.readDataset(spark, "path/to/train_data.conll")
    pipelineModel = pipeline.fit(trainingData)
  33. class MedicalNerDLGraphChecker extends NerDLGraphChecker

    Checks whether a suitable MedicalNerApproach graph is available for the given training dataset, before any computations/training is done.

    Checks whether a suitable MedicalNerApproach graph is available for the given training dataset, before any computations/training is done. This annotator is useful for custom training cases, where specialized graphs might not be available and we want to check before embeddings are evaluated.

    Important: This annotator should be used or positioned before any embedding or MedicalNerApproach annotators in the pipeline and will process the whole dataset to extract the required graph parameters.

    This annotator requires a dataset with at least two columns: one with tokens and one with the labels. In addition, it requires the used embedding annotator in the pipeline to extract the suitable embedding dimension.

    Example

    import com.johnsnowlabs.nlp.annotator._
    import com.johnsnowlabs.nlp.jsl.annotator._
    import com.johnsnowlabs.nlp.training.CoNLL
    import org.apache.spark.ml.Pipeline
    
    // This CoNLL dataset already includes a sentence, token and label column with their respective annotator types.
    val conll = CoNLL()
    val trainingData = conll.readDataset(spark, "PATH/TO/YOUR/TRAINING/DATA")
    
    val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
      .setInputCols("sentence", "token")
      .setOutputCol("embeddings")
    
    // Requires the data for MedicalNerApproach graphs: text, tokens, labels and the embedding model
    val nerDLGraphChecker = new MedicalNerDLGraphChecker()
      .setInputCols("sentence", "token")
      .setLabelColumn("label")
      .setEmbeddingsModel(embeddings)
    
    val nerTagger = new MedicalNerApproach()
      .setInputCols("sentence", "token", "embeddings")
      .setLabelColumn("label")
      .setOutputCol("ner")
      .setMaxEpochs(1)
      .setRandomSeed(42)
      .setVerbose(0)
      .setEarlyStoppingCriterion(0.50f)
      .setEnableOutputLogs(false)
      .setUseBestModel(true)
    
    val pipeline = new Pipeline().setStages(
      Array(nerDLGraphChecker, embeddings, nerTagger))
    
    // Will throw an exception if no suitable graph is found
    val pipelineModel = pipeline.fit(trainingData)
  34. class MedicalNerModel extends AnnotatorModel[MedicalNerModel] with MedicalNerParams with HasBatchedAnnotate[MedicalNerModel] with WriteTensorflowModel with HasStorageRef with ParamsAndFeaturesWritable with CheckLicense
  35. trait MedicalNerParams extends Params with HasFeatures
  36. case class NamedEntityConfidence(start: Int, end: Int, entity: String, text: String, sentenceId: String, confidence: Option[Float]) extends Product with Serializable
  37. class NerChunker extends AnnotatorModel[NerChunker] with HasSimpleAnnotate[NerChunker]

    Extracts phrases that fits into a known pattern using the NER tags.

    Extracts phrases that fits into a known pattern using the NER tags. Useful for entity groups with neighboring tokens when there is no pretrained NER model to address certain issues. A Regex needs to be provided to extract the tokens between entities.

    Example

    Defining pipeline stages for NER

    val data= Seq("She has cystic cyst on her kidney.").toDF("text")
    
    val documentAssembler=new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentenceDetector=new SentenceDetector()
      .setInputCols("document")
      .setOutputCol("sentence")
      .setUseAbbreviations(false)
    
    val tokenizer=new Tokenizer()
      .setInputCols(Array("sentence"))
      .setOutputCol("token")
    
    val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
      .setInputCols("sentence","token")
      .setOutputCol("embeddings")
      .setCaseSensitive(false)
    
    val ner = MedicalNerModel.pretrained("ner_radiology", "en", "clinical/models")
      .setInputCols("sentence","token","embeddings")
      .setOutputCol("ner")
      .setIncludeConfidence(true)

    Define the NerChunker to combine to chunks

    val chunker = new NerChunker()
      .setInputCols(Array("sentence","ner"))
      .setOutputCol("ner_chunk")
      .setRegexParsers(Array("<ImagingFindings>.*<BodyPart>"))
    
    val pipeline=new Pipeline().setStages(Array(
      documentAssembler,
      sentenceDetector,
      tokenizer,
      embeddings,
      ner,
      chunker
    ))
    
    val result = pipeline.fit(data).transform(data)

    Show results:

    result.selectExpr("explode(arrays_zip(ner.metadata , ner.result))")
      .selectExpr("col['0'].word as word" , "col['1'] as ner").show(truncate=false)
    +------+-----------------+
    |word  |ner              |
    +------+-----------------+
    |She   |O                |
    |has   |O                |
    |cystic|B-ImagingFindings|
    |cyst  |I-ImagingFindings|
    |on    |O                |
    |her   |O                |
    |kidney|B-BodyPart       |
    |.     |O                |
    +------+-----------------+
    
    result.select("ner_chunk.result").show(truncate=false)
    +---------------------------+
    |result                     |
    +---------------------------+
    |[cystic cyst on her kidney]|
    +---------------------------+
  38. class NerConverterInternal extends AnnotatorApproach[NerConverterInternalModel] with SourceTrackingMetadataParams with HasFeatures with FilteringParams with CheckLicense

    Converts IOB or IOB2 representations of entities to a user-friendly one.

    Converts IOB or IOB2 representations of entities to a user-friendly one.

    This is the AnnotatorApproach version of the NerConverterInternalModel annotator.

    Chunks with no associated entity (tagged "O") are filtered.

    This licensed annotator adds extra functionality to the open-source version by adding the following parameters: blackList, greedyMode, threshold, and ignoreStopWords that are not available in the open-source annotator.

    See also Inside–outside–beginning (tagging) for more information.

    Example

    The output of a MedicalNerModel follows the Annotator schema and looks like this after the transformation.

    result.selectExpr("explode(ner_result)").show(5, false)
    +--------------------------------------------------------------------------+
    |col                                                                       |
    +--------------------------------------------------------------------------+
    |{named_entity, 3, 3, O, {word -> A, confidence -> 0.994}, []}             |
    |{named_entity, 5, 15, B-Age, {word -> 63-year-old, confidence -> 1.0}, []}|
    |{named_entity, 17, 19, B-Gender, {word -> man, confidence -> 0.9858}, []} |
    |{named_entity, 21, 28, O, {word -> presents, confidence -> 0.9952}, []}   |
    |{named_entity, 30, 31, O, {word -> to, confidence -> 0.7063}, []}         |
    +--------------------------------------------------------------------------+

    After the converter is used:

    result.selectExpr("explode(ner_converter_result)").show(5, false)
    +-----------------------------------------------------------------------------------+
    |col                                                                                |
    +-----------------------------------------------------------------------------------+
    |{chunk, 5, 15, 63-year-old, {entity -> Age, sentence -> 0, chunk -> 0}, []}        |
    |{chunk, 17, 19, man, {entity -> Gender, sentence -> 0, chunk -> 1}, []}            |
    |{chunk, 64, 72, recurrent, {entity -> Modifier, sentence -> 0, chunk -> 2}, []}    |
    |{chunk, 98, 107, cellulitis, {entity -> Diagnosis, sentence -> 0, chunk -> 3}, []} |
    |{chunk, 110, 119, pneumonias, {entity -> Diagnosis, sentence -> 0, chunk -> 4}, []}|
    +-----------------------------------------------------------------------------------+
    See also

    MedicalNerModel

  39. class NerConverterInternalModel extends AnnotatorModel[NerConverterInternalModel] with HasSimpleAnnotate[NerConverterInternalModel] with SourceTrackingMetadataParams with FilteringParams with CheckLicense

    Converts a IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their label.

    Converts a IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their label. Chunks with no associated entity (tagged "O") are filtered. See also Inside–outside–beginning (tagging) for more information.

    Example

    The output of a MedicalNerModel follows the Annotator schema and looks like this after the transformation.

    result.selectExpr("explode(ner_result)").show(5, false)
    +--------------------------------------------------------------------------+
    |col                                                                       |
    +--------------------------------------------------------------------------+
    |{named_entity, 3, 3, O, {word -> A, confidence -> 0.994}, []}             |
    |{named_entity, 5, 15, B-Age, {word -> 63-year-old, confidence -> 1.0}, []}|
    |{named_entity, 17, 19, B-Gender, {word -> man, confidence -> 0.9858}, []} |
    |{named_entity, 21, 28, O, {word -> presents, confidence -> 0.9952}, []}   |
    |{named_entity, 30, 31, O, {word -> to, confidence -> 0.7063}, []}         |
    +--------------------------------------------------------------------------+

    After the converter is used:

    result.selectExpr("explode(ner_converter_result)").show(5, false)
    +-----------------------------------------------------------------------------------+
    |col                                                                                |
    +-----------------------------------------------------------------------------------+
    |{chunk, 5, 15, 63-year-old, {entity -> Age, sentence -> 0, chunk -> 0}, []}        |
    |{chunk, 17, 19, man, {entity -> Gender, sentence -> 0, chunk -> 1}, []}            |
    |{chunk, 64, 72, recurrent, {entity -> Modifier, sentence -> 0, chunk -> 2}, []}    |
    |{chunk, 98, 107, cellulitis, {entity -> Diagnosis, sentence -> 0, chunk -> 3}, []} |
    |{chunk, 110, 119, pneumonias, {entity -> Diagnosis, sentence -> 0, chunk -> 4}, []}|
    +-----------------------------------------------------------------------------------+
    See also

    MedicalNerModel

  40. class NerTemplateRenderModel extends AnnotatorModel[NerTemplateRenderModel] with HasSimpleAnnotate[NerTemplateRenderModel] with CheckLicense

    Renders a list of SparkNLP for Healthcare templates provided as a StringArrayParam

    Renders a list of SparkNLP for Healthcare templates provided as a StringArrayParam

    The output of a NerTemplateRenderModel follows outputs Documents based on the provideed templates.

    See also

    NerTemplateRenderModel

  41. class PretrainedZeroShotMultiTask extends AnnotatorModel[PretrainedZeroShotMultiTask] with ParamsAndFeaturesWritable with HasBatchedAnnotate[PretrainedZeroShotMultiTask] with InternalWriteOnnxModel with WriteSentencePieceModel with HasEngine with CheckLicense

    Zero-shot multi-task information extraction.

    Zero-shot multi-task information extraction.

    Performs four extraction tasks simultaneously from a single document in a single forward pass:

    • **Named entity extraction** — spans of text matching a given type
    • **Relation extraction** — (head, tail) span pairs for a given relation type
    • **Classification** — document-level or sentence-level label assignment
    • **Structured extraction** — structured records with typed fields extracted from text

    All tasks are defined via a compact :: DSL, described below, and can be combined freely. Tasks are zero-shot: no fine-tuning is needed.

    DSL syntax

    Specifications use :: as a separator. Order of optional parts is flexible.

    Entities

    Each entry is a string: "name", "name::dtype", "name::description", or "name::dtype::description" where dtype is "list" (default, multiple spans) or "str" (single best span).

    .setEntities(Array(
      "person",                                      // list of persons
      "company::str",                                // single company
      "product::Names of products or services",      // list with description
      "price::str::Monetary value including currency" // single with dtype and description
    ))
    Relations

    Each entry is a string: "relation_name" or "relation_name::description". The model extracts (head, tail) span pairs for each relation type.

    .setRelations(Array(
      "works_for",
      "located_in::The organization is physically located in the place"
    ))
    Classifications

    Each entry is a (taskSpec, Array[labelSpec]) tuple.

    • Task spec: "task_name" (single-label) or "task_name::multi" (multi-label)
    • Label spec: "label" or "label::description"
    .setClassifications(Array(
      ("sentiment", Array("positive", "negative", "neutral")),
      ("topics::multi", Array("finance::Financial content", "technology", "politics"))
    ))
    Structures

    Each entry is a (structureName, Array[fieldSpec]) tuple. Fields use: "field_name", "field_name::dtype", "field_name::description", "field_name::dtype::description", or "field_name::[choice1|choice2]" (forces dtype=str).

    .setStructures(Array(
      ("product_info", Array(
        "name::str",
        "price::str::Price including currency symbol",
        "features::list",
        "availability::[in_stock|pre_order|sold_out]"
      ))
    ))

    Output

    All task results are returned in a single output column as Array[Annotation]:

    • Entities → annotatorType = "chunk", result = span text, metadata contains entity, confidence, sentence
    • Classifications → annotatorType = "category", result = label, metadata contains confidence, task, sentence
    • Relations → annotatorType = "category", result = relation name, metadata contains chunk1, chunk2, entity1, entity2, entity1_begin, entity1_end, entity2_begin, entity2_end, chunk1_confidence, chunk2_confidence, sentence (compatible with com.johnsnowlabs.nlp.annotators.re.RelationExtractionDLModel output)
    • Structures → annotatorType = "struct", result = structure name, metadata contains one key per field (value is JSON-encoded: object for str fields, array for list fields) plus instance_idx and sentence

    Example

    val zeroShot = PretrainedZeroShotMultiTask.pretrained()
      .setInputCols("document")
      .setOutputCol("extractions")
      .setEntities(Array("person", "company::str", "product::List of products"))
      .setClassifications(Array(("sentiment", Array("positive", "negative", "neutral"))))
      .setRelations(Array("works_for", "founded"))
      .setStructures(Array(
        ("invoice", Array("vendor::str", "amount::str", "items::list"))))
      .setEntityThreshold(0.5f)
      .setRelationThreshold(0.6f)
  42. class PretrainedZeroShotNER extends AnnotatorModel[PretrainedZeroShotNER] with ParamsAndFeaturesWritable with HasBatchedAnnotate[PretrainedZeroShotNER] with InternalWriteOnnxModel with WriteSentencePieceModel with WriteOpenvinoModel with HasEngine with CheckLicense
  43. class PretrainedZeroShotNERChunker extends PretrainedZeroShotNER

    A fine-tuned zero-shot named-entity recognition (NER) model.

    A fine-tuned zero-shot named-entity recognition (NER) model. Performs NER on arbitrary text without task-specific labeled training.

    In contrast to PretrainedZeroShotNER this annotator directly outputs NER chunks instead of aligning them to provided tokens.

    Example

    val text = """
            |Cristiano Ronaldo dos Santos Aveiro (Portuguese pronunciation: [kɾiʃˈtjɐnu ʁɔˈnaldu]; born 5 February
            |1985) is a Portuguese professional footballer who plays as a forward for and captains both Saudi Pro
            |League club Al Nassr and the Portugal national team. Widely regarded as one of the greatest players of
            |all time, Ronaldo has won five Ballon d'Or awards,[note 3] a record three UEFA Men's Player of the Year
            |Awards, and four European Golden Shoes, the most by a European player.
            """.stripMargin
    
    val testData = Seq(text).toDF("text")
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    val sentenceDetector = new SentenceDetector()
      .setInputCols(Array("document"))
      .setOutputCol("sentence")
    val ner = PretrainedZeroShotNERChunker
      .pretrained()
      .setInputCols(Array("sentence"))
      .setOutputCol("ner_chunk")
      .setLabels(Array("person", "award", "date", "competitions", "teams"))
    val pipeline =
      new Pipeline().setStages(Array(documentAssembler, sentenceDetector, ner))
    val results = pipeline.fit(testData).transform(testData)
    results.selectExpr("explode(entity)").show(1000, truncate = false)

    Results:

    +--------------------------------------------------------------------------------------------------------------------------------------------------+
    |col                                                                                                                                               |
    +--------------------------------------------------------------------------------------------------------------------------------------------------+
    |{chunk, 2, 37, Cristiano Ronaldo dos Santos Aveiro, {sentence -> 0, entity -> person, confidence -> 0.9144007, ner_source -> ner_chunk}, []}      |
    |{chunk, 93, 109, 5 February\r\n1985, {sentence -> 1, entity -> date, confidence -> 0.99999976, ner_source -> ner_chunk}, []}                      |
    |{chunk, 196, 213, Saudi Pro\r\nLeague, {sentence -> 1, entity -> competitions, confidence -> 0.9926515, ner_source -> ner_chunk}, []}             |
    |{chunk, 219, 227, Al Nassr, {sentence -> 1, entity -> teams, confidence -> 0.99384415, ner_source -> ner_chunk}, []}                              |
    |{chunk, 321, 328, Ronaldo, {sentence -> 2, entity -> person, confidence -> 0.999997, ner_source -> ner_chunk}, []}                                |
    |{chunk, 342, 353, Ballon d'Or, {sentence -> 2, entity -> award, confidence -> 0.95896983, ner_source -> ner_chunk}, []}                           |
    |{chunk, 385, 422, UEFA Men's Player of the Year\r\nAwards, {sentence -> 2, entity -> award, confidence -> 0.9687164, ner_source -> ner_chunk}, []}|
    |{chunk, 433, 454, European Golden Shoes, {sentence -> 2, entity -> award, confidence -> 0.999326, ner_source -> ner_chunk}, []}                   |
    +--------------------------------------------------------------------------------------------------------------------------------------------------+
    See also

    PretrainedZeroShotNER

  44. trait ReadPretrainedZeroShotMultiTask extends ReadSentencePieceModel with InternalReadOnnxModel
  45. trait ReadPretrainedZeroShotNER extends ReadSentencePieceModel with InternalReadOnnxModel with ReadOpenvinoModel
  46. trait ReadPretrainedZeroShotNERChunker extends ReadSentencePieceModel with InternalReadOnnxModel with ReadOpenvinoModel
  47. trait ReadZeroShotNerTensorflowModel extends ReadTensorflowModel with InternalReadOnnxModel with ReadOpenvinoModel
  48. trait ReadablePretrainedMedicalNer extends ParamsAndFeaturesReadable[MedicalNerModel] with HasPretrained[MedicalNerModel]
  49. trait ReadablePretrainedPretrainedZeroShotMultiTask extends ParamsAndFeaturesReadable[PretrainedZeroShotMultiTask] with HasPretrained[PretrainedZeroShotMultiTask]
  50. trait ReadablePretrainedPretrainedZeroShotNER extends ParamsAndFeaturesReadable[PretrainedZeroShotNER] with HasPretrained[PretrainedZeroShotNER]
  51. trait ReadablePretrainedPretrainedZeroShotNERChunker extends ParamsAndFeaturesReadable[PretrainedZeroShotNERChunker] with HasPretrained[PretrainedZeroShotNERChunker]
  52. trait ReadablePretrainedZeroShotNer extends ParamsAndFeaturesReadable[ZeroShotNerModel] with HasPretrained[ZeroShotNerModel]
  53. trait ReadsMedicalNerGraph extends ParamsAndFeaturesReadable[MedicalNerModel] with ReadTensorflowModel
  54. case class RegexValidator(pattern: String, mode: String = "full", exclude: Boolean = false, caseInsensitive: Boolean = true) extends Product with Serializable

    Regex-based span filter for post-processing entity extraction.

    Regex-based span filter for post-processing entity extraction.

    pattern

    The regex pattern as a string or compiled Regex

    mode

    Match mode: "full" for fullmatch, "partial" for search

    exclude

    If true, inverts the match result

    caseInsensitive

    If true, performs case-insensitive matching

  55. case class RelationConfig(name: String, head: String = "", tail: String = "") extends Product with Serializable

    Relation configuration in schema output.

    Relation configuration in schema output.

    name

    Relation name

    head

    Head entity placeholder (empty string)

    tail

    Tail entity placeholder (empty string)

  56. case class RelationMetadata(threshold: Option[Double]) extends ValidateFields with Product with Serializable

    Metadata for relation extraction configuration.

    Metadata for relation extraction configuration.

    threshold

    Optional confidence threshold for extraction

  57. case class RelationSpec(name: String, description: Option[String] = None) extends Product with Serializable

    Specification for relation extraction from DSL string.

    Specification for relation extraction from DSL string.

    Format: "relation_name" or "relation_name::description"

    Examples:

    • "works_for" → RelationSpec("works_for", None)
    • "works_for::Employment relationship" → RelationSpec("works_for", Some("Employment relationship"))
    name

    Relation name

    description

    Optional description

  58. class StructureBuilder extends AnyRef

    Builder for structured data schemas (JSON structures).

  59. case class StructureConfig(name: String, fields: ListMap[String, FieldValue]) extends Product with Serializable

    JSON structure for structured data extraction.

    JSON structure for structured data extraction.

    name

    Structure name

    fields

    Map of field names to field values

  60. trait ValidateFields extends AnyRef

    Trait for types that contain an optional confidence threshold.

    Trait for types that contain an optional confidence threshold.

    Automatically validates threshold on construction.

  61. trait WithMedicalNerGraphResolver extends AnyRef
  62. class ZeroShotNerModel extends RoBertaForQuestionAnswering with CheckLicense

    ZeroShotNerModel implements zero shot named entity recognition by utilizing RoBERTa transformer models fine tuned on a question answering task.

    ZeroShotNerModel implements zero shot named entity recognition by utilizing RoBERTa transformer models fine tuned on a question answering task. Its input is a list of document annotations and it automatically generates questions which are used to recognize entities. The definitions of entities is given by a dictionary structures, specifying a set of questions for each entity. The model is based on RoBertaForQuestionAnswering from the open source SparkNLP project.

    Pretrained models can be loaded with pretrained of the companion object:

    val zeroShotRE = ZeroShotNerModel.pretrained()
      .setInputCols("document")
      .setOutputCol("zer_shot_ner")

    For available pretrained models please see the Models Hub.

    Example

     val documentAssembler = new DocumentAssembler()
       .setInputCol("text")
       .setOutputCol("document")
    
     val sentenceDetector = new SentenceDetector()
       .setInputCols(Array("document"))
       .setOutputCol("sentences")
    
     val zeroShotNer = ZeroShotNerModel
       .pretrained()
       .setEntityDefinitions(
         Map(
           "NAME" -> Array("What is his name?", "What is her name?"),
           "CITY" -> Array("Which city?")))
       .setPredictionThreshold(0.01f)
       .setInputCols("sentences")
       .setOutputCol("zero_shot_ner")
    
     val pipeline = new Pipeline()
       .setStages(Array(
         documentAssembler,
         sentenceDetector,
         zeroShotNer))
    
     val model = pipeline.fit(Seq("").toDS.toDF("text"))
     val results = model.transform(
       Seq("Clara often travels between New York and Paris.").toDS.toDF("text"))
    
     results
       .selectExpr("document", "explode(zero_shot_ner) AS entity")
       .select(
         col("entity.result"),
         col("entity.metadata.word"),
         col("entity.metadata.sentence"),
         col("entity.begin"),
         col("entity.end"),
         col("entity.metadata.confidence"),
         col("entity.metadata.question"))
       .show(truncate=false)
    
    +------+-----+--------+-----+---+----------+------------------+
    |result|word |sentence|begin|end|confidence|question          |
    +------+-----+--------+-----+---+----------+------------------+
    |B-CITY|Paris|0       |41   |45 |0.78655756|Which is the city?|
    |B-CITY|New  |0       |28   |30 |0.29346612|Which city?       |
    |I-CITY|York |0       |32   |35 |0.29346612|Which city?       |
    +------+-----+--------+-----+---+----------+------------------+
    See also

    https://arxiv.org/abs/1907.11692 for details about the RoBERTa transformer

    RoBertaForQuestionAnswering for the SparkNLP implementation of RoBERTa question answering

Value Members

  1. object EmptyField extends FieldValue with Product with Serializable
  2. object Gliner2DslParser

    Parser for GLiNER2 DSL (Domain-Specific Language) specifications.

    Parser for GLiNER2 DSL (Domain-Specific Language) specifications.

    Provides methods to parse string specifications into structured objects for entities, classifications, relations, and structure fields.

    All parsing methods follow the :: separator pattern for consistency.

    Example usage:

    val entitySpec = Gliner2DslParser.parseEntitySpec("person::Names of people")
    val fieldSpec = Gliner2DslParser.parseFieldSpec("category::[electronics|software]::str")
  3. object Gliner2SpecialTokens
  4. object GlinerOptions
    Attributes
    protected[johnsnowlabs]
  5. object IOBTagger extends ParamsAndFeaturesReadable[IOBTagger] with Serializable
  6. object MedicalNerApproach extends DefaultParamsReadable[MedicalNerApproach] with WithMedicalNerGraphResolver with Serializable
  7. object MedicalNerDLGraphChecker extends ParamsAndFeaturesReadable[MedicalNerDLGraphChecker] with Serializable
  8. object MedicalNerModel extends ReadablePretrainedMedicalNer with ReadsMedicalNerGraph with Serializable
  9. object NerChunker extends DefaultParamsReadable[Chunker] with Serializable
  10. object NerConverterInternalModel extends ParamsAndFeaturesReadable[NerConverterInternalModel] with Serializable
  11. object NerTaggedInternal
  12. object NerTagsEncodingInternal

    Works with different NER representations as tags Supports: IOB and IOB2 https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)

  13. object NerTemplateRenderModel extends ParamsAndFeaturesReadable[NerTemplateRenderModel] with Serializable
  14. object PretrainedZeroShotMultiTask extends ReadablePretrainedPretrainedZeroShotMultiTask with ReadPretrainedZeroShotMultiTask with Serializable
  15. object PretrainedZeroShotNER extends ReadablePretrainedPretrainedZeroShotNER with ReadPretrainedZeroShotNER with Serializable
  16. object PretrainedZeroShotNERChunker extends ReadablePretrainedPretrainedZeroShotNERChunker with ReadPretrainedZeroShotNERChunker with Serializable
  17. object ZeroShotNerModel extends ReadablePretrainedZeroShotNer with ReadZeroShotNerTensorflowModel with Serializable

Ungrouped