Packages

package ner

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. case class GlinerConfig(maxWidth: Int = 12, entToken: String = "<<END>>", entTokenId: Long = 128002L, sepToken: String = "<<SEP>>", sepTokenId: Long = 128003L) extends Product with Serializable
  2. case class GlinerData(tokens: Array[String], tokenStarts: Array[Int], tokenEnds: Array[Int], tokenIds: Array[Long], tokenTypeIds: Array[Long], attentionMask: Array[Long], wordsMask: Array[Long], spanIdx: Array[Array[Long]], spanMask: Array[Boolean], textLength: Array[Long], idToClasses: Map[Long, String], classesToId: Map[String, Long]) extends Product with Serializable
  3. class GlinerDataProcessor extends AnyRef
  4. class IOBTagger extends AnnotatorModel[IOBTagger] with CheckLicense with HasSimpleAnnotate[IOBTagger]

    Merges token tags and NER labels from chunks in the specified format.

    Merges token tags and NER labels from chunks in the specified format. For example output columns as inputs from NerConverter and Tokenizer can be used to merge.

    Example

    Pipeline stages are defined where NER is done. NER is converted to chunks.

    val data = Seq(("A 63-year-old man presents to the hospital ...")).toDF("text")
    val docAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
    val sentenceDetector = new SentenceDetector().setInputCols("document").setOutputCol("sentence")
    val tokenizer = new Tokenizer().setInputCols("sentence").setOutputCol("token")
    val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models").setOutputCol("embs")
    val nerModel = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models").setInputCols("sentence", "token", "embs").setOutputCol("ner")
    val nerConverter = new NerConverter().setInputCols("sentence", "token", "ner").setOutputCol("ner_chunk")

    Define the IOB tagger, which needs tokens and chunks as input. Show results.

    val iobTagger = new IOBTagger().setInputCols("token", "ner_chunk").setOutputCol("ner_label")
    val pipeline = new Pipeline().setStages(Array(docAssembler, sentenceDetector, tokenizer, embeddings, nerModel, nerConverter, iobTagger))
    
    result.selectExpr("explode(ner_label) as a")
      .selectExpr("a.begin","a.end","a.result as chunk","a.metadata.word as word")
      .where("chunk!='O'").show(5, false)
    
    +-----+---+-----------+-----------+
    |begin|end|chunk      |word       |
    +-----+---+-----------+-----------+
    |5    |15 |B-Age      |63-year-old|
    |17   |19 |B-Gender   |man        |
    |64   |72 |B-Modifier |recurrent  |
    |98   |107|B-Diagnosis|cellulitis |
    |110  |119|B-Diagnosis|pneumonias |
    +-----+---+-----------+-----------+
    See also

    Tokenizer

    MedicalNerModel

    NerConverter

  5. class MedicalNerApproach extends AnnotatorApproach[MedicalNerModel] with MedicalNerParams with NerApproach[MedicalNerApproach] with Logging with ParamsAndFeaturesWritable with EvaluationDLParams with CheckLicense

    Trains generic NER models based on Neural Networks.

    Trains generic NER models based on Neural Networks.

    The architecture of the neural network is a Char CNNs - BiLSTM - CRF that achieves state-of-the-art in most datasets. For instantiated/pretrained models, see MedicalNerModel

    The training data should be a labeled Spark Dataset, in the CoNLL 2003 IOB format with Annotation type columns. The data should have columns of type DOCUMENT, TOKEN, WORD_EMBEDDINGS and an additional label column of annotator type NAMED_ENTITY.

    Excluding the label, this can be done with, for example, the annotators SentenceDetector, Tokenizer, and WordEmbeddingsModel (any embeddings can be chosen, e.g. BertEmbeddings for BERT based embeddings).

    For extended examples of usage, see the Spark NLP Workshop.

    Notes

    Both DocumentAssembler and SentenceDetector annotators are annotators that output the DOCUMENT annotation type. Thus, any of them can be used as the first annotators in a pipeline.

    Example

    First extract the prerequisites for the MedicalNerApproach

    val document = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    val sentenceDetector = new SentenceDetector()
      .setInputCols("document")
      .setOutputCol("sentence")
    val tokenizer = new Tokenizer()
      .setInputCols("sentence")
      .setOutputCol("token")
    val embeddings = BertEmbeddings.pretrained()
      .setInputCols("sentence", "token")
      .setOutputCol("embeddings")

    Then define the NER annotator

    val nerTagger = new MedicalNerApproach()
      .setInputCols("sentence", "token", "embeddings")
      .setLabelColumn("label")
      .setOutputCol("ner")
      .setMaxEpochs(10)
      .setLr(0.005f)
      .setPo(0.005f)
      .setBatchSize(32)
      .setValidationSplit(0.1f)

    Then the training can start

    val pipeline = new Pipeline().setStages(Array(
      document,
      sentenceDetector,
      tokenizer,
      embeddings,
      nerTagger
    ))
    
    trainingData = conll.readDataset(spark, "path/to/train_data.conll")
    pipelineModel = pipeline.fit(trainingData)
  6. class MedicalNerModel extends AnnotatorModel[MedicalNerModel] with MedicalNerParams with HasBatchedAnnotate[MedicalNerModel] with WriteTensorflowModel with HasStorageRef with ParamsAndFeaturesWritable with CheckLicense
  7. trait MedicalNerParams extends Params with HasFeatures
  8. case class NamedEntityConfidence(start: Int, end: Int, entity: String, text: String, sentenceId: String, confidence: Option[Float]) extends Product with Serializable
  9. class NerChunker extends AnnotatorModel[NerChunker] with HasSimpleAnnotate[NerChunker]

    Extracts phrases that fits into a known pattern using the NER tags.

    Extracts phrases that fits into a known pattern using the NER tags. Useful for entity groups with neighboring tokens when there is no pretrained NER model to address certain issues. A Regex needs to be provided to extract the tokens between entities.

    Example

    Defining pipeline stages for NER

    val data= Seq("She has cystic cyst on her kidney.").toDF("text")
    
    val documentAssembler=new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentenceDetector=new SentenceDetector()
      .setInputCols("document")
      .setOutputCol("sentence")
      .setUseAbbreviations(false)
    
    val tokenizer=new Tokenizer()
      .setInputCols(Array("sentence"))
      .setOutputCol("token")
    
    val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
      .setInputCols("sentence","token")
      .setOutputCol("embeddings")
      .setCaseSensitive(false)
    
    val ner = MedicalNerModel.pretrained("ner_radiology", "en", "clinical/models")
      .setInputCols("sentence","token","embeddings")
      .setOutputCol("ner")
      .setIncludeConfidence(true)

    Define the NerChunker to combine to chunks

    val chunker = new NerChunker()
      .setInputCols(Array("sentence","ner"))
      .setOutputCol("ner_chunk")
      .setRegexParsers(Array("<ImagingFindings>.*<BodyPart>"))
    
    val pipeline=new Pipeline().setStages(Array(
      documentAssembler,
      sentenceDetector,
      tokenizer,
      embeddings,
      ner,
      chunker
    ))
    
    val result = pipeline.fit(data).transform(data)

    Show results:

    result.selectExpr("explode(arrays_zip(ner.metadata , ner.result))")
      .selectExpr("col['0'].word as word" , "col['1'] as ner").show(truncate=false)
    +------+-----------------+
    |word  |ner              |
    +------+-----------------+
    |She   |O                |
    |has   |O                |
    |cystic|B-ImagingFindings|
    |cyst  |I-ImagingFindings|
    |on    |O                |
    |her   |O                |
    |kidney|B-BodyPart       |
    |.     |O                |
    +------+-----------------+
    
    result.select("ner_chunk.result").show(truncate=false)
    +---------------------------+
    |result                     |
    +---------------------------+
    |[cystic cyst on her kidney]|
    +---------------------------+
  10. class NerConverterInternal extends AnnotatorApproach[NerConverterInternalModel] with SourceTrackingMetadataParams with HasFeatures with FilteringParams with CheckLicense

    Converts IOB or IOB2 representations of entities to a user-friendly one.

    Converts IOB or IOB2 representations of entities to a user-friendly one.

    This is the AnnotatorApproach version of the NerConverterInternalModel annotator.

    Chunks with no associated entity (tagged "O") are filtered.

    This licensed annotator adds extra functionality to the open-source version by adding the following parameters: blackList, greedyMode, threshold, and ignoreStopWords that are not available in the open-source annotator.

    See also Inside–outside–beginning (tagging) for more information.

    Example

    The output of a MedicalNerModel follows the Annotator schema and looks like this after the transformation.

    result.selectExpr("explode(ner_result)").show(5, false)
    +--------------------------------------------------------------------------+
    |col                                                                       |
    +--------------------------------------------------------------------------+
    |{named_entity, 3, 3, O, {word -> A, confidence -> 0.994}, []}             |
    |{named_entity, 5, 15, B-Age, {word -> 63-year-old, confidence -> 1.0}, []}|
    |{named_entity, 17, 19, B-Gender, {word -> man, confidence -> 0.9858}, []} |
    |{named_entity, 21, 28, O, {word -> presents, confidence -> 0.9952}, []}   |
    |{named_entity, 30, 31, O, {word -> to, confidence -> 0.7063}, []}         |
    +--------------------------------------------------------------------------+

    After the converter is used:

    result.selectExpr("explode(ner_converter_result)").show(5, false)
    +-----------------------------------------------------------------------------------+
    |col                                                                                |
    +-----------------------------------------------------------------------------------+
    |{chunk, 5, 15, 63-year-old, {entity -> Age, sentence -> 0, chunk -> 0}, []}        |
    |{chunk, 17, 19, man, {entity -> Gender, sentence -> 0, chunk -> 1}, []}            |
    |{chunk, 64, 72, recurrent, {entity -> Modifier, sentence -> 0, chunk -> 2}, []}    |
    |{chunk, 98, 107, cellulitis, {entity -> Diagnosis, sentence -> 0, chunk -> 3}, []} |
    |{chunk, 110, 119, pneumonias, {entity -> Diagnosis, sentence -> 0, chunk -> 4}, []}|
    +-----------------------------------------------------------------------------------+
    See also

    MedicalNerModel

  11. class NerConverterInternalModel extends AnnotatorModel[NerConverterInternalModel] with HasSimpleAnnotate[NerConverterInternalModel] with SourceTrackingMetadataParams with FilteringParams with CheckLicense

    Converts a IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their label.

    Converts a IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their label. Chunks with no associated entity (tagged "O") are filtered. See also Inside–outside–beginning (tagging) for more information.

    Example

    The output of a MedicalNerModel follows the Annotator schema and looks like this after the transformation.

    result.selectExpr("explode(ner_result)").show(5, false)
    +--------------------------------------------------------------------------+
    |col                                                                       |
    +--------------------------------------------------------------------------+
    |{named_entity, 3, 3, O, {word -> A, confidence -> 0.994}, []}             |
    |{named_entity, 5, 15, B-Age, {word -> 63-year-old, confidence -> 1.0}, []}|
    |{named_entity, 17, 19, B-Gender, {word -> man, confidence -> 0.9858}, []} |
    |{named_entity, 21, 28, O, {word -> presents, confidence -> 0.9952}, []}   |
    |{named_entity, 30, 31, O, {word -> to, confidence -> 0.7063}, []}         |
    +--------------------------------------------------------------------------+

    After the converter is used:

    result.selectExpr("explode(ner_converter_result)").show(5, false)
    +-----------------------------------------------------------------------------------+
    |col                                                                                |
    +-----------------------------------------------------------------------------------+
    |{chunk, 5, 15, 63-year-old, {entity -> Age, sentence -> 0, chunk -> 0}, []}        |
    |{chunk, 17, 19, man, {entity -> Gender, sentence -> 0, chunk -> 1}, []}            |
    |{chunk, 64, 72, recurrent, {entity -> Modifier, sentence -> 0, chunk -> 2}, []}    |
    |{chunk, 98, 107, cellulitis, {entity -> Diagnosis, sentence -> 0, chunk -> 3}, []} |
    |{chunk, 110, 119, pneumonias, {entity -> Diagnosis, sentence -> 0, chunk -> 4}, []}|
    +-----------------------------------------------------------------------------------+
    See also

    MedicalNerModel

  12. class NerTemplateRenderModel extends AnnotatorModel[NerTemplateRenderModel] with HasSimpleAnnotate[NerTemplateRenderModel] with CheckLicense

    Renders a list of SparkNLP for Healthcare templates provided as a StringArrayParam

    Renders a list of SparkNLP for Healthcare templates provided as a StringArrayParam

    The output of a NerTemplateRenderModel follows outputs Documents based on the provideed templates.

    See also

    NerTemplateRenderModel

  13. class PretrainedZeroShotNER extends AnnotatorModel[PretrainedZeroShotNER] with ParamsAndFeaturesWritable with HasSimpleAnnotate[PretrainedZeroShotNER] with WriteOnnxModel with WriteSentencePieceModel with CheckLicense
  14. trait ReadPretrainedZeroShotNER extends ReadSentencePieceModel with InternalReadOnnxModel
  15. trait ReadZeroShotNerTensorflowModel extends ReadTensorflowModel with ReadOnnxModel
  16. trait ReadablePretrainedMedicalNer extends ParamsAndFeaturesReadable[MedicalNerModel] with HasPretrained[MedicalNerModel]
  17. trait ReadablePretrainedPretrainedZeroShotNER extends ParamsAndFeaturesReadable[PretrainedZeroShotNER] with HasPretrained[PretrainedZeroShotNER]
  18. trait ReadablePretrainedZeroShotNer extends ParamsAndFeaturesReadable[ZeroShotNerModel] with HasPretrained[ZeroShotNerModel]
  19. trait ReadsMedicalNerGraph extends ParamsAndFeaturesReadable[MedicalNerModel] with ReadTensorflowModel
  20. trait WithMedicalNerGraphResolver extends AnyRef
  21. class ZeroShotNerModel extends RoBertaForQuestionAnswering with CheckLicense

    ZeroShotNerModel implements zero shot named entity recognition by utilizing RoBERTa transformer models fine tuned on a question answering task.

    ZeroShotNerModel implements zero shot named entity recognition by utilizing RoBERTa transformer models fine tuned on a question answering task. Its input is a list of document annotations and it automatically generates questions which are used to recognize entities. The definitions of entities is given by a dictionary structures, specifying a set of questions for each entity. The model is based on RoBertaForQuestionAnswering from the open source SparkNLP project.

    Pretrained models can be loaded with pretrained of the companion object:

    val zeroShotRE = ZeroShotNerModel.pretrained()
      .setInputCols("document")
      .setOutputCol("zer_shot_ner")

    For available pretrained models please see the Models Hub.

    Example

     val documentAssembler = new DocumentAssembler()
       .setInputCol("text")
       .setOutputCol("document")
    
     val sentenceDetector = new SentenceDetector()
       .setInputCols(Array("document"))
       .setOutputCol("sentences")
    
     val zeroShotNer = ZeroShotNerModel
       .pretrained()
       .setEntityDefinitions(
         Map(
           "NAME" -> Array("What is his name?", "What is her name?"),
           "CITY" -> Array("Which city?")))
       .setPredictionThreshold(0.01f)
       .setInputCols("sentences")
       .setOutputCol("zero_shot_ner")
    
     val pipeline = new Pipeline()
       .setStages(Array(
         documentAssembler,
         sentenceDetector,
         zeroShotNer))
    
     val model = pipeline.fit(Seq("").toDS.toDF("text"))
     val results = model.transform(
       Seq("Clara often travels between New York and Paris.").toDS.toDF("text"))
    
     results
       .selectExpr("document", "explode(zero_shot_ner) AS entity")
       .select(
         col("entity.result"),
         col("entity.metadata.word"),
         col("entity.metadata.sentence"),
         col("entity.begin"),
         col("entity.end"),
         col("entity.metadata.confidence"),
         col("entity.metadata.question"))
       .show(truncate=false)
    
    +------+-----+--------+-----+---+----------+------------------+
    |result|word |sentence|begin|end|confidence|question          |
    +------+-----+--------+-----+---+----------+------------------+
    |B-CITY|Paris|0       |41   |45 |0.78655756|Which is the city?|
    |B-CITY|New  |0       |28   |30 |0.29346612|Which city?       |
    |I-CITY|York |0       |32   |35 |0.29346612|Which city?       |
    +------+-----+--------+-----+---+----------+------------------+
    See also

    https://arxiv.org/abs/1907.11692 for details about the RoBERTa transformer

    RoBertaForQuestionAnswering for the SparkNLP implementation of RoBERTa question answering

Value Members

  1. object IOBTagger extends ParamsAndFeaturesReadable[IOBTagger] with Serializable
  2. object MedicalNerApproach extends DefaultParamsReadable[MedicalNerApproach] with WithMedicalNerGraphResolver with Serializable
  3. object MedicalNerModel extends ReadablePretrainedMedicalNer with ReadsMedicalNerGraph with Serializable
  4. object NerChunker extends DefaultParamsReadable[Chunker] with Serializable
  5. object NerConverterInternalModel extends ParamsAndFeaturesReadable[NerConverterInternalModel] with Serializable
  6. object NerTaggedInternal
  7. object NerTagsEncodingInternal

    Works with different NER representations as tags Supports: IOB and IOB2 https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)

  8. object NerTemplateRenderModel extends ParamsAndFeaturesReadable[NerTemplateRenderModel] with Serializable
  9. object PretrainedZeroShotNER extends ReadablePretrainedPretrainedZeroShotNER with ReadPretrainedZeroShotNER with Serializable
  10. object ZeroShotNerModel extends ReadablePretrainedZeroShotNer with ReadZeroShotNerTensorflowModel with Serializable

Ungrouped