Packages

p

com.johnsnowlabs.nlp

annotators

package annotators

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. class Chunk2Token extends AnnotatorModel[Chunk2Token] with HasSimpleAnnotate[Chunk2Token]

    A feature transformer that converts the input array of strings (annotatorType CHUNK) into an array of chunk-based tokens (annotatorType TOKEN).

    A feature transformer that converts the input array of strings (annotatorType CHUNK) into an array of chunk-based tokens (annotatorType TOKEN).

    When the input is empty, an empty array is returned.

    This Annotator is specially convenient when using NGramGenerator annotations as inputs to WordEmbeddingsModels

    Example

    Define a pipeline for generating n-grams

    val data = Seq(("A 63-year-old man presents to the hospital ...")).toDF("text")
    val document = new DocumentAssembler().setInputCol("text").setOutputCol("document")
    val sentenceDetector = new SentenceDetector().setInputCols("document").setOutputCol("sentence")
    val token = new Tokenizer().setInputCols("sentence").setOutputCol("token")
    val ngrammer = new NGramGenerator()
     .setN(2)
     .setEnableCumulative(false)
     .setInputCols("token")
     .setOutputCol("ngrams")
     .setDelimiter("_")

    Stage to convert n-gram CHUNKS to TOKEN type

    val chunk2Token = new Chunk2Token().setInputCols("ngrams").setOutputCol("ngram_tokens")
    val trainingPipeline = new Pipeline().setStages(Array(document, sentenceDetector, token, ngrammer, chunk2Token)).fit(data)
    
    val result = trainingPipeline.transform(data).cache()
    result.selectExpr("explode(ngram_tokens)").show(5, false)
      +----------------------------------------------------------------+
      |col                                                             |
      +----------------------------------------------------------------+
      |{token, 3, 15, A_63-year-old, {sentence -> 0, chunk -> 0}, []}  |
      |{token, 5, 19, 63-year-old_man, {sentence -> 0, chunk -> 1}, []}|
      |{token, 17, 28, man_presents, {sentence -> 0, chunk -> 2}, []}  |
      |{token, 21, 31, presents_to, {sentence -> 0, chunk -> 3}, []}   |
      |{token, 30, 35, to_the, {sentence -> 0, chunk -> 4}, []}        |
      +----------------------------------------------------------------+
    See also

    NGramGenerator

  2. class DocumentFiltererByClassifier extends AnnotatorModel[DocumentFiltererByClassifier] with HasSimpleAnnotate[DocumentFiltererByClassifier] with WhiteAndBlackListParams

    Filters documents by the result of classifier annotators.

    Filters documents by the result of classifier annotators. Documents are filtered by the white list and black list. The white list is a list of classifier results that are allowed to pass the filter. The black list is a list of classifier results that are not allowed to pass the filter. The filter is case sensitive. If the caseSensitive is set to false, the filter is case in-sensitive.

    Example

    val text ="""British Department of Health confirms first two cases of in UK.
       |So my trip to visit my australian exchange student just got canceled because of Coronavirus.
       |I wish everyone to be safe at home and stop pandemic.""".stripMargin
    val documentAssembler = new DocumentAssembler()
       .setInputCol("text").setOutputCol("document")
    val sentenceDetector = new SentenceDetector()
       .setInputCols("document").setOutputCol("sentence")
    val tokenizer = new Tokenizer()
       .setInputCols("sentence").setOutputCol("token")
    val medicalBFSC = MedicalBertForSequenceClassification
       .pretrained("bert_sequence_classifier_covid_sentiment", "en", "clinical/models")
       .setInputCols("sentence", "token").setOutputCol("classifier")
    val documentFilterer = new DocumentFiltererByClassifier()
       .setInputCols("sentence", "classifier").setOutputCol("filteredDocuments")
       .setWhiteList("positive")
       .setCaseSensitive(false)
    import spark.implicits._
    val textDF = Seq(text).toDF("text")
    val result = new Pipeline().setStages(Array(
       documentAssembler,
       sentenceDetector,
       tokenizer,
       medicalBFSC,
       documentFilterer
    )).fit(textDF).transform(textDF)
    result.selectExpr("explode(filteredDocuments)").show(false)
    
    +------------------------------------------------------------------------------------------------+
    |col                                                                                             |
    +------------------------------------------------------------------------------------------------+
    |{document, 159, 211, I wish everyone to be safe at home and stop pandemic., {sentence -> 2}, []}|
    +------------------------------------------------------------------------------------------------+
  3. class DocumentFiltererByNER extends AnnotatorModel[DocumentFiltererByNER] with HasSimpleAnnotate[DocumentFiltererByNER] with WhiteAndBlackListParams with CheckLicense

    Filters documents by the entity fields of the chunks.

    Filters documents by the entity fields of the chunks. Documents are filtered by the white list and black list. The white list is a list of classifier results that are allowed to pass the filter. The black list is a list of classifier results that are not allowed to pass the filter. The filter is case sensitive. If the caseSensitive is set to false, the filter is case in-sensitive. If the outputAsDocument is set to true, the output will be a single document with all sentences joined. The joinString parameter is used to add the delimiter between results of annotations when combining them into a single result.

    The input annotators are expected to be of type DOCUMENT and CHUNK. The output annotation type is DOCUMENT.

    Note

    A document may contain multiple chunks. If any of the chunks in the document is in the white list, the document will pass the filter. And white list has priority over black list.

  4. class DrugNormalizer extends AnnotatorModel[DrugNormalizer] with HasSimpleAnnotate[DrugNormalizer] with CheckLicense

    Annotator which normalizes raw text from clinical documents, e.g.

    Annotator which normalizes raw text from clinical documents, e.g. scraped web pages or xml documents, from document type columns into Sentence. Removes all dirty characters from text following one or more input regex patterns. Can apply non wanted character removal which a specific policy. Can apply lower case normalization.

    See Spark NLP Workshop for more examples of usage.

    Example

    val data = Seq(
      ("Sodium Chloride/Potassium Chloride 13bag"),
      ("interferon alfa-2b 10 million unit ( 1 ml ) injec"),
      ("aspirin 10 meq/ 5 ml oral sol")
    ).toDF("text")
    val document = new DocumentAssembler().setInputCol("text").setOutputCol("document")
    val drugNormalizer = new DrugNormalizer().setInputCols("document").setOutputCol("document_normalized")
    
    val trainingPipeline = new Pipeline().setStages(Array(document, drugNormalizer))
    val result = trainingPipeline.fit(data).transform(data)
    
    result.selectExpr("explode(document_normalized.result) as normalized_text").show(false)
    +----------------------------------------------------+
    |normalized_text                                     |
    +----------------------------------------------------+
    |Sodium Chloride / Potassium Chloride 13 bag         |
    |interferon alfa - 2b 10000000 unt ( 1 ml ) injection|
    |aspirin 2 meq/ml oral solution                      |
    +----------------------------------------------------+
  5. class Flattener extends Transformer with ParamsAndFeaturesWritable

    Converts annotation results into exploded and flattened format.

    Converts annotation results into exploded and flattened format. It is useful to extract the results from Spark NLP Pipelines. The Flattener outputs annotation(s) values into String.

    Example

     val dataSet = Seq("GENERAL: He is an elderly gentleman in no acute distress. He is sitting up in bed eating his breakfast." +
    " He is alert and oriented and answering questions appropriately.\nHEENT: Sclerae showed mild arcus senilis in the right." +
    " Left was clear. Pupils are equally round and reactive to light. Extraocular movements are intact. Oropharynx is clear." +
    "\nNECK: Supple. Trachea is midline. No jugular venous pressure distention is noted. No adenopathy in the cervical, " +
    "supraclavicular, or axillary areas.\nABDOMEN: Soft and not tender. There may be some fullness in the left upper quadrant, " +
    "although I do not appreciate a true spleen with inspiration.\nEXTREMITIES: There is some edema, but no cyanosis and " ).toDS.toDF("text")
    
    
    val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
    val sentenceDetector = new SentenceDetector().setInputCols(Array("document")).setOutputCol("sentence")
    val tokenizer = new Tokenizer().setInputCols(Array("sentence")).setOutputCol("token")
    val wordEmbeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models").setInputCols(Array("sentence", "token")).setOutputCol("embeddings")
    val clinicalNer = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models").setInputCols(Array("sentence", "token", "embeddings")).setOutputCol("ner")
    val nerConverter = new NerConverter().setInputCols(Array("sentence", "token", "ner")).setOutputCol("ner_chunk")
    val clinicalAssertion = AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models").setInputCols(Array("sentence", "ner_chunk", "embeddings")).setOutputCol("assertion").setEntityAssertionCaseSensitive(false)
    
    val flattener = new Flattener()
     .setInputCols("sentence", "ner_chunk", "assertion")
     .setExplodeSelectedFields(Map("ner_chunk" -> Array("result","metadata.entity"),
                                   "assertion"->Array("result","metadata.confidence")))
    
    val pipeline = new Pipeline().setStages(
     Array(
       documentAssembler,
       sentenceDetector,
       tokenizer,
       wordEmbeddings,
       clinicalNer,
       nerConverter,
       clinicalAssertion,
       flattener
     ))
    
     val result = pipeline.fit(dataSet).transform(dataSet)
     result.show(false)
    
        +----------------------------------+-------------------------+----------------+-----------------------------+
        |ner_chunk_result                  |ner_chunk_metadata_entity|assertion_result|assertion_metadata_confidence|
        +----------------------------------+-------------------------+----------------+-----------------------------+
        |distress                          |Symptom                  |Absent          |1.0                          |
        |arcus senilis                     |Disease_Syndrome_Disorder|Past            |1.0                          |
        |jugular venous pressure distention|Symptom                  |Absent          |1.0                          |
        |adenopathy                        |Symptom                  |Absent          |1.0                          |
        |tender                            |Symptom                  |Absent          |1.0                          |
        |fullness                          |Symptom                  |Possible        |0.9999                       |
        |edema                             |Symptom                  |Present         |1.0                          |
        |cyanosis                          |VS_Finding               |Absent          |1.0                          |
        +----------------------------------+-------------------------+----------------+-----------------------------+
  6. class MultiChunk2Doc extends AnnotatorModel[MultiChunk2Doc] with HasSimpleAnnotate[MultiChunk2Doc] with WhiteAndBlackListParams with CheckLicense

    MultiChunk2Doc annotator merges a given chunks to create a document.

    MultiChunk2Doc annotator merges a given chunks to create a document. During document creation, a specific whitelist and blacklist filter can be applied, and case sensitivity can be adjusted.

    See also

    WhiteAndBlackListParams Additionally, specified prefix and suffix texts can be placed before and after the merged chunks in the resulting document. And a separator can be placed between the chunks.

    Example

    val document_assembler = new DocumentAssembler()
     .setInputCol("text").setOutputCol("document")
    
    val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
     .setInputCols("document").setOutputCol("sentence")
    
    val tokenizer = new Tokenizer()
     .setInputCols("sentence").setOutputCol("token")
    
    val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
     .setInputCols(Array("sentence", "token")).setOutputCol("embeddings")
    
    val ner = MedicalNerModel.pretrained("ner_clinical_large_langtest", "en", "clinical/models")
     .setInputCols("sentence", "token", "embeddings").setOutputCol("ner")
    
    val ner_converter = new NerConverterInternal()
     .setInputCols(Array("sentence", "token", "ner")).setOutputCol("ner_chunk")
    
    val multi_chunk2_doc = new MultiChunk2Doc()
     .setInputCols("ner_chunk")
     .setOutputCol("new_doc")
     .setWhiteList(Array("test"))
     .setCaseSensitive(false)
     .setPrefix("<")
     .setSuffix(">")
     .setSeparator("><")
    
    val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter, multi_chunk2_doc))
    import spark.implicits._
    val data = Seq(
    """A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM),
    | one prior episode of HTG-induced pancreatitis three years prior to presentation, and associated with an acute hepatitis,
    | presented with a one-week history of polyuria, poor appetite, and vomiting. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG.
    | She had been on dapagliflozin for six months at the time of presentation. Physical examination on presentation was significant for dry oral mucosa; significantly,
    | her abdominal examination was benign with no tenderness, guarding, or rigidity.""".stripMargin)
    .toDF("text")
    val result = pipeline.fit(data).transform(data)

    Show Results

    result.selectExpr("explode(new_doc) as result").show(false)
    
    +----------------------------------------------------------------------------------------------------------+
    |result                                                                                                    |
    +----------------------------------------------------------------------------------------------------------+
    |{document, 0, 48, <Physical examination><her abdominal examination>, {document -> 0, chunk_count -> 2}, []}|
    +----------------------------------------------------------------------------------------------------------+

Value Members

  1. object DocumentFiltererByClassifier extends DefaultParamsReadable[DocumentFiltererByClassifier] with Serializable

    This is the companion object of DocumentFiltererByClassifier.

    This is the companion object of DocumentFiltererByClassifier. Please refer to that class for the documentation.

  2. object DocumentFiltererByNER extends DefaultParamsReadable[DocumentFiltererByNER] with Serializable

    This is the companion object of DocumentFiltererByNER.

    This is the companion object of DocumentFiltererByNER. Please refer to that class for the documentation.

  3. object DrugNormalizer extends DefaultParamsReadable[DrugNormalizer] with Serializable
  4. object Flattener extends ParamsAndFeaturesReadable[Flattener] with Serializable

    This is the companion object of Flattener.

    This is the companion object of Flattener. Please refer to that class for the documentation.

  5. object MultiChunk2Doc extends DefaultParamsReadable[MultiChunk2Doc] with Serializable

    This is the companion object of MultiChunk2Doc.

    This is the companion object of MultiChunk2Doc. Please refer to that class for the documentation.

Ungrouped