Packages

p

com.johnsnowlabs.nlp

annotators

package annotators

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. class Chunk2Token extends AnnotatorModel[Chunk2Token] with HasSimpleAnnotate[Chunk2Token]

    A feature transformer that converts the input array of strings (annotatorType CHUNK) into an array of chunk-based tokens (annotatorType TOKEN).

    A feature transformer that converts the input array of strings (annotatorType CHUNK) into an array of chunk-based tokens (annotatorType TOKEN).

    When the input is empty, an empty array is returned.

    This Annotator is specially convenient when using NGramGenerator annotations as inputs to WordEmbeddingsModels

    Example

    Define a pipeline for generating n-grams

    val data = Seq(("A 63-year-old man presents to the hospital ...")).toDF("text")
    val document = new DocumentAssembler().setInputCol("text").setOutputCol("document")
    val sentenceDetector = new SentenceDetector().setInputCols("document").setOutputCol("sentence")
    val token = new Tokenizer().setInputCols("sentence").setOutputCol("token")
    val ngrammer = new NGramGenerator()
     .setN(2)
     .setEnableCumulative(false)
     .setInputCols("token")
     .setOutputCol("ngrams")
     .setDelimiter("_")

    Stage to convert n-gram CHUNKS to TOKEN type

    val chunk2Token = new Chunk2Token().setInputCols("ngrams").setOutputCol("ngram_tokens")
    val trainingPipeline = new Pipeline().setStages(Array(document, sentenceDetector, token, ngrammer, chunk2Token)).fit(data)
    
    val result = trainingPipeline.transform(data).cache()
    result.selectExpr("explode(ngram_tokens)").show(5, false)
      +----------------------------------------------------------------+
      |col                                                             |
      +----------------------------------------------------------------+
      |{token, 3, 15, A_63-year-old, {sentence -> 0, chunk -> 0}, []}  |
      |{token, 5, 19, 63-year-old_man, {sentence -> 0, chunk -> 1}, []}|
      |{token, 17, 28, man_presents, {sentence -> 0, chunk -> 2}, []}  |
      |{token, 21, 31, presents_to, {sentence -> 0, chunk -> 3}, []}   |
      |{token, 30, 35, to_the, {sentence -> 0, chunk -> 4}, []}        |
      +----------------------------------------------------------------+
    See also

    NGramGenerator

  2. class DrugNormalizer extends AnnotatorModel[DrugNormalizer] with HasSimpleAnnotate[DrugNormalizer] with CheckLicense

    Annotator which normalizes raw text from clinical documents, e.g.

    Annotator which normalizes raw text from clinical documents, e.g. scraped web pages or xml documents, from document type columns into Sentence. Removes all dirty characters from text following one or more input regex patterns. Can apply non wanted character removal which a specific policy. Can apply lower case normalization.

    See Spark NLP Workshop for more examples of usage.

    Example

    val data = Seq(
      ("Sodium Chloride/Potassium Chloride 13bag"),
      ("interferon alfa-2b 10 million unit ( 1 ml ) injec"),
      ("aspirin 10 meq/ 5 ml oral sol")
    ).toDF("text")
    val document = new DocumentAssembler().setInputCol("text").setOutputCol("document")
    val drugNormalizer = new DrugNormalizer().setInputCols("document").setOutputCol("document_normalized")
    
    val trainingPipeline = new Pipeline().setStages(Array(document, drugNormalizer))
    val result = trainingPipeline.fit(data).transform(data)
    
    result.selectExpr("explode(document_normalized.result) as normalized_text").show(false)
    +----------------------------------------------------+
    |normalized_text                                     |
    +----------------------------------------------------+
    |Sodium Chloride / Potassium Chloride 13 bag         |
    |interferon alfa - 2b 10000000 unt ( 1 ml ) injection|
    |aspirin 2 meq/ml oral solution                      |
    +----------------------------------------------------+

Value Members

  1. object DrugNormalizer extends DefaultParamsReadable[DrugNormalizer] with Serializable

Ungrouped