Packages

package chunker

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. class AssertionFilterer extends AnnotatorModel[AssertionFilterer] with HasSimpleAnnotate[AssertionFilterer] with CheckLicense

    Filters entities coming from ASSERTION type annotations and returns the CHUNKS.

    Filters entities coming from ASSERTION type annotations and returns the CHUNKS. Filters can be set via a white list on the extracted chunk, the assertion or a regular expression. White list for assertion is enabled by default. To use chunk white list, criteria has to be set to "isin". For regex, criteria has to be set to "regex".

    Example

    To see how the assertions are extracted, see the example for AssertionDLModel.

    Define an extra step where the assertions are filtered
    val assertionFilterer = new AssertionFilterer()
      .setInputCols("sentence","ner_chunk","assertion")
      .setOutputCol("filtered")
      .setCriteria("assertion")
      .setWhiteList("present")
    
    val assertionPipeline = new Pipeline().setStages(Array(
      documentAssembler,
      sentenceDetector,
      tokenizer,
      embeddings,
      nerModel,
      nerConverter,
      clinicalAssertion,
      assertionFilterer
    ))
    
    val assertionModel = assertionPipeline.fit(data)
    val result = assertionModel.transform(data)

    Show results:

    result.selectExpr("ner_chunk.result", "assertion.result").show(3, truncate=false)
    +--------------------------------+--------------------------------+
    |result                          |result                          |
    +--------------------------------+--------------------------------+
    |[severe fever, sore throat]     |[present, present]              |
    |[stomach pain]                  |[absent]                        |
    |[an epidural, PCA, pain control]|[present, present, hypothetical]|
    +--------------------------------+--------------------------------+
    
    result.select("filtered.result").show(3, truncate=false)
    +---------------------------+
    |result                     |
    +---------------------------+
    |[severe fever, sore throat]|
    |[]                         |
    |[an epidural, PCA]         |
    +---------------------------+
    See also

    AssertionDLModel to extract the assertions

  2. class ChunkConverter extends AnnotatorModel[ChunkConverter] with HasSimpleAnnotate[ChunkConverter] with CheckLicense

    Convert chunks from regexMatcher to chunks with a entity in the metadata.

    Convert chunks from regexMatcher to chunks with a entity in the metadata. Use the identifier or field as a entity.

    Example

      val sampleDataset = ResourceHelper.spark.createDataFrame(Seq(
        (1, "My first sentence with the first rule. This is my second sentence with ceremonies rule.")
      )).toDF("id", "text")
    
      val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
    
      val sentence = new SentenceDetector().setInputCols("document").setOutputCol("sentence")
    
      val regexMatcher = new RegexMatcher()
        .setExternalRules(ExternalResource("src/test/resources/regex-matcher/rules.txt", ReadAs.TEXT, Map("delimiter" -> ",")))
        .setInputCols(Array("sentence"))
        .setOutputCol("regex")
        .setStrategy(strategy)
    
      val chunkConverter = new ChunkConverter().setInputCols("regex").setOutputCol("chunk")
    
      val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, regexMatcher,chunkConverter))
    
      val results = pipeline.fit(sampleDataset).transform(sampleDataset)
      results.select("chunk").show(truncate = false)
    +------------------------------------------------------------------------------------------------+
    |col                                                                                             |
    +------------------------------------------------------------------------------------------------+
    |[chunk, 23, 31, the first, [identifier -> NAME, sentence -> 0, chunk -> 0, entity -> NAME], []] |
    |[chunk, 71, 80, ceremonies, [identifier -> NAME, sentence -> 1, chunk -> 0, entity -> NAME], []]|
    +------------------------------------------------------------------------------------------------+
  3. class ChunkFilterer extends AnnotatorModel[ChunkFilterer] with HasSimpleAnnotate[ChunkFilterer] with CheckLicense

    Filters entities coming from CHUNK annotations.

    Filters entities coming from CHUNK annotations. Filters can be set via a white list of terms or a regular expression. White list criteria is enabled by default. To use regex, criteria has to be set to regex.

    Example

    Filtering POS tags

    First pipeline stages to extract the POS tags are defined

    val data = Seq("Has a past history of gastroenteritis and stomach pain, however patient ...").toDF("text")
    val docAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
    val sentenceDetector = new SentenceDetector().setInputCols("document").setOutputCol("sentence")
    val tokenizer = new Tokenizer().setInputCols("sentence").setOutputCol("token")
    
    val posTagger = PerceptronModel.pretrained()
      .setInputCols("sentence", "token")
      .setOutputCol("pos")
    
    val chunker = new Chunker()
      .setInputCols("pos", "sentence")
      .setOutputCol("chunk")
      .setRegexParsers(Array("(<NN>)+"))

    Then the chunks can be filtered via a white list. Here only terms with "gastroenteritis" remain.

    val chunkerFilter = new ChunkFilterer()
      .setInputCols("sentence","chunk")
      .setOutputCol("filtered")
      .setCriteria("isin")
      .setWhiteList("gastroenteritis")
    
    val pipeline = new Pipeline().setStages(Array(
      docAssembler,
      sentenceDetector,
      tokenizer,
      posTagger,
      chunker,
      chunkerFilter))
    
    result.selectExpr("explode(chunk)").show(truncate=false)
    +---------------------------------------------------------------------------------+
    |col                                                                              |
    +---------------------------------------------------------------------------------+
    |{chunk, 11, 17, history, {sentence -> 0, chunk -> 0}, []}                        |
    |{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []}                |
    |{chunk, 42, 53, stomach pain, {sentence -> 0, chunk -> 2}, []}                   |
    |{chunk, 64, 70, patient, {sentence -> 0, chunk -> 3}, []}                        |
    |{chunk, 81, 110, stomach pain now.We don't care, {sentence -> 0, chunk -> 4}, []}|
    |{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []}              |
    +---------------------------------------------------------------------------------+
    
    result.selectExpr("explode(filtered)").show(truncate=false)
    +-------------------------------------------------------------------+
    |col                                                                |
    +-------------------------------------------------------------------+
    |{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []}  |
    |{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []}|
    +-------------------------------------------------------------------+
  4. class ChunkFiltererApproach extends AnnotatorApproach[ChunkFilterer] with CheckLicense

    Filters entities coming from CHUNK annotations.

    Filters entities coming from CHUNK annotations. Filters can be set via a white list of terms or a regular expression. White list criteria is enabled by default. To use regex, criteria has to be set to regex.

    Example

    Filtering POS tags

    First pipeline stages to extract the POS tags are defined

    val data = Seq("Has a past history of gastroenteritis and stomach pain, however patient ...").toDF("text")
    val docAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
    val sentenceDetector = new SentenceDetector().setInputCols("document").setOutputCol("sentence")
    val tokenizer = new Tokenizer().setInputCols("sentence").setOutputCol("token")
    
    val posTagger = PerceptronModel.pretrained()
      .setInputCols("sentence", "token")
      .setOutputCol("pos")
    
    val chunker = new Chunker()
      .setInputCols("pos", "sentence")
      .setOutputCol("chunk")
      .setRegexParsers(Array("(<NN>)+"))

    Then the chunks can be filtered via a white list. Here only terms with "gastroenteritis" remain.

    val chunkerFilter = new ChunkFiltererApproach()
      .setInputCols("sentence","chunk")
      .setOutputCol("filtered")
      .setCriteria("isin")
      .setWhiteList("gastroenteritis")
    
    val pipeline = new Pipeline().setStages(Array(
      docAssembler,
      sentenceDetector,
      tokenizer,
      posTagger,
      chunker,
      chunkerFilter))
    
    result.selectExpr("explode(chunk)").show(truncate=false)
    +---------------------------------------------------------------------------------+
    |col                                                                              |
    +---------------------------------------------------------------------------------+
    |{chunk, 11, 17, history, {sentence -> 0, chunk -> 0}, []}                        |
    |{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []}                |
    |{chunk, 42, 53, stomach pain, {sentence -> 0, chunk -> 2}, []}                   |
    |{chunk, 64, 70, patient, {sentence -> 0, chunk -> 3}, []}                        |
    |{chunk, 81, 110, stomach pain now.We don't care, {sentence -> 0, chunk -> 4}, []}|
    |{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []}              |
    +---------------------------------------------------------------------------------+
    
    result.selectExpr("explode(filtered)").show(truncate=false)
    +-------------------------------------------------------------------+
    |col                                                                |
    +-------------------------------------------------------------------+
    |{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []}  |
    |{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []}|
    +-------------------------------------------------------------------+
  5. class ChunkKeyPhraseExtraction extends BertSentenceEmbeddings with CheckLicense

    Chunk KeyPhrase Extraction uses Bert Sentence Embeddings to determine the most relevant key phrases describing a text.

    Chunk KeyPhrase Extraction uses Bert Sentence Embeddings to determine the most relevant key phrases describing a text. The input to the model consists of chunk annotations and sentence or document annotation. The model compares the chunks against the corresponding sentences/documents and selects the chunks which are most representative of the broader text context (i.e. the document or the sentence they belong to). The key phrases candidates (i.e. the input chunks) can be generated in various ways, e.g. by NGramGenerator, TextMatcher or NerConverter. The model operates either at sentence (selecting the most descriptive chunks from the sentence they belong to) or at document level. In the latter case, the key phrases are selected to represent all the input document annotations.

    This model is a subclass of BertSentenceEmbeddings and shares all parameters with it. It can load any pretrained BertSentenceEmbeddings model. Available models can be found at Models Hub.

    val embeddings = ChunkKeyPhraseExtraction.pretrained()
      .setInputCols("sentence", "chunk")
      .setOutputCol("key_phrase_chunks")

    The default model is "sbert_jsl_medium_uncased", if no name is provided.

    Sources :

    The use of MMR, diversity-based reranking for reordering documents and producing summaries

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.SentenceDetector
    import com.johnsnowlabs.nlp.embeddings.BertSentenceEmbeddings
    import com.johnsnowlabs.nlp.EmbeddingsFinisher
    import org.apache.spark.ml.Pipeline
    
     val documentAssembler = new DocumentAssembler()
       .setInputCol("text")
       .setOutputCol("document")
    
     val tokenizer = new Tokenizer()
       .setInputCols("document")
       .setOutputCol("tokens")
    
     val stopWordsCleaner = StopWordsCleaner.pretrained()
       .setInputCols("tokens")
       .setOutputCol("clean_tokens")
       .setCaseSensitive(false)
    
     val nGrams = new NGramGenerator()
       .setInputCols(Array("clean_tokens"))
       .setOutputCol("ngrams")
       .setN(3)
    
    
     val chunkKeyPhraseExtractor = ChunkKeyPhraseExtraction
       .pretrained()
       .setTopN(2)
       .setDivergence(0.7f)
       .setInputCols(Array("document", "ngrams"))
       .setOutputCol("key_phrases")
    
     val pipeline = new Pipeline()
       .setStages(Array(
         documentAssembler,
         tokenizer,
         stopWordsCleaner,
         nGrams,
         chunkKeyPhraseExtractor))
    
    val sampleText = "Her Diabetes has become type 2 in the last year with her Diabetes." +
       " He complains of swelling in his right forearm."
    
    val testDataset = Seq("").toDS.toDF("text")
    val result = pipeline.fit(emptyDataset).transform(testDataset)
    
     result
       .selectExpr("explode(key_phrases) AS key_phrase")
       .selectExpr(
         "key_phrase.result",
         "key_phrase.metadata.DocumentSimilarity",
         "key_phrase.metadata.MMRScore")
       .show(truncate=false)
    
    +--------------------------+-------------------+------------------+
    |result                    |DocumentSimilarity |MMRScore          |
    +--------------------------+-------------------+------------------+
    |complains swelling forearm|0.6325718954229369 |0.1897715761677257|
    |type 2 year               |0.40181028931546364|-0.189501077108947|
    +--------------------------+-------------------+------------------+
    See also

    BertEmbeddings for token-level embeddings

    BertSentenceEmbeddings for sentence-level embeddings

    Annotators Main Page for a list of transformer based embeddings

  6. class ChunkMapperApproach extends AnnotatorApproach[ChunkMapperModel] with CheckLicense
  7. class ChunkMapperModel extends AnnotatorModel[ChunkMapperModel] with HasSimpleAnnotate[ChunkMapperModel] with CheckLicense
  8. class ChunkSentenceSplitter extends AnnotatorModel[ChunkSentenceSplitter] with HasSimpleAnnotate[ChunkSentenceSplitter] with CheckLicense

    Split the document using the chunks that you provided,and put in the metadata the chunk entity.

    Split the document using the chunks that you provided,and put in the metadata the chunk entity. The first piece of document to the first chunk will have the entity as header.

    Is useful if you identify the titles and subtitles using some ner and after that you can split by paragraph

    Example

     val data = Seq(text,text).toDS.toDF("text")
    val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("doc")
    val regexMatcher = new RegexMatcher().setInputCols("doc").setOutputCol("chunks").setExternalRules("src/test/resources/chunker/title_regex.txt",",")
    val chunkSentenceSplitter = new ChunkSentenceSplitter().setInputCols("chunks","doc").setOutputCol("paragraphs")
    val pipeline =  new Pipeline().setStages(Array(documentAssembler,regexMatcher,chunkSentenceSplitter))
    val result = pipeline.fit(data).transform(data).select("paragraphs")
    result.show(truncate = false)
  9. trait ReadChunkKeyPhraseExtractionTensorflowModel extends ReadTensorflowModel
  10. trait ReadablePretrainedChunkKeyPhraseExtractionModel extends ParamsAndFeaturesReadable[ChunkKeyPhraseExtraction] with HasPretrained[ChunkKeyPhraseExtraction]
  11. trait ReadablePretrainedChunkMapperModel extends ParamsAndFeaturesReadable[ChunkMapperModel] with HasPretrained[ChunkMapperModel]

Value Members

  1. object AssertionFilterer extends ParamsAndFeaturesReadable[AssertionFilterer] with Serializable
  2. object ChunkFilterer extends ParamsAndFeaturesReadable[ChunkFilterer] with Serializable
  3. object ChunkFiltererApproach extends DefaultParamsReadable[ChunkFiltererApproach] with Serializable
  4. object ChunkKeyPhraseExtraction extends ReadablePretrainedChunkKeyPhraseExtractionModel with ReadChunkKeyPhraseExtractionTensorflowModel with Serializable
  5. object ChunkMapperModel extends ParamsAndFeaturesReadable[ChunkMapperModel] with ReadablePretrainedChunkMapperModel with Serializable
  6. object ChunkSentenceSplitter extends ParamsAndFeaturesReadable[ChunkSentenceSplitter] with Serializable

Ungrouped