chunker

package chunker

Ordering

Alphabetic

Visibility

Public
All

Type Members

class AssertionFilterer extends AnnotatorModel[AssertionFilterer] with HasSimpleAnnotate[AssertionFilterer] with HandleExceptionParams with HasSafeAnnotate[AssertionFilterer] with FilteringParams with CheckLicense

Filters entities coming from ASSERTION type annotations and returns the CHUNKS.

Filters entities coming from ASSERTION type annotations and returns the CHUNKS. Filters can be set via a white list and black list on the extracted chunk, the assertion or a regular expression. White and black lists are for assertion are enabled by default. To use chunk white list, criteria has to be set to "isin". For regex, criteria has to be set to "regex".

Example

To see how the assertions are extracted, see the example for AssertionDLModel.

Define an extra step where the assertions are filtered

val assertionFilterer = new AssertionFilterer()
  .setInputCols("sentence","ner_chunk","assertion")
  .setOutputCol("filtered")
  .setCriteria("assertion")
  .setWhiteList("present")

val assertionPipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter,
  clinicalAssertion,
  assertionFilterer
))

val assertionModel = assertionPipeline.fit(data)
val result = assertionModel.transform(data)

Show results:

result.selectExpr("ner_chunk.result", "assertion.result").show(3, truncate=false)
+--------------------------------+--------------------------------+
|result                          |result                          |
+--------------------------------+--------------------------------+
|[severe fever, sore throat]     |[present, present]              |
|[stomach pain]                  |[absent]                        |
|[an epidural, PCA, pain control]|[present, present, hypothetical]|
+--------------------------------+--------------------------------+

result.select("filtered.result").show(3, truncate=false)
+---------------------------+
|result                     |
+---------------------------+
|[severe fever, sore throat]|
|[]                         |
|[an epidural, PCA]         |
+---------------------------+

See also: AssertionDLModel to extract the assertions

class ChunkConverter extends AnnotatorModel[ChunkConverter] with HasSimpleAnnotate[ChunkConverter] with SourceTrackingMetadataParams with CheckLicense

Convert chunks from regexMatcher to chunks with a entity in the metadata.

Convert chunks from regexMatcher to chunks with a entity in the metadata. Use the identifier or field as a entity.

Example

  val sampleDataset = ResourceHelper.spark.createDataFrame(Seq(
    (1, "My first sentence with the first rule. This is my second sentence with ceremonies rule.")
  )).toDF("id", "text")

  val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")

  val sentence = new SentenceDetector().setInputCols("document").setOutputCol("sentence")

  val regexMatcher = new RegexMatcher()
    .setExternalRules(ExternalResource("src/test/resources/regex-matcher/rules.txt", ReadAs.TEXT, Map("delimiter" -> ",")))
    .setInputCols(Array("sentence"))
    .setOutputCol("regex")
    .setStrategy(strategy)

  val chunkConverter = new ChunkConverter().setInputCols("regex").setOutputCol("chunk")

  val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, regexMatcher,chunkConverter))

  val results = pipeline.fit(sampleDataset).transform(sampleDataset)
  results.select("chunk").show(truncate = false)
+------------------------------------------------------------------------------------------------+
|col                                                                                             |
+------------------------------------------------------------------------------------------------+
|[chunk, 23, 31, the first, [identifier -> NAME, sentence -> 0, chunk -> 0, entity -> NAME], []] |
|[chunk, 71, 80, ceremonies, [identifier -> NAME, sentence -> 1, chunk -> 0, entity -> NAME], []]|
+------------------------------------------------------------------------------------------------+

class ChunkFilterer extends AnnotatorModel[ChunkFilterer] with HasSimpleAnnotate[ChunkFilterer] with CheckLicense with HandleExceptionParams with HasSafeAnnotate[ChunkFilterer] with FilteringParams

ChunkFilterer can filter chunks coming from CHUNK annotations.

ChunkFilterer can filter chunks coming from CHUNK annotations. Filters can be set via white list and black list or a regular expression. White list criteria is enabled by default. To use regex, criteria has to be set to regex. Additionally, It can filter chunks according to the confidence of the chunk in the metadata.

Example

Filtering POS tags

First pipeline stages to extract the POS tags are defined

val data = Seq("Has a past history of gastroenteritis and stomach pain, however patient ...").toDF("text")
val docAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentenceDetector = new SentenceDetector().setInputCols("document").setOutputCol("sentence")
val tokenizer = new Tokenizer().setInputCols("sentence").setOutputCol("token")

val posTagger = PerceptronModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("pos")

val chunker = new Chunker()
  .setInputCols("pos", "sentence")
  .setOutputCol("chunk")
  .setRegexParsers(Array("(<NN>)+"))

Then the chunks can be filtered via a white list. Here only terms with "gastroenteritis" remain.

val chunkerFilter = new ChunkFilterer()
  .setInputCols("sentence","chunk")
  .setOutputCol("filtered")
  .setCriteria("isin")
  .setWhiteList("gastroenteritis")

val pipeline = new Pipeline().setStages(Array(
  docAssembler,
  sentenceDetector,
  tokenizer,
  posTagger,
  chunker,
  chunkerFilter))

result.selectExpr("explode(chunk)").show(truncate=false)
+---------------------------------------------------------------------------------+
|col                                                                              |
+---------------------------------------------------------------------------------+
|{chunk, 11, 17, history, {sentence -> 0, chunk -> 0}, []}                        |
|{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []}                |
|{chunk, 42, 53, stomach pain, {sentence -> 0, chunk -> 2}, []}                   |
|{chunk, 64, 70, patient, {sentence -> 0, chunk -> 3}, []}                        |
|{chunk, 81, 110, stomach pain now.We don't care, {sentence -> 0, chunk -> 4}, []}|
|{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []}              |
+---------------------------------------------------------------------------------+

result.selectExpr("explode(filtered)").show(truncate=false)
+-------------------------------------------------------------------+
|col                                                                |
+-------------------------------------------------------------------+
|{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []}  |
|{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []}|
+-------------------------------------------------------------------+

class ChunkFiltererApproach extends AnnotatorApproach[ChunkFilterer] with HasFeatures with FilteringParams with HandleExceptionParams with CheckLicense

Trains a ChunkFilterer annotator.

Trains a ChunkFilterer annotator. ChunkFiltererApproach can filter chunks coming from CHUNK annotations. Filters can be set via white list and black list or a regular expression. White list criteria is enabled by default. To use regex, criteria has to be set to regex. Additionally, It can filter chunks according to the confidence of the chunk in the metadata.

Example

Filtering POS tags

First pipeline stages to extract the POS tags are defined

val data = Seq("Has a past history of gastroenteritis and stomach pain, however patient ...").toDF("text")
val docAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentenceDetector = new SentenceDetector().setInputCols("document").setOutputCol("sentence")
val tokenizer = new Tokenizer().setInputCols("sentence").setOutputCol("token")

val posTagger = PerceptronModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("pos")

val chunker = new Chunker()
  .setInputCols("pos", "sentence")
  .setOutputCol("chunk")
  .setRegexParsers(Array("(<NN>)+"))

Then the chunks can be filtered via a white list. Here only terms with "gastroenteritis" remain.

val chunkerFilter = new ChunkFiltererApproach()
  .setInputCols("sentence","chunk")
  .setOutputCol("filtered")
  .setCriteria("isin")
  .setWhiteList("gastroenteritis")

val pipeline = new Pipeline().setStages(Array(
  docAssembler,
  sentenceDetector,
  tokenizer,
  posTagger,
  chunker,
  chunkerFilter))

result.selectExpr("explode(chunk)").show(truncate=false)
+---------------------------------------------------------------------------------+
|col                                                                              |
+---------------------------------------------------------------------------------+
|{chunk, 11, 17, history, {sentence -> 0, chunk -> 0}, []}                        |
|{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []}                |
|{chunk, 42, 53, stomach pain, {sentence -> 0, chunk -> 2}, []}                   |
|{chunk, 64, 70, patient, {sentence -> 0, chunk -> 3}, []}                        |
|{chunk, 81, 110, stomach pain now.We don't care, {sentence -> 0, chunk -> 4}, []}|
|{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []}              |
+---------------------------------------------------------------------------------+

result.selectExpr("explode(filtered)").show(truncate=false)
+-------------------------------------------------------------------+
|col                                                                |
+-------------------------------------------------------------------+
|{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []}  |
|{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []}|
+-------------------------------------------------------------------+

class ChunkKeyPhraseExtraction extends BertSentenceEmbeddings with CheckLicense
Extracts key phrases from texts.
Extracts key phrases from texts.
ChunkKeyPhraseExtraction uses BertSentenceEmbeddings to determine the most relevant key phrases describing a text with the use of two approaches:
- By using cosine similarities between the embedding representation of the chunks and the embedding representation of the corresponding sentences/documents.
- By using the Maximal Marginal Relevance (MMR) algorithm (set with the setDivergence method) to determine the most relevant key phrases. If the selectMostDifferent parameter is set, return the key phrases that are the most different from each other (avoid too similar key phrases). The model compares the chunks against the corresponding sentences/documents and selects the chunks which are most representative of the broader text context (i.e., the document or the sentence they belong to). This allows, for example, to obtain a brief understanding of a document by selecting the most relevant phrases. The input to the model consists of chunk annotations and sentence or document annotation. The input chunks can be generated in various ways:
- Using NGramGenerator, which allows to obtain ranked n-gram chunks from the text (can be used to identify new entities).
- Using YakeKeywordExtractor, allowing to rank the keywords extracted using the YAKE algorithm.
- Using TextMatcher, which allows to rank the desired chunks from the annotator.
- Using NerConverter, which allows to extract ranked named entities (which entities are the most relevant in the sentence/document). The model operates either at sentence (selecting the most descriptive chunks from the sentence they belong to) or at document level. In the latter case, the key phrases are selected to represent all the input document annotations.
This model is a subclass of BertSentenceEmbeddings and shares all parameters with it. It can load any pretrained BertSentenceEmbeddings model. Available models can be found at Models Hub.
```
val embeddings = ChunkKeyPhraseExtraction.pretrained()
  .setInputCols("sentence", "chunk")
  .setOutputCol("key_phrase_chunks")
```
The default model is "sbert_jsl_medium_uncased", if no name is provided.
Sources :
The use of MMR, diversity-based reranking for reordering documents and producing summaries
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Example
```
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.embeddings.BertSentenceEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

 val documentAssembler = new DocumentAssembler()
   .setInputCol("text")
   .setOutputCol("document")

 val tokenizer = new Tokenizer()
   .setInputCols("document")
   .setOutputCol("tokens")

 val stopWordsCleaner = StopWordsCleaner.pretrained()
   .setInputCols("tokens")
   .setOutputCol("clean_tokens")
   .setCaseSensitive(false)

 val nGrams = new NGramGenerator()
   .setInputCols(Array("clean_tokens"))
   .setOutputCol("ngrams")
   .setN(3)


 val chunkKeyPhraseExtractor = ChunkKeyPhraseExtraction
   .pretrained()
   .setTopN(2)
   .setDivergence(0.7f)
   .setInputCols(Array("document", "ngrams"))
   .setOutputCol("key_phrases")

 val pipeline = new Pipeline()
   .setStages(Array(
     documentAssembler,
     tokenizer,
     stopWordsCleaner,
     nGrams,
     chunkKeyPhraseExtractor))

val sampleText = "Her Diabetes has become type 2 in the last year with her Diabetes." +
   " He complains of swelling in his right forearm."

val testDataset = Seq("").toDS.toDF("text")
val result = pipeline.fit(emptyDataset).transform(testDataset)

 result
   .selectExpr("explode(key_phrases) AS key_phrase")
   .selectExpr(
     "key_phrase.result",
     "key_phrase.metadata.DocumentSimilarity",
     "key_phrase.metadata.MMRScore")
   .show(truncate=false)

+--------------------------+-------------------+------------------+
|result                    |DocumentSimilarity |MMRScore          |
+--------------------------+-------------------+------------------+
|complains swelling forearm|0.6325718954229369 |0.1897715761677257|
|type 2 year               |0.40181028931546364|-0.189501077108947|
+--------------------------+-------------------+------------------+
```
See also
BertEmbeddings for token-level embeddings
BertSentenceEmbeddings for sentence-level embeddings
Annotators Main Page for a list of transformer based embeddings
class ChunkMapperApproach extends AnnotatorApproach[ChunkMapperModel] with CheckLicense with ChunkMapperFuzzyMatchingParams with HandleExceptionParams
class ChunkMapperFilterer extends AnnotatorModel[ChunkMapperFilterer] with HasSimpleAnnotate[ChunkMapperFilterer] with WhiteAndBlackListParams with CheckLicense
class ChunkMapperModel extends AnnotatorModel[ChunkMapperModel] with HasSimpleAnnotate[ChunkMapperModel] with CheckLicense with ChunkMapperFuzzyMatchingParams with HandleExceptionParams

class ChunkSentenceSplitter extends AnnotatorModel[ChunkSentenceSplitter] with HasSimpleAnnotate[ChunkSentenceSplitter] with CheckLicense

An annotator that splits a document into sentences based on provided chunks.

An annotator that splits a document into sentences based on provided chunks. The first piece of the document is treated as a header, and subsequent chunks are labeled with their associated entities.

This annotator is particularly useful when identifying titles and subtitles using Named Entity Recognition (NER), followed by a paragraph-level split.

Example

// Create a DataFrame with a "text" column
val data = Seq(text, text).toDS.toDF("text")

// Set up the NLP pipeline with DocumentAssembler, RegexMatcher, and ChunkSentenceSplitter
val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("doc")
val regexMatcher = new RegexMatcher().setInputCols("doc").setOutputCol("chunks")
                  .setExternalRules("src/test/resources/chunker/title_regex.txt", ",")
val chunkSentenceSplitter = new ChunkSentenceSplitter().setInputCols("chunks", "doc").setOutputCol("paragraphs")
val pipeline = new Pipeline().setStages(Array(documentAssembler, regexMatcher, chunkSentenceSplitter))

// Fit the pipeline to the data and transform it
val result = pipeline.fit(data).transform(data).select("paragraphs")
result.show(truncate = false)

class DocMapperApproach extends ChunkMapperApproach
class DocMapperModel extends ChunkMapperModel

class Mapper2Chunk extends AnnotatorModel[Mapper2Chunk] with HasSimpleAnnotate[Mapper2Chunk]

This annotator converts 'LABELED_DEPENDENCY' type annotations coming from ChunkMapper into 'CHUNK' type to create new chunk-type column, compatible with annotators that use chunk type as input.

Example

Define dataset

 val testText = "Patient resting in bed. Patient given azithromycin without any difficulty." +
" Patient denies nausea at this time. zofran declined. Patient is also having intermittent sweating "

val testDataSet = Seq(testText).toDS.toDF("text")

Define a pipeline

 val documentAssembler = new DocumentAssembler()
   .setInputCol("text")
   .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
   .setInputCols(Array("document"))
   .setOutputCol("sentences")

val tokenizer = new Tokenizer()
   .setInputCols(Array("sentences"))
   .setOutputCol("tokens")

val embedder = WordEmbeddingsModel
   .pretrained("embeddings_clinical", "en", "clinical/models")
   .setInputCols(Array("sentences", "tokens"))
   .setOutputCol("embeddings")

val nerTagger = MedicalNerModel
   .pretrained("ner_clinical", "en", "clinical/models")
   .setInputCols(Array("sentences", "tokens", "embeddings"))
   .setOutputCol("nerTags")

val nerConverter = new NerConverterInternal()
   .setInputCols(Array("sentences", "tokens", "nerTags"))
   .setOutputCol("nerChunks")

val chunkerMapper = ChunkMapperModel.pretrained("drug_action_treatment_mapper", "en", "clinical/models")
   .setInputCols("nerChunks")
   .setOutputCol("relations")
   .setRels(Array("action"))
   .setAllowMultiTokenChunk(true)
   .setEnableCharFingerprintMatching(true)

val mapper2Chunk = new Mapper2Chunk()
   .setInputCols("relations")
   .setOutputCol("chunk")

val flattener = new Flattener()
   .setInputCols("chunk")
   .setExplodeSelectedFields(Map("chunk" -> Array("result as result",
                                                  "metadata.__trained__ as chunk",
                                                  "metadata.relation as relation",
                                                  "annotatorType as annotatorType")))

val pipeline = new Pipeline()
   .setStages(Array(
         documentAssembler,
         sentenceDetector,
         tokenizer,
         embedder,
         nerTagger,
         nerConverter,
         chunkerMapper,
         mapper2Chunk,
         flattener
           )).fit(testDataSet)

val dataSetResult = pipeline.transform(testDataSet)
dataSetResult.show(false)

  +---------------+------------+--------+-------------+
  |result         |chunk       |relation|annotatorType|
  +---------------+------------+--------+-------------+
  |bactericidal   |azithromycin|action  |chunk        |
  |NONE           |null        |null    |chunk        |
  |antiemetic     |nausea      |action  |chunk        |
  |anti-abstinence|zofran      |action  |chunk        |
  |NONE           |null        |null    |chunk        |
  +---------------+------------+--------+-------------+

trait ReadChunkKeyPhraseExtractionTensorflowModel extends ReadTensorflowModel with InternalReadOnnxModel with ReadOpenvinoModel
trait ReadablePretrainedChunkKeyPhraseExtractionModel extends ParamsAndFeaturesReadable[ChunkKeyPhraseExtraction] with HasPretrained[ChunkKeyPhraseExtraction]
trait ReadablePretrainedChunkMapperModel extends ParamsAndFeaturesReadable[ChunkMapperModel] with HasPretrained[ChunkMapperModel]
trait ReadablePretrainedDocMapperModel extends ParamsAndFeaturesReadable[DocMapperModel] with HasPretrained[DocMapperModel]

Value Members

object AssertionFilterer extends ParamsAndFeaturesReadable[AssertionFilterer] with Serializable
object ChunkConverter extends ParamsAndFeaturesReadable[ChunkConverter] with Serializable
object ChunkFilterer extends ParamsAndFeaturesReadable[ChunkFilterer] with Serializable
object ChunkFiltererApproach extends DefaultParamsReadable[ChunkFiltererApproach] with Serializable
object ChunkKeyPhraseExtraction extends ReadablePretrainedChunkKeyPhraseExtractionModel with ReadChunkKeyPhraseExtractionTensorflowModel with Serializable
object ChunkMapperFilterer extends ParamsAndFeaturesReadable[ChunkMapperFilterer] with Serializable
object ChunkMapperModel extends ParamsAndFeaturesReadable[ChunkMapperModel] with ReadablePretrainedChunkMapperModel with Serializable
object ChunkSentenceSplitter extends ParamsAndFeaturesReadable[ChunkSentenceSplitter] with Serializable
object DocMapperModel extends ParamsAndFeaturesReadable[DocMapperModel] with ReadablePretrainedDocMapperModel with Serializable
object Mapper2Chunk extends DefaultParamsReadable[Mapper2Chunk] with Serializable

Packages

chunker

package chunker

Type Members

Example

Define an extra step where the assertions are filtered

Example

Example

Filtering POS tags

Example

Filtering POS tags

Example

Example

Example

Value Members

Ungrouped

Packages

chunker 

package chunker

Type Members

Example

Define an extra step where the assertions are filtered

Example

Example

Filtering POS tags

Example

Filtering POS tags

Example

Example

Example

Value Members

Ungrouped

chunker