Enterprise NLP Annotators

 

A Spark NLP Enterprise license includes access to unique annotators. At the Spark NLP Workshop you can see different types of annotators in action.

By clicking on any annotator, you will see different sections:

  • The Approach, or class to train models.
  • The Model, to infer using pretrained models.

Also, for most of the annotators, you will find examples for the different enterprise libraries:

  • Healthcare NLP
  • Finance NLP
  • Legal NLP

Check out the Spark NLP Annotators page for more information on how to read this page.

Available Annotators

Annotators Description
AnnotationMerger Merge annotations from different pipeline steps that have the same annotation type into a unified annotation.
AssertionChunkConverter AssertionChunkConverter annotator uses both begin and end indices of the tokens as input to add a more robust metadata to the chunk column in a way that improves the reliability of the indices and avoid loss of data.
AssertionDL AssertionDL is a deep Learning based approach used to extract Assertion Status from extracted entities and text.
AssertionFilterer Filters entities coming from ASSERTION type annotations and returns the CHUNKS.
AssertionLogReg Logistic Regression is used to extract Assertion Status from extracted entities and text.
AssertionMerger Merges variety assertion columns coming from Assertion annotators
AverageEmbeddings Computes the mean of vector embeddings for two sentences of equal size, producing a unified representation
BertForSequenceClassification Can load Bert Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.
BertForTokenClassifier Can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) for Named-Entity-Recognition (NER) tasks.
BertSentenceChunkEmbeddings This annotator combines sentence and NER chunk embeddings to enhance resolution codes, leveraging contextual information in the embeddings for more precise results. It takes sentence context and NER chunks as input and produces embeddings for each chunk, facilitating input for the resolution model.
Chunk2Token A feature transformer that converts the input array of strings (annotatorType CHUNK) into an array of chunk-based tokens (annotatorType TOKEN).
ChunkConverter This annotator merges NER-detected entities with RegexMatcher-based rules for unified processing in the pipeline.
ChunkEntityResolver Returns a normalized entity for a particular trained ontology / curated dataset (e.g. clinical ICD-10, RxNorm, SNOMED; financial SEC’s EDGAR database, etc).
ChunkFilterer Filters entities coming from CHUNK annotations.
ChunkKeyPhraseExtraction Uses Bert Sentence Embeddings to determine the most relevant key phrases describing a text.
ChunkMapper We can use ChunkMapper to map entities with their associated code/reference based on pre-defined dictionaries.
ChunkMapperFilterer Annotator to be used after ChunkMapper that allows to filter chunks based on the results of the mapping, whether it was successful or failed.
ChunkMerge Merges entities coming from different CHUNK annotations.
ChunkSentenceSplitter Annotator can split the documents into chunks according to separators given as CHUNK columns. It is useful when you need to perform different models or analysis in different sections of your document
ContextualParser Extracts entity from a document based on user defined rules.
DateNormalizer This annotator transforms date mentions to a common standard format: YYYY/MM/DD. It is useful when using data from different sources, some times from different countries that has different formats to represent dates.
DeIdentification Deidentifies Input Annotations of types DOCUMENT, TOKEN and CHUNK, by either masking or obfuscating the given CHUNKS.
DistilBertForSequenceClassification Can load DistilBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.
Doc2ChunkInternal Converts DOCUMENT, TOKEN typed annotations into CHUNK type with the contents of a chunkCol.
DocMapper Uses the text representation of document annotations to map clinical codes to other codes or relevant information.
DocumentFiltererByClassifier This annotator sorts documents based on classifier results. It uses white and black lists, allowing or blocking specific outcomes. It can be case-sensitive or case-insensitive for broader matching. This tool efficiently organizes documents based on classifier outcomes.
DocumentHashCoder This annotator swaps dates in a document column with hash codes from another column, creating a new column with shifted day information. The subsequent DeIdentification annotator anonymizes the document, incorporating the altered dates.
DocumentLogRegClassifier Classifies documents with a Logarithmic Regression algorithm.
DocumentMLClassifier classifies documents with a Logarithmic Regression algorithm.
DrugNormalizer Annotator which normalizes raw text from documents, e.g. scraped web pages or xml documents.
EntityChunkEmbeddings Entity Chunk Embeddings uses BERT Sentence embeddings to compute a weighted average vector represention of related entity chunks.
EntityRulerInternal This annotator match exact strings or regex patterns provided in a file against a Document and assigns them an named entity.
FeaturesAssembler Collects features from different columns.
FewShotClassifier This Annotator specifically target few-shot classification tasks, which involve training a model to make accurate predictions with limited labeled data.
FewShotAssertionClassifierModel assertion classification using large (LLMS based) few shot classifiers based on the SetFit approach.
Flattener Flattener annotator in Spark NLP converts annotation results into a simplified DataFrame format for easier analysis and interpretation.
GenericClassifier Creates a generic single-label classifier which uses pre-generated Tensorflow graphs.
GenericLogRegClassifier Is a derivative of GenericClassifier which implements a multinomial logistic regression.
GenericSVMClassifier Creates a generic single-label classifier which uses pre-generated Tensorflow graphs.
InternalDocumentSplitter This annotator splits large documents into small documents.
IOBTagger Merges token tags and NER labels from chunks in the specified format.
LightDeIdentification Light version of DeIdentification.
MultiChunk2Doc Merges a given chunks to create a document.
NameChunkObfuscator This annotator allows to transform a dataset with an Input Annotation of type CHUNK, into its obfuscated version of by obfuscating the given CHUNKS.
NerChunker Extracts phrases that fits into a known pattern using the NER tags.
NerConverterInternal Converts a IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their label.
NerDisambiguator Links words of interest, such as names of persons, locations and companies, from an input text document to a corresponding unique entity in a target Knowledge Base (KB).
NerModel This Named Entity recognition annotator is a generic NER model based on Neural Networks.
NerQuestionGenerator This annotator takes an NER chunk (obtained by, e.g., NerConverterInternal) and generates a questions based on two entity types, a pronoun and a strategy.
QuestionAnswering GPT-based model for answering questions given a context.
RegexMatcherInternal RegexMatcherInternal matches predefined regex patterns with entities in text, allowing for flexible entity recognition based on user-defined rules.
ReIdentification Reidentifies obfuscated entities by DeIdentification.
RelationExtraction Extracts and classifies instances of relations between named entities.
RelationExtractionDL Extracts and classifies instances of relations between named entities.
RENerChunksFilter Filters and outputs combinations of relations between extracted entities, for further processing.
Replacer This annotator allows to replace entities in the original text with the ones extracted by the annotators NameChunkObfuscatorApproach or DateNormalizer.
Resolution2Chunk This annotator is responsible for converting the annotations generated by entity resolver models (typically labeled as ENTITY) into a format compatible with subsequent stages of the pipeline, such as the ChunkMapperModel.
ResolverMerger This annotator is provide the ability to merge sentence enitity resolver and chunk mapper model output columns.
Router This annotator is provide the ability to split an output of an annotator for a selected metadata field and the value for that field.
SentenceEntityResolver Returns the normalized entity for a particular trained ontology / curated dataset (e.g. clinical ICD-10, RxNorm, SNOMED; financial SEC’s EDGAR database, etc) based on sentence embeddings.
Summarizer Helps to quickly summarize complex medical information.
TextGenerator Uses the basic BioGPT model to perform various tasks related to medical text abstraction.
TextMatcherInternal This annotator match exact phrases provided in a file against a Document.
TFGraphBuilder Creates Tensorflow graphs.
WindowedSentenceModel This annotator that helps you to merge the previous and following sentences of a given piece of text, so that you add the context surrounding them.
ZeroShotNerModel This is a zero-shot named entity recognition using RoBertaForQuestionAnswering. It identifies entities across diverse data without domain-specific fine-tuning.
ZeroShotRelationExtractionModel This annotator implements zero-shot binary relations extraction by utilizing BERT transformer models trained on the NLI (Natural Language Inference) task.

AnnotationMerger

Model

Merge annotations from different pipeline steps that have the same annotation type into a unified annotation. Possible annotations that can be merged include:

  • document (e.g., output of DocumentAssembler annotator)
  • token (e.g., output of Tokenizer annotator)
  • word_embeddings (e.g., output of WordEmbeddingsModel annotator)
  • sentence_embeddings (e.g., output of BertSentenceEmbeddings annotator)
  • category (e.g., output of RelationExtractionModel annotator)
  • date (e.g., output of DateMatcher annotator)
  • sentiment (e.g., output of SentimentDLModel annotator)
  • pos (e.g., output of PerceptronModel annotator)
  • chunk (e.g., output of NerConverter annotator)
  • named_entity (e.g., output of NerDLModel annotator)
  • regex (e.g., output of RegexTokenizer annotator)
  • dependency (e.g., output of DependencyParserModel annotator)
  • language (e.g., output of LanguageDetectorDL annotator)
  • keyword (e.g., output of YakeModel annotator)

Parameters:

  • inputType: The type of the annotations that you want to merge. Possible values.

Input Annotator Types: ANY

Output Annotator Type: ANY

Python API: AnnotationMerger Scala API: AnnotationMerger Notebook: AnnotationMerger
Show Example

from johnsnowlabs import nlp, medical

# Create the pipeline with two RE models
documenter = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencer = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentences"])\
    .setOutputCol("tokens")

words_embedder = nlp.WordEmbeddingsModel()\
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("embeddings")

pos_tagger = nlp.PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("pos_tags")

pos_ner_tagger = medical.NerModel()\
    .pretrained("ner_posology", "en", "clinical/models")\
    .setInputCols("sentences", "tokens", "embeddings")\
    .setOutputCol("ner_pos")

pos_ner_chunker = medical.NerConverterInternal()\
    .setInputCols(["sentences", "tokens", "ner_pos"])\
    .setOutputCol("pos_ner_chunks")

dependency_parser = nlp.DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentences", "pos_tags", "tokens"])\
    .setOutputCol("dependencies")

pos_reModel = medical.RelationExtractionModel()\
    .pretrained("posology_re")\
    .setInputCols(["embeddings", "pos_tags", "pos_ner_chunks", "dependencies"])\
    .setOutputCol("pos_relations")\
    .setMaxSyntacticDistance(4)

ade_ner_tagger = medical.NerModel.pretrained("ner_ade_clinical", "en", "clinical/models")\
    .setInputCols("sentences", "tokens", "embeddings")\
    .setOutputCol("ade_ner_tags")  

ade_ner_chunker = medical.NerConverterInternal()\
    .setInputCols(["sentences", "tokens", "ade_ner_tags"])\
    .setOutputCol("ade_ner_chunks")

ade_reModel = medical.RelationExtractionModel()\
    .pretrained("re_ade_clinical", "en", 'clinical/models')\
    .setInputCols(["embeddings", "pos_tags", "ade_ner_chunks", "dependencies"])\
    .setOutputCol("ade_relations")\
    .setMaxSyntacticDistance(10)\
    .setRelationPairs(["drug-ade, ade-drug"])

annotation_merger = medical.AnnotationMerger()\
    .setInputCols("ade_relations", "pos_relations")\
    .setInputType("category")\
    .setOutputCol("all_relations")

merger_pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer, 
    words_embedder, 
    pos_tagger, 
    pos_ner_tagger,
    pos_ner_chunker,
    dependency_parser,
    pos_reModel,
    ade_ner_tagger,
    ade_ner_chunker,
    ade_reModel,
    annotation_merger
])

# Show example result
text = """
The patient was prescribed 1 unit of naproxen for 5 days after meals for chronic low back pain. The patient was also given 1 unit of oxaprozin daily for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands.. 
"""
data = spark.createDataFrame([[text]]).toDF("text")

result = merger_pipeline.fit(data).transform(data)
result.selectExpr("pos_relations.result as PosologyRelation", 
                  "ade_relations.result as AdeRelation", 
                  "all_relations.result as MergedRelation").show(truncate=False)

+---------------------------------------------------------+-----------+---------------------------------------------------------------+
|PosologyRelation                                         |AdeRelation|MergedRelation                                                 |
+---------------------------------------------------------+-----------+---------------------------------------------------------------+
|[DOSAGE-DRUG, DRUG-DURATION, DOSAGE-DRUG, DRUG-FREQUENCY]|[1, 1]     |[1, 1, DOSAGE-DRUG, DRUG-DURATION, DOSAGE-DRUG, DRUG-FREQUENCY]|
+---------------------------------------------------------+-----------+---------------------------------------------------------------+
from johnsnowlabs import nlp, finance

# Create the pipeline with two RE models
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

text_splitter = finance.TextSplitter()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model_date = finance.NerModel.pretrained("finner_sec_dates", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_dates")

ner_converter_date = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_dates"])\
    .setOutputCol("ner_chunk_date")

ner_model_org= finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_orgs")

ner_converter_org = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_orgs"])\
    .setOutputCol("ner_chunk_org")\

chunk_merger = finance.ChunkMergeApproach()\
    .setInputCols('ner_chunk_org', "ner_chunk_date")\
    .setOutputCol('ner_chunk')

pos = nlp.PerceptronModel.pretrained()\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("pos")

dependency_parser = nlp.DependencyParserModel().pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos", "token"])\
    .setOutputCol("dependencies")

re_filter = finance.RENerChunksFilter()\
    .setInputCols(["ner_chunk", "dependencies"])\
    .setOutputCol("re_ner_chunk")\
    .setRelationPairs(["ORG-ORG", "ORG-DATE"])\
    .setMaxSyntacticDistance(10)

reDL = finance.RelationExtractionDLModel().pretrained('finre_acquisitions_subsidiaries_md', 'en', 'finance/models')\
    .setInputCols(["re_ner_chunk", "sentence"])\
    .setOutputCol("relations_acq")\
    .setPredictionThreshold(0.1)

reDL_alias = finance.RelationExtractionDLModel().pretrained('finre_org_prod_alias', 'en', 'finance/models')\
    .setInputCols(["re_ner_chunk", "sentence"])\
    .setOutputCol("relations_alias")\
    .setPredictionThreshold(0.1)

annotation_merger = finance.AnnotationMerger()\
    .setInputCols("relations_acq", "relations_alias")\
    .setOutputCol("relations")\
    .setInputType("category")

nlpPipeline = nlp.Pipeline(stages=[
        document_assembler,
        text_splitter,
        tokenizer,
        embeddings,
        ner_model_date,
        ner_converter_date,
        ner_model_org,
        ner_converter_org,
        chunk_merger,
        pos,
        dependency_parser,
        re_filter,
        reDL,
        reDL_alias,
        annotation_merger])


# Show example result
text ="""Definite-lived intangible assets acquired with Cadence’s fiscal 2021 acquisitions were as follows:
 
Acquisition Date Fair Value
Weighted Average Amortization Period
 
(In thousands)
 (in years)
Existing technology
$
59,100 
13.7 years
Agreements and relationships
28,900 
13.7 years
Tradenames, trademarks and patents
4,600 
14.3 years
Total acquired intangibles with definite lives
$
92,600 
13.7 years
2020 Acquisitions
In fiscal 2020, Cadence acquired all of the outstanding equity of AWR Corporation ("AWR") and Integrand Software, Inc. ("Integrand"). These acquisitions enhanced Cadence’s technology portfolio to address growing radio frequency design activity, driven by expanding use of 5G communications.
The aggregate cash consideration for these acquisitions was $195.6 million, after taking into account cash acquired of $1.5 million. The total purchase consideration was allocated to the assets acquired and liabilities assumed based on their respective estimated fair values on the acquisition dates. Cadence will also make payments to certain employees, subject to continued employment and other performance-based conditions, through the first quarter of fiscal 2023.
 With its acquisitions of AWR and Integrand, Cadence recorded $101.3 million of definite-lived intangible assets with a weighted average amortization period of approximately nine years. The definite-lived intangible assets related primarily to existing technology and customer agreements and relationships. Cadence also recorded $119.4 million of goodwill and $25.1 million of net liabilities, consisting primarily of deferred tax liabilities, assumed deferred revenue and trade accounts receivable. The recorded goodwill was primarily related to the acquired assembled workforce and expected synergies from combining operations of the acquired companies with Cadence. None of the goodwill related to the acquisitions of AWR and Integrand is deductible for tax purposes.
Cadence completed one additional acquisition during fiscal 2020 that was not material to the consolidated financial statements. 
Pro Forma Financial Information
Cadence has not presented pro forma financial information for any of the businesses it acquired during fiscal 2021 and fiscal 2020 because the results of operations for these businesses are not material to Cadence’s consolidated financial statements.
Acquisition-Related Transaction Costs
Transaction costs associated with acquisitions, which consist of professional fees and administrative costs, were not material during fiscal 2021, 2020 or 2019 and were expensed as incurred in Cadence’s consolidated income statements.
NOTE 7. GOODWILL AND ACQUIRED INTANGIBLES
Goodwill
The changes in the carrying amount of goodwill during fiscal 2021 and 2020 were as follows:
 
Gross CarryingAmount
 
(In thousands)
Balance as of December 28, 2019
$
661,856 
Goodwill resulting from acquisitions
120,564 
Effect of foreign currency translation
(333)
Balance as of January 2, 2021
782,087 
Goodwill resulting from acquisitions
154,362 
Effect of foreign currency translation
(8,091)
Balance as of January 1, 2022
$
928,358 
Cadence completed its annual goodwill impairment test during the third quarter of fiscal 2021 and determined that the fair value of Cadence’s single reporting unit exceeded the carrying amount of its net assets and that no impairment existed.
65"""

data = spark.createDataFrame([[text]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)


# Show the results 
result.selectExpr("relations_acq.result as AcqRelation", 
                  "relations_alias.result as AliasRelation", 
                  "relations.result as MergedRelation").show(truncate=False)

+------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|AcqRelation                                                                                     |AliasRelation                                                                           |MergedRelation                                                                                                                                                                          |
+------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[has_acquisition_date, was_acquired_by, other, other, other, has_acquisition_date, other, other]|[has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias]|[has_acquisition_date, was_acquired_by, other, other, other, has_acquisition_date, other, other, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias]|
+------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
from johnsnowlabs import nlp, legal

# Create the pipeline with two RE models
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

text_splitter = legal.TextSplitter()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings =nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model_date = legal.NerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_parties")

ner_converter_date = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_parties"])\
    .setOutputCol("ner_chunk_parties")

ner_model_org= legal.NerModel.pretrained("legner_whereas_md", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_whereas")

ner_converter_org = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_whereas"])\
    .setOutputCol("ner_chunk_whereas")\

chunk_merger = legal.ChunkMergeApproach()\
    .setInputCols('ner_chunk_whereas', "ner_chunk_parties")\
    .setOutputCol('ner_chunk')

pos = nlp.PerceptronModel.pretrained()\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("pos")

dependency_parser = nlp.DependencyParserModel().pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos", "token"])\
    .setOutputCol("dependencies")

re_filter = legal.RENerChunksFilter()\
    .setInputCols(["ner_chunk", "dependencies"])\
    .setOutputCol("re_ner_chunk")\
    .setMaxSyntacticDistance(10)

reDL = legal.RelationExtractionDLModel().pretrained("legre_contract_doc_parties_md", "en", "legal/models")\
    .setInputCols(["re_ner_chunk", "sentence"])\
    .setOutputCol("relations_parties")\
    .setPredictionThreshold(0.1)

reDL_alias = legal.RelationExtractionDLModel().pretrained("legre_whereas", "en", "legal/models")\
    .setInputCols(["re_ner_chunk", "sentence"])\
    .setOutputCol("relations_whereas")\
    .setPredictionThreshold(0.1)

annotation_merger = legal.AnnotationMerger()\
    .setInputCols("relations_parties", "relations_whereas")\
    .setOutputCol("relations")\
    .setInputType("category")

nlpPipeline = nlp.Pipeline(stages=[
        document_assembler,
        text_splitter,
        tokenizer,
        embeddings,
        ner_model_date,
        ner_converter_date,
        ner_model_org,
        ner_converter_org,
        chunk_merger,
        pos,
        dependency_parser,
        re_filter,
        reDL,
        reDL_alias,
        annotation_merger])


# Show example result
text = """
WHEREAS, the Company Entities own certain Copyrights and Know-How which may be used in the Arizona Field, and in connection with the transactions contemplated by the Stock Purchase Agreement, Arizona desires to obtain a license from the Company Entities to use such Intellectual Property on the terms and subject to the conditions set forth herein.
"""
data = spark.createDataFrame([[text]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)

# Show the results 
result.selectExpr("relations_parties.result as PartiesRelation", 
                  "relations_whereas.result as WhereasRelation", 
                  "relations.result as MergedRelation").show(truncate=False)

+-----------------------------+--------------------------------------+-------------------------------------------------------------------+
|PartiesRelation              |WhereasRelation                       |MergedRelation                                                     |
+-----------------------------+--------------------------------------+-------------------------------------------------------------------+
|[signed_by, other, signed_by]|[has_subject, has_subject, has_object]|[signed_by, other, signed_by, has_subject, has_subject, has_object]|
+-----------------------------+--------------------------------------+-------------------------------------------------------------------+

import spark.implicits._

// Create the pipeline with two RE models
val documenter = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentencer = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentences")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentences"))
  .setOutputCol("tokens")

val words_embedder = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("sentences", "tokens"))
  .setOutputCol("embeddings")

val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models")
  .setInputCols(Array("sentences", "tokens"))
  .setOutputCol("pos_tags")

val pos_ner_tagger = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models")
  .setInputCols(Array("sentences", "tokens", "embeddings"))
  .setOutputCol("ner_pos")

val pos_ner_chunker = new NerConverterInternal()
  .setInputCols(Array("sentences", "tokens", "ner_pos"))
  .setOutputCol("pos_ner_chunks")

val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en")
  .setInputCols(Array("sentences", "pos_tags", "tokens"))
  .setOutputCol("dependencies")

val pos_reModel = RelationExtractionModel.pretrained("posology_re")
  .setInputCols(Array("embeddings", "pos_tags", "pos_ner_chunks", "dependencies"))
  .setOutputCol("pos_relations")
  .setMaxSyntacticDistance(4)

val ade_ner_tagger = MedicalNerModel.pretrained("ner_ade_clinical", "en", "clinical/models")
  .setInputCols(Array("sentences", "tokens", "embeddings"))
  .setOutputCol("ade_ner_tags")

val ade_ner_chunker = new NerConverterInternal()
  .setInputCols(Array("sentences", "tokens", "ade_ner_tags"))
  .setOutputCol("ade_ner_chunks")

val ade_reModel = RelationExtractionModel.pretrained("re_ade_clinical", "en", "clinical/models")
  .setInputCols(Array("embeddings", "pos_tags", "ade_ner_chunks", "dependencies"))
  .setOutputCol("ade_relations")
  .setMaxSyntacticDistance(10)
  .setRelationPairs(Array("drug-ade", "ade-drug"))

val annotation_merger = new AnnotationMerger()
  .setInputCols(Array("ade_relations", "pos_relations"))
  .setInputType("category")
  .setOutputCol("all_relations")

val merger_pipeline = new Pipeline().setStages(Array(
  documenter,
  sentencer,
  tokenizer,
  words_embedder,
  pos_tagger,
  pos_ner_tagger,
  pos_ner_chunker,
  dependency_parser,
  pos_reModel,
  ade_ner_tagger,
  ade_ner_chunker,
  ade_reModel,
  annotation_merger
))


// Show example result

val text =
  """
The patient was prescribed 1 unit of naproxen for 5 days after meals for chronic low back pain. The patient was also given 1 unit of oxaprozin daily for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands..
"""

val data = Seq(text).toDF("text")
val result = merger_pipeline.fit(data).transform(data)

+---------------------------------------------------------+-----------+---------------------------------------------------------------+
|PosologyRelation                                         |AdeRelation|MergedRelation                                                 |
+---------------------------------------------------------+-----------+---------------------------------------------------------------+
|[DOSAGE-DRUG, DRUG-DURATION, DOSAGE-DRUG, DRUG-FREQUENCY]|[1, 1]     |[1, 1, DOSAGE-DRUG, DRUG-DURATION, DOSAGE-DRUG, DRUG-FREQUENCY]|
+---------------------------------------------------------+-----------+---------------------------------------------------------------+
import spark.implicits._

// Create the pipeline with two RE models
val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val text_splitter = new TextSplitter()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val ner_model_date = FinanceNerModel.pretrained("finner_sec_dates", "en", "finance/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner_dates")

val ner_converter_date = new NerConverter()
  .setInputCols(Array("sentence", "token", "ner_dates"))
  .setOutputCol("ner_chunk_date")

val ner_model_org = FinanceNerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner_orgs")

val ner_converter_org = new NerConverter()
  .setInputCols(Array("sentence", "token", "ner_orgs"))
  .setOutputCol("ner_chunk_org")

val chunk_merger = new Chunker()
  .setInputCols(Array("ner_chunk_org", "ner_chunk_date"))
  .setOutputCol("ner_chunk")

val pos = new PerceptronModel()
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("pos")

val dependency_parser = new DependencyParserModel()
  .pretrained("dependency_conllu", "en")
  .setInputCols(Array("sentence", "pos", "token"))
  .setOutputCol("dependencies")

val re_filter = new RelationExtractionModel()
  .setInputCols(Array("ner_chunk", "dependencies"))
  .setOutputCol("re_ner_chunk")
  .setRelationPairs(Array("ORG-ORG", "ORG-DATE"))
  .setMaxSyntacticDistance(10)

val reDL = new RelationExtractionModel()
  .pretrained("finre_acquisitions_subsidiaries_md", "en", "finance/models")
  .setInputCols(Array("re_ner_chunk", "sentence"))
  .setOutputCol("relations_acq")
  .setPredictionThreshold(0.1)

val reDL_alias = new RelationExtractionModel()
  .pretrained("finre_org_prod_alias", "en", "finance/models")
  .setInputCols(Array("re_ner_chunk", "sentence"))
  .setOutputCol("relations_alias")
  .setPredictionThreshold(0.1)

val annotation_merger = new AnnotationMerger()
  .setInputCols("relations_acq", "relations_alias")
  .setOutputCol("relations")
  .setInputType("category")

val nlpPipeline = new Pipeline().setStages(Array(
  document_assembler,
  text_splitter,
  tokenizer,
  embeddings,
  ner_model_date,
  ner_converter_date,
  ner_model_org,
  ner_converter_org,
  chunk_merger,
  pos,
  dependency_parser,
  re_filter,
  reDL,
  reDL_alias,
  annotation_merger
))

// Show example result
val text = """
Definite-lived intangible assets acquired with Cadence’s fiscal 2021 acquisitions were as follows:
 
Acquisition Date Fair Value
Weighted Average Amortization Period
 
(In thousands)
 (in years)
Existing technology
$
59,100 
13.7 years
Agreements and relationships
28,900 
13.7 years
Tradenames, trademarks and patents
4,600 
14.3 years
Total acquired intangibles with definite lives
$
92,600 
13.7 years
2020 Acquisitions
In fiscal 2020, Cadence acquired all of the outstanding equity of AWR Corporation ("AWR") and Integrand Software, Inc. ("Integrand"). These acquisitions enhanced Cadence’s technology portfolio to address growing radio frequency design activity, driven by expanding use of 5G communications.
The aggregate cash consideration for these acquisitions was $195.6 million, after taking into account cash acquired of $1.5 million. The total purchase consideration was allocated to the assets acquired and liabilities assumed based on their respective estimated fair values on the acquisition dates. Cadence will also make payments to certain employees, subject to continued employment and other performance-based conditions, through the first quarter of fiscal 2023.
 With its acquisitions of AWR and Integrand, Cadence recorded $101.3 million of definite-lived intangible assets with a weighted average amortization period of approximately nine years. The definite-lived intangible assets related primarily to existing technology and customer agreements and relationships. Cadence also recorded $119.4 million of goodwill and $25.1 million of net liabilities, consisting primarily of deferred tax liabilities, assumed deferred revenue and trade accounts receivable. The recorded goodwill was primarily related to the acquired assembled workforce and expected synergies from combining operations of the acquired companies with Cadence. None of the goodwill related to the acquisitions of AWR and Integrand is deductible for tax purposes.
Cadence completed one additional acquisition during fiscal 2020 that was not material to the consolidated financial statements. 
Pro Forma Financial Information
Cadence has not presented pro forma financial information for any of the businesses it acquired during fiscal 2021 and fiscal 2020 because the results of operations for these businesses are not material to Cadence’s consolidated financial statements.
Acquisition-Related Transaction Costs
Transaction costs associated with acquisitions, which consist of professional fees and administrative costs, were not material during fiscal 2021, 2020 or 2019 and were expensed as incurred in Cadence’s consolidated income statements.
NOTE 7. GOODWILL AND ACQUIRED INTANGIBLES
Goodwill
The changes in the carrying amount of goodwill during fiscal 2021 and 2020 were as follows:
 
Gross CarryingAmount
 
(In thousands)
Balance as of December 28, 2019
$
661,856 
Goodwill resulting from acquisitions
120,564 
Effect of foreign currency translation
(333)
Balance as of January 2, 2021
782,087 
Goodwill resulting from acquisitions
154,362 
Effect of foreign currency translation
(8,091)
Balance as of January 1, 2022
$
928,358 
Cadence completed its annual goodwill impairment test during the third quarter of fiscal 2021 and determined that the fair value of Cadence’s single reporting unit exceeded the carrying amount of its net assets and that no impairment existed.
65
"""

val data = Seq(text).toDF("text")
val result = nlpPipeline.fit(data).transform(data)

+------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|AcqRelation                                                                                     |AliasRelation                                                                           |MergedRelation                                                                                                                                                                          |
+------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[has_acquisition_date, was_acquired_by, other, other, other, has_acquisition_date, other, other]|[has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias]|[has_acquisition_date, was_acquired_by, other, other, other, has_acquisition_date, other, other, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias]|
+------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._

// Create the pipeline with two RE models
val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val text_splitter = new TextSplitter()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val ner_model_date = LegalNerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner_parties")

val ner_converter_date = new NerConverter()
  .setInputCols(Array("sentence", "token", "ner_parties"))
  .setOutputCol("ner_chunk_parties")

val ner_model_org = LegalNerModel.pretrained("legner_whereas_md", "en", "legal/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner_whereas")

val ner_converter_org = new NerConverter()
  .setInputCols(Array("sentence", "token", "ner_whereas"))
  .setOutputCol("ner_chunk_whereas")

val chunk_merger = new Chunker()
  .setInputCols(Array("ner_chunk_whereas", "ner_chunk_parties"))
  .setOutputCol("ner_chunk")

val pos = new PerceptronModel()
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("pos")

val dependency_parser = new DependencyParserModel()
  .pretrained("dependency_conllu", "en")
  .setInputCols(Array("sentence", "pos", "token"))
  .setOutputCol("dependencies")

val re_filter = new RelationExtractionModel()
  .setInputCols(Array("ner_chunk", "dependencies"))
  .setOutputCol("re_ner_chunk")
  .setMaxSyntacticDistance(10)

val reDL = new RelationExtractionModel()
  .pretrained("legre_contract_doc_parties_md", "en", "legal/models")
  .setInputCols(Array("re_ner_chunk", "sentence"))
  .setOutputCol("relations_parties")
  .setPredictionThreshold(0.1)

val reDL_alias = new RelationExtractionModel()
  .pretrained("legre_whereas", "en", "legal/models")
  .setInputCols(Array("re_ner_chunk", "sentence"))
  .setOutputCol("relations_whereas")
  .setPredictionThreshold(0.1)

val annotation_merger = new AnnotationMerger()
  .setInputCols("relations_parties", "relations_whereas")
  .setOutputCol("relations")
  .setInputType("category")

val nlpPipeline = new Pipeline().setStages(Array(
  document_assembler,
  text_splitter,
  tokenizer,
  embeddings,
  ner_model_date,
  ner_converter_date,
  ner_model_org,
  ner_converter_org,
  chunk_merger,
  pos,
  dependency_parser,
  re_filter,
  reDL,
  reDL_alias,
  annotation_merger
))

// Show example result

val text = """WHEREAS, the Company Entities own certain Copyrights and Know-How which may be used in the Arizona Field, and in connection with the transactions contemplated by the Stock Purchase Agreement, Arizona desires to obtain a license from the Company Entities to use such Intellectual Property on the terms and subject to the conditions set forth herein.
"""

val data = Seq(text).toDF("text")
val result = nlpPipeline.fit(data).transform(data)

+-----------------------------+--------------------------------------+-------------------------------------------------------------------+
|PartiesRelation              |WhereasRelation                       |MergedRelation                                                     |
+-----------------------------+--------------------------------------+-------------------------------------------------------------------+
|[signed_by, other, signed_by]|[has_subject, has_subject, has_object]|[signed_by, other, signed_by, has_subject, has_subject, has_object]|
+-----------------------------+--------------------------------------+-------------------------------------------------------------------+

AssertionChunkConverter

Model

This annotator creates a CHUNK column with metadata useful for training an Assertion Status Detection model (see AssertionDL).

In some cases, there may be issues while creating the chunk column when using token indices that can lead to loss of data to train assertion status models.

The AssertionChunkConverter annotator uses both the begin and end indices of the tokens as input to add more robust metadata to the chunk column in a way that improves the reliability of the indices and avoids loss of data.

Parameters:

  • chunkBeginCol: (Str) The column containing the start index of the chunk.

  • chunkEndCol: (Str) The column containing the end index of the chunk.

  • chunkTextCol: (Str) The column containing the text chunk.

  • outputTokenBeginCol: (Str) The column containing the selected token start.

  • outputTokenEndCol: (Str) The column containing the selected token end index.

NOTE: Chunk begin and end indices in the assertion status model training dataframe can be populated using the new version of the ALAB module.

Input Annotator Types: TOKEN

Output Annotator Type: CHUNK

Python API: AssertionChunkConverter Scala API: AssertionChunkConverter Notebook: AssertionChunkConverterNootebook
Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("tokens")

converter = medical.AssertionChunkConverter()\
  .setInputCols("tokens")\
  .setChunkTextCol("target")\
  .setChunkBeginCol("char_begin")\
  .setChunkEndCol("char_end")\
  .setOutputTokenBeginCol("token_begin")\
  .setOutputTokenEndCol("token_end")\
  .setOutputCol("chunk")


pipeline = nlp.Pipeline().setStages(
    [document_assembler, 
     sentenceDetector, 
     tokenizer, 
     converter]
)

data = spark.createDataFrame([
    ["An angiography showed bleeding in two vessels off of the Minnie supplying the sigmoid that were succesfully embolized.","Minnie", 57, 64,],
    ["After discussing this with his PCP, Leon was clear that the patient had had recurrent DVTs and ultimately a PE and his PCP felt strongly that he required long-term anticoagulation","PCP",31,34,],
]).toDF("text", "target", "char_begin", "char_end")

results = pipeline.fit(data).transform(data)

results.selectExpr(
    "target",
    "char_begin",
    "char_end",
    "token_begin",
    "token_end",
    "tokens[token_begin].result",
    "tokens[token_end].result",
    "target",
    "chunk",
).show(truncate=False)

+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
|target|char_begin|char_end|token_begin|token_end|tokens[token_begin].result|tokens[token_end].result|target|chunk                                         |
+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
|Minnie|57        |64      |10         |10       |Minnie                    |Minnie                  |Minnie|[{chunk, 57, 62, Minnie, {sentence -> 0}, []}]|
|PCP   |31        |34      |5          |5        |PCP                       |PCP                     |PCP   |[{chunk, 31, 33, PCP, {sentence -> 0}, []}]   |
+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
from johnsnowlabs import nlp, finance

document_assembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("tokens")

converter = finance.AssertionChunkConverter()\
  .setInputCols("tokens")\
  .setChunkTextCol("target")\
  .setChunkBeginCol("char_begin")\
  .setChunkEndCol("char_end")\
  .setOutputTokenBeginCol("token_begin")\
  .setOutputTokenEndCol("token_end")\
  .setOutputCol("chunk")


pipeline = nlp.Pipeline().setStages(
    [document_assembler, 
     sentenceDetector, 
     tokenizer, 
     converter]
)

data = spark.createDataFrame([
    ["Tom Martin worked as Cadence's CTO until 2010","Cadence's CTO",21,33],
    ["Mrs. Charles was before Managing Director at a big consultancy company","Managing Director",24,40],
]).toDF("text", "target", "char_begin", "char_end")

results = pipeline.fit(data).transform(data)

results.selectExpr(
    "target",
    "char_begin",
    "char_end",
    "token_begin",
    "token_end",
    "tokens[token_begin].result",
    "tokens[token_end].result",
    "target",
    "chunk",
).show(truncate=False)

+-----------------+----------+--------+-----------+---------+--------------------------+------------------------+-----------------+---------------------------------------------------------+
|target           |char_begin|char_end|token_begin|token_end|tokens[token_begin].result|tokens[token_end].result|target           |chunk                                                    |
+-----------------+----------+--------+-----------+---------+--------------------------+------------------------+-----------------+---------------------------------------------------------+
|Cadence's CTO    |21        |33      |4          |4        |Cadence's                 |Cadence's               |Cadence's CTO    |[{chunk, 21, 29, Cadence's CTO, {sentence -> 0}, []}]    |
|Managing Director|24        |40      |5          |5        |Managing                  |Managing                |Managing Director|[{chunk, 24, 31, Managing Director, {sentence -> 0}, []}]|
+-----------------+----------+--------+-----------+---------+--------------------------+------------------------+-----------------+---------------------------------------------------------+
from johnsnowlabs import nlp, legal

document_assembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("tokens")

converter = legal.AssertionChunkConverter()\
  .setInputCols("tokens")\
  .setChunkTextCol("target")\
  .setChunkBeginCol("char_begin")\
  .setChunkEndCol("char_end")\
  .setOutputTokenBeginCol("token_begin")\
  .setOutputTokenEndCol("token_end")\
  .setOutputCol("chunk")


pipeline = nlp.Pipeline().setStages(
    [document_assembler,
     sentenceDetector,
     tokenizer, 
     converter]
)

data = spark.createDataFrame([
    ["This Agreement may be executed by different parties hereto","parties",44,50,],
    ["The Administrative Agent will determine the Dollar Equivalent amount","Agent",19,23,],
]).toDF("text", "target", "char_begin", "char_end")

results = pipeline.fit(data).transform(data)

results.selectExpr(
    "target",
    "char_begin",
    "char_end",
    "token_begin",
    "token_end",
    "tokens[token_begin].result",
    "tokens[token_end].result",
    "target",
    "chunk",
).show(truncate=False)

+-------+----------+--------+-----------+---------+--------------------------+------------------------+-------+-----------------------------------------------+
|target |char_begin|char_end|token_begin|token_end|tokens[token_begin].result|tokens[token_end].result|target |chunk                                          |
+-------+----------+--------+-----------+---------+--------------------------+------------------------+-------+-----------------------------------------------+
|parties|44        |50      |7          |6        |parties                   |different               |parties|[{chunk, 44, 42, parties, {sentence -> 0}, []}]|
|Agent  |19        |23      |2          |1        |Agent                     |Administrative          |Agent  |[{chunk, 19, 17, Agent, {sentence -> 0}, []}]  |
+-------+----------+--------+-----------+---------+--------------------------+------------------------+-------+-----------------------------------------------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("tokens")

val converter = new AssertionChunkConverter()
  .setInputCols("tokens")
  .setOutputCol("chunk")
  .setChunkTextCol("target")
  .setChunkBeginCol("char_begin")
  .setChunkEndCol("char_end")
  .setOutputTokenBeginCol("token_begin")
  .setOutputTokenEndCol("token_end")

val pipeline = new Pipeline().setStages(Array(
                                              document_assembler, 
                                              sentenceDetector, 
                                              tokenizer, 
                                              converter))

val data = Seq(Array(
    ("An angiography showed bleeding in two vessels off of the Minnie supplying the sigmoid that were succesfully embolized.", "Minnie",57,64,),
    ("After discussing this with his PCP, Leon was clear that the patient had had recurrent DVTs and ultimately a PE and his PCP felt strongly that he required long-term anticoagulation", "PCP", 31, 34,)
)).toDF("text", "target", "char_begin", "char_end")

val results = pipeline.fit(data).transform(data)

+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
|target|char_begin|char_end|token_begin|token_end|tokens[token_begin].result|tokens[token_end].result|target|chunk                                         |
+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
|Minnie|57        |64      |10         |10       |Minnie                    |Minnie                  |Minnie|[{chunk, 57, 62, Minnie, {sentence -> 0}, []}]|
|PCP   |31        |34      |5          |5        |PCP                       |PCP                     |PCP   |[{chunk, 31, 33, PCP, {sentence -> 0}, []}]   |
+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
import spark.implicits._

val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("tokens")

val converter = new AssertionChunkConverter()
  .setInputCols("tokens")
  .setOutputCol("chunk")
  .setChunkTextCol("target")
  .setChunkBeginCol("char_begin")
  .setChunkEndCol("char_end")
  .setOutputTokenBeginCol("token_begin")
  .setOutputTokenEndCol("token_end")

val pipeline = new Pipeline().setStages(Array(
                                              document_assembler, 
                                              sentenceDetector, 
                                              tokenizer, 
                                              converter))

val data = Seq(Array(
    ("Tom Martin worked as Cadence's CTO until 2010","Cadence's CTO",21,33,),
    ("Mrs. Charles was before Managing Director at a big consultancy company", "Managing Director",24, 40,)
)).toDF("text", "target", "char_begin", "char_end")

val results = pipeline.fit(data).transform(data)

+-----------------+----------+--------+-----------+---------+-----+----+------------+----------+-------------------+
|target           |char_begin|char_end|token_begin|token_end|begin|end |begin_result|end_result|result             |
+-----------------+----------+--------+-----------+---------+-----+----+------------+----------+-------------------+
|Cadence's CTO    |21        |33      |4          |4        |[21] |[29]|Cadence's   |Cadence's |[Cadence's CTO]    |
|Managing Director|24        |40      |5          |5        |[24] |[31]|Managing    |Managing  |[Managing Director]|
+-----------------+----------+--------+-----------+---------+-----+----+------------+----------+-------------------+
import spark.implicits._

val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("tokens")

val converter = new AssertionChunkConverter()
  .setInputCols("tokens")
  .setOutputCol("chunk")
  .setChunkTextCol("target")
  .setChunkBeginCol("char_begin")
  .setChunkEndCol("char_end")
  .setOutputTokenBeginCol("token_begin")
  .setOutputTokenEndCol("token_end")

val pipeline = new Pipeline().setStages(Array(
                                              document_assembler, 
                                              sentenceDetector, 
                                              tokenizer, 
                                              converter))

val data = Seq(Array(
    ("Tom Martin worked as Cadence's CTO until 2010","Cadence's CTO", 21,33,),
    ("Mrs. Charles was before Managing Director at a big consultancy company","Managing Director",24,40,)
)).toDF("text", "target", "char_begin", "char_end")
 
val results = pipeline.fit(data).transform(data)

+-------+----------+--------+-----------+---------+--------------------------+------------------------+-------+-----------------------------------------------+
|target |char_begin|char_end|token_begin|token_end|tokens[token_begin].result|tokens[token_end].result|target |chunk                                          |
+-------+----------+--------+-----------+---------+--------------------------+------------------------+-------+-----------------------------------------------+
|parties|44        |50      |7          |6        |parties                   |different               |parties|[{chunk, 44, 42, parties, {sentence -> 0}, []}]|
|Agent  |19        |23      |2          |1        |Agent                     |Administrative          |Agent  |[{chunk, 19, 17, Agent, {sentence -> 0}, []}]  |
+-------+----------+--------+-----------+---------+--------------------------+------------------------+-------+-----------------------------------------------+

AssertionDL

AssertionDL is a deep Learning based approach used to extract Assertion Status from extracted entities and text. AssertionDLModel requires DOCUMENT, CHUNK and WORD_EMBEDDINGS type annotator inputs, which can be obtained by e.g a DocumentAssembler, NerConverter and WordEmbeddingsModel. The result is an assertion status annotation for each recognized entity. Possible values include “present”,“absent”,“hypothetical”,“conditional”,“associated_with_other_person” etc.

Parameters:

  • inputCols: Gets current column names of input annotations.

  • outputCol: Gets output column name of annotations.

  • ScopeWindow: Sets the scope of the window of the assertion expression.

  • EntityAssertionCaseSensitive: Sets the case sensitivity of entities and assertion labels.

  • DoExceptionHandling: If it is set as True, the annotator tries to process as usual and ff exception-causing data (e.g. corrupted record/ document) is passed to the annotator, an exception warning is emitted which has the exception message.

For pretrained models please see the Models Hub for available models.

Input Annotator Types: DOCUMENT, CHUNK, WORD_EMBEDDINGS

Output Annotator Type: ASSERTION

Python API: AssertionDLModel Scala API: AssertionDLModel Notebook: AssertionDLModelNotebook
Show Example

from johnsnowlabs import nlp, medical
# Define pipeline stages to extract NER chunks first
documentAssembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embeddings")

nerModel = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"])\
  .setOutputCol("ner")

nerConverter = nlp.NerConverter()\
  .setInputCols(["sentence", "token", "ner"])\
  .setOutputCol("ner_chunk")

# Then a pretrained AssertionDLModel is used to extract the assertion status
clinicalAssertion = medical.AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models") \
  .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
  .setOutputCol("assertion")

assertionPipeline = nlp.Pipeline(stages=[
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter,
  clinicalAssertion
])

data = spark.createDataFrame([
  ["Patient with severe fever and sore throat"],
  ["Patient shows no stomach pain"],
  ["She was maintained on an epidural and PCA for pain control."]]).toDF("text")


# Show results
result = assertionPipeline.fit(data).transform(data)
result.selectExpr("ner_chunk.result as chunk_result", "assertion.result as assertion_result").show(3, truncate=False)

+--------------------------------+--------------------------------+
|chunk_result                    |assertion_result                |
+--------------------------------+--------------------------------+
|[severe fever, sore throat]     |[present, present]              |
|[stomach pain]                  |[absent]                        |
|[an epidural, PCA, pain control]|[present, present, hypothetical]|
+--------------------------------+--------------------------------+
from johnsnowlabs import nlp, finance

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector =  nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer =  nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings =  nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")\

ner_converter = finance.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\

assertion = finance.AssertionDLModel.pretrained("finassertion_competitors", "en", "finance/models")\
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")
    
pipeline = nlp.Pipeline(stages=[
    document_assembler, 
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    assertion
    ])

data = spark.createDataFrame([["Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc."]]).toDF("text")


# Show results
result = pipeline.fit(data).transform(data)
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata, result.assertion.result)).alias("cols"))\
      .select(F.expr("cols['1']['sentence']").alias("sent_id"),
              F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label"),
              F.expr("cols['2']").alias("assertion")).show(truncate=False)

+-------+------------+---------+----------+
|sent_id|chunk       |ner_label|assertion |
+-------+------------+---------+----------+
|0      |McAfee LLC  |ORG      |COMPETITOR|
|0      |Broadcom Inc|ORG      |COMPETITOR|
+-------+------------+---------+----------+
from johnsnowlabs import nlp, legal

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings_ner = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
    .setInputCols("sentence", "token") \
    .setOutputCol("embeddings_ner")\

ner_model = legal.NerModel.pretrained('legner_contract_doc_parties', 'en', 'legal/models')\
    .setInputCols(["sentence", "token", "embeddings_ner"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setWhiteList(["DOC", "EFFDATE", "PARTY"])

embeddings_ass = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings_ass")

assertion = legal.AssertionDLModel.pretrained("legassertion_time", "en", "legal/models")\
    .setInputCols(["sentence", "ner_chunk", "embeddings_ass"]) \
    .setOutputCol("assertion")

nlpPipeline = nlp.Pipeline(stages=[
            document_assembler, 
            sentence_detector,
            tokenizer,
            embeddings_ner,
            ner_model,
            ner_converter,
            embeddings_ass,
            assertion
            ])

data = spark.createDataFrame([["This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc."]]).toDF("text")


# Show results
result = nlpPipeline.fit(data).transform(data)
result.select(F.explode(F.arrays_zip(result.ner_chunk.result,  
                                     result.ner_chunk.begin, 
                                     result.ner_chunk.end, 
                                     result.ner_chunk.metadata, 
                                     result.assertion.result)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias("ner_label"),
              F.expr("cols['4']").alias("assertion")).show(truncate=False)

+-------------------------------+-----+---+---------+---------+
|chunk                          |begin|end|ner_label|assertion|
+-------------------------------+-----+---+---------+---------+
|Intellectual Property Agreement|11   |41 |DOC      |PRESENT  |
|Amazon Inc                     |51   |60 |PARTY    |PRESENT  |
|Atlantic Inc                   |67   |78 |PARTY    |PRESENT  |
+-------------------------------+-----+---+---------+---------+

import spark.implicits._
// Define pipeline stages to extract NER chunks first

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val nerModel = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val nerConverter = new NerConverter()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

// Then a pretrained AssertionDLModel is used to extract the assertion status
val clinicalAssertion = AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models")
    .setInputCols(Array("sentence", "ner_chunk", "embeddings"))
    .setOutputCol("assertion")

val assertionPipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter,
  clinicalAssertion
))

val data = Seq(
  "Patient with severe fever and sore throat",
  "Patient shows no stomach pain",
  "She was maintained on an epidural and PCA for pain control.").toDF("text")
  

// Show results
val result = assertionPipeline.fit(data).transform(data)

+--------------------------------+--------------------------------+
|chunk_result                    |assertion_result                |
+--------------------------------+--------------------------------+
|[severe fever, sore throat]     |[present, present]              |
|[stomach pain]                  |[absent]                        |
|[an epidural, PCA, pain control]|[present, present, hypothetical]|
+--------------------------------+--------------------------------+
import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentence_detector =  new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer =  new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val embeddings =  BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val ner_model = FinanceNerModel.pretrained("finner_orgs_prods_alias","en","finance/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = new NerConverter()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

val assertion = AssertionDLModel.pretrained("finassertion_competitors", "en", "finance/models")
    .setInputCols(Array("sentence", "ner_chunk", "embeddings"))
    .setOutputCol("assertion")
    
val pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    assertion
    ))

val data = Seq("Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc.").toDF("text")


// Show results
val result = pipeline.fit(data).transform(data)

+-------+------------+---------+----------+
|sent_id|chunk       |ner_label|assertion |
+-------+------------+---------+----------+
|0      |McAfee LLC  |ORG      |COMPETITOR|
|0      |Broadcom Inc|ORG      |COMPETITOR|
+-------+------------+---------+----------+
import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val embeddings_ner = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings_ner")

val ner_model = LegalNerModel.pretrained('legner_contract_doc_parties', 'en', 'legal/models')
    .setInputCols(Array("sentence", "token", "embeddings_ner"))
    .setOutputCol("ner")

val ner_converter = new NerConverter()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")
    .setWhiteList(Array("DOC", "EFFDATE", "PARTY"))

val embeddings_ass = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings_ass")

val assertion = AssertionDLModel.pretrained("legassertion_time", "en", "legal/models")
    .setInputCols(Array("sentence", "ner_chunk", "embeddings_ass"))
    .setOutputCol("assertion")
    
val pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    sentence_detector,
    tokenizer,
    embeddings_ner,
    ner_model,
    ner_converter,
    embeddings_ass,
    assertion
    ))

val data = Seq("This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc.").toDF("text")


// Show results
val result = pipeline.fit(data).transform(data)

+-------------------------------+-----+---+---------+---------+
|chunk                          |begin|end|ner_label|assertion|
+-------------------------------+-----+---+---------+---------+
|Intellectual Property Agreement|11   |41 |DOC      |PRESENT  |
|Amazon Inc                     |51   |60 |PARTY    |PRESENT  |
|Atlantic Inc                   |67   |78 |PARTY    |PRESENT  |
+-------------------------------+-----+---+---------+---------+

Trains AssertionDL, a deep Learning based approach used to extract Assertion Status from extracted entities and text. Contains all the methods for training an AssertionDLModel. For pretrained models please use AssertionDLModel and see the Models Hub for available models.

Parameters:

  • inputCols: Gets current column names of input annotations.

  • outputCol: Gets output column name of annotations.

  • ScopeWindow: Sets the scope of the window of the assertion expression.

  • StartCol: Set a column that contains the token number for the start of the target.

Input Annotator Types: DOCUMENT, CHUNK, WORD_EMBEDDINGS

Output Annotator Type: ASSERTION

Python API: AssertionDLApproach Scala API: AssertionDLApproach Notebook: AssertionDLApproachNotebook
Show Example

from johnsnowlabs import nlp, medical

# First, pipeline stages for pre-processing the dataset (containing columns for text and label) are defined.
document = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

chunk = nlp.Doc2Chunk() \
    .setInputCols(["document"]) \
    .setOutputCol("chunk") \
    .setChunkCol("target")\
    .setStartCol("start")\
    .setStartColByTokenIndex(True)\
    .setFailOnMissing(False)\
    .setLowerCase(True)

token = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

# Define AssertionDLApproach with parameters and start training
assertionStatus = medical.AssertionDLApproach() \
    .setLabelCol("label") \
    .setInputCols(["document", "chunk", "embeddings"]) \
    .setOutputCol("assertion") \
    .setBatchSize(128) \
    .setDropout(0.012) \
    .setLearningRate(0.015) \
    .setEpochs(1) \
    .setStartCol("start") \
    .setEndCol("end") \
    .setMaxSentLen(250)

trainingPipeline = nlp.Pipeline().setStages([
    document,
    chunk,
    token,
    embeddings,
    assertionStatus
])

assertionResults = trainingPipeline.fit(data).transform(data).cache()
from johnsnowlabs import nlp, finance

# First, pipeline stages for pre-processing the dataset (containing columns for text and label) are defined.
document = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

chunk = nlp.Doc2Chunk() \
    .setInputCols(["document"]) \
    .setOutputCol("chunk")

token = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

# Define AssertionDLApproach with parameters and start training
assertionStatus = finance.AssertionDLApproach() \
    .setLabelCol("label") \
    .setInputCols(["document", "chunk", "embeddings"]) \
    .setOutputCol("assertion") \
    .setBatchSize(128) \
    .setDropout(0.012) \
    .setLearningRate(0.015) \
    .setEpochs(1) \
    .setStartCol("start") \
    .setEndCol("end") \
    .setMaxSentLen(250)

trainingPipeline = nlp.Pipeline().setStages([
    document,
    chunk,
    token,
    embeddings,
    assertionStatus
])

assertionResults = trainingPipeline.fit(data).transform(data).cache()
from johnsnowlabs import nlp, legal

# First, pipeline stages for pre-processing the dataset (containing columns for text and label) are defined.
document = nlp.DocumentAssembler()\
    .setInputCol("sentence")\
    .setOutputCol("document")

chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("doc_chunk")

token = nlp.Tokenizer()\
    .setInputCols(['document'])\
    .setOutputCol('token')

roberta_embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings") \
    .setMaxSentenceLength(512)

# Define AssertionDLApproach with parameters and start training
assertionStatus = legal.AssertionDLApproach()\
    .setLabelCol("assertion_label")\
    .setInputCols(["document", "doc_chunk", "embeddings"])\
    .setOutputCol("assertion")\
    .setBatchSize(128)\
    .setLearningRate(0.001)\
    .setEpochs(2)\
    .setStartCol("tkn_start")\
    .setEndCol("tkn_end")\
    .setMaxSentLen(1200)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath('training_logs/')\
    .setGraphFolder(graph_folder)\
    .setGraphFile(f"{graph_folder}/assertion_graph.pb")\
    .setTestDataset(path="test_data.parquet", read_as='SPARK', options={'format': 'parquet'})\
    .setScopeWindow(scope_window)
    #.setValidationSplit(0.2)\    
    #.setDropout(0.1)\    

trainingPipeline = nlp.Pipeline().setStages([
    document,
    chunk,
    token,
    roberta_embeddings,
    assertionStatus
])

assertionResults = trainingPipeline.fit(data).transform(data).cache()

import spark.implicits._

// First, pipeline stages for pre-processing the dataset (containing columns for text and label) are defined.
val document = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val chunk = new Doc2Chunk()
  .setInputCols(Array("document"))
  .setOutputCol("chunk")

val token = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("document", "token"))
  .setOutputCol("embeddings")

// Define AssertionDLApproach with parameters and start training
val assertionStatus = new AssertionDLApproach()
  .setLabelCol("label")
  .setInputCols(Array("document", "chunk", "embeddings"))
  .setOutputCol("assertion")
  .setBatchSize(128)
  .setDropout(0.012)
  .setLearningRate(0.015)
  .setEpochs(1)
  .setStartCol("start")
  .setEndCol("end")
  .setMaxSentLen(250)

val trainingPipeline = new Pipeline().setStages(Array(
  document,
  chunk,
  token,
  embeddings,
  assertionStatus
))

val assertionResults = trainingPipeline.fit(data).transform(data).cache()
import spark.implicits._
// First, pipeline stages for pre-processing the dataset (containing columns for text and label) are defined.

val document = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val chunk = new Doc2Chunk()
  .setInputCols(Array("document"))
  .setOutputCol("chunk")

val token = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("document", "token"))
  .setOutputCol("embeddings")

// Define AssertionDLApproach with parameters and start training
val assertionStatus = new AssertionDLApproach()
  .setLabelCol("label")
  .setInputCols(Array("document", "chunk", "embeddings"))
  .setOutputCol("assertion")
  .setBatchSize(128)
  .setDropout(0.012)
  .setLearningRate(0.015)
  .setEpochs(1)
  .setStartCol("start")
  .setEndCol("end")
  .setMaxSentLen(250)

val trainingPipeline = new Pipeline().setStages(Array(
  document,
  chunk,
  token,
  embeddings,
  assertionStatus
))

val assertionResults = trainingPipeline.fit(data).transform(data).cache()
import spark.implicits._

val document = new DocumentAssembler()
    .setInputCol("sentence")
    .setOutputCol("document")

val chunk = new Doc2Chunk()
    .setInputCols(Array("document"))
    .setOutputCol("doc_chunk")
    .setChunkCol("chunk")
    .setStartCol("tkn_start")
    .setStartColByTokenIndex(true)
    .setFailOnMissing(false)
    .setLowerCase(false)

val token = new Tokenizer()
    .setInputCols(Array('document'))
    .setOutputCol('token')

val roberta_embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") 
    .setInputCols(Array("document", "token")) 
    .setOutputCol("embeddings") 
    .setMaxSentenceLength(512)

# Define AssertionDLApproach with parameters and start training
val assertionStatus = new AssertionDLApproach()
    .setLabelCol("assertion_label")
    .setInputCols(Array("document", "doc_chunk", "embeddings"))
    .setOutputCol("assertion")
    .setBatchSize(128)
    .setLearningRate(0.001)
    .setEpochs(2)
    .setStartCol("tkn_start")
    .setEndCol("tkn_end")
    .setMaxSentLen(1200)
    .setEnableOutputLogs(true)
    .setOutputLogsPath('training_logs/')
    .setGraphFolder(graph_folder)
    .setGraphFile(f"{graph_folder}/assertion_graph.pb")
    .setTestDataset(path="test_data.parquet", read_as='SPARK', options={'format': 'parquet'})
    .setScopeWindow(scope_window)
    #.setValidationSplit(0.2) 
    #.setDropout(0.1) 

val trainingPipeline = new Pipeline().setStages(Array(
  document,
  chunk,
  token,
  roberta_embeddings,
  assertionStatus
))

val assertionResults = trainingPipeline.fit(data).transform(data).cache()

AssertionFilterer

Model

Filters entities coming from ASSERTION type annotations and returns the CHUNKS. Filters can be set via a white list on the extracted chunk, the assertion or a regular expression. White list for assertion is enabled by default. To use chunk white list, criteria has to be set to "isin". For regex, criteria has to be set to "regex".

Parameters:

  • whiteList: (list) If defined, list of entities to process. The rest will be ignored.

  • CaseSensitive: (bool) Determines whether the definitions of the white listed entities are case sensitive.

  • regex: (list) List of dash-separated pairs of named entities.

  • criteria: (list) Set tag representing what is the criteria to filter the chunks. possibles values (assertion,isIn,regex). assertion: Filter by the assertion, isIn : Filter by the chunk, regex : Filter using a regex.

  • entitiesConfidence: (Str) Entity pairs to remove based on the confidence level.

Input Annotator Types: DOCUMENT, CHUNK, ASSERTION

Output Annotator Type: CHUNK

Python API: AssertionFilterer Scala API: AssertionFilterer Notebook: AssertionFiltererNotebook
Show Example

from johnsnowlabs import nlp, medical 

# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")\
    #.setIncludeAllConfidenceScores(False)

ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\
    .setWhiteList(["PROBLEM", "TEST","TREATMENT"])

clinical_assertion = medical.AssertionDLModel.pretrained("assertion_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")

assertion_filterer = medical.AssertionFilterer()\
    .setInputCols("sentence","ner_chunk","assertion")\
    .setOutputCol("assertion_filtered")\
    .setCaseSensitive(False)\
    .setWhiteList(["Present"])
#or .setBlackList([["absent"]])

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      clinical_assertion,
      assertion_filterer
    ])

data = spark.createDataFrame([["Patient has a headache for the last 2 weeks, needs to get a head CT, and appears anxious when she walks fast. Alopecia noted. She denies pain."]]).toDF("text")

result = nlpPipeline.fit(data).transform(data)

# Show results:

result.selectExpr("ner_chunk.result as ner_chunk", "assertion.result as assertion").show(3, truncate=False)
+------------------------------------------------+--------------------------------------------------+
|ner_chunk                                       |assertion                                         |
+------------------------------------------------+--------------------------------------------------+
|[a headache, a head CT, anxious, Alopecia, pain]|[Present, Hypothetical, Possible, Present, Absent]|
+------------------------------------------------+--------------------------------------------------+

result.select("filtered.result").show(3, truncate=False)
+----------------------+
|result                |
+----------------------+
|[a headache, Alopecia]|
+----------------------+
from johnsnowlabs import nlp, finance 

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector =  nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer =  nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings =  nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")\

ner_converter = finance.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\

assertion = finance.AssertionDLModel.pretrained("finassertion_competitors", "en", "finance/models")\
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")

assertion_filterer = finance.AssertionFilterer()\
    .setInputCols("sentence","ner_chunk","assertion")\
    .setOutputCol("assertion_filtered")\
    .setCaseSensitive(False)\
    .setWhiteList(["Competitor"])

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    assertion,
    assertion_filterer
    ])

data = spark.createDataFrame([["Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc."]]).toDF("text")

# Show results

result = pipeline.fit(data).transform(data)
result.selectExpr("ner_chunk.result as ner_chunk", "assertion.result as assertion").show(3, truncate=False)
+--------------------------+------------------------+
|ner_chunk                 |assertion               |
+--------------------------+------------------------+
|[McAfee LLC, Broadcom Inc]|[COMPETITOR, COMPETITOR]|
+--------------------------+------------------------+

result.select("assertion_filtered.result").show(3, truncate=False)
+--------------------------+
|result                    |
+--------------------------+
|[McAfee LLC, Broadcom Inc]|
+--------------------------+
from johnsnowlabs import nlp, legal 

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings_ner = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
    .setInputCols("sentence", "token") \
    .setOutputCol("embeddings_ner")\

ner_model = legal.NerModel.pretrained('legner_contract_doc_parties', 'en', 'legal/models')\
    .setInputCols(["sentence", "token", "embeddings_ner"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setWhiteList(["DOC", "EFFDATE", "PARTY"])

embeddings_ass = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings_ass")

assertion = legal.AssertionDLModel.pretrained("legassertion_time", "en", "legal/models")\
    .setInputCols(["sentence", "ner_chunk", "embeddings_ass"]) \
    .setOutputCol("assertion")

assertion_filterer = legal.AssertionFilterer()\
    .setInputCols("sentence","ner_chunk","assertion")\
    .setOutputCol("assertion_filtered")\
    .setCaseSensitive(False)\
    .setWhiteList(["Present"])


nlpPipeline = nlp.Pipeline(stages=[
            document_assembler,
            sentence_detector,
            tokenizer,
            embeddings_ner,
            ner_model,
            ner_converter,
            embeddings_ass,
            assertion,
            assertion_filterer
            ])

data = spark.createDataFrame([["This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc."]]).toDF("text")

# Show results

result = nlpPipeline.fit(data).transform(data)
result.selectExpr("ner_chunk.result as ner_chunk", "assertion.result as assertion").show(3, truncate=False)
+-----------------------------------------------------------+---------------------------+
|ner_chunk                                                  |assertion                  |
+-----------------------------------------------------------+---------------------------+
|[Intellectual Property Agreement, Amazon Inc, Atlantic Inc]|[PRESENT, PRESENT, PRESENT]|
+-----------------------------------------------------------+---------------------------+

result.select("assertion_filtered.result").show(3, truncate=False)
+-----------------------------------------------------------+
|result                                                     |
+-----------------------------------------------------------+
|[Intellectual Property Agreement, Amazon Inc, Atlantic Inc]|
+-----------------------------------------------------------+

import spark.implicits._

// Annotator that transforms a text column from dataframe into an Annotation ready for NLP
val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

// Sentence Detector annotator, processes various sentences per line
val sentenceDetector = new SentenceDetector()
    .setInputCols(Array("document"))
    .setOutputCol("sentence")

// Tokenizer splits words in a relevant format for NLP
val tokenizer = new Tokenizer()
    .setInputCols(Array("sentence"))
    .setOutputCol("token")

// Clinical word embeddings trained on PubMED dataset
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") 
    .setInputCols(Array("sentence", "token", "embeddings")) 
    .setOutputCol("ner")
    //.setIncludeAllConfidenceScores(false)

val ner_converter = new NerConverterInternal() 
    .setInputCols(Array("sentence", "token", "ner")) 
    .setOutputCol("ner_chunk")
    .setWhiteList(Array("PROBLEM", "TEST","TREATMENT"))

val clinical_assertion = AssertionDLModel.pretrained("assertion_jsl", "en", "clinical/models") 
    .setInputCols(Array("sentence", "ner_chunk", "embeddings")) 
    .setOutputCol("assertion")

val assertion_filterer = new AssertionFilterer()
    .setInputCols("sentence","ner_chunk","assertion")
    .setOutputCol("assertion_filtered")
    .setCaseSensitive(false)
    .setWhiteList(Array("Present"))
//or .setBlackList(Array("absent"))

val nlpPipeline = new Pipeline().setStages(Array(
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      clinical_assertion,
      assertion_filterer
))


val text ="""Patient has a headache for the last 2 weeks, needs to get a head CT, and appears anxious when she walks fast. Alopecia noted. She denies pain."""

val data = Seq(text).toDF("text")
val result = nlpPipeline.fit(data).transform(data)

// Show results:

+------------------------------------------------+--------------------------------------------------+
|ner_chunk                                       |assertion                                         |
+------------------------------------------------+--------------------------------------------------+
|[a headache, a head CT, anxious, Alopecia, pain]|[Present, Hypothetical, Possible, Present, Absent]|
+------------------------------------------------+--------------------------------------------------+

+----------------------+
|result                |
+----------------------+
|[a headache, Alopecia]|
+----------------------+
import spark.implicits._

// Annotator that transforms a text column from dataframe into an Annotation ready for NLP
val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

// Sentence Detector annotator, processes various sentences per line
val sentence_detector = new SentenceDetector()
    .setInputCols(Array("document"))
    .setOutputCol("sentence")

// Tokenizer splits words in a relevant format for NLP
val tokenizer = new Tokenizer()
    .setInputCols(Array("sentence"))
    .setOutputCol("token")

// Clinical word embeddings trained on PubMED dataset
val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val ner_model = FinanceNerModel.pretrained("finner_orgs_prods_alias","en","finance/models") 
    .setInputCols(Array("sentence", "token", "embeddings")) 
    .setOutputCol("ner")

val ner_converter = new NerConverterInternal() 
    .setInputCols(Array("sentence", "token", "ner")) 
    .setOutputCol("ner_chunk")

val assertion = AssertionDLModel.pretrained("finassertion_competitors", "en", "finance/models") 
    .setInputCols(Array("sentence", "ner_chunk", "embeddings")) 
    .setOutputCol("assertion")

val assertion_filterer = new AssertionFilterer()
    .setInputCols("sentence","ner_chunk","assertion")
    .setOutputCol("assertion_filtered")
    .setCaseSensitive(false)
    .setWhiteList(Array("Competitor"))


val nlpPipeline = new Pipeline().setStages(Array(
        document_assembler,
        sentence_detector,
        tokenizer,
        embeddings,
        ner_model,
        ner_converter,
        assertion,
        assertion_filterer
))

val text ="""Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc."""

val data = Seq(text).toDF("text")
val result = nlpPipeline.fit(data).transform(data)

// Show results:
+--------------------------+------------------------+
|ner_chunk                 |assertion               |
+--------------------------+------------------------+
|[McAfee LLC, Broadcom Inc]|[COMPETITOR, COMPETITOR]|
+--------------------------+------------------------+

+--------------------------+
|result                    |
+--------------------------+
|[McAfee LLC, Broadcom Inc]|
+--------------------------+
import spark.implicits._

// Annotator that transforms a text column from dataframe into an Annotation ready for NLP
val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

// Sentence Detector annotator, processes various sentences per line
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
    .setInputCols(Array("document"))
    .setOutputCol("sentence")

// Tokenizer splits words in a relevant format for NLP
val tokenizer = new Tokenizer()
    .setInputCols(Array("sentence"))
    .setOutputCol("token")

// Clinical word embeddings trained on PubMED dataset
val embeddings_ner = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings_ner")

val ner_model = LegalNerModel.pretrained('legner_contract_doc_parties', 'en', 'legal/models') 
    .setInputCols(Array("sentence", "token", "embeddings_ner")) 
    .setOutputCol("ner")
    //.setIncludeAllConfidenceScores(false)

val ner_converter = new NerConverterInternal() 
    .setInputCols(Array("sentence", "token", "ner")) 
    .setOutputCol("ner_chunk")
    .setWhiteList(Array("DOC", "EFFDATE", "PARTY"))

val embeddings_ass = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") 
    .setInputCols(Array("sentence", "token")) 
    .setOutputCol("embeddings_ass")

val assertion = AssertionDLModel.pretrained("assertion_jsl", "en", "clinical/models") 
    .setInputCols(Array("sentence", "ner_chunk", "embeddings_ass")) 
    .setOutputCol("assertion")

val assertion_filterer = new AssertionFilterer()
    .setInputCols("sentence","ner_chunk","assertion")
    .setOutputCol("assertion_filtered")
    .setCaseSensitive(false)
    .setWhiteList(Array("Present"))


val nlpPipeline = new Pipeline().setStages(Array(
      document_assembler,
      sentence_detector,
      tokenizer,
      embeddings_ner,
      ner_model,
      ner_converter,
      embeddings_ass,
      assertion,
      assertion_filterer
))


val text ="""This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc."""

val data = Seq(text).toDF("text")
val result = nlpPipeline.fit(data).transform(data)

// Show results:
+-----------------------------------------------------------+---------------------------+
|ner_chunk                                                  |assertion                  |
+-----------------------------------------------------------+---------------------------+
|[Intellectual Property Agreement, Amazon Inc, Atlantic Inc]|[PRESENT, PRESENT, PRESENT]|
+-----------------------------------------------------------+---------------------------+

+-----------------------------------------------------------+
|result                                                     |
+-----------------------------------------------------------+
|[Intellectual Property Agreement, Amazon Inc, Atlantic Inc]|
+-----------------------------------------------------------+

AssertionLogReg

This is a main class in AssertionLogReg family. Logarithmic Regression is used to extract Assertion Status from extracted entities and text. AssertionLogRegModel requires DOCUMENT, CHUNK and WORD_EMBEDDINGS type annotator inputs, which can be obtained by e.g a DocumentAssembler, NerConverter and WordEmbeddingsModel. The result is an assertion status annotation for each recognized entity. Possible values are "Negated", "Affirmed" and "Historical".

Unlike the DL Model, this class does not extend AnnotatorModel. Instead it extends the RawAnnotator, that’s why the main point of interest is method transform().

At the moment there are no pretrained models available for this class. Please refer to AssertionLogRegApproach to train your own model.

Parametres:

  • setAfter(Int): Length of the context after the target (Default: 13)

  • setBefore(Int): Length of the context before the target (Default: 11)

  • setEndCol(String): Column that contains the token number for the end of the target

  • setStartCol(String): Column that contains the token number for the start of the target

Input Annotator Types: DOCUMENT, CHUNK, WORD_EMBEDDINGS

Output Annotator Type: ASSERTION

Python API: AssertionLogRegModel Scala API: AssertionLogRegModel Notebook: AssertionLogRegModelNotebook

Trains a classification method, which uses the Logarithmic Regression Algorithm. It is used to extract Assertion Status from extracted entities and text. Contains all the methods for training a AssertionLogRegModel, together with trainWithChunk, trainWithStartEnd.

Parameters:

  • label : Column with label per each token

  • maxIter: This specifies the maximum number of iterations to be performed in the model’s training, default: 26

  • regParam : This specifies the regularization parameter. Regularization helps to control the complexity of the model, aiding in preventing the issue of overfitting.

  • eNetParam : Elastic net parameter

  • beforeParam : Length of the context before the target

  • afterParam : Length of the context after the target

  • startCol : Column that contains the token number for the start of the target

  • endCol : Column that contains the token number for the end of the target

Input Annotator Types: DOCUMENT, CHUNK, WORD_EMBEDDINGS

Output Annotator Type: ASSERTION

Python API: AssertionLogRegApproach Scala API: AssertionLogRegApproach Notebook: AssertionLogRegApproachNotebook
Show Example

from johnsnowlabs import nlp, medical

# First define pipeline stages to extract embeddings and text chunks
documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

glove = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("word_embeddings") \
    .setCaseSensitive(False)

chunkAssembler = nlp.Doc2Chunk() \
    .setInputCols(["document"]) \
    .setChunkCol("target") \
    .setOutputCol("chunk")

# Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
assertion = medical.AssertionLogRegApproach() \
    .setLabelCol("label") \
    .setInputCols(["document", "chunk", "word_embeddings"]) \
    .setOutputCol("assertion") \
    .setReg(0.01) \
    .setBefore(11) \
    .setAfter(13) \
    .setStartCol("start") \
    .setEndCol("end")

assertionPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    nerModel,
    nerConverter,
    assertion
])

assertionModel = assertionPipeline.fit(dataset)
from johnsnowlabs import nlp, finance

# First define pipeline stages to extract embeddings and text chunks
documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

glove = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("word_embeddings") \
    .setCaseSensitive(False)

chunkAssembler = nlp.Doc2Chunk() \
    .setInputCols(["document"]) \
    .setChunkCol("target") \
    .setOutputCol("chunk")

# Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
assertion = finance.AssertionLogRegApproach() \
    .setLabelCol("label") \
    .setInputCols(["document", "chunk", "word_embeddings"]) \
    .setOutputCol("assertion") \
    .setReg(0.01) \
    .setBefore(11) \
    .setAfter(13) \
    .setStartCol("start") \
    .setEndCol("end")

assertionPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    nerModel,
    nerConverter,
    assertion
])

assertionModel = assertionPipeline.fit(dataset)
from johnsnowlabs import nlp, legal

# First define pipeline stages to extract embeddings and text chunks
documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

glove = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("word_embeddings") \
    .setCaseSensitive(False)

chunkAssembler = nlp.Doc2Chunk() \
    .setInputCols(["document"]) \
    .setChunkCol("target") \
    .setOutputCol("chunk")

# Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
assertion = legal.AssertionLogRegApproach() \
    .setLabelCol("label") \
    .setInputCols(["document", "chunk", "word_embeddings"]) \
    .setOutputCol("assertion") \
    .setReg(0.01) \
    .setBefore(11) \
    .setAfter(13) \
    .setStartCol("start") \
    .setEndCol("end")

assertionPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    nerModel,
    nerConverter,
    assertion
])

assertionModel = assertionPipeline.fit(dataset)

import spark.implicits._

// First define pipeline stages to extract embeddings and text chunks
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val glove = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("document", "token"))
  .setOutputCol("word_embeddings")
  .setCaseSensitive(false)

val chunkAssembler = new Doc2Chunk()
  .setInputCols("document")
  .setChunkCol("target")
  .setOutputCol("chunk")

// Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
val assertion = new AssertionLogRegApproach()
  .setLabelCol("label")
  .setInputCols(Array("document", "chunk", "word_embeddings"))
  .setOutputCol("assertion")
  .setReg(0.01)
  .setBefore(11)
  .setAfter(13)
  .setStartCol("start")
  .setEndCol("end")

val assertionPipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter,
  assertion
))

val assertionModel = assertionPipeline.fit(dataset)
import spark.implicits._

// First define pipeline stages to extract embeddings and text chunks
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val glove = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("document", "token"))
  .setOutputCol("word_embeddings")
  .setCaseSensitive(false)

val chunkAssembler = new Doc2Chunk()
  .setInputCols("document")
  .setChunkCol("target")
  .setOutputCol("chunk")

// Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
val assertion = new AssertionLogRegApproach()
  .setLabelCol("label")
  .setInputCols(Array("document", "chunk", "word_embeddings"))
  .setOutputCol("assertion")
  .setReg(0.01)
  .setBefore(11)
  .setAfter(13)
  .setStartCol("start")
  .setEndCol("end")

val assertionPipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter,
  assertion
))

val assertionModel = assertionPipeline.fit(dataset)
import spark.implicits._

// First define pipeline stages to extract embeddings and text chunks
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val glove = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("document", "token"))
  .setOutputCol("word_embeddings")
  .setCaseSensitive(false)

val chunkAssembler = new Doc2Chunk()
  .setInputCols("document")
  .setChunkCol("target")
  .setOutputCol("chunk")

// Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
val assertion = new AssertionLogRegApproach()
  .setLabelCol("label")
  .setInputCols(Array("document", "chunk", "word_embeddings"))
  .setOutputCol("assertion")
  .setReg(0.01)
  .setBefore(11)
  .setAfter(13)
  .setStartCol("start")
  .setEndCol("end")

val assertionPipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter,
  assertion
))

val assertionModel = assertionPipeline.fit(dataset)

AssertionMerger

Model

Merges variety assertion columns coming from Assertion annotators like sparknlp_jsl.annotator.assertion.AssertionDLModel. AssertionMerger can filter, prioritize and merge assertion annotations by using proper parameters. See Also: sparknlp_jsl.annotator.WhiteBlackListParams for filtering options.

Parameters:

  • mergeOverlapping (Bool): Whether to merge overlapping matched assertions.

  • applyFilterBeforeMerge (Bool): Whether to apply filtering before merging.

  • assertionsConfidence (dict[str, float]): Pairs (assertion,confidenceThreshold) to filter assertions which have confidence lower than the confidence threshold.

  • orderingFeatures (list[str]): Specifies the ordering features to use for overlapping entities. Possible values include: ‘begin’, ‘end’, ‘length’, ‘source’, ‘confidence’. Default: [‘begin’, ‘length’, ‘source’]

  • selectionStrategy (str): Determines the strategy for selecting annotations. Annotations can be selected either sequentially based on their order (Sequential) or using a more diverse strategy (DiverseLonger). Currently, only Sequential and DiverseLonger options are available. Default: Sequential.

  • defaultConfidence (float): When the confidence value is included in the orderingFeatures and a given annotation does not have any confidence, this parameter determines the value to be used. The default value is 0.

  • assertionSourcePrecedence (str): Specifies the assertion sources to use for prioritizing overlapping annotations when the ‘source’ ordering feature is utilized. This parameter contains a comma-separated list of assertion sources that drive the prioritization. Annotations will be prioritized based on the order of the given string.

  • sortByBegin (Bool): Whether to sort the annotations by begin at the end of the merge and filter process. Default: False.

  • blackList (list[str]): If defined, list of entities to ignore. The rest will be processed.

  • whiteList (list[str]): If defined, list of entities to process. The rest will be ignored. Do not include IOB prefix on labels.

  • caseSensitive (Bool): Determines whether the definitions of the white listed and black listed entities are case sensitive. Default: True.

Input Annotator Types: ASSERTION

Output Annotator Type: ASSERTION

Python API: AssertionMerger Scala API: AssertionMerger
Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_jsl = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_jsl")\
    #.setIncludeAllConfidenceScores(False)

ner_jsl_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner_jsl"]) \
    .setOutputCol("ner_jsl_chunk")\
    .setWhiteList(["SYMPTOM","VS_FINDING","DISEASE_SYNDROME_DISORDER","ADMISSION_DISCHARGE","PROCEDURE"])

assertion_jsl = medical.AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_jsl_chunk", "embeddings"]) \
    .setOutputCol("assertion_jsl")\
    .setEntityAssertionCaseSensitive(False)

ner_clinical = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_clinical")\
    #.setIncludeAllConfidenceScores(False)

ner_clinical_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner_clinical"]) \
    .setOutputCol("ner_clinical_chunk")\

assertion_dl = medical.AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_clinical_chunk", "embeddings"]) \
    .setOutputCol("assertion_dl")

assertion_merger = medical.AssertionMerger() \
    .setInputCols("assertion_jsl", "assertion_dl") \
    .setOutputCol("assertion_merger") \
    .setMergeOverlapping(True) \
    .setSelectionStrategy("sequential") \
    .setAssertionSourcePrecedence("assertion_dl, assertion_jsl") \
    .setCaseSensitive(False) \
    .setAssertionsConfidence({"past": 0.70}) \
    .setOrderingFeatures(["length", "source", "confidence"]) \
    .setDefaultConfidence(0.50)\
    #.setBlackList(["HYPothetical"])

pipeline = Pipeline( stages =[document_assembler,
                              sentence_detector,
                              tokenizer,
                              word_embeddings,
                              ner_jsl,
                              ner_jsl_converter,
                              assertion_jsl,
                              ner_clinical,
                              ner_clinical_converter,
                              assertion_dl,
                              assertion_merger])

data = spark.createDataFrame([
                        """Patient had a headache for the last 2 weeks, and appears anxious when she walks fast. No alopecia noted. She denies pain. Her father is paralyzed and it is a stressor for her. She got antidepressant. We prescribed sleeping pills for her current insomnia."""], StringType()).toDF("text")


data = data.coalesce(1).withColumn("idx", F.monotonically_increasing_id())
results = pipeline.fit(data).transform(data)

## Result

+---+--------------+-----+---+---------+---------+----------------+----------+
|idx|ner_chunk     |begin|end|ner_label|assertion|assertion_source|confidence|
+---+--------------+-----+---+---------+---------+----------------+----------+
|0  |headache      |14   |21 |Symptom  |Past     |assertion_jsl   |0.9999    |
|0  |anxious       |57   |63 |PROBLEM  |present  |assertion_dl    |0.9392    |
|0  |alopecia      |89   |96 |PROBLEM  |absent   |assertion_dl    |0.9992    |
|0  |pain          |116  |119|PROBLEM  |absent   |assertion_dl    |0.9884    |
|0  |paralyzed     |136  |144|Symptom  |Family   |assertion_jsl   |0.9995    |
|0  |stressor      |158  |165|Symptom  |Family   |assertion_jsl   |1.0       |
|0  |antidepressant|184  |197|TREATMENT|present  |assertion_dl    |0.9628    |
|0  |sleeping pills|214  |227|TREATMENT|present  |assertion_dl    |0.998     |
|0  |insomnia      |245  |252|Symptom  |Past     |assertion_jsl   |0.9862    |
+---+--------------+-----+---+---------+---------+----------------+----------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")

val sentence_detector = new SentenceDetector()
      .setInputCols(Array("document"))
      .setOutputCol("sentence")

val tokenizer = new Tokenizer()
      .setInputCols(Array("sentence"))
      .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
      .setInputCols(Array("sentence", "token"))
      .setOutputCol("embeddings")

val ner_jsl = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models")
      .setInputCols(Array("sentence", "token", "embeddings"))
      .setOutputCol("ner_jsl")
    //.setIncludeAllConfidenceScores(false)

val ner_jsl_converter = new NerConverterInternal()
      .setInputCols(Array("sentence", "token", "ner_jsl"))
      .setOutputCol("ner_jsl_chunk")
      .setWhiteList(Array("SYMPTOM", "VS_FINDING", "DISEASE_SYNDROME_DISORDER", "ADMISSION_DISCHARGE", "PROCEDURE"))

val assertion_jsl = AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models")
      .setInputCols(Array("sentence", "ner_jsl_chunk", "embeddings"))
      .setOutputCol("assertion_jsl")
      .setEntityAssertionCaseSensitive(false)

val ner_clinical = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
      .setInputCols(Array("sentence", "token", "embeddings"))
      .setOutputCol("ner_clinical")
    //.setIncludeAllConfidenceScores(false)

val ner_clinical_converter = new NerConverterInternal()
      .setInputCols(Array("sentence", "token", "ner_clinical"))
      .setOutputCol("ner_clinical_chunk")

val assertion_dl = AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models")
      .setInputCols(Array("sentence", "ner_clinical_chunk", "embeddings"))
      .setOutputCol("assertion_dl")

val assertion_merger = new AssertionMerger()
      .setInputCols("assertion_jsl", "assertion_dl")
      .setOutputCol("assertion_merger")
      .setMergeOverlapping(true)
      .setSelectionStrategy("sequential")
      .setAssertionSourcePrecedence("assertion_dl, assertion_jsl")
      .setCaseSensitive(false)
      .setAssertionsConfidence(Map("past"-> 0.70f))
      .setOrderingFeatures(Array("length", "source", "confidence"))
      .setDefaultConfidence(0.50f)
     // .setBlackList(("HYPothetical"))

val pipeline = new Pipeline().setStages(Array(document_assembler,
          sentence_detector,
          tokenizer,
          word_embeddings,
          ner_jsl,
          ner_jsl_converter,
          assertion_jsl,
          ner_clinical,
          ner_clinical_converter,
          assertion_dl,
          assertion_merger))

val text = "Patient had a headache for the last 2 weeks, and appears anxious when she walks fast. No alopecia noted. She denies pain. Her father is paralyzed and it is a stressor for her. She got antidepressant. We prescribed sleeping pills for her current insomnia."

val data = Seq(text).toDF("text")
//val data = data.coalesce(1).withColumn("idx", F.monotonically_increasing_id())

val results = pipeline.fit(data).transform(data)

// Result

+---+--------------+-----+---+---------+---------+----------------+----------+
|idx|ner_chunk     |begin|end|ner_label|assertion|assertion_source|confidence|
+---+--------------+-----+---+---------+---------+----------------+----------+
|0  |headache      |14   |21 |Symptom  |Past     |assertion_jsl   |0.9999    |
|0  |anxious       |57   |63 |PROBLEM  |present  |assertion_dl    |0.9392    |
|0  |alopecia      |89   |96 |PROBLEM  |absent   |assertion_dl    |0.9992    |
|0  |pain          |116  |119|PROBLEM  |absent   |assertion_dl    |0.9884    |
|0  |paralyzed     |136  |144|Symptom  |Family   |assertion_jsl   |0.9995    |
|0  |stressor      |158  |165|Symptom  |Family   |assertion_jsl   |1.0       |
|0  |antidepressant|184  |197|TREATMENT|present  |assertion_dl    |0.9628    |
|0  |sleeping pills|214  |227|TREATMENT|present  |assertion_dl    |0.998     |
|0  |insomnia      |245  |252|Symptom  |Past     |assertion_jsl   |0.9862    |
+---+--------------+-----+---+---------+---------+----------------+----------+

AverageEmbeddings

Model

AverageEmbeddings computes the mean of vector embeddings for two sentences of equal size, producing a unified representation.

Parameters:

  • inputCols: The name of the columns containing the input annotations. It can read either a String column or an Array.

  • outputCol: The name of the column in Document type that is generated. We can specify only one column here.

All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols().

Input Annotator Types: SENTENCE_EMBEDDINGS, SENTENCE_EMBEDDINGS, CHUNK

Output Annotator Type: EMBEDDINGS

Python API: AverageEmbeddings Scala API: AverageEmbeddings Notebook: AverageEmbeddingsNotebook
Show Example

from johnsnowlabs import nlp, medical

document_assembler =  nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")\

sentence_detector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

doc2Chunk = nlp.Doc2Chunk() \
    .setInputCols("sentence") \
    .setOutputCol("chunk") \
    .setIsArray(True)

sbiobert_base_cased_mli = nlp.BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
    .setInputCols("sentence")\
    .setOutputCol("sbiobert_base_cased_mli")

sent_biobert_clinical_base_cased = nlp.BertSentenceEmbeddings.pretrained("sent_biobert_clinical_base_cased", "en") \
    .setInputCols("sentence") \
    .setOutputCol("sent_biobert_clinical_base_cased")

avg_embeddings = medical.AverageEmbeddings()\
    .setInputCols(["sent_biobert_clinical_base_cased","sbiobert_base_cased_mli","chunk"])\
    .setOutputCol("embeddings")

pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        doc2Chunk,
        sbiobert_base_cased_mli,
        sent_biobert_clinical_base_cased,
        avg_embeddings
    ])

data = spark.createDataFrame([[" The patient was prescribed 1 capsule of Advil for 5 days "]]).toDF("text")

result = pipeline.fit(data).transform(data)

result_df = result.select(F.explode(F.arrays_zip(result.chunk.result,
                                                 result.chunk.metadata,
                                                 result.sentence.result,
                                                 result.embeddings.embeddings,
                                                 result.sent_biobert_clinical_base_cased.embeddings,
                                                 result.sbiobert_base_cased_mli.embeddings,)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("sentence"),
                          F.expr("cols['1']").alias("sentence_metadata"),
                          F.expr("cols['2']").alias("chunk"),
                          F.expr("cols['3']").alias("embeddings"),
                          F.expr("cols['4']").alias("sent_biobert_clinical_base_cased"),
                          F.expr("cols['5']").alias("sbiobert_base_cased_mli"))

result_df.show(50, truncate=1000)

## Result

+--------------------------------------------------+---------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|                                          sentence|          sentence_metadata|                                             chunk|                                        embeddings|                  sent_biobert_clinical_base_cased|                           sbiobert_base_cased_mli|
+--------------------------------------------------+---------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|The patient was prescribed 1 capsule of Advil f...|{sentence -> 0, chunk -> 0}|The patient was prescribed 1 capsule of Advil f...|[0.32466835, 0.12497781, -0.20237188, 0.3716198...|[-0.07857181, -0.061015874, -0.020198729, 0.177...|[0.7279085, 0.3109715, -0.38454503, 0.5657965, ...|
+--------------------------------------------------+---------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document") 

val sentence_detector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val doc2Chunk = new Doc2Chunk()
  .setInputCols("sentence")
  .setOutputCol("chunk")
  .setIsArray(true)

val sbiobert_base_cased_mli = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
  .setInputCols("sentence")
  .setOutputCol("sbiobert_base_cased_mli")

val sent_biobert_clinical_base_cased = BertSentenceEmbeddings.pretrained("sent_biobert_clinical_base_cased","en")
  .setInputCols("sentence")
  .setOutputCol("sent_biobert_clinical_base_cased")

val avg_embeddings = new AverageEmbeddings()
  .setInputCols(Array("sent_biobert_clinical_base_cased","sbiobert_base_cased_mli","chunk"))
  .setOutputCol("embeddings") 

val pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    sentence_detector, 
    doc2Chunk, 
    sbiobert_base_cased_mli, 
    sent_biobert_clinical_base_cased, 
    avg_embeddings)) 

val data = Seq(" The patient was prescribed 1 capsule of Advil for 5 days").toDF("text")

val result = pipeline.fit(data).transform(data)

// Show results
+--------------------------------------------------+---------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|                                          sentence|          sentence_metadata|                                             chunk|                                        embeddings|                  sent_biobert_clinical_base_cased|                           sbiobert_base_cased_mli|
+--------------------------------------------------+---------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|The patient was prescribed 1 capsule of Advil f...|{sentence -> 0, chunk -> 0}|The patient was prescribed 1 capsule of Advil f...|[0.32466835, 0.12497781, -0.20237188, 0.3716198...|[-0.07857181, -0.061015874, -0.020198729, 0.177...|[0.7279085, 0.3109715, -0.38454503, 0.5657965, ...|
+--------------------------------------------------+---------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+

BertForSequenceClassification

Model

BertForSequenceClassification can load Bert Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.

Parameters:

  • batchSize’: Size of every batch (default: 8).

  • coalesceSentences’: Instead of 1 class per sentence (if inputCols is “sentence” output 1 class per document by averaging probabilities in all sentences (default: False).

  • maxSentenceLength’: Max sentence length to process (default: 128).

  • caseSensitive’: Whether to ignore case in tokens for embeddings matching (default: True)

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CATEGORY

Python API: BertForSequenceClassification Scala API: BertForSequenceClassification
Show Example

from johnsnowlabs import nlp, medical
 
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

sequenceClassifier = medical.BertForSequenceClassification.pretrained("bert_sequence_classifier_ade", "en", "clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("classes")

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    tokenizer,
    sequenceClassifier
])


text =[["Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair 2137. Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. Name (NI) past or present smoking hx, no EtOH."],
     ["Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep Apnea. Denies tobacco and ETOH. Works as cafeteria worker."]]

data = spark.createDataFrame(text).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("text", "classes.result").show(2,truncate=100)

| text                                                                                           | result |
|------------------------------------------------------------------------------------------------|-------|
| Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair... | [False] |
| Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep... | [False] |

import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new Tokenizer()
    .setInputCols("document")
    .setOutputCol("token")

val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_ade", "en", "clinical/models")
    .setInputCols(Array("document","token"))
    .setOutputCol("classes")

val pipeline =  new Pipeline().setStages(Array(
    document_assembler, 
    tokenizer, 
    sequenceClassifier))

val text = List(
  List("Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair 2137. Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. Name (NI) past or present smoking hx, no EtOH."),
  List("Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep Apnea. Denies tobacco and ETOH. Works as cafeteria worker.")
)

val data = Seq(text).toDF("text")

val result = pipeline.fit(data).transform(data)

| text                                                                                           | result |
|------------------------------------------------------------------------------------------------|-------|
| Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair... | [False] |
| Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep... | [False] |

BertForTokenClassifier

Model

BertForTokenClassifier can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) for Named-Entity-Recognition (NER) tasks.\

Parameters:

  • CaseSensitive (Boolean): Whether to lowercase tokens or not (Default: False).

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: NAMED_ENTITY

Python API: BertForTokenClassifier Scala API: BertForTokenClassifier
Show Example

from johnsnowlabs import nlp, medical
 
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols("sentence")\
    .setOutputCol("token")

tokenClassifier = medical.BertForTokenClassification.pretrained("bert_token_classifier_ner_clinical", "en", "clinical/models")\
    .setInputCols("token", "sentence")\
    .setOutputCol("ner")\
    .setCaseSensitive(True)

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

pipeline =  nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    tokenClassifier,
    ner_converter
    ])


text = """A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .
Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection .
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation .
Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity .
Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 .
Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia .
The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission .
However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L .
The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again .
The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours .
Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use .
The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day .
It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge ."""

data = spark.createDataFrame([[text]]).toDF("text")

res = pipeline.fit(data).transform(data)

res.select(F.explode(F.arrays_zip(res.ner_chunk.result, res.ner_chunk.begin, res.ner_chunk.end, res.ner_chunk.metadata)).alias("cols")) \
   .select(F.expr("cols['3']['sentence']").alias("sentence_id"),
           F.expr("cols['0']").alias("chunk"),
           F.expr("cols['2']").alias("end"),
           F.expr("cols['3']['entity']").alias("ner_label"))\
   .filter("ner_label!='O'")\
   .show(truncate=False)

## Result
+-----------+-----------------------------+---+---------+
|sentence_id|chunk                        |end|ner_label|
+-----------+-----------------------------+---+---------+
|0          |gestational diabetes mellitus|67 |PROBLEM  |
|0          |type two diabetes mellitus   |153|PROBLEM  |
|0          |T2DM                         |160|PROBLEM  |
|0          |HTG-induced pancreatitis     |209|PROBLEM  |
|0          |an acute hepatitis           |280|PROBLEM  |
|0          |obesity                      |294|PROBLEM  |
|0          |a body mass index            |317|TEST     |
|0          |BMI                          |323|TEST     |
|0          |polyuria                     |387|PROBLEM  |
|0          |polydipsia                   |400|PROBLEM  |
|0          |poor appetite                |416|PROBLEM  |
|0          |vomiting                     |431|PROBLEM  |
|1          |amoxicillin                  |521|TREATMENT|
|1          |a respiratory tract infection|555|PROBLEM  |
|2          |metformin                    |578|TREATMENT|
|2          |glipizide                    |590|TREATMENT|
|2          |dapagliflozin                |610|TREATMENT|
|2          |T2DM                         |619|PROBLEM  |
|2          |atorvastatin                 |636|TREATMENT|
|2          |gemfibrozil                  |652|TREATMENT|
+-----------+-----------------------------+---+---------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val tokenClassifier = MedicalBertForTokenClassification.pretrained("bert_token_classifier_ner_clinical", "en", "clinical/models")
    .setInputCols(Array("token", "sentence"))
    .setOutputCol("ner")
    .setCaseSensitive(True)

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence","token","ner"))
    .setOutputCol("ner_chunk")

val pipeline =  new Pipeline().setStages(Array(
    documentAssembler,
    sentenceDetector, 
    tokenizer, 
    tokenClassifier, 
    ner_converter))

val text = """A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .
Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection .
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation .
Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity .
Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 .
Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia .
The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission .
However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L .
The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again .
The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours .
Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use .
The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day .
It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge ."""

val data = Seq(text).toDF("text")

val result = pipeline.fit(data).transform(data)

// Result

+-----------+-----------------------------+---+---------+
|sentence_id|chunk                        |end|ner_label|
+-----------+-----------------------------+---+---------+
|0          |gestational diabetes mellitus|67 |PROBLEM  |
|0          |type two diabetes mellitus   |153|PROBLEM  |
|0          |T2DM                         |160|PROBLEM  |
|0          |HTG-induced pancreatitis     |209|PROBLEM  |
|0          |an acute hepatitis           |280|PROBLEM  |
|0          |obesity                      |294|PROBLEM  |
|0          |a body mass index            |317|TEST     |
|0          |BMI                          |323|TEST     |
|0          |polyuria                     |387|PROBLEM  |
|0          |polydipsia                   |400|PROBLEM  |
|0          |poor appetite                |416|PROBLEM  |
|0          |vomiting                     |431|PROBLEM  |
|1          |amoxicillin                  |521|TREATMENT|
|1          |a respiratory tract infection|555|PROBLEM  |
|2          |metformin                    |578|TREATMENT|
|2          |glipizide                    |590|TREATMENT|
|2          |dapagliflozin                |610|TREATMENT|
|2          |T2DM                         |619|PROBLEM  |
|2          |atorvastatin                 |636|TREATMENT|
|2          |gemfibrozil                  |652|TREATMENT|
+-----------+-----------------------------+---+---------+

BertSentenceChunkEmbeddings

Model

This annotator allows aggregating sentence embeddings with ner chunk embeddings to get specific and more accurate resolution codes. It works by averaging sentence and chunk embeddings add contextual information in the embedding value. Input to this annotator is the context (sentence) and ner chunks, while the output is embedding for each chunk that can be fed to the resolver model.

Parameters:

  • inputCols: The name of the columns containing the input annotations. It can read either a String column or an Array.

  • outputCol: The name of the column in Document type that is generated. We can specify only one column here.

  • chunkWeight: Relative weight of chunk embeddings in comparison to sentence embeddings. The value should between 0 and 1. The default is 0.5, which means the chunk and sentence embeddings are given equal weight.

  • setMaxSentenceLength: Sets max sentence length to process, by default 128.

  • caseSensitive: Determines whether the definitions of the white listed entities are case sensitive.

All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols().

For more information and examples of BertSentenceChunkEmbeddings annotator, you can check the Spark NLP Workshop, and in special, the notebook 24.1.Improved_Entity_Resolution_with_SentenceChunkEmbeddings.ipynb.

Input Annotator Types: DOCUMENT, CHUNK

Output Annotator Type: SENTENCE_EMBEDDINGS

Python API: BertSentenceChunkEmbeddings Scala API: BertSentenceChunkEmbeddings Notebook: BertSentenceChunkEmbeddingsNotebook
Show Example

from johnsnowlabs import nlp, medical
# Define the pipeline

document_assembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
      .setInputCols(["document"])\
      .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["document", "token"])\
      .setOutputCol("word_embeddings")

clinical_ner = medical.NerModel.pretrained("ner_abbreviation_clinical", "en", "clinical/models") \
      .setInputCols(["document", "token", "word_embeddings"]) \
      .setOutputCol("ner")

ner_converter = medical.NerConverterInternal() \
      .setInputCols(["document", "token", "ner"]) \
      .setOutputCol("ner_chunk")\
      .setWhiteList(['ABBR'])

sentence_chunk_embeddings = medical.BertSentenceChunkEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
      .setInputCols(["document", "ner_chunk"])\
      .setOutputCol("sentence_embeddings")\
      .setChunkWeight(0.5)\
      .setCaseSensitive(True)
    
resolver_pipeline = nlp.Pipeline(
    stages = [
      document_assembler,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      sentence_chunk_embeddings
])


sample_text = [
"""The patient admitted from the IR for aggressive irrigation of the Miami pouch. DISCHARGE DIAGNOSES: 1. A 58-year-old female with a history of stage 2 squamous cell carcinoma of the cervix status post total pelvic exenteration in 1991.""",
"""Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal. 
Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet."""]

from pyspark.sql.types import StringType, IntegerType

df = spark.createDataFrame(sample_text, StringType()).toDF('text')
result = resolver_pipeline.fit(df).transform(df)

result.selectExpr("explode(sentence_embeddings) AS s")\
      .selectExpr("s.result", "slice(s.embeddings, 1, 5) AS averageEmbedding")\
      .show(truncate=False)

+------+--------------------------------------------------------------+
|result|averageEmbedding                                              |
+------+--------------------------------------------------------------+
|IR    |[0.11792798, 0.36022937, -1.0620842, 0.87576616, 0.5389829]   |
|CBC   |[-0.07262431, -0.671684, 0.009878114, 0.76053196, 0.4687413]  |
|AB    |[-0.2781681, -0.43619046, -0.20924012, 0.84943366, 0.40831584]|
|VDRL  |[-0.07109344, -0.20644212, 0.0367461, 0.43459156, 0.3684616]  |
|HIV   |[-0.1740405, -0.4599509, -0.041505605, 0.61368394, 0.66777927]|
+------+--------------------------------------------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")

val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("tokens")

val wordEmbeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
      .setInputCols(Array("document", "tokens"))
      .setOutputCol("word_embeddings")

val nerModel = MedicalNerModel.pretrained("ner_abbreviation_clinical", "en", "clinical/models")
      .setInputCols(Array("document", "tokens", "word_embeddings"))
      .setOutputCol("ner")

val nerConverter = new NerConverterInternal()
      .setInputCols("document", "tokens", "ner")
      .setOutputCol("ner_chunk")
      .setWhiteList(Array('ABBR'))

val sentenceChunkEmbeddings = BertSentenceChunkEmbeddings.pretrained("sbluebert_base_uncased_mli", "en", "clinical/models")
      .setInputCols(Array("document", "ner_chunk"))
      .setOutputCol("sentence_embeddings")
      .setChunkWeight(0.5)
      .setCaseSensitive(True)

val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      sentenceDetector,
      tokenizer,
      wordEmbeddings,
      nerModel,
      nerConverter,
      sentenceChunkEmbeddings))

val sampleText = "The patient admitted from the IR for aggressive irrigation of the Miami pouch. DISCHARGE DIAGNOSES: 1. A 58-year-old female with a history of stage 2 squamous cell carcinoma of the cervix status post total pelvic exenteration in 1991." +
"Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal. 
Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet."

val data = Seq(sampleText).toDF("sampleText")
val result = pipeline.fit(data).transform(data)

+------+--------------------------------------------------------------+
|result|averageEmbedding                                              |
+------+--------------------------------------------------------------+
|IR    |[0.11792798, 0.36022937, -1.0620842, 0.87576616, 0.5389829]   |
|CBC   |[-0.07262431, -0.671684, 0.009878114, 0.76053196, 0.4687413]  |
|AB    |[-0.2781681, -0.43619046, -0.20924012, 0.84943366, 0.40831584]|
|VDRL  |[-0.07109344, -0.20644212, 0.0367461, 0.43459156, 0.3684616]  |
|HIV   |[-0.1740405, -0.4599509, -0.041505605, 0.61368394, 0.66777927]|
+------+--------------------------------------------------------------+

Chunk2Token

Model

A feature transformer that converts the input array of strings (annotatorType CHUNK) into an array of chunk-based tokens (annotatorType TOKEN).

When the input is empty, an empty array is returned.

This Annotator is specially convenient when using NGramGenerator annotations as inputs to WordEmbeddingsModels.

Parameters:

  • inputCols: The name of the columns containing the input annotations. It can read either a String column or an Array.

  • outputCol: The name of the column in Document type that is generated. We can specify only one column here.

All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols().

Input Annotator Types: CHUNK

Output Annotator Type: TOKEN

Python API: Chunk2Token Scala API: Chunk2Token Notebook: Chunk2TokenNotebook
Show Example

from johnsnowlabs import nlp, medical 
# Define a pipeline for generating n-grams
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

token = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

ngrammer = nlp.NGramGenerator() \
    .setN(2) \
    .setEnableCumulative(False) \
    .setInputCols(["token"]) \
    .setOutputCol("ngrams") \
    .setDelimiter("_")

# Stage to convert n-gram CHUNKS to TOKEN type
chunk2Token = medical.Chunk2Token()\
    .setInputCols(["ngrams"])\
    .setOutputCol("ngram_tokens")

trainingPipeline = nlp.Pipeline(stages=[
    document, 
    sentenceDetector, 
    token, 
    ngrammer, 
    chunk2Token])

data = spark.createDataFrame([["A 63-year-old man presents to the hospital ..."]]).toDF("text")
result = trainingPipeline.fit(data).transform(data).cache()
result.selectExpr("explode(ngram_tokens)").show(5, False)

+----------------------------------------------------------------+
|col                                                             |
+----------------------------------------------------------------+
|{token, 0, 12, A_63-year-old, {sentence -> 0, chunk -> 0}, []}  |
|{token, 2, 16, 63-year-old_man, {sentence -> 0, chunk -> 1}, []}|
|{token, 14, 25, man_presents, {sentence -> 0, chunk -> 2}, []}  |
|{token, 18, 28, presents_to, {sentence -> 0, chunk -> 3}, []}   |
|{token, 27, 32, to_the, {sentence -> 0, chunk -> 4}, []}        |
+----------------------------------------------------------------+
from johnsnowlabs import nlp, finance

# Define a pipeline for generating n-grams
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

token = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

ngrammer = nlp.NGramGenerator() \
    .setN(2) \
    .setEnableCumulative(False) \
    .setInputCols(["token"]) \
    .setOutputCol("ngrams") 

# Stage to convert n-gram CHUNKS to TOKEN type
chunk2Token = finance.Chunk2Token()\
    .setInputCols(["ngrams"])\
    .setOutputCol("ngram_tokens")

trainingPipeline = nlp.Pipeline(stages=[
    document, 
    sentenceDetector, 
    token, 
    ngrammer, 
    chunk2Token])

data = spark.createDataFrame([["Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc."]]).toDF("text")

result = trainingPipeline.fit(data).transform(data)
result.selectExpr("explode(ngram_tokens)").show(5, False)

+--------------------------------------------------------------------+
|col                                                                 |
+--------------------------------------------------------------------+
|{token, 0, 14, Our competitors, {sentence -> 0, chunk -> 0}, []}    |
|{token, 4, 22, competitors include, {sentence -> 0, chunk -> 1}, []}|
|{token, 16, 26, include the, {sentence -> 0, chunk -> 2}, []}       |
|{token, 24, 36, the following, {sentence -> 0, chunk -> 3}, []}     |
|{token, 28, 39, following by, {sentence -> 0, chunk -> 4}, []}      |
+--------------------------------------------------------------------+
from johnsnowlabs import nlp, legal 
# Define a pipeline for generating n-grams
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

token = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

ngrammer = nlp.NGramGenerator() \
    .setN(2) \
    .setEnableCumulative(False) \
    .setInputCols(["token"]) \
    .setOutputCol("ngrams") 

# Stage to convert n-gram CHUNKS to TOKEN type
chunk2Token = legal.Chunk2Token()\
    .setInputCols(["ngrams"])\
    .setOutputCol("ngram_tokens")

trainingPipeline = nlp.Pipeline(stages=[
    document, 
    sentenceDetector, 
    token, 
    ngrammer, 
    chunk2Token])

data = spark.createDataFrame([["This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc."]]).toDF("text")
result = trainingPipeline.fit(data).transform(data).cache()
result.selectExpr("explode(ngram_tokens)").show(5, False)

+-----------------------------------------------------------------------+
|col                                                                    |
+-----------------------------------------------------------------------+
|{token, 0, 6, This is, {sentence -> 0, chunk -> 0}, []}                |
|{token, 5, 9, is an, {sentence -> 0, chunk -> 1}, []}                  |
|{token, 8, 22, an Intellectual, {sentence -> 0, chunk -> 2}, []}       |
|{token, 11, 31, Intellectual Property, {sentence -> 0, chunk -> 3}, []}|
|{token, 24, 41, Property Agreement, {sentence -> 0, chunk -> 4}, []}   |
+-----------------------------------------------------------------------+

import spark.implicits._

// Define a pipeline for generating n-grams
val document = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val token = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val ngrammer = new NGramGenerator()
    .setN(2)
    .setEnableCumulative(false)
    .setInputCols("token")
    .setOutputCol("ngrams")
    .setDelimiter("_")

// Stage to convert n-gram CHUNKS to TOKEN type
val chunk2Token = new Chunk2Token()
    .setInputCols("ngrams")
    .setOutputCol("ngram_tokens")

val trainingPipeline = new Pipeline().setStages(Array(
    document, 
    sentenceDetector, 
    token, 
    ngrammer, 
    chunk2Token))

val data = Seq(("A 63-year-old man presents to the hospital ...")).toDF("text")

val result = trainingPipeline.fit(data).transform(data)

+----------------------------------------------------------------+
|col                                                             |
+----------------------------------------------------------------+
|{token, 3, 15, A_63-year-old, {sentence -> 0, chunk -> 0}, []}  |
|{token, 5, 19, 63-year-old_man, {sentence -> 0, chunk -> 1}, []}|
|{token, 17, 28, man_presents, {sentence -> 0, chunk -> 2}, []}  |
|{token, 21, 31, presents_to, {sentence -> 0, chunk -> 3}, []}   |
|{token, 30, 35, to_the, {sentence -> 0, chunk -> 4}, []}        |
+----------------------------------------------------------------+
import spark.implicits._

// Define a pipeline for generating n-grams
val document = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val token = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val ngrammer = new NGramGenerator()
    .setN(2)
    .setEnableCumulative(false)
    .setInputCols("token")
    .setOutputCol("ngrams")

// Stage to convert n-gram CHUNKS to TOKEN type
val chunk2Token = new Chunk2Token()
    .setInputCols("ngrams")
    .setOutputCol("ngram_tokens")

val trainingPipeline = new Pipeline().setStages(Array(
    document, 
    sentenceDetector, 
    token, 
    ngrammer, 
    chunk2Token))

val data = Seq(("Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc.")).toDF("text")

val result = trainingPipeline.fit(data).transform(data)

+--------------------------------------------------------------------+
|col                                                                 |
+--------------------------------------------------------------------+
|{token, 0, 14, Our competitors, {sentence -> 0, chunk -> 0}, []}    |
|{token, 4, 22, competitors include, {sentence -> 0, chunk -> 1}, []}|
|{token, 16, 26, include the, {sentence -> 0, chunk -> 2}, []}       |
|{token, 24, 36, the following, {sentence -> 0, chunk -> 3}, []}     |
|{token, 28, 39, following by, {sentence -> 0, chunk -> 4}, []}      |
+--------------------------------------------------------------------+
import spark.implicits._

// Define a pipeline for generating n-grams
val document = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val token = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val ngrammer = new NGramGenerator()
    .setN(2)
    .setEnableCumulative(false)
    .setInputCols("token")
    .setOutputCol("ngrams")

// Stage to convert n-gram CHUNKS to TOKEN type
val chunk2Token = new Chunk2Token()
    .setInputCols("ngrams")
    .setOutputCol("ngram_tokens")

val trainingPipeline = new Pipeline().setStages(Array(
    document, 
    sentenceDetector, 
    token, 
    ngrammer, 
    chunk2Token))

val data = Seq(("This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc.")).toDF("text")

val result = trainingPipeline.fit(data).transform(data)

+-----------------------------------------------------------------------+
|col                                                                    |
+-----------------------------------------------------------------------+
|{token, 0, 6, This is, {sentence -> 0, chunk -> 0}, []}                |
|{token, 5, 9, is an, {sentence -> 0, chunk -> 1}, []}                  |
|{token, 8, 22, an Intellectual, {sentence -> 0, chunk -> 2}, []}       |
|{token, 11, 31, Intellectual Property, {sentence -> 0, chunk -> 3}, []}|
|{token, 24, 41, Property Agreement, {sentence -> 0, chunk -> 4}, []}   |
+-----------------------------------------------------------------------+

ChunkConverter

Model

Convert chunks from RegexMatcher to chunks with a entity in the metadata.

This annotator is important when the user wants to merge entities identified by NER models together with rules-based matching used by the RegexMathcer annotator. In the following steps of the pipeline, all the identified entities can be treated in a unified field.

Parameters:

  • inputCols: The name of the columns containing the input annotations. It can read either a String column or an Array.

  • outputCol: The name of the column in Document type that is generated. We can specify only one column here.

All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols().

Input Annotator Types: DOCUMENT, CHUNK

Output Annotator Type: CHUNK

Python API: ChunkConverter Scala API: ChunkConverter Notebook: ChunkConverterNotebook
Show Example

from johnsnowlabs import nlp, medical

# Creating the pipeline
rules = '''
\b[A-Z]+(\s+[A-Z]+)*:\b, SECTION_HEADER
'''

with open('regex_rules.txt', 'w') as f:
    f.write(rules)

sample_text = """
POSTOPERATIVE DIAGNOSIS: Cervical lymphadenopathy.
PROCEDURE:  Excisional biopsy of right cervical lymph node.
ANESTHESIA:  General endotracheal anesthesia.
Specimen:  Right cervical lymph node.
EBL: 10 cc.
COMPLICATIONS:  None.
FINDINGS: Enlarged level 2 lymph node was identified and removed and sent for pathologic examination.
FLUIDS:  Please see anesthesia report.
URINE OUTPUT:  None recorded during the case.
INDICATIONS FOR PROCEDURE:  This is a 43-year-old female with a several-year history of persistent cervical lymphadenopathy. She reports that it is painful to palpation on the right and has had multiple CT scans as well as an FNA which were all nondiagnostic. After risks and benefits of surgery were discussed with the patient, an informed consent was obtained. She was scheduled for an excisional biopsy of the right cervical lymph node.
PROCEDURE IN DETAIL:  The patient was taken to the operating room and placed in the supine position. She was anesthetized with general endotracheal anesthesia. The neck was then prepped and draped in the sterile fashion. Again, noted on palpation there was an enlarged level 2 cervical lymph node.A 3-cm horizontal incision was made over this lymph node. Dissection was carried down until the sternocleidomastoid muscle was identified. The enlarged lymph node that measured approximately 2 cm in diameter was identified and was removed and sent to Pathology for touch prep evaluation. The area was then explored for any other enlarged lymph nodes. None were identified, and hemostasis was achieved with electrocautery. A quarter-inch Penrose drain was placed in the wound.The wound was then irrigated and closed with 3-0 interrupted Vicryl sutures for a deep closure followed by a running 4-0 Prolene subcuticular suture. Mastisol and Steri-Strip were placed over the incision, and sterile bandage was applied. The patient tolerated this procedure well and was extubated without complications and transported to the recovery room in stable condition. She will return to the office tomorrow in followup to have the Penrose drain removed.
"""

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model = medical.NerModel.pretrained("ner_clinical_large","en","clinical/models") \
    .setInputCols("sentence","token","embeddings") \
    .setOutputCol("ner")

ner_converter= medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\

regex_matcher = nlp.RegexMatcher()\
    .setInputCols('document')\
    .setStrategy("MATCH_ALL")\
    .setOutputCol("regex_matches")\
    .setExternalRules(path='/content/regex_rules.txt', delimiter=',')

chunkConverter = medical.ChunkConverter()\
    .setInputCols("regex_matches")\
    .setOutputCol("regex_chunk")

merger= medical.ChunkMergeApproach()\
    .setInputCols(["regex_chunk", "ner_chunk"])\
    .setOutputCol("merged_chunks")\
    .setMergeOverlapping(True)\
    .setChunkPrecedence("field")

pipeline= nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    ner_model,
    ner_converter,
    regex_matcher,
    chunkConverter,
    merger
])

data= spark.createDataFrame([[sample_text]]).toDF("text")
result = pipeline.fit(data).transform(data)

# Results
result.select(F.explode(F.arrays_zip(result.merged_chunks.result, 
                                     result.merged_chunks.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("chunk"),
                          F.expr("cols['1']['entity']").alias("merged_entity")).show(15, truncate=100)

+----------------------------------------------+--------------+
|                                         chunk| merged_entity|
+----------------------------------------------+--------------+
|                      POSTOPERATIVE DIAGNOSIS:|SECTION_HEADER|
|                      Cervical lymphadenopathy|       PROBLEM|
|                                    PROCEDURE:|SECTION_HEADER|
|Excisional biopsy of right cervical lymph node|          TEST|
|                                   ANESTHESIA:|SECTION_HEADER|
|               General endotracheal anesthesia|     TREATMENT|
|                     Right cervical lymph node|       PROBLEM|
|                                          EBL:|SECTION_HEADER|
|                                COMPLICATIONS:|SECTION_HEADER|
|                                     FINDINGS:|SECTION_HEADER|
|                   Enlarged level 2 lymph node|       PROBLEM|
|                        pathologic examination|          TEST|
|                                       FLUIDS:|SECTION_HEADER|
|                                 URINE OUTPUT:|SECTION_HEADER|
|                    INDICATIONS FOR PROCEDURE:|SECTION_HEADER|
+----------------------------------------------+--------------+
from johnsnowlabs import nlp, finance

# Creating the pipeline
rules = '''
\b[A-Z]+(\s+[A-Z]+)*:\b, SECTION_HEADER
'''

with open('regex_rules.txt', 'w') as f:
    f.write(rules)

sample_text="""AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price. """

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter= nlp.NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")
    # .setWhiteList(["ORG"]) # Return only ORG entities

regex_matcher = nlp.RegexMatcher()\
    .setInputCols('document')\
    .setStrategy("MATCH_ALL")\
    .setOutputCol("regex_matches")\
    .setExternalRules(path='/content/regex_rules.txt', delimiter=',')

chunkConverter = finance.ChunkConverter()\
    .setInputCols("regex_matches")\
    .setOutputCol("regex_chunk")

merger= finance.ChunkMergeApproach()\
    .setInputCols(["regex_chunk", "ner_chunk"])\
    .setOutputCol("merged_chunks")\
    .setMergeOverlapping(True)\
    .setChunkPrecedence("field")

pipeline= nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    ner_model,
    ner_converter,
    regex_matcher,
    chunkConverter,
    merger
])

data= spark.createDataFrame([[sample_text]]).toDF("text")
result = pipeline.fit(data).transform(data)

# Results
result.select(F.explode(F.arrays_zip(result.merged_chunks.result, 
                                     result.merged_chunks.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("chunk"),
                          F.expr("cols['1']['entity']").alias("merged_entity")).show(15, truncate=100)

+--------+-------------+
|   chunk|merged_entity|
+--------+-------------+
|Group LP|          ORG|
+--------+-------------+
from johnsnowlabs import nlp, legal

# Creating the pipeline
rules = '''
\b[A-Z]+(\s+[A-Z]+)*:\b, SECTION_HEADER
'''

with open('regex_rules.txt', 'w') as f:
    f.write(rules)

sample_text="""AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price. """

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = legal.NerModel.pretrained("legner_org_per_role_date", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter= nlp.NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")
    # .setWhiteList(["ORG"]) # Return only ORG entities

regex_matcher = nlp.RegexMatcher()\
    .setInputCols('document')\
    .setStrategy("MATCH_ALL")\
    .setOutputCol("regex_matches")\
    .setExternalRules(path='/content/regex_rules.txt', delimiter=',')

chunkConverter = legal.ChunkConverter()\
    .setInputCols("regex_matches")\
    .setOutputCol("regex_chunk")

merger= legal.ChunkMergeApproach()\
    .setInputCols(["regex_chunk", "ner_chunk"])\
    .setOutputCol("merged_chunks")\
    .setMergeOverlapping(True)\
    .setChunkPrecedence("field")

pipeline= nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    ner_model,
    ner_converter,
    regex_matcher,
    chunkConverter,
    merger
])

data= spark.createDataFrame([[sample_text]]).toDF("text")
result = pipeline.fit(data).transform(data)

# Results
result.select(F.explode(F.arrays_zip(result.merged_chunks.result, 
                                     result.merged_chunks.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("chunk"),
                          F.expr("cols['1']['entity']").alias("merged_entity")).show(15, truncate=100)

+--------+-------------+
|   chunk|merged_entity|
+--------+-------------+
|Group LP|          ORG|
+--------+-------------+

// val rules = """\b[A-Z]+(\s+[A-Z]+)*:\b, SECTION_HEADER""" 
// with open("regex_rules.txt","w") as f: 
//    f.write(rules) 


import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text") 
    .setOutputCol("document") 

val sentenceDetector = new SentenceDetector()
    .setInputCols(Array("document")) 
    .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
    .setInputCols(Array("sentence")) 
    .setOutputCol("token") 

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
    .setInputCols(Array("sentence","token"))
    .setOutputCol("embeddings") 

val ner_model = MedicalNerModel.pretrained("ner_clinical_large","en","clinical/models")
    .setInputCols("sentence","token","embeddings") 
    .setOutputCol("ner") 

val ner_converter= new NerConverterInternal() 
    .setInputCols(Array("sentence","token","ner")) 
    .setOutputCol("ner_chunk") 

val regex_matcher = new RegexMatcher()
    .setInputCols("document") 
    .setStrategy("MATCH_ALL") 
    .setOutputCol("regex_matches") 
    .setExternalRules(path="/content/regex_rules.txt",delimiter=",") 

val chunkConverter = new ChunkConverter()
    .setInputCols("regex_matches") 
    .setOutputCol("regex_chunk") 

val merger= new ChunkMergeApproach() 
    .setInputCols(Array("regex_chunk","ner_chunk")) 
    .setOutputCol("merged_chunks") 
    .setMergeOverlapping(true) 
    .setChunkPrecedence("field") 

val pipeline= new Pipeline().setStages(Array( 
    documentAssembler, 
    sentenceDetector, 
    tokenizer, 
    word_embeddings,
    ner_model, 
    ner_converter, 
    regex_matcher, 
    chunkConverter, 
    merger )) 

val data = Seq(("POSTOPERATIVE DIAGNOSIS: Cervical lymphadenopathy. PROCEDURE: Excisional biopsy of right cervical lymph node. ANESTHESIA: General endotracheal anesthesia. Specimen: Right cervical lymph node. EBL: 10 cc. COMPLICATIONS: None. FINDINGS: Enlarged level 2 lymph node was identified and removed and sent for pathologic examination. FLUIDS: Please see anesthesia report. URINE OUTPUT: None recorded during the case. INDICATIONS FOR PROCEDURE: This is a 43-year-old female with a several-year history of persistent cervical lymphadenopathy. She reports that it is painful to palpation on the right and has had multiple CT scans as well as an FNA which were all nondiagnostic. After risks and benefits of surgery were discussed with the patient,an informed consent was obtained. She was scheduled for an excisional biopsy of the right cervical lymph node. PROCEDURE IN DETAIL: The patient was taken to the operating room and placed in the supine position. She was anesthetized with general endotracheal anesthesia. The neck was then prepped and draped in the sterile fashion. Again,noted on palpation there was an enlarged level 2 cervical lymph node.A 3-cm horizontal incision was made over this lymph node. Dissection was carried down until the sternocleidomastoid muscle was identified. The enlarged lymph node that measured approximately 2 cm in diameter was identified and was removed and sent to Pathology for touch prep evaluation. The area was then explored for any other enlarged lymph nodes. None were identified,and hemostasis was achieved with electrocautery. A quarter-inch Penrose drain was placed in the wound.The wound was then irrigated and closed with 3-0 interrupted Vicryl sutures for a deep closure followed by a running 4-0 Prolene subcuticular suture. Mastisol and Steri-Strip were placed over the incision,and sterile bandage was applied. The patient tolerated this procedure well and was extubated without complications and transported to the recovery room in stable condition. She will return to the office tomorrow in followup to have the Penrose drain removed.")).toDF("text")

val result = pipeline.fit(data).transform(data)

+----------------------------------------------+--------------+
|                                         chunk| merged_entity|
+----------------------------------------------+--------------+
|                      POSTOPERATIVE DIAGNOSIS:|SECTION_HEADER|
|                      Cervical lymphadenopathy|       PROBLEM|
|                                    PROCEDURE:|SECTION_HEADER|
|Excisional biopsy of right cervical lymph node|          TEST|
|                                   ANESTHESIA:|SECTION_HEADER|
|               General endotracheal anesthesia|     TREATMENT|
|                     Right cervical lymph node|       PROBLEM|
|                                          EBL:|SECTION_HEADER|
|                                COMPLICATIONS:|SECTION_HEADER|
|                                     FINDINGS:|SECTION_HEADER|
|                   Enlarged level 2 lymph node|       PROBLEM|
|                        pathologic examination|          TEST|
|                                       FLUIDS:|SECTION_HEADER|
|                                 URINE OUTPUT:|SECTION_HEADER|
|                    INDICATIONS FOR PROCEDURE:|SECTION_HEADER|
+----------------------------------------------+--------------+
// val rules = """\b[A-Z]+(\s+[A-Z]+)*:\b, SECTION_HEADER""" 
// with open("regex_rules.txt","w") as f: 
//    f.write(rules) 

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text") 
    .setOutputCol("document") 

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
    .setInputCols(Array("document")) 
    .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
    .setInputCols(Array("sentence")) 
    .setOutputCol("token") 

val word_embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
    .setInputCols(Array("sentence","token")) 
    .setOutputCol("embeddings") 

val ner_model = FinanceNerModel.pretrained("finner_orgs_prods_alias","en","finance/models")
    .setInputCols(Array("sentence","token","embeddings")) 
    .setOutputCol("ner") 

val ner_converter= new NerConverterInternal()  
    .setInputCols(Array("sentence","token","ner")) 
    .setOutputCol("ner_chunk") 
    // .setWhiteList(Array("ORG")) 

// Return only ORG entities 
val regex_matcher = new RegexMatcher()
    .setInputCols("document") 
    .setStrategy("MATCH_ALL") 
    .setOutputCol("regex_matches") 
    .setExternalRules(path="/content/regex_rules.txt",delimiter=",") 

val chunkConverter = new ChunkConverter()
    .setInputCols("regex_matches") 
    .setOutputCol("regex_chunk") 
 
val merger= new ChunkMergeApproach() 
    .setInputCols(Array("regex_chunk","ner_chunk")) 
    .setOutputCol("merged_chunks") 
    .setMergeOverlapping(true) 
    .setChunkPrecedence("field") 

val pipeline= new Pipeline().setStages(Array( 
    documentAssembler, 
    sentenceDetector, 
    tokenizer, 
    word_embeddings, 
    ner_model, 
    ner_converter, 
    regex_matcher, 
    chunkConverter, 
    merger )) 
    
val data = Seq(("AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price.")).toDF("text")

val result = pipeline.fit(data).transform(data)

+--------+-------------+
|   chunk|merged_entity|
+--------+-------------+
|Group LP|          ORG|
+--------+-------------+
// val rules = """[A-Z]+[\s+[A-Z]+]*,SECTION_HEADER """ 
// with open("regex_rules.txt","w") as f: 
//    f.write(rules) 

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text") 
    .setOutputCol("document") 

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
    .setInputCols(Array("document")) 
    .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
    .setInputCols(Array("sentence")) 
    .setOutputCol("token") 

val word_embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
    .setInputCols(Array("sentence","token")) 
    .setOutputCol("embeddings") 

val ner_model = LegalNerModel.pretrained("legner_org_per_role_date", "en", "legal/models")
    .setInputCols(Array("sentence","token","embeddings")) 
    .setOutputCol("ner") 

val ner_converter= new NerConverterInternal()  
    .setInputCols(Array("sentence","token","ner")) 
    .setOutputCol("ner_chunk") 
    // .setWhiteList(Array("ORG")) 

// Return only ORG entities 
val regex_matcher = new RegexMatcher()
    .setInputCols("document") 
    .setStrategy("MATCH_ALL") 
    .setOutputCol("regex_matches") 
    .setExternalRules(path="/content/regex_rules.txt",delimiter=",") 

val chunkConverter = new ChunkConverter()
    .setInputCols("regex_matches") 
    .setOutputCol("regex_chunk") 
 
val merger= new ChunkMergeApproach() 
    .setInputCols(Array("regex_chunk","ner_chunk")) 
    .setOutputCol("merged_chunks") 
    .setMergeOverlapping(true) 
    .setChunkPrecedence("field") 

val pipeline= new Pipeline().setStages(Array( 
    documentAssembler, 
    sentenceDetector, 
    tokenizer, 
    word_embeddings, 
    ner_model, 
    ner_converter, 
    regex_matcher, 
    chunkConverter, 
    merger )) 
    
val data = Seq(("AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price.")).toDF("text")

val result = pipeline.fit(data).transform(data)
                          
+--------+-------------+
|   chunk|merged_entity|
+--------+-------------+
|Group LP|          ORG|
+--------+-------------+

ChunkEntityResolver

The ChunkEntityResolverModel encompasses the functionality to produce a normalized entity from a specialized ontology or curated dataset (such as ICD-10, RxNorm, SNOMED, etc.). This model includes comprehensive parameters and methods essential for its training. It operates by transforming a dataset that incorporates two Input Annotations: TOKEN and WORD_EMBEDDINGS, sourced from tools like ChunkTokenizer and ChunkEmbeddings Annotators. Ultimately, it generates the normalized entity relevant to the specified trained ontology or curated dataset, ensuring accurate entity resolution within the given context.

For available pretrained models please see the Models Hub.

Importand Note: This annotator has been deprecated.

Input Annotator Types: TOKEN, WORD_EMBEDDINGS

Output Annotator Type: ENTITY

Scala API: ChunkEntityResolverModel
Show Example

from johnsnowlabs import nlp, medical
# Using pretrained models for SNOMED
# First the prior steps of the pipeline are defined.
# Output of types TOKEN and WORD_EMBEDDINGS are needed.
data = spark.createDataFrame([["A 63-year-old man presents to the hospital ..."]]).toDF("text")

docAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("word_embeddings")

icdo_ner = medical.NerModel.pretrained("ner_bionlp", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "word_embeddings"]) \
    .setOutputCol("icdo_ner")

icdo_chunk = nlp.NerConverter()\
    .setInputCols(["sentence","token","icdo_ner"])\
    .setOutputCol("icdo_chunk")\
    .setWhiteList(["Cancer"])

icdo_chunk_embeddings = nlp.ChunkEmbeddings() \
    .setInputCols(["icdo_chunk", "word_embeddings"]) \
    .setOutputCol("icdo_chunk_embeddings")

icdo_chunk_resolver = medical.ChunkEntityResolverModel\
    .pretrained("chunkresolve_icdo_clinical", "en", "clinical/models") \
    .setInputCols(["token","icdo_chunk_embeddings"]) \
    .setOutputCol("tm_icdo_code")

clinical_ner = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "word_embeddings"]) \
    .setOutputCol("ner")

ner_converter = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

ner_chunk_tokenizer = nlp.ChunkTokenizer() \
    .setInputCols(["ner_chunk"]) \
    .setOutputCol("ner_token")
    
ner_chunk_embeddings = nlp.ChunkEmbeddings() \
    .setInputCols(["ner_chunk", "word_embeddings"]) \
    .setOutputCol("ner_chunk_embeddings")

# Definition of the SNOMED Resolution
ner_snomed_resolver = medical.ChunkEntityResolverModel\
    .pretrained("chunkresolve_snomed_findings_clinical","en","clinical/models") \
    .setInputCols(["ner_token","ner_chunk_embeddings"])\
    .setOutputCol("snomed_result")

pipelineFull = nlp.Pipeline().setStages([
      docAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,

      clinical_ner,
      ner_converter,
      ner_chunk_embeddings,
      ner_chunk_tokenizer,
      ner_snomed_resolver,

      icdo_ner,
      icdo_chunk,
      icdo_chunk_embeddings,
      icdo_chunk_resolver
])
pipelineModelFull = pipelineFull.fit(data)
result = pipelineModelFull.transform(data).cache()

# Show results
result.selectExpr("explode(snomed_result)")
  .selectExpr(
    "col.metadata.target_text",
    "col.metadata.resolved_text",
    "col.metadata.confidence",
    "col.metadata.all_k_results",
    "col.metadata.all_k_resolutions")
  .filter($"confidence" > 0.2).show(5)
+--------------------+--------------------+----------+--------------------+--------------------+
|         target_text|       resolved_text|confidence|       all_k_results|   all_k_resolutions|
+--------------------+--------------------+----------+--------------------+--------------------+
|hypercholesterolemia|Hypercholesterolemia|    0.2524|13644009:::267432...|Hypercholesterole...|
|                 CBC|             Neocyte|    0.4980|259680000:::11573...|Neocyte:::Blood g...|
|                CD38|       Hypoviscosity|    0.2560|47872005:::370970...|Hypoviscosity:::E...|
|           platelets| Increased platelets|    0.5267|6631009:::2596800...|Increased platele...|
|                CD38|       Hypoviscosity|    0.2560|47872005:::370970...|Hypoviscosity:::E...|
+--------------------+--------------------+----------+--------------------+--------------------+

import spark.implicits._
// Using pretrained models for SNOMED
// First the prior steps of the pipeline are defined.
// Output of types TOKEN and WORD_EMBEDDINGS are needed.
val data = Seq(("A 63-year-old man presents to the hospital ...")).toDF("text")

val docAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("word_embeddings")

val icdo_ner = MedicalNerModel.pretrained("ner_bionlp", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "word_embeddings"))
    .setOutputCol("icdo_ner")

val icdo_chunk = new NerConverter()
    .setInputCols(Array("sentence","token","icdo_ner"))
    .setOutputCol("icdo_chunk")
    .setWhiteList("Cancer")

val icdo_chunk_embeddings = new ChunkEmbeddings()
    .setInputCols(Array("icdo_chunk", "word_embeddings"))
    .setOutputCol("icdo_chunk_embeddings")

val icdo_chunk_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_icdo_clinical", "en", "clinical/models")
    .setInputCols(Array("token","icdo_chunk_embeddings"))
    .setOutputCol("tm_icdo_code")

val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence", "token", "word_embeddings"))
  .setOutputCol("ner")

val ner_converter = new NerConverter()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")

val ner_chunk_tokenizer = new ChunkTokenizer()
    .setInputCols("ner_chunk")
    .setOutputCol("ner_token")
  
val ner_chunk_embeddings = new ChunkEmbeddings()
    .setInputCols(Array("ner_chunk", "word_embeddings"))
    .setOutputCol("ner_chunk_embeddings")

// Definition of the SNOMED Resolution
val ner_snomed_resolver = ChunkEntityResolverModel\
    .pretrained("chunkresolve_snomed_findings_clinical","en","clinical/models")
    .setInputCols(Array("ner_token","ner_chunk_embeddings"))\
    .setOutputCol("snomed_result")

val pipelineFull = new Pipeline().setStages(Array(
    docAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,

    clinical_ner,
    ner_converter,
    ner_chunk_embeddings,
    ner_chunk_tokenizer,
    ner_snomed_resolver,

    icdo_ner,
    icdo_chunk,
    icdo_chunk_embeddings,
    icdo_chunk_resolver
))
val pipelineModelFull = pipelineFull.fit(data)
val result = pipelineModelFull.transform(data).cache()

// Show results
//
// result.selectExpr("explode(snomed_result)")
//   .selectExpr(
//     "col.metadata.target_text",
//     "col.metadata.resolved_text",
//     "col.metadata.confidence",
//     "col.metadata.all_k_results",
//     "col.metadata.all_k_resolutions")
//   .filter($"confidence" > 0.2).show(5)
// +--------------------+--------------------+----------+--------------------+--------------------+
// |         target_text|       resolved_text|confidence|       all_k_results|   all_k_resolutions|
// +--------------------+--------------------+----------+--------------------+--------------------+
// |hypercholesterolemia|Hypercholesterolemia|    0.2524|13644009:::267432...|Hypercholesterole...|
// |                 CBC|             Neocyte|    0.4980|259680000:::11573...|Neocyte:::Blood g...|
// |                CD38|       Hypoviscosity|    0.2560|47872005:::370970...|Hypoviscosity:::E...|
// |           platelets| Increased platelets|    0.5267|6631009:::2596800...|Increased platele...|
// |                CD38|       Hypoviscosity|    0.2560|47872005:::370970...|Hypoviscosity:::E...|
// +--------------------+--------------------+----------+--------------------+--------------------+
//

Contains all the parameters and methods to train a ChunkEntityResolverModel. It transform a dataset with two Input Annotations of types TOKEN and WORD_EMBEDDINGS, coming from e.g. ChunkTokenizer and ChunkEmbeddings Annotators and returns the normalized entity for a particular trained ontology / curated dataset. (e.g. ICD-10, RxNorm, SNOMED etc.)

To use pretrained models please use ChunkEntityResolverModel and see the Models Hub for available models.

Input Annotator Types: TOKEN, WORD_EMBEDDINGS

Output Annotator Type: ENTITY

Scala API: ChunkEntityResolverApproach
Show Example

from johnsnowlabs import nlp, medical
# Training a SNOMED model
# Define pre-processing pipeline for training data. It needs consists of columns for the normalized training data and their labels.
document = nlp.DocumentAssembler() \
    .setInputCol("normalized_text") \
    .setOutputCol("document")

chunk = nlp.Doc2Chunk() \
    .setInputCols(["document"]) \
    .setOutputCol("chunk")

token = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel\
    .pretrained("embeddings_healthcare_100d", "en", "clinical/models") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

chunkEmb = nlp.ChunkEmbeddings() \
    .setInputCols(["chunk", "embeddings"]) \
    .setOutputCol("chunk_embeddings")

snomedTrainingPipeline = nlp.Pipeline().setStages([
    document,
    chunk,
    token,
    embeddings,
    chunkEmb
])

snomedTrainingModel = snomedTrainingPipeline.fit(data)

snomedData = snomedTrainingModel.transform(data).cache()

# Then the Resolver can be trained with
snomedExtractor = medical.ChunkEntityResolverApproach() \
    .setInputCols(["token", "chunk_embeddings"]) \
    .setOutputCol("recognized") \
    .setNeighbours(1000) \
    .setAlternatives(25) \
    .setNormalizedCol("normalized_text") \
    .setLabelCol("label") \
    .setEnableWmd(True).setEnableTfidf(True).setEnableJaccard(True) \
    .setEnableSorensenDice(True).setEnableJaroWinkler(True).setEnableLevenshtein(True) \
    .setDistanceWeights([1, 2, 2, 1, 1, 1]) \
    .setAllDistancesMetadata(True) \
    .setPoolingStrategy("MAX") \
    .setThreshold(1e32)
model = snomedExtractor.fit(snomedData)

import spark.implicits._
// Training a SNOMED model
// Define pre-processing pipeline for training data. It needs consists of columns for the normalized training data and their labels.
val document = new DocumentAssembler()
    .setInputCol("normalized_text")
    .setOutputCol("document")

val chunk = new Doc2Chunk()
    .setInputCols("document")
    .setOutputCol("chunk")

val token = new Tokenizer()
    .setInputCols("document")
    .setOutputCol("token")

val embeddings = WordEmbeddingsModel\
    .pretrained("embeddings_healthcare_100d", "en", "clinical/models")
    .setInputCols(Array("document", "token"))
    .setOutputCol("embeddings")

val chunkEmb = new ChunkEmbeddings()
    .setInputCols(Array("chunk", "embeddings"))
    .setOutputCol("chunk_embeddings")

val snomedTrainingPipeline = new Pipeline().setStages(Array(
    document,
    chunk,
    token,
    embeddings,
    chunkEmb
))

val snomedTrainingModel = snomedTrainingPipeline.fit(data)

val snomedData = snomedTrainingModel.transform(data).cache()

// Then the Resolver can be trained with
val snomedExtractor = new ChunkEntityResolverApproach()
    .setInputCols(Array("token", "chunk_embeddings"))
    .setOutputCol("recognized")
    .setNeighbours(1000)
    .setAlternatives(25)
    .setNormalizedCol("normalized_text")
    .setLabelCol("label")
    .setEnableWmd(true).setEnableTfidf(true).setEnableJaccard(true)
    .setEnableSorensenDice(true).setEnableJaroWinkler(true).setEnableLevenshtein(true)
    .setDistanceWeights(Array(1, 2, 2, 1, 1, 1))
    .setAllDistancesMetadata(true)
    .setPoolingStrategy("MAX")
    .setThreshold(1e32)
val model = snomedExtractor.fit(snomedData)

ChunkFilterer

Model

Filters entities coming from CHUNK annotations. Filters can be set via a white list of terms or a regular expression. White list criteria is enabled by default. To use regex, criteria has to be set to regex.

Parametres:

  • inputCols: The name of the columns containing the input annotations. It can read either a String column or an Array.

  • outputCol: The name of the column in Document type that is generated. We can specify only one column here.

  • criteria: Tag representing what is the criteria to filter the chunks. Possibles values are: - isIn: Filter by the chunk - regex: Filter using a regex

  • whiteList: If defined, list of entities to process. The rest will be ignored.

  • blackList: If defined, list of entities to ignore. The rest will be processed.

  • regex: If defined, list of regex to process the chunks (Default: []).

  • filterEntity: If equal to “entity”, use the ner label to filter. If set to “result”, use the result attribute of the annotation to filter.

  • entitiesConfidence: Path to csv with pairs (entity,confidenceThreshold). Filter the chunks with entities which have confidence lower than the confidence threshold.

All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols().

Input Annotator Types: DOCUMENT,CHUNK

Output Annotator Type: CHUNK

Python API: ChunkFilterer Scala API: ChunkFilterer Notebook: ChunkFiltererNotebook
Show Example

from johnsnowlabs import nlp, medical

# Filtering POS tags
# First pipeline stages to extract the POS tags are defined

docAssembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")

posTagger = nlp.PerceptronModel.pretrained() \
  .setInputCols(["sentence", "token"]) \
  .setOutputCol("pos")

chunker = nlp.Chunker() \
  .setInputCols(["pos", "sentence"]) \
  .setOutputCol("chunk") \
  .setRegexParsers(["(<NN>)+"])

# Then the chunks can be filtered via a white list. Here only terms with "gastroenteritis" remain.
chunkerFilter = medical.ChunkFilterer() \
  .setInputCols(["sentence","chunk"]) \
  .setOutputCol("filtered") \
  .setCriteria("isin") \
  .setWhiteList(["gastroenteritis"])

pipeline = nlp.Pipeline(stages=[
  docAssembler,
  sentenceDetector,
  tokenizer,
  posTagger,
  chunker,
  chunkerFilter])

data = spark.createDataFrame([["Has a past history of gastroenteritis and stomach pain, however patient ..."]]).toDF("text")

result = pipeline.fit(data).transform(data)
result.selectExpr("explode(chunk)").show(truncate=False)
+---------------------------------------------------------------------------------+
|col                                                                              |
+---------------------------------------------------------------------------------+
|{chunk, 11, 17, history, {sentence -> 0, chunk -> 0}, []}                        |
|{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []}                |
|{chunk, 42, 53, stomach pain, {sentence -> 0, chunk -> 2}, []}                   |
|{chunk, 64, 70, patient, {sentence -> 0, chunk -> 3}, []}                        |
|{chunk, 81, 110, stomach pain now.We don't care, {sentence -> 0, chunk -> 4}, []}|
|{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []}              |
+---------------------------------------------------------------------------------+

result.selectExpr("explode(filtered)").show(truncate=False)
+-------------------------------------------------------------------+
|col                                                                |
+-------------------------------------------------------------------+
|{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []}  |
|{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []}|
+-------------------------------------------------------------------+
from johnsnowlabs import nlp, finance

# Filtering POS tags
# First pipeline stages to extract the POS tags are defined

docAssembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")

posTagger = nlp.PerceptronModel.pretrained() \
  .setInputCols(["sentence", "token"]) \
  .setOutputCol("pos")

chunker = nlp.Chunker() \
  .setInputCols(["pos", "sentence"]) \
  .setOutputCol("chunk") \
  .setRegexParsers(["(<NN>)+"])

# Then the chunks can be filtered via a white list. Here only terms with "gastroenteritis" remain.
chunkerFilter = finance.ChunkFilterer() \
  .setInputCols(["sentence","chunk"]) \
  .setOutputCol("filtered") \
  .setCriteria("isin") \
  .setWhiteList(["rate"])

pipeline = nlp.Pipeline(stages=[
  docAssembler,
  sentenceDetector,
  tokenizer,
  posTagger,
  chunker,
  chunkerFilter])

data = spark.createDataFrame([["AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price."]]).toDF("text")

result = pipeline.fit(data).transform(data)
result.selectExpr("explode(chunk)").show(truncate=False)
+-------------------------------------------------------+
|col                                                    |
+-------------------------------------------------------+
|{chunk, 73, 77, basis, {sentence -> 0, chunk -> 0}, []}|
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+

result.selectExpr("explode(filtered)").show(truncate=False)
+-------------------------------------------------------+
|col                                                    |
+-------------------------------------------------------+
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+
from johnsnowlabs import nlp, legal

# Filtering POS tags
# First pipeline stages to extract the POS tags are defined

docAssembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")

posTagger = nlp.PerceptronModel.pretrained() \
  .setInputCols(["sentence", "token"]) \
  .setOutputCol("pos")

chunker = nlp.Chunker() \
  .setInputCols(["pos", "sentence"]) \
  .setOutputCol("chunk") \
  .setRegexParsers(["(<NN>)+"])

# Then the chunks can be filtered via a white list. Here only terms with "gastroenteritis" remain.
chunkerFilter = legal.ChunkFilterer() \
  .setInputCols(["sentence","chunk"]) \
  .setOutputCol("filtered") \
  .setCriteria("isin") \
  .setWhiteList(["rate"])

pipeline = nlp.Pipeline(stages=[
  docAssembler,
  sentenceDetector,
  tokenizer,
  posTagger,
  chunker,
  chunkerFilter])

data = spark.createDataFrame([["AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price."]]).toDF("text")

result = pipeline.fit(data).transform(data)
result.selectExpr("explode(chunk)").show(truncate=False)
+-------------------------------------------------------+
|col                                                    |
+-------------------------------------------------------+
|{chunk, 73, 77, basis, {sentence -> 0, chunk -> 0}, []}|
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+

result.selectExpr("explode(filtered)").show(truncate=False)
+-------------------------------------------------------+
|col                                                    |
+-------------------------------------------------------+
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+

// Filtering POS tags
// First pipeline stages to extract the POS tags are defined

import spark.implicits._

val docAssembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val sentenceDetector = new SentenceDetector()
 .setInputCols(Array("document")) 
 .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
 .setInputCols(Array("sentence")) 
 .setOutputCol("token") 

val posTagger = PerceptronModel.pretrained()
 .setInputCols(Array("sentence","token")) 
 .setOutputCol("pos") 

val chunker = new Chunker()
 .setInputCols(Array("pos","sentence")) 
 .setOutputCol("chunk") 
 .setRegexParsers(Array("(<NN>) +")) 

val chunkerFilter = new ChunkFilterer()
 .setInputCols(Array("sentence","chunk")) 
 .setOutputCol("filtered") 
 .setCriteria("isin") 
 .setWhiteList(Array("gastroenteritis"))

val pipeline = new Pipeline().setStages(Array(
  docAssembler, 
  sentenceDetector, 
  tokenizer, 
  posTagger, 
  chunker, 
  chunkerFilter)) 

val text ="""Has a past history of gastroenteritis and stomach pain, however patient ..."""

val data = Seq(text).toDF("text")
val result = pipeline.fit(data).transform(data)

// result.selectExpr("explode(chunk)").show(truncate=false)
+---------------------------------------------------------------------------------+
|col                                                                              |
+---------------------------------------------------------------------------------+
|{chunk, 11, 17, history, {sentence -> 0, chunk -> 0}, []}                        |
|{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []}                |
|{chunk, 42, 53, stomach pain, {sentence -> 0, chunk -> 2}, []}                   |
|{chunk, 64, 70, patient, {sentence -> 0, chunk -> 3}, []}                        |
|{chunk, 81, 110, stomach pain now.We don't care, {sentence -> 0, chunk -> 4}, []}|
|{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []}              |
+---------------------------------------------------------------------------------+

// result.selectExpr("explode(filtered)").show(truncate=false)
+-------------------------------------------------------------------+
|col                                                                |
+-------------------------------------------------------------------+
|{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []}  |
|{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []}|
+-------------------------------------------------------------------+
import spark.implicits._

val docAssembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val sentenceDetector = new SentenceDetector()
 .setInputCols(Array("document")) 
 .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
 .setInputCols(Array("sentence")) 
 .setOutputCol("token") 

val posTagger = PerceptronModel.pretrained()
 .setInputCols(Array("sentence","token")) 
 .setOutputCol("pos") 

val chunker = new Chunker()
 .setInputCols(Array("pos","sentence")) 
 .setOutputCol("chunk") 
 .setRegexParsers(Array("(<NN>) +")) 

val chunkerFilter = new ChunkFilterer()
 .setInputCols(Array("sentence","chunk")) 
 .setOutputCol("filtered") 
 .setCriteria("isin") 

val pipeline = new Pipeline().setStages(Array(
  docAssembler, 
  sentenceDetector, 
  tokenizer, 
  posTagger, 
  chunker, 
  chunkerFilter)) 

val text ="""AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price."""

val data = Seq(text).toDF("text")
val result = pipeline.fit(data).transform(data)

// result.selectExpr("explode(chunk)").show(truncate=false)
+-------------------------------------------------------+
|col                                                    |
+-------------------------------------------------------+
|{chunk, 73, 77, basis, {sentence -> 0, chunk -> 0}, []}|
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+

// result.selectExpr("explode(filtered)").show(truncate=False)
+-------------------------------------------------------+
|col                                                    |
+-------------------------------------------------------+
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+
import spark.implicits._

val docAssembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val sentenceDetector = new SentenceDetector()
 .setInputCols(Array("document")) 
 .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
 .setInputCols(Array("sentence")) 
 .setOutputCol("token") 

val posTagger = PerceptronModel.pretrained()
 .setInputCols(Array("sentence","token")) 
 .setOutputCol("pos") 

val chunker = new Chunker()
 .setInputCols(Array("pos","sentence")) 
 .setOutputCol("chunk") 
 .setRegexParsers(Array("(<NN>) +")) 

val chunkerFilter = new ChunkFilterer()
 .setInputCols(Array("sentence","chunk")) 
 .setOutputCol("filtered") 
 .setCriteria("isin") 

val pipeline = new Pipeline().setStages(Array(
  docAssembler, 
  sentenceDetector, 
  tokenizer, 
  posTagger, 
  chunker, 
  chunkerFilter)) 

val text ="""AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price."""

val data = Seq(text).toDF("text")
val result = pipeline.fit(data).transform(data)

// result.selectExpr("explode(chunk)").show(truncate=false)
+-------------------------------------------------------+
|col                                                    |
+-------------------------------------------------------+
|{chunk, 73, 77, basis, {sentence -> 0, chunk -> 0}, []}|
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+

// result.selectExpr("explode(filtered)").show(truncate=False)
+-------------------------------------------------------+
|col                                                    |
+-------------------------------------------------------+
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+

ChunkKeyPhraseExtraction

Model

Chunk KeyPhrase Extraction uses Bert Sentence Embeddings to determine the most relevant key phrases describing a text. The input to the model consists of chunk annotations and sentence or document annotation. The model compares the chunks against the corresponding sentences/documents and selects the chunks which are most representative of the broader text context (i.e. the document or the sentence they belong to). The key phrases candidates (i.e. the input chunks) can be generated in various ways, e.g. by NGramGenerator, TextMatcher or NerConverter. The model operates either at sentence (selecting the most descriptive chunks from the sentence they belong to) or at document level. In the latter case, the key phrases are selected to represent all the input document annotations.

Parametres:

  • setConcatenateSentences(value: Boolean): Concatenate the input sentence/documentation annotations before computing their embedding Default value is ‘true’.

  • setDivergence(value: Float): Set the level of divergence of the extracted key phrases.

  • setDocumentLevelProcessing(value: Boolean): Extract key phrases from the whole document (true) or from particular sentences which the chunks refer to (false) Default value is ‘true’.

  • setDropPunctuation(value: Boolean): Remove punctuation marks from input chunks.

  • setSelectMostDifferent(value: Boolean): Let the model return the top N key phrases which are the most different from each other.

  • setTopN(value: Int): Set the number of key phrases to extract.

This model is a subclass of [[BertSentenceEmbeddings]] and shares all parameters with it. It can load any pretrained BertSentenceEmbeddings model. Available models can be found at the Models Hub.

Input Annotator Types: DOCUMENT, CHUNK

Output Annotator Type: CHUNK

Python API: ChunkKeyPhraseExtraction Scala API: ChunkKeyPhraseExtraction Notebook: ChunkKeyPhraseExtractionNotebook
Show Example

from johnsnowlabs import nlp, medical

documenter = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentencer = nlp.SentenceDetector() \
    .setInputCols(["document"])\
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("tokens") \

embeddings = nlp.WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "tokens"]) \
    .setOutputCol("embeddings")

ner_tagger = medical.NerModel().pretrained("ner_jsl_slim", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens", "embeddings"]) \
    .setOutputCol("ner_tags")

ner_converter = nlp.NerConverter()\
    .setInputCols("sentences", "tokens", "ner_tags")\
    .setOutputCol("ner_chunks")

key_phrase_extractor = medical.ChunkKeyPhraseExtraction.pretrained()\
    .setTopN(1)\
    .setDocumentLevelProcessing(False)\
    .setDivergence(0.4)\
    .setInputCols(["sentences", "ner_chunks"])\
    .setOutputCol("ner_chunk_key_phrases")

pipeline = nlp.Pipeline(stages=[
    documenter, 
    sentencer, 
    tokenizer, 
    embeddings, 
    ner_tagger, 
    ner_converter,
    key_phrase_extractor])

data = spark.createDataFrame([["Her Diabetes has become type 2 in the last year with her Diabetes.He complains of swelling in his right forearm."]]).toDF("text")
results = pipeline.fit(data).transform(data)

results.selectExpr("explode(ner_chunk_key_phrases) AS key_phrase")\
       .selectExpr("key_phrase.result",
                   "key_phrase.metadata.entity",
                   "key_phrase.metadata.DocumentSimilarity",
                   "key_phrase.metadata.MMRScore").show(truncate=False)

+--------+-------------------------+------------------+-----------------+
|result  |entity                   |DocumentSimilarity|MMRScore         |
+--------+-------------------------+------------------+-----------------+
|Diabetes|Disease_Syndrome_Disorder|0.66827321499841  |0.400963944931921|
+--------+-------------------------+------------------+-----------------+
from johnsnowlabs import nlp, finance

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
        
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

key_phrase_extractor = finance.ChunkKeyPhraseExtraction\
    .pretrained()\
    .setTopN(1)\
    .setDocumentLevelProcessing(False)\
    .setDivergence(0.4)\
    .setInputCols(["sentence", "ner_chunk"])\
    .setOutputCol("ner_chunk_key_phrases")

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    key_phrase_extractor])

text = ["""In 2020, we acquired certain assets of Spell Security Private Limited (also known as "Spell Security"). More specifically, their Compliance product - Policy Compliance (PC)")."""]

data = spark.createDataFrame([text]).toDF("text")
result = nlpPipeline.fit(data).transform(data)

result.selectExpr("explode(ner_chunk_key_phrases) AS key_phrase")\
      .selectExpr("key_phrase.result",
                  "key_phrase.metadata.entity",
                  "key_phrase.metadata.DocumentSimilarity",
                  "key_phrase.metadata.MMRScore").show(truncate=False)

+------------------------------+-------+------------------+-------------------+
|result                        |entity |DocumentSimilarity|MMRScore           |
+------------------------------+-------+------------------+-------------------+
|Policy Compliance             |PRODUCT|0.6446724461374882|0.38680348305268175|
|Spell Security Private Limited|ORG    |0.6282153013401193|0.3769291957818915 |
+------------------------------+-------+------------------+-------------------+
from johnsnowlabs import nlp, legal

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
        
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias","en","legal/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

key_phrase_extractor = legal.ChunkKeyPhraseExtraction\
    .pretrained()\
    .setTopN(1)\
    .setDocumentLevelProcessing(False)\
    .setDivergence(0.4)\
    .setInputCols(["sentence", "ner_chunk"])\
    .setOutputCol("ner_chunk_key_phrases")

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    key_phrase_extractor])

text = ["""This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties").
"""]

data = spark.createDataFrame([text]).toDF("text")
result = nlpPipeline.fit(data).transform(data)

result.selectExpr("explode(ner_chunk_key_phrases) AS key_phrase")\
      .selectExpr("key_phrase.result",
                  "key_phrase.metadata.entity",
                  "key_phrase.metadata.DocumentSimilarity",
                  "key_phrase.metadata.MMRScore").show(truncate=False)

+--------------+------+------------------+-------------------+
|result        |entity|DocumentSimilarity|MMRScore           |
+--------------+------+------------------+-------------------+
|Buyer Entities|ALIAS |0.5680936022739617|0.34085617490878395|
+--------------+------+------------------+-------------------+

import spark.implicits._

val documenter = new DocumentAssembler()
  .setInputCol("text") 
  .setOutputCol("document") 

val sentencer = new SentenceDetector()
  .setInputCols(Array("document")) 
  .setOutputCol("sentences") 

val tokenizer = new Tokenizer()
  .setInputCols(Array("document")) 
  .setOutputCol("tokens") 

val embeddings = WordEmbeddingsModel
  .pretrained("embeddings_clinical","en","clinical/models") 
  .setInputCols(Array("document","tokens")) 
  .setOutputCol("embeddings") 

val ner_tagger = MedicalNerModel.pretrained("ner_jsl_slim","en","clinical/models") 
  .setInputCols(Array("sentences","tokens","embeddings")) 
  .setOutputCol("ner_tags") 
 
val ner_converter = new NerConverter()
  .setInputCols("sentences","tokens","ner_tags") 
  .setOutputCol("ner_chunks") 

val key_phrase_extractor = ChunkKeyPhraseExtraction.pretrained()
  .setTopN(1) 
  .setDocumentLevelProcessing(false) 
  .setDivergence(0.4) 
  .setInputCols(Array("sentences","ner_chunks")) 
  .setOutputCol("ner_chunk_key_phrases") 

val pipeline = new Pipeline().setStages(Array( 
  documenter, 
  sentencer, 
  tokenizer, 
  embeddings, 
  ner_tagger, 
  ner_converter, 
  key_phrase_extractor)) 

val text ="""Her Diabetes has become type 2 in the last year with her Diabetes.He complains of swelling in his right forearm."""
val data = Seq(text).toDF("text")

val results = pipeline.fit(data).transform(data)

+--------+-------------------------+------------------+-----------------+
|result  |entity                   |DocumentSimilarity|MMRScore         |
+--------+-------------------------+------------------+-----------------+
|Diabetes|Disease_Syndrome_Disorder|0.66827321499841  |0.400963944931921|
+--------+-------------------------+------------------+-----------------+
import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text") 
  .setOutputCol("document") 

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
  .setInputCols(Array("document") ) 
  .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence")) 
  .setOutputCol("token") 

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
  .setInputCols(Array("sentence","token")) 
  .setOutputCol("embeddings") 

val ner_model = FinanceNerModel.pretrained("finner_orgs_prods_alias","en","finance/models")
  .setInputCols(Array("sentence","token","embeddings")) 
  .setOutputCol("ner") 

val ner_converter = new NerConverter()
  .setInputCols(Array("sentence","token","ner")) 
  .setOutputCol("ner_chunk") 

val key_phrase_extractor = ChunkKeyPhraseExtraction.pretrained() 
  .setTopN(1) 
  .setDocumentLevelProcessing(false) 
  .setDivergence(0.4) 
  .setInputCols(Array("sentence","ner_chunk")) 
  .setOutputCol("ner_chunk_key_phrases")

val nlpPipeline = new Pipeline().setStages(Array( 
  documentAssembler, 
  sentenceDetector, 
  tokenizer, 
  embeddings, 
  ner_model, n
  er_converter, 
  key_phrase_extractor) ) 

val text ="""In 2020, we acquired certain assets of Spell Security Private Limited (also known as "Spell Security"). More specifically, their Compliance product - Policy Compliance (PC)."""
val data = Seq(text).toDF("text")

val result = nlpPipeline.fit(data).transform(data)

+------------------------------+-------+------------------+-------------------+
|result                        |entity |DocumentSimilarity|MMRScore           |
+------------------------------+-------+------------------+-------------------+
|Policy Compliance             |PRODUCT|0.6446724461374882|0.38680348305268175|
|Spell Security Private Limited|ORG    |0.6282153013401193|0.3769291957818915 |
+------------------------------+-------+------------------+-------------------+
import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text") 
  .setOutputCol("document") 

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
  .setInputCols(Array("document") ) 
  .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence")) 
  .setOutputCol("token") 

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
  .setInputCols(Array("sentence","token")) 
  .setOutputCol("embeddings") 

val ner_model = LegalNerModel.pretrained("legner_orgs_prods_alias","en","legal/models")
  .setInputCols(Array("sentence","token","embeddings")) 
  .setOutputCol("ner") 

val ner_converter = new NerConverter()
  .setInputCols(Array("sentence","token","ner")) 
  .setOutputCol("ner_chunk") 

val key_phrase_extractor = ChunkKeyPhraseExtraction.pretrained() 
  .setTopN(1) 
  .setDocumentLevelProcessing(false) 
  .setDivergence(0.4) 
  .setInputCols(Array("sentence","ner_chunk")) 
  .setOutputCol("ner_chunk_key_phrases")

val nlpPipeline = new Pipeline().setStages(Array( 
  documentAssembler, 
  sentenceDetector, 
  tokenizer, 
  embeddings, 
  ner_model, n
  er_converter, 
  key_phrase_extractor) ) 

val text ="""This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties")."""
val data = Seq(text).toDF("text")

val result = nlpPipeline.fit(data).transform(data)

+--------------+------+------------------+-------------------+
|result        |entity|DocumentSimilarity|MMRScore           |
+--------------+------+------------------+-------------------+
|Buyer Entities|ALIAS |0.5680936022739617|0.34085617490878395|
+--------------+------+------------------+-------------------+

ChunkMapper

We can use ChunkMapper to map entities with their associated code/reference based on pre-defined dictionaries.

This is the AnnotatorModel of the ChunkMapper, which can be used to access pretrained models with the .pretrained() or .load() methods. To train a new model, check the documentation of the ChunkMapperApproach annotator.

The annotator also allows using fuzzy matching, which can take into consideration parts of the tokens tha can map even when word order is different, char ngrams that can map even when thre are typos, and using fuzzy distance metric (Jaccard, Levenshtein, etc.).

Parametres:

  • setRels (List[str]): Relations that we are going to use to map the chunk

  • setLowerCase (Boolean): Set if we want to map the chunks in lower case or not (Default: True)

  • setAllowMultiTokenChunk (Boolean): Whether to skip relations with multitokens (Default: True)

  • setMultivaluesRelations (Boolean): Whether to decide to return all values in a relation together or separately (Default: False)

Example usage and more details can be found on Spark NLP Workshop repository accessible in GitHub, for example the notebook Healthcare Chunk Mapping.

Input Annotator Types: CHUNK

Output Annotator Type: LABEL_DEPENDENCY

Python API: ChunkMapperModel Scala API: ChunkMapperModel Notebook: ChunkMapperModelNotebook
Show Example

from johnsnowlabs import nlp, medical

documenter = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentencer = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentences")

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentences"])\
  .setOutputCol("tokens")

words_embedder = nlp.WordEmbeddingsModel()\
  .pretrained("embeddings_clinical", "en", "clinical/models")\
  .setInputCols(["sentences", "tokens"])\
  .setOutputCol("embeddings")

ner_tagger = medical.NerModel()\
  .pretrained("ner_posology", "en", "clinical/models")\
  .setInputCols("sentences", "tokens", "embeddings")\
  .setOutputCol("ner_tags")

ner_converter = medical.NerConverterInternal()\
  .setInputCols(["sentences", "tokens", "ner_tags"])\
  .setOutputCol("ner_chunks")\
  .setWhiteList(["DRUG"])

chunkToDoc = nlp.Chunk2Doc()\
  .setInputCols("ner_chunks")\
  .setOutputCol("ner_chunks_doc")

sbert_embedder = nlp.BertSentenceEmbeddings\
  .pretrained("sbiobert_base_cased_mli", "en","clinical/models")\
  .setInputCols(["ner_chunks_doc"])\
  .setOutputCol("sbert_embeddings")\
  .setCaseSensitive(False)

rxnorm_resolver = medical.SentenceEntityResolverModel\
  .pretrained("sbiobertresolve_rxnorm_augmented", "en", "clinical/models")\
  .setInputCols(["sbert_embeddings"])\
  .setOutputCol("rxnorm_code")\
  .setDistanceFunction("EUCLIDEAN")\

resolver2chunk = medical.Resolution2Chunk()\
  .setInputCols(["rxnorm_code"]) \
  .setOutputCol("rxnorm_chunk")\

chunkerMapper = medical.ChunkMapperModel.pretrained("rxnorm_drug_brandname_mapper", "en", "clinical/models")\
  .setInputCols(["rxnorm_chunk"])\
  .setOutputCol("rxnorm_drug_brandname_mapper")\
  .setRels(["rxnorm_brandname"])

pipeline = nlp.Pipeline(
    stages = [
        documenter,
        sentencer,
        tokenizer,
        words_embedder,
        ner_tagger,
        ner_converter,
        chunkToDoc,
        sbert_embedder,
        rxnorm_resolver,
        resolver2chunk,
        chunkerMapper
        ])


data = spark.createDataFrame([["The doctor prescribed Sinequan 150 MG for depression and Zonalon 50 mg for managing skin itching"]]).toDF("text")

result= pipeline.fit(data).transform(data)

result.select(F.explode(F.arrays_zip(result.ner_chunks.result,
                                     result.rxnorm_code.result)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("ner_chunks"),
                          F.expr("cols['1']").alias("rxnorm_code")).show(15, truncate=100)

+----------+-----------+----------------------------+
|ner_chunks|rxnorm_code|rxnorm_drug_brandname_mapper|
+----------+-----------+----------------------------+
|  Sinequan|     224915|         Sinequan (Sinequan)|
|   Zonalon|       9801|           Zonalon (Zonalon)|
+----------+-----------+----------------------------+
from johnsnowlabs import nlp, finance

document_assembler = nlp.DocumentAssembler()\
  .setInputCol('text')\
  .setOutputCol('document')

tokenizer = nlp.Tokenizer()\
  .setInputCols("document")\
  .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
  .setInputCols(["document", "token"]) \
  .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_ticker", "en", "finance/models")\
  .setInputCols(["document", "token", "embeddings"])\
  .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
  .setInputCols(["document", "token", "ner"])\
  .setOutputCol("ner_chunk")

CM = finance.ChunkMapperModel.pretrained('finmapper_nasdaq_ticker_stock_screener', 'en', 'finance/models')\
  .setInputCols(["ner_chunk"])\
  .setOutputCol("mappings")

pipeline = nlp.Pipeline().setStages([
  document_assembler,
  tokenizer, 
  embeddings,
  ner_model, 
  ner_converter, 
  CM])
                                 
text = ["""There are some serious purchases and sales of AMZN stock today."""]

data = spark.createDataFrame([text]).toDF("text")

result = pipeline.fit(data).transform(data)

+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result|result                                                                                                                                                             |
+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[AMZN]|[AMZN, Amazon.com Inc. Common Stock, $98.12, 2.85, 2.991%, 9.98556270184E11, United States, 1997, 85412563, Consumer Discretionary, Catalog/Specialty Distribution]|
+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
from johnsnowlabs import nlp, legal

document_assembler = nlp.DocumentAssembler()\
  .setInputCol('text')\
  .setOutputCol('document')

tokenizer = nlp.Tokenizer()\
  .setInputCols("document")\
  .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained('glove_100d') \
  .setInputCols(['document', 'token']) \
  .setOutputCol('embeddings')

ner_model = nlp.NerDLModel.pretrained("onto_100", "en") \
  .setInputCols(["document", "token", "embeddings"]) \
  .setOutputCol("ner")
 
ner_converter = nlp.NerConverter()\
  .setInputCols(["document", "token", "ner"])\
  .setOutputCol("ner_chunk")\
  .setWhiteList(["CARDINAL"])

CM = legal.ChunkMapperModel().pretrained("legmapper_edgar_irs", "en", "legal/models")\
  .setInputCols(["ner_chunk"])\
  .setOutputCol("mappings")

pipeline = nlp.Pipeline().setStages([
  document_assembler,
  tokenizer, 
  embeddings,
  ner_model, 
  ner_converter, 
  CM])

text = ["""873474341 is an American multinational corporation that is engaged in the design, development, manufacturing, and worldwide marketing and sales of footwear, apparel, equipment, accessories, and services"""]

data = spark.createDataFrame([text]).toDF("text")

result= pipeline.fit(data).transform(data)

+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result     |result                                                                                                                                                               |
+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[873474341]|[Masterworks 096, LLC, RETAIL-RETAIL STORES, NEC [5990], 5990, 873474341, 1231, NY, DE, 225 LIBERTY STREET, NEW YORK, NY, 10281, 2035185172, , , 2022-01-10, 1894064]|
+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val documenter = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val sentencer = new SentenceDetector()
 .setInputCols("document")
 .setOutputCol("sentences") 

val tokenizer = new Tokenizer()
 .setInputCols("sentences")
 .setOutputCol("tokens") 

val words_embedder = WordEmbeddingsModel
 .pretrained("embeddings_clinical","en","clinical/models") 
 .setInputCols(Array("sentences","tokens")) 
 .setOutputCol("embeddings") 

val ner_tagger = MedicalNerModel
 .pretrained("ner_posology","en","clinical/models") 
 .setInputCols(Array("sentences","tokens","embeddings"))
 .setOutputCol("ner_tags") 

val ner_converter = new NerConverterInternal()
 .setInputCols(Array("sentences","tokens","ner_tags")) 
 .setOutputCol("ner_chunks") 
 .setWhiteList("DRUG") 

val chunkToDoc = new Chunk2Doc()
 .setInputCols("ner_chunks") 
 .setOutputCol("ner_chunks_doc") 

val sbert_embedder = BertSentenceEmbeddings
 .pretrained("sbiobert_base_cased_mli","en","clinical/models") 
 .setInputCols("ner_chunks_doc")
 .setOutputCol("sbert_embeddings") 
 .setCaseSensitive(false) 

val rxnorm_resolver = SentenceEntityResolverModel
 .pretrained("sbiobertresolve_rxnorm_augmented","en","clinical/models") 
 .setInputCols("sbert_embeddings")
 .setOutputCol("rxnorm_code") 
 .setDistanceFunction("EUCLIDEAN") 

val resolver2chunk = new Resolution2Chunk()
 .setInputCols("rxnorm_code")
 .setOutputCol("rxnorm_chunk") 

val chunkerMapper = ChunkMapperModel.pretrained("rxnorm_drug_brandname_mapper","en","clinical/models")
 .setInputCols("rxnorm_chunk")
 .setOutputCol("rxnorm_drug_brandname_mapper") 
 .setRels(Array("rxnorm_brandname")) 

val pipeline = new Pipeline().setStages(Array(
 documenter, 
 sentencer, 
 tokenizer, 
 words_embedder, 
 ner_tagger, 
 ner_converter, 
 chunkToDoc, 
 sbert_embedder, 
 rxnorm_resolver, 
 resolver2chunk,
  chunkerMapper )) 

val text ="""The doctor prescribed Sinequan 150 MG for depression and Zonalon 50 mg for managing skin itching"""
val data = Seq(text).toDF("text")

val result= mapper_pipeline.fit(data).transform(data)

+----------+-----------+----------------------------+
|ner_chunks|rxnorm_code|rxnorm_drug_brandname_mapper|
+----------+-----------+----------------------------+
|  Sinequan|     224915|         Sinequan (Sinequan)|
|   Zonalon|       9801|           Zonalon (Zonalon)|
+----------+-----------+----------------------------+
import spark.implicits._

val document_assembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val tokenizer = new Tokenizer()
 .setInputCols("document") 
 .setOutputCol("token") 

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
 .setInputCols(Array("document","token")) 
 .setOutputCol("embeddings") 

val ner_model = FinanceNerModel.pretrained("finner_ticker","en","finance/models")
 .setInputCols(Array("document","token","embeddings")) 
 .setOutputCol("ner") 

val ner_converter = new NerConverter()
 .setInputCols(Array("document","token","ner")) 
 .setOutputCol("ner_chunk") 

val CM = ChunkMapperModel.pretrained("finmapper_nasdaq_ticker_stock_screener","en","finance/models")
 .setInputCols("ner_chunk")
 .setOutputCol("mappings") 

val pipeline = new Pipeline().setStages(Array( 
  document_assembler, 
  tokenizer, 
  embeddings, 
  ner_model, 
  ner_converter, 
  CM) ) 
 
val text ="""There are some serious purchases and sales of AMZN stock today."""
val data = Seq(text).toDF("text")

val result = pipeline.fit(data).transform(data)

+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result|result                                                                                                                                                             |
+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[AMZN]|[AMZN, Amazon.com Inc. Common Stock, $98.12, 2.85, 2.991%, 9.98556270184E11, United States, 1997, 85412563, Consumer Discretionary, Catalog/Specialty Distribution]|
+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._

val document_assembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val tokenizer = new Tokenizer()
 .setInputCols("document") 
 .setOutputCol("token") 

val embeddings = WordEmbeddingsModel.pretrained("glove_100d")
 .setInputCols(Array("document","token")) 
 .setOutputCol("embeddings") 

val ner_model = NerDLModel.pretrained("onto_100","en")
 .setInputCols(Array("document","token","embeddings")) 
 .setOutputCol("ner") 

val ner_converter = new NerConverter()
 .setInputCols(Array("document","token","ner")) 
 .setOutputCol("ner_chunk") 
 .setWhiteList(Array("CARDINAL")) 

val CM = ChunkMapperModel.pretrained("legmapper_edgar_irs","en","legal/models") 
.setInputCols("ner_chunk")
.setOutputCol("mappings") 

val pipeline = new Pipeline().setStages(Array( 
  document_assembler, 
  tokenizer, 
  embeddings, 
  ner_model, 
  ner_converter, 
  CM) ) 

val text ="""873474341 is an American multinational corporation that is engaged in the design,development,manufacturing,and worldwide marketing and sales of footwear,apparel,equipment,accessories,and services"""
val data = Seq(text).toDF("text")

val result= pipeline.fit(data).transform(data)

+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result     |result                                                                                                                                                               |
+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[873474341]|[Masterworks 096, LLC, RETAIL-RETAIL STORES, NEC [5990], 5990, 873474341, 1231, NY, DE, 225 LIBERTY STREET, NEW YORK, NY, 10281, 2035185172, , , 2022-01-10, 1894064]|
+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+

We can use ChunkMapper to map entities with their associated code/reference based on pre-defined dictionaries.

This is the AnnotatorApproach of the ChunkMapper, which can be used to train ChunkMapper models by giving a custom mapping dictionary. To use pretriained models, check the documentation of the ChunkMapperModel annotator.

The annotator also allows using fuzzy matching, which can take into consideration parts of the tokens tha can map even when word order is different, char ngrams that can map even when thre are typos, and using fuzzy distance metric (Jaccard, Levenshtein, etc.).

Example usage and more details can be found on Spark NLP Workshop repository accessible in GitHub, for example the notebook Healthcare Chunk Mapping.

Input Annotator Types: CHUNK

Output Annotator Type: LABEL_DEPENDENCY

Python API: ChunkMapperApproach Scala API: ChunkMapperApproach Notebook: ChunkMapperApproachModelNotebook
Show Example

from johnsnowlabs import nlp, medical

# First, create a dictionay in JSON format following this schema:
import json
data_set= {
  "mappings": [
    {
      "key": "metformin",
      "relations": [
        {
          "key": "action",
          "values" : ["hypoglycemic", "Drugs Used In Diabetes"]
        },
        {
          "key": "treatment",
          "values" : ["diabetes", "t2dm"]
        }
      ]
    }
  ]
}

with open('sample_drug.json', 'w', encoding='utf-8') as f:
    json.dump(data_set, f, ensure_ascii=False, indent=4)


# Create a pipeline
document_assembler = nlp.DocumentAssembler()\
    .setInputCol('text')\
    .setOutputCol('document')

sentence_detector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols("sentence")\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

#NER model to detect drug in the text
clinical_ner =  medical.NerModel.pretrained("ner_posology_small","en","clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner")\
    .setLabelCasing("upper")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setWhiteList(["DRUG"])

chunkerMapper =  medical.ChunkMapperApproach()\
    .setInputCols(["ner_chunk"])\
    .setOutputCol("mappings")\
    .setDictionary("/content/sample_drug.json")\
    .setRels(["action"]) #or treatment

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    sentence_detector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    chunkerMapper])

text = ["The patient was given 1 unit of metformin daily."]

test_data = spark.createDataFrame([text]).toDF("text")

model = pipeline.fit(test_data)
res= model.transform(test_data)

model.stages[-1].write().save("models/drug_mapper")
from johnsnowlabs import nlp, finance

# First, create a dictionay in JSON format following this schema:
import json

data_set= {
  "mappings": [
    {
      "key": "Rayton Solar Inc.",
      "relations": [
        {
          "key": "name",
          "values" : ['Rayton Solar Inc.']
        },
        {
          "key": "sic",
          "values" : ['SEMICONDUCTORS & RELATED DEVICES [3674]']
        }]
    }]
}

with open('sample_finance.json', 'w', encoding='utf-8') as f:
    json.dump(data_set, f, ensure_ascii=False, indent=4)

# Create a pipeline
document_assembler = nlp.DocumentAssembler()\
  .setInputCol('text')\
  .setOutputCol('document')

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols("sentence")\
  .setOutputCol("token")

word_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
  .setInputCols(["sentence", "token"]) \
  .setOutputCol("embeddings")

finance_ner = finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\
  .setInputCols(["sentence", "token", "embeddings"])\
  .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
  .setInputCols(["sentence","token","ner"])\
  .setOutputCol("ner_chunk")\
  .setWhiteList(["ORG"]) # Return only ORG entities

chunkerMapper =  finance.ChunkMapperApproach()\
  .setInputCols(["ner_chunk"])\
  .setOutputCol("mappings")\
  .setDictionary("/content/sample_finance.json")\
  .setRels(all_rels)

pipeline = nlp.Pipeline().setStages([
  document_assembler,
  sentence_detector,
  tokenizer,
  word_embeddings,
  finance_ner,
  ner_converter,
  chunkerMapper])

text = ["AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price. "]

test_data = spark.createDataFrame([text]).toDF("text")

model = pipeline.fit(test_data)
res= model.transform(test_data)

model.stages[-1].write().save("models/finance_mapper")
from johnsnowlabs import nlp, legal

# First, create a dictionay in JSON format following this schema:
import json

data_set= {
  "mappings": [
    {
      "key": "Rayton Solar Inc.",
      "relations": [
        {
          "key": "name",
          "values" : ['Rayton Solar Inc.']
        },
        {
          "key": "sic",
          "values" : ['SEMICONDUCTORS & RELATED DEVICES [3674]']
        }]
    }]
}

with open('sample_legal.json', 'w', encoding='utf-8') as f:
    json.dump(data_set, f, ensure_ascii=False, indent=4)

# Create a pipeline
document_assembler = nlp.DocumentAssembler()\
  .setInputCol('text')\
  .setOutputCol('document')

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols("sentence")\
  .setOutputCol("token")

word_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
  .setInputCols(["sentence", "token"]) \
  .setOutputCol("embeddings")

legal_ner = legal.NerModel.pretrained("legner_org_per_role_date", "en", "legal/models")\
  .setInputCols(["sentence", "token", "embeddings"])\
  .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
  .setInputCols(["sentence","token","ner"])\
  .setOutputCol("ner_chunk")\
  .setWhiteList(["ORG"]) # Return only ORG entities

chunkerMapper =  legal.ChunkMapperApproach()\
  .setInputCols(["ner_chunk"])\
  .setOutputCol("mappings")\
  .setDictionary("/content/sample_legal.json")\
  .setRels(all_rels)

pipeline = nlp.Pipeline().setStages([
  document_assembler,
  sentence_detector,
  tokenizer,
  word_embeddings,
  legal_ner,
  ner_converter,
  chunkerMapper])

text = ["AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price. "]

test_data = spark.createDataFrame([text]).toDF("text")

model = pipeline.fit(test_data)
res= model.transform(test_data)

model.stages[-1].write().save("models/legal_mapper")

import spark.implicits._

val document_assembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val sentence_detector = new SentenceDetector()
 .setInputCols("document")
 .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
 .setInputCols("sentence") 
 .setOutputCol("token") 

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
 .setInputCols(Array("sentence","token")) 
 .setOutputCol("embeddings") //NER model to detect drug in the text 

val clinical_ner = MedicalNerModel.pretrained("ner_posology_small","en","clinical/models")
 .setInputCols(Array("sentence","token","embeddings")) 
 .setOutputCol("ner") 
 .setLabelCasing("upper") 

val ner_converter = new NerConverterInternal()
 .setInputCols(Array("sentence","token","ner")) 
 .setOutputCol("ner_chunk") 
 .setWhiteList(Array("DRUG")) 

val chunkerMapper = new ChunkMapperApproach()
 .setInputCols("ner_chunk") 
 .setOutputCol("mappings") 
 .setDictionary("/content/sample_drug.json") 
 .setRels(Array("action") ) //or treatment 

val pipeline = new Pipeline()
 .setStages(Array(
  document_assembler, 
  sentence_detector, 
  tokenizer, 
  word_embeddings, 
  clinical_ner, 
  ner_converter, 
  chunkerMapper) ) 
val text = new Array("The patient was given 1 unit of metformin daily.") 

val test_data = seq(Array(text)) .toDF("text") 
val model = pipeline.fit(test_data) 

res= model.transform(test_data) 
model.stagesArray(-1) .write() .save("models/drug_mapper")
import spark.implicits._

val document_assembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
 .setInputCols("document")
 .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
 .setInputCols("sentence") 
 .setOutputCol("token") 

val word_embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
 .setInputCols(Array("sentence","token") ) 
 .setOutputCol("embeddings") 

val finance_ner = FinanceNerModel.pretrained("finner_orgs_prods_alias","en","finance/models")
 .setInputCols(Array("sentence","token","embeddings") ) 
 .setOutputCol("ner") 

val ner_converter = new NerConverter()
 .setInputCols(Array("sentence","token","ner") ) 
 .setOutputCol("ner_chunk") 
 .setWhiteList(Array("ORG") ) // Return only ORG entities 

val chunkerMapper = new ChunkMapperApproach()
 .setInputCols("ner_chunk")
 .setOutputCol("mappings") 
 .setDictionary("/content/sample_json") 
 .setRels(all_rels) 

val pipeline = new Pipeline()
 .setStages(Array( 
  document_assembler, 
  sentence_detector, 
  tokenizer, 
  word_embeddings, 
  finance_ner, 
  ner_converter, 
  chunkerMapper) ) 

val text = new Array("AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price. ") 

val test_data = seq(Array(text)).toDF("text") 

val model = pipeline.fit(test_data) 
res= model.transform(test_data) 

model.stagesArray(-1) .write() .save("models/finance_mapper")
import spark.implicits._
 
val document_assembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
 .setInputCols("document")
 .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
 .setInputCols("sentence") 
 .setOutputCol("token") 

val word_embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
 .setInputCols(Array("sentence","token")) 
 .setOutputCol("embeddings") 

val legal_ner = LegalNerModel.pretrained("legner_org_per_role_date","en","legal/models")
 .setInputCols(Array("sentence","token","embeddings")) 
 .setOutputCol("ner") 

val ner_converter = new NerConverter()
 .setInputCols(Array("sentence","token","ner")) 
 .setOutputCol("ner_chunk") 
 .setWhiteList("ORG") // Return only ORG entities 

val chunkerMapper = new ChunkMapperApproach()
 .setInputCols("ner_chunk")
 .setOutputCol("mappings") 
 .setDictionary("/content/sample_json") 
 .setRels(all_rels) 

val pipeline = new Pipeline()
 .setStages(Array( 
  document_assembler, 
  sentence_detector, 
  tokenizer, 
  word_embeddings, 
  legal_ner, 
  ner_converter, 
  chunkerMapper) ) 

val text = new Array("AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price. ") 
val test_data = seq(Array(text) ) .toDF("text") 
val model = pipeline.fit(test_data) 

res= model.transform(test_data) 
model.stagesArray(-1) .write() .save("models/legal_mapper")

ChunkMapperFilterer

Model

ChunkMapperFilterer is an annotator to be used after ChunkMapper that allows to filter chunks based on the results of the mapping, whether it was successful or failed.

Parametres:

  • ReturnCriteria (String): Has two possible values: “success” or “fail”. If “fail” (default), returns the chunks that are not in the label dependencies; if “success”, returns the labels that were successfully mapped by the ChunkMapperModel annotator.

Example usage and more details can be found on Spark NLP Workshop repository accessible in GitHub, for example the notebook Healthcare Chunk Mapping.

Input Annotator Types: CHUNK, LABEL_DEPENDENCY

Output Annotator Type: CHUNK

Python API: ChunkMapperFilterer Scala API: ChunkMapperFilterer Notebook: ChunkMapperFiltererNotebook
Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

sentence_detector = nlp.SentenceDetector()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
      .setInputCols("sentence")\
      .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

ner_model = medical.NerModel.pretrained("ner_posology_greedy", "en", "clinical/models")\
      .setInputCols(["sentence", "token", "embeddings"])\
      .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
      .setInputCols("sentence", "token", "ner")\
      .setOutputCol("chunk")

chunkerMapper = medical.ChunkMapperModel.pretrained("rxnorm_mapper", "en", "clinical/models")\
      .setInputCols(["chunk"])\
      .setOutputCol("RxNorm_Mapper")\
      .setRels(["rxnorm_code"])

chunk_mapper_filterer = medical.ChunkMapperFilterer() \
      .setInputCols(["chunk", "RxNorm_Mapper"]) \
      .setOutputCol("chunks_fail") \
      .setReturnCriteria("fail")

mapper_pipeline = nlp.Pipeline(
      stages = [
          document_assembler,
          sentence_detector,
          tokenizer,
          word_embeddings,
          ner_model,
          ner_converter,
          chunkerMapper,
          chunkerMapper,
          chunk_mapper_filterer
      ])

samples = [["The patient was given Adapin 10 MG, coumadn 5 mg"],
           ["The patient was given Avandia 4 mg, Tegretol, zitiga"] ]
           
data = spark.createDataFrame(samples).toDF("text")

result = mapper_pipeline.fit(data).transform(data)

result.selectExpr("chunk.result as chunk", 
                  "RxNorm_Mapper.result as RxNorm_Mapper", 
                  "chunks_fail.result as chunks_fail").show(truncate = False)

+--------------------------------+----------------------+--------------+
|chunk                           |RxNorm_Mapper         |chunks_fail   |
+--------------------------------+----------------------+--------------+
|[Adapin 10 MG, coumadn 5 mg]    |[1000049, NONE]       |[coumadn 5 mg]|
|[Avandia 4 mg, Tegretol, zitiga]|[261242, 203029, NONE]|[zitiga]      |
+--------------------------------+----------------------+--------------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
 
val sentence_detector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols("sentence", "token")
    .setOutputCol("embeddings")

val ner_model = MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models")
    .setInputCols("sentence", "token", "embeddings")
    .setOutputCol("ner")

val ner_converter = new NerConverterInternal()
    .setInputCols("sentence", "token", "ner")
    .setOutputCol("chunk")

val chunkerMapper = ChunkMapperModel.pretrained("rxnorm_mapper", "en", "clinical/models")
    .setInputCols("chunk")
    .setOutputCol("RxNorm_Mapper")
    .setRels(Array("rxnorm_code"))

val chunk_mapper_filterer = new ChunkMapperFilterer()
    .setInputCols("chunk", "RxNorm_Mapper")
    .setOutputCol("chunks_fail")
    .setReturnCriteria("fail")

val mapper_pipeline = new Pipeline().setStages(Array(
    document_assembler,
    sentence_detector,
    tokenizer,
    word_embeddings,
    ner_model,
    ner_converter,
    chunkerMapper,
    chunk_mapper_filterer
    ))


val data = Seq("The patient was given Adapin 10 MG, coumadn 5 mg",
"The patient was given Avandia 4 mg, Tegretol, zitiga").toDF("text")

val result = mapper_pipeline.fit(data).transform(data)

+--------------------------------+----------------------+--------------+
|chunk                           |RxNorm_Mapper         |chunks_fail   |
+--------------------------------+----------------------+--------------+
|[Adapin 10 MG, coumadn 5 mg]    |[1000049, NONE]       |[coumadn 5 mg]|
|[Avandia 4 mg, Tegretol, zitiga]|[261242, 203029, NONE]|[zitiga]      |
+--------------------------------+----------------------+--------------+

ChunkMerge

Approach

Merges two chunk columns coming from two annotators(NER, ContextualParser or any other annotator producing chunks). The merger of the two chunk columns is made by selecting one chunk from one of the columns according to certain criteria. The decision on which chunk to select is made according to the chunk indices in the source document. (chunks with longer lengths and highest information will be kept from each source) Labels can be changed by setReplaceDictResource.

Parameters:

  • inputCols: The name of the columns containing the input annotations. It can read either a String column or an Array.
  • outputCol: The name of the column in Document type that is generated. We can specify only one column here.
  • mergeOverlapping: (Boolean) Sets whether to merge overlapping matched chunks. Default True.
  • falsePositivesResource: Sets file with false positive pairs
  • replaceDictResource: Sets replace dictionary pairs for NER labels
  • blackList: (String List) If defined, list of entities to ignore. The rest will be processed.
  • whiteList: (String List) If defined, list of entities to accept.
  • selectionStrategy: (String) Sets Whether to select annotations sequentially based on annotation order Sequential or using any other available strategy; currently only Sequential and DiverseLonger are available. Default Sequential.
  • orderingFeatures: (String List) The ordering features to use for overlapping entities. Possible values are ChunkBegin, ChunkLength, ChunkPrecedence, ChunkConfidence.
  • defaultConfidence: (Float) Sets when ChunkConfidence ordering feature is included and a given annotation does not have any confidence. The value of this param will be used as a confidence score for annotations without a confidence score.
  • chunkPrecedence: (String List) Sets what is the precedence order when a chunk labeled by two models.
  • chunkPrecedenceValuePrioritization: (String List) Sets when ChunkPrecedence ordering feature is used. This param contains an Array of comma-separated values representing the desired order of prioritization for the values in the metadata fields included from chunkPrecedence.

All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols().

Input Annotator Types: CHUNK, CHUNK

Output Annotator Type: CHUNK

Python API: ChunkMergeApproach Scala API: ChunkMergeApproach
Show Example

from johnsnowlabs import nlp, medical

# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# 1- ner_clinical model
clinical_ner = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("clinical_ner")

clinical_ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "clinical_ner"]) \
    .setOutputCol("clinical_ner_chunk")

# 2- posology ner model
posology_ner = medical.NerModel.pretrained("ner_posology", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("posology_ner")

posology_ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "posology_ner"]) \
    .setOutputCol("posology_ner_chunk")

# 3- generate a text matcher annotator that extracts female related entities
entities = ['she', 'her', 'girl', 'woman', 'women', 'womanish', 'womanlike', 'womanly', 'madam', 'madame', 'senora', 'lady', 'miss', 'girlfriend', 'wife', 'bride', 'misses', 'mrs.', 'female']
with open ('female_entities.txt', 'w') as f:
    for i in entities:
        f.write(i+'\n')

# Find female entities using TextMatcher
female_entity_extractor = nlp.TextMatcher() \
    .setInputCols(["sentence",'token'])\
    .setOutputCol("female_entities")\
    .setEntities("female_entities.txt")\
    .setCaseSensitive(False)\
    .setEntityValue('female_entity')

# Chunk Merge annotator is used to merge columns
chunk_merger = medical.ChunkMergeApproach()\
    .setInputCols("posology_ner_chunk", 'clinical_ner_chunk', "female_entities")\
    .setOutputCol('merged_ner_chunk')

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    clinical_ner_converter,
    posology_ner,
    posology_ner_converter,
    female_entity_extractor,
    chunk_merger])

sample_text = """The lady was treated with a five-day course of amoxicillin for a respiratory tract infection .
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG ."""


data = spark.createDataFrame([[sample_text]]).toDF("text")
model = nlpPipeline.fit(data).transform(data)

# Show results
model.selectExpr("explode(merged_ner_chunk) as a") \
  .selectExpr("a.begin","a.end","a.result as chunk","a.metadata.entity as entity") \
  .show(10, False)
+-----+---+-----------------------------+-------------+
|begin|end|chunk                        |entity       |
+-----+---+-----------------------------+-------------+
|4    |7  |lady                         |female_entity|
|47   |57 |amoxicillin                  |DRUG         |
|63   |91 |a respiratory tract infection|PROBLEM      |
|95   |97 |She                          |female_entity|
|106  |114|metformin                    |DRUG         |
|118  |126|glipizide                    |TREATMENT    |
|134  |146|dapagliflozin                |TREATMENT    |
|152  |155|T2DM                         |PROBLEM      |
|161  |172|atorvastatin                 |DRUG         |
|178  |188|gemfibrozil                  |TREATMENT    |
+-----+---+-----------------------------+-------------+
from johnsnowlabs import nlp, finance

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

bert_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("bert_embeddings")

fin_ner = finance.NerModel.pretrained('finner_deid', "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

ner_converter =  finance.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setReplaceLabels({"ORG": "PARTY"}) # Replace "ORG" entity as "PARTY"

ner_finner = finance.NerModel.pretrained("finner_org_per_role_date", "en", "finance/models")\
    .setInputCols(["sentence", "token", "bert_embeddings"]) \
    .setOutputCol("ner_finner") 
    #.setLabelCasing("upper")

ner_converter_finner = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner_finner"]) \
    .setOutputCol("ner_finner_chunk") \
    .setWhiteList(['ROLE']) # Just use "ROLE" entity from this NER

chunk_merge =  finance.ChunkMergeApproach()\
    .setInputCols("ner_finner_chunk", "ner_chunk")\
    .setOutputCol("deid_merged_chunk")

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      bert_embeddings,
      fin_ner,
      ner_converter,
      ner_finner,
      ner_converter_finner,
      chunk_merge])

data = spark.createDataFrame([["Jeffrey Preston Bezos is an American entrepreneur, founder and CEO of Amazon"]]).toDF("text")

# Show results
result = nlpPipeline.fit(data).transform(data).cache()
result.select(F.explode(F.arrays_zip(result.deid_merged_chunk.result, 
                                     result.deid_merged_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)
+---------------------+---------+
|chunk                |ner_label|
+---------------------+---------+
|Jeffrey Preston Bezos|PERSON   |
|founder              |ROLE     |
|CEO                  |ROLE     |
|Amazon               |PARTY    |
+---------------------+---------+
from johnsnowlabs import nlp, legal

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

legal_ner = legal.NerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

ner_converter = legal.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setReplaceLabels({"ALIAS": "PARTY"})

ner_signers = legal.NerModel.pretrained("legner_signers", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_signers") 
    #.setLabelCasing("upper")

ner_converter_signers = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner_signers"]) \
    .setOutputCol("ner_signer_chunk")

chunk_merge = legal.ChunkMergeApproach()\
    .setInputCols("ner_signer_chunk", "ner_chunk")\
    .setOutputCol("deid_merged_chunk")

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      legal_ner,
      ner_converter,
      ner_signers,
      ner_converter_signers,
      chunk_merge])


data = spark.createDataFrame([["ENTIRE AGREEMENT.  This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby, supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter.

2THEMART.COM, INC.:  I-ESCROW, INC.: By:Dominic J. Magliarditi By:Sanjay Bajaj Name: Dominic J. Magliarditi Name: Sanjay Bajaj Title: President Title: VP Business Development Date: 6/21/99    Date: 6/11/99 "]]).toDF("text")

# Show results
result = nlpPipeline.fit(data).transform(data).cache()
result.select(F.explode(F.arrays_zip(result.deid_merged_chunk.result, 
                                     result.deid_merged_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)
+-----------------------+--------------+
|chunk                  |ner_label     |
+-----------------------+--------------+
|ENTIRE AGREEMENT       |DOC           |
|INC                    |PARTY         |
|J. Magliarditi         |SIGNING_PERSON|
|Bajaj                  |SIGNING_PERSON|
|Dominic J. Magliarditi |SIGNING_PERSON|
|Sanjay Bajaj           |SIGNING_PERSON|
|President              |SIGNING_TITLE |
|VP Business Development|SIGNING_TITLE |
+-----------------------+--------------+

import spark.implicits._

// Annotator that transforms a text column from dataframe into an Annotation ready for NLP 
val documentAssembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 
 
// Sentence Detector annotator,processes various sentences per line 
val sentenceDetector = new SentenceDetector()
 .setInputCols("document")
 .setOutputCol("sentence") 
 
// Tokenizer splits words in a relevant format for NLP 
val tokenizer = new Tokenizer()
 .setInputCols("sentence") 
 .setOutputCol("token") 
 
// Clinical word embeddings trained on PubMED dataset 
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
 .setInputCols(Array("sentence","token")) 
 .setOutputCol("embeddings") 
 
// 1- ner_clinical model 
val clinical_ner = MedicalNerModel.pretrained("ner_clinical","en","clinical/models")
 .setInputCols(Array("sentence","token","embeddings")) 
 .setOutputCol("clinical_ner") 

val clinical_ner_converter = new NerConverterInternal()
 .setInputCols(Array("sentence","token","clinical_ner")) 
 .setOutputCol("clinical_ner_chunk") 
 
// 2- posology ner model 
val posology_ner = MedicalNerModel.pretrained("ner_posology","en","clinical/models")
 .setInputCols(Array("sentence","token","embeddings")) 
 .setOutputCol("posology_ner") 

val posology_ner_converter = new NerConverterInternal()
 .setInputCols(Array("sentence","token","posology_ner")) 
 .setOutputCol("posology_ner_chunk") 
 
// 3- generate a text matcher annotator that extracts female related entities 
val entities = new Array("she","her","girl","woman","women","womanish","womanlike","womanly","madam","madame","senora","lady","miss","girlfriend","wife","bride","misses","mrs.","female")

with open ('female_entities.txt', 'w') as f:
    for i in entities:
        f.write(i+'\n')
 
// Find female entities using TextMatcher 
val female_entity_extractor = new TextMatcher()
 .setInputCols(Array("sentence","token")) 
 .setOutputCol("female_entities") 
 .setEntities("female_entities.txt") 
 .setCaseSensitive(false) 
 .setEntityValue("female_entity") 
 
// Chunk Merge annotator is used to merge columns 
val chunk_merger = new ChunkMergeApproach()
 .setInputCols(Array("posology_ner_chunk","clinical_ner_chunk","female_entities"))
 .setOutputCol("merged_ner_chunk") 

val nlpPipeline = new Pipeline().setStages(Array( 
    documentAssembler, 
    sentenceDetector, 
    tokenizer, 
    word_embeddings, 
    clinical_ner, 
    clinical_ner_converter, 
    posology_ner, 
    posology_ner_converter, 
    female_entity_extractor, 
    chunk_merger)) 

val text ="""The lady was treated with a five-day course of amoxicillin for a respiratory tract infection .
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG ."""
val data = Seq(text).toDF("text")

val model = nlpPipeline.fit(data).transform(data)

+-----+---+-----------------------------+-------------+
|begin|end|chunk                        |entity       |
+-----+---+-----------------------------+-------------+
|4    |7  |lady                         |female_entity|
|47   |57 |amoxicillin                  |DRUG         |
|63   |91 |a respiratory tract infection|PROBLEM      |
|95   |97 |She                          |female_entity|
|106  |114|metformin                    |DRUG         |
|118  |126|glipizide                    |TREATMENT    |
|134  |146|dapagliflozin                |TREATMENT    |
|152  |155|T2DM                         |PROBLEM      |
|161  |172|atorvastatin                 |DRUG         |
|178  |188|gemfibrozil                  |TREATMENT    |
+-----+---+-----------------------------+-------------+
import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCol("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCol("sentence")
    .setOutputCol("token")

val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val bert_embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("bert_embeddings")

val fin_ner = FinanceNerModel.pretrained('finner_deid', "en", "finance/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

val ner_converter =  new NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")
    .setReplaceLabels({"ORG": "PARTY"}) # Replace "ORG" entity as "PARTY"

val ner_finner = FinanceNerModel.pretrained("finner_org_per_role_date", "en", "finance/models")\
    .setInputCols(Array("sentence", "token", "bert_embeddings"))
    .setOutputCol("ner_finner") 
    #.setLabelCasing("upper")

val ner_converter_finner = new NerConverter()
    .setInputCols(Array("sentence", "token", "ner_finner"))
    .setOutputCol("ner_finner_chunk")
    .setWhiteList(['ROLE']) # Just use "ROLE" entity from this NER

val chunk_merge =  new ChunkMergeApproach()
    .setInputCols(Array("ner_finner_chunk", "ner_chunk"))
    .setOutputCol("deid_merged_chunk")

val nlpPipeline = new Pipeline().setStages(Array(
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      bert_embeddings,
      fin_ner,
      ner_converter,
      ner_finner,
      ner_converter_finner,
      chunk_merge))

val data = Seq(("Jeffrey Preston Bezos is an American entrepreneur, founder and CEO of Amazon")).toDF("text")

# Show results
result = nlpPipeline.fit(data).transform(data)

+---------------------+---------+
|chunk                |ner_label|
+---------------------+---------+
|Jeffrey Preston Bezos|PERSON   |
|founder              |ROLE     |
|CEO                  |ROLE     |
|Amazon               |PARTY    |
+---------------------+---------+
import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCol("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCol("sentence")
    .setOutputCol("token")

val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val legal_ner = LegalNerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")\
    .setReplaceLabels({"ALIAS": "PARTY"})

val ner_signers = LegalNerModel.pretrained("legner_signers", "en", "legal/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner_signers") 
    #.setLabelCasing("upper")

val ner_converter_signers = new NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner_signers"))
    .setOutputCol("ner_signer_chunk")

val chunk_merge = new ChunkMergeApproach()
    .