Enterprise NLP Annotators

 

A Spark NLP Enterprise license includes access to unique annotators. At the Spark NLP Workshop you can see different types of annotators in action.

By clicking on any annotator, you will see different sections:

  • The Approach, or class to train models.
  • The Model, to infer using pretrained models.

Also, for most of the annotators, you will find examples for the different enterprise libraries:

  • Healthcare NLP
  • Finance NLP
  • Legal NLP

Check out the Spark NLP Annotators page for more information on how to read this page.

Available Annotators

Annotators Description
AnnotationMerger Merge annotations from different pipeline steps that have the same annotation type into a unified annotation.
AssertionChunkConverter AssertionChunkConverter annotator uses both begin and end indices of the tokens as input to add a more robust metadata to the chunk column in a way that improves the reliability of the indices and avoid loss of data.
AssertionDL AssertionDL is a deep Learning based approach used to extract Assertion Status from extracted entities and text.
AssertionFilterer Filters entities coming from ASSERTION type annotations and returns the CHUNKS.
AssertionLogReg Logistic Regression is used to extract Assertion Status from extracted entities and text.
AverageEmbeddings Computes the mean of vector embeddings for two sentences of equal size, producing a unified representation
BertForSequenceClassification Can load Bert Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.
BertForTokenClassifier Can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) for Named-Entity-Recognition (NER) tasks.
BertSentenceChunkEmbeddings This annotator combines sentence and NER chunk embeddings to enhance resolution codes, leveraging contextual information in the embeddings for more precise results. It takes sentence context and NER chunks as input and produces embeddings for each chunk, facilitating input for the resolution model.
Chunk2Token A feature transformer that converts the input array of strings (annotatorType CHUNK) into an array of chunk-based tokens (annotatorType TOKEN).
ChunkConverter This annotator merges NER-detected entities with RegexMatcher-based rules for unified processing in the pipeline.
ChunkEntityResolver Returns a normalized entity for a particular trained ontology / curated dataset (e.g. clinical ICD-10, RxNorm, SNOMED; financial SEC’s EDGAR database, etc).
ChunkFilterer Filters entities coming from CHUNK annotations.
ChunkKeyPhraseExtraction Uses Bert Sentence Embeddings to determine the most relevant key phrases describing a text.
ChunkMapper We can use ChunkMapper to map entities with their associated code/reference based on pre-defined dictionaries.
ChunkMapperFilterer Annotator to be used after ChunkMapper that allows to filter chunks based on the results of the mapping, whether it was successful or failed.
ChunkMerge Merges entities coming from different CHUNK annotations.
ChunkSentenceSplitter Annotator can split the documents into chunks according to separators given as CHUNK columns. It is useful when you need to perform different models or analysis in different sections of your document
ContextualParser Extracts entity from a document based on user defined rules.
DateNormalizer This annotator transforms date mentions to a common standard format: YYYY/MM/DD. It is useful when using data from different sources, some times from different countries that has different formats to represent dates.
DeIdentification Deidentifies Input Annotations of types DOCUMENT, TOKEN and CHUNK, by either masking or obfuscating the given CHUNKS.
DistilBertForSequenceClassification Can load DistilBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.
Doc2ChunkInternal Converts DOCUMENT, TOKEN typed annotations into CHUNK type with the contents of a chunkCol.
DocMapper Uses the text representation of document annotations to map clinical codes to other codes or relevant information.
DocumentFiltererByClassifier This annotator sorts documents based on classifier results. It uses white and black lists, allowing or blocking specific outcomes. It can be case-sensitive or case-insensitive for broader matching. This tool efficiently organizes documents based on classifier outcomes.
DocumentHashCoder This annotator swaps dates in a document column with hash codes from another column, creating a new column with shifted day information. The subsequent DeIdentification annotator anonymizes the document, incorporating the altered dates.
DocumentLogRegClassifier Classifies documents with a Logarithmic Regression algorithm.
DocumentMLClassifier classifies documents with a Logarithmic Regression algorithm.
DrugNormalizer Annotator which normalizes raw text from documents, e.g. scraped web pages or xml documents.
EntityChunkEmbeddings Entity Chunk Embeddings uses BERT Sentence embeddings to compute a weighted average vector represention of related entity chunks.
EntityRulerInternal This annotator match exact strings or regex patterns provided in a file against a Document and assigns them an named entity.
FeaturesAssembler Collects features from different columns.
FewShotClassifier This Annotator specifically target few-shot classification tasks, which involve training a model to make accurate predictions with limited labeled data.
Flattener Flattener annotator in Spark NLP converts annotation results into a simplified DataFrame format for easier analysis and interpretation.
GenericClassifier Creates a generic single-label classifier which uses pre-generated Tensorflow graphs.
GenericLogRegClassifier Is a derivative of GenericClassifier which implements a multinomial logistic regression.
GenericSVMClassifier Creates a generic single-label classifier which uses pre-generated Tensorflow graphs.
InternalDocumentSplitter This annotator splits large documents into small documents.
IOBTagger Merges token tags and NER labels from chunks in the specified format.
NameChunkObfuscator This annotator allows to transform a dataset with an Input Annotation of type CHUNK, into its obfuscated version of by obfuscating the given CHUNKS.
NerChunker Extracts phrases that fits into a known pattern using the NER tags.
NerConverterInternal Converts a IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their label.
NerDisambiguator Links words of interest, such as names of persons, locations and companies, from an input text document to a corresponding unique entity in a target Knowledge Base (KB).
NerModel This Named Entity recognition annotator is a generic NER model based on Neural Networks.
NerQuestionGenerator This annotator takes an NER chunk (obtained by, e.g., NerConverterInternal) and generates a questions based on two entity types, a pronoun and a strategy.
QuestionAnswering GPT-based model for answering questions given a context.
RegexMatcherInternal RegexMatcherInternal matches predefined regex patterns with entities in text, allowing for flexible entity recognition based on user-defined rules.
ReIdentification Reidentifies obfuscated entities by DeIdentification.
RelationExtraction Extracts and classifies instances of relations between named entities.
RelationExtractionDL Extracts and classifies instances of relations between named entities.
RENerChunksFilter Filters and outputs combinations of relations between extracted entities, for further processing.
Replacer This annotator allows to replace entities in the original text with the ones extracted by the annotators NameChunkObfuscatorApproach or DateNormalizer.
Resolution2Chunk This annotator is responsible for converting the annotations generated by entity resolver models (typically labeled as ENTITY) into a format compatible with subsequent stages of the pipeline, such as the ChunkMapperModel.
ResolverMerger This annotator is provide the ability to merge sentence enitity resolver and chunk mapper model output columns.
Router This annotator is provide the ability to split an output of an annotator for a selected metadata field and the value for that field.
SentenceEntityResolver Returns the normalized entity for a particular trained ontology / curated dataset (e.g. clinical ICD-10, RxNorm, SNOMED; financial SEC’s EDGAR database, etc) based on sentence embeddings.
Summarizer Helps to quickly summarize complex medical information.
TextGenerator Uses the basic BioGPT model to perform various tasks related to medical text abstraction.
TextMatcherInternal This annotator match exact phrases provided in a file against a Document.
TFGraphBuilder Creates Tensorflow graphs.
WindowedSentenceModel This annotator that helps you to merge the previous and following sentences of a given piece of text, so that you add the context surrounding them.
ZeroShotNerModel This is a zero-shot named entity recognition using RoBertaForQuestionAnswering. It identifies entities across diverse data without domain-specific fine-tuning.
ZeroShotRelationExtractionModel This annotator implements zero-shot binary relations extraction by utilizing BERT transformer models trained on the NLI (Natural Language Inference) task.

AnnotationMerger

Model

Merge annotations from different pipeline steps that have the same annotation type into a unified annotation. Possible annotations that can be merged include:

  • document (e.g., output of DocumentAssembler annotator)
  • token (e.g., output of Tokenizer annotator)
  • word_embeddings (e.g., output of WordEmbeddingsModel annotator)
  • sentence_embeddings (e.g., output of BertSentenceEmbeddings annotator)
  • category (e.g., output of RelationExtractionModel annotator)
  • date (e.g., output of DateMatcher annotator)
  • sentiment (e.g., output of SentimentDLModel annotator)
  • pos (e.g., output of PerceptronModel annotator)
  • chunk (e.g., output of NerConverter annotator)
  • named_entity (e.g., output of NerDLModel annotator)
  • regex (e.g., output of RegexTokenizer annotator)
  • dependency (e.g., output of DependencyParserModel annotator)
  • language (e.g., output of LanguageDetectorDL annotator)
  • keyword (e.g., output of YakeModel annotator)

Parameters:

  • inputType: The type of the annotations that you want to merge. Possible values.

Input Annotator Types: ANY

Output Annotator Type: ANY

Python API: AnnotationMerger Scala API: AnnotationMerger Notebook: AnnotationMerger
Show Example

from johnsnowlabs import nlp, medical

# Create the pipeline with two RE models
documenter = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencer = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentences"])\
    .setOutputCol("tokens")

words_embedder = nlp.WordEmbeddingsModel()\
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("embeddings")

pos_tagger = nlp.PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("pos_tags")

pos_ner_tagger = medical.NerModel()\
    .pretrained("ner_posology", "en", "clinical/models")\
    .setInputCols("sentences", "tokens", "embeddings")\
    .setOutputCol("ner_pos")

pos_ner_chunker = medical.NerConverterInternal()\
    .setInputCols(["sentences", "tokens", "ner_pos"])\
    .setOutputCol("pos_ner_chunks")

dependency_parser = nlp.DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentences", "pos_tags", "tokens"])\
    .setOutputCol("dependencies")

pos_reModel = medical.RelationExtractionModel()\
    .pretrained("posology_re")\
    .setInputCols(["embeddings", "pos_tags", "pos_ner_chunks", "dependencies"])\
    .setOutputCol("pos_relations")\
    .setMaxSyntacticDistance(4)

ade_ner_tagger = medical.NerModel.pretrained("ner_ade_clinical", "en", "clinical/models")\
    .setInputCols("sentences", "tokens", "embeddings")\
    .setOutputCol("ade_ner_tags")  

ade_ner_chunker = medical.NerConverterInternal()\
    .setInputCols(["sentences", "tokens", "ade_ner_tags"])\
    .setOutputCol("ade_ner_chunks")

ade_reModel = medical.RelationExtractionModel()\
    .pretrained("re_ade_clinical", "en", 'clinical/models')\
    .setInputCols(["embeddings", "pos_tags", "ade_ner_chunks", "dependencies"])\
    .setOutputCol("ade_relations")\
    .setMaxSyntacticDistance(10)\
    .setRelationPairs(["drug-ade, ade-drug"])

annotation_merger = medical.AnnotationMerger()\
    .setInputCols("ade_relations", "pos_relations")\
    .setInputType("category")\
    .setOutputCol("all_relations")

merger_pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer, 
    words_embedder, 
    pos_tagger, 
    pos_ner_tagger,
    pos_ner_chunker,
    dependency_parser,
    pos_reModel,
    ade_ner_tagger,
    ade_ner_chunker,
    ade_reModel,
    annotation_merger
])

# Show example result
text = """
The patient was prescribed 1 unit of naproxen for 5 days after meals for chronic low back pain. The patient was also given 1 unit of oxaprozin daily for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands.. 
"""
data = spark.createDataFrame([[text]]).toDF("text")

result = merger_pipeline.fit(data).transform(data)
result.selectExpr("pos_relations.result as PosologyRelation", 
                  "ade_relations.result as AdeRelation", 
                  "all_relations.result as MergedRelation").show(truncate=False)

+---------------------------------------------------------+-----------+---------------------------------------------------------------+
|PosologyRelation                                         |AdeRelation|MergedRelation                                                 |
+---------------------------------------------------------+-----------+---------------------------------------------------------------+
|[DOSAGE-DRUG, DRUG-DURATION, DOSAGE-DRUG, DRUG-FREQUENCY]|[1, 1]     |[1, 1, DOSAGE-DRUG, DRUG-DURATION, DOSAGE-DRUG, DRUG-FREQUENCY]|
+---------------------------------------------------------+-----------+---------------------------------------------------------------+
from johnsnowlabs import nlp, finance

# Create the pipeline with two RE models
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

text_splitter = finance.TextSplitter()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model_date = finance.NerModel.pretrained("finner_sec_dates", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_dates")

ner_converter_date = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_dates"])\
    .setOutputCol("ner_chunk_date")

ner_model_org= finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_orgs")

ner_converter_org = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_orgs"])\
    .setOutputCol("ner_chunk_org")\

chunk_merger = finance.ChunkMergeApproach()\
    .setInputCols('ner_chunk_org', "ner_chunk_date")\
    .setOutputCol('ner_chunk')

pos = nlp.PerceptronModel.pretrained()\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("pos")

dependency_parser = nlp.DependencyParserModel().pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos", "token"])\
    .setOutputCol("dependencies")

re_filter = finance.RENerChunksFilter()\
    .setInputCols(["ner_chunk", "dependencies"])\
    .setOutputCol("re_ner_chunk")\
    .setRelationPairs(["ORG-ORG", "ORG-DATE"])\
    .setMaxSyntacticDistance(10)

reDL = finance.RelationExtractionDLModel().pretrained('finre_acquisitions_subsidiaries_md', 'en', 'finance/models')\
    .setInputCols(["re_ner_chunk", "sentence"])\
    .setOutputCol("relations_acq")\
    .setPredictionThreshold(0.1)

reDL_alias = finance.RelationExtractionDLModel().pretrained('finre_org_prod_alias', 'en', 'finance/models')\
    .setInputCols(["re_ner_chunk", "sentence"])\
    .setOutputCol("relations_alias")\
    .setPredictionThreshold(0.1)

annotation_merger = finance.AnnotationMerger()\
    .setInputCols("relations_acq", "relations_alias")\
    .setOutputCol("relations")\
    .setInputType("category")

nlpPipeline = nlp.Pipeline(stages=[
        document_assembler,
        text_splitter,
        tokenizer,
        embeddings,
        ner_model_date,
        ner_converter_date,
        ner_model_org,
        ner_converter_org,
        chunk_merger,
        pos,
        dependency_parser,
        re_filter,
        reDL,
        reDL_alias,
        annotation_merger])


# Show example result
text ="""Definite-lived intangible assets acquired with Cadence’s fiscal 2021 acquisitions were as follows:
 
Acquisition Date Fair Value
Weighted Average Amortization Period
 
(In thousands)
 (in years)
Existing technology
$
59,100 
13.7 years
Agreements and relationships
28,900 
13.7 years
Tradenames, trademarks and patents
4,600 
14.3 years
Total acquired intangibles with definite lives
$
92,600 
13.7 years
2020 Acquisitions
In fiscal 2020, Cadence acquired all of the outstanding equity of AWR Corporation ("AWR") and Integrand Software, Inc. ("Integrand"). These acquisitions enhanced Cadence’s technology portfolio to address growing radio frequency design activity, driven by expanding use of 5G communications.
The aggregate cash consideration for these acquisitions was $195.6 million, after taking into account cash acquired of $1.5 million. The total purchase consideration was allocated to the assets acquired and liabilities assumed based on their respective estimated fair values on the acquisition dates. Cadence will also make payments to certain employees, subject to continued employment and other performance-based conditions, through the first quarter of fiscal 2023.
 With its acquisitions of AWR and Integrand, Cadence recorded $101.3 million of definite-lived intangible assets with a weighted average amortization period of approximately nine years. The definite-lived intangible assets related primarily to existing technology and customer agreements and relationships. Cadence also recorded $119.4 million of goodwill and $25.1 million of net liabilities, consisting primarily of deferred tax liabilities, assumed deferred revenue and trade accounts receivable. The recorded goodwill was primarily related to the acquired assembled workforce and expected synergies from combining operations of the acquired companies with Cadence. None of the goodwill related to the acquisitions of AWR and Integrand is deductible for tax purposes.
Cadence completed one additional acquisition during fiscal 2020 that was not material to the consolidated financial statements. 
Pro Forma Financial Information
Cadence has not presented pro forma financial information for any of the businesses it acquired during fiscal 2021 and fiscal 2020 because the results of operations for these businesses are not material to Cadence’s consolidated financial statements.
Acquisition-Related Transaction Costs
Transaction costs associated with acquisitions, which consist of professional fees and administrative costs, were not material during fiscal 2021, 2020 or 2019 and were expensed as incurred in Cadence’s consolidated income statements.
NOTE 7. GOODWILL AND ACQUIRED INTANGIBLES
Goodwill
The changes in the carrying amount of goodwill during fiscal 2021 and 2020 were as follows:
 
Gross CarryingAmount
 
(In thousands)
Balance as of December 28, 2019
$
661,856 
Goodwill resulting from acquisitions
120,564 
Effect of foreign currency translation
(333)
Balance as of January 2, 2021
782,087 
Goodwill resulting from acquisitions
154,362 
Effect of foreign currency translation
(8,091)
Balance as of January 1, 2022
$
928,358 
Cadence completed its annual goodwill impairment test during the third quarter of fiscal 2021 and determined that the fair value of Cadence’s single reporting unit exceeded the carrying amount of its net assets and that no impairment existed.
65"""

data = spark.createDataFrame([[text]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)


# Show the results 
result.selectExpr("relations_acq.result as AcqRelation", 
                  "relations_alias.result as AliasRelation", 
                  "relations.result as MergedRelation").show(truncate=False)

+------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|AcqRelation                                                                                     |AliasRelation                                                                           |MergedRelation                                                                                                                                                                          |
+------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[has_acquisition_date, was_acquired_by, other, other, other, has_acquisition_date, other, other]|[has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias]|[has_acquisition_date, was_acquired_by, other, other, other, has_acquisition_date, other, other, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias]|
+------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
from johnsnowlabs import nlp, legal

# Create the pipeline with two RE models
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

text_splitter = legal.TextSplitter()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings =nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model_date = legal.NerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_parties")

ner_converter_date = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_parties"])\
    .setOutputCol("ner_chunk_parties")

ner_model_org= legal.NerModel.pretrained("legner_whereas_md", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_whereas")

ner_converter_org = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_whereas"])\
    .setOutputCol("ner_chunk_whereas")\

chunk_merger = legal.ChunkMergeApproach()\
    .setInputCols('ner_chunk_whereas', "ner_chunk_parties")\
    .setOutputCol('ner_chunk')

pos = nlp.PerceptronModel.pretrained()\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("pos")

dependency_parser = nlp.DependencyParserModel().pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos", "token"])\
    .setOutputCol("dependencies")

re_filter = legal.RENerChunksFilter()\
    .setInputCols(["ner_chunk", "dependencies"])\
    .setOutputCol("re_ner_chunk")\
    .setMaxSyntacticDistance(10)

reDL = legal.RelationExtractionDLModel().pretrained("legre_contract_doc_parties_md", "en", "legal/models")\
    .setInputCols(["re_ner_chunk", "sentence"])\
    .setOutputCol("relations_parties")\
    .setPredictionThreshold(0.1)

reDL_alias = legal.RelationExtractionDLModel().pretrained("legre_whereas", "en", "legal/models")\
    .setInputCols(["re_ner_chunk", "sentence"])\
    .setOutputCol("relations_whereas")\
    .setPredictionThreshold(0.1)

annotation_merger = legal.AnnotationMerger()\
    .setInputCols("relations_parties", "relations_whereas")\
    .setOutputCol("relations")\
    .setInputType("category")

nlpPipeline = nlp.Pipeline(stages=[
        document_assembler,
        text_splitter,
        tokenizer,
        embeddings,
        ner_model_date,
        ner_converter_date,
        ner_model_org,
        ner_converter_org,
        chunk_merger,
        pos,
        dependency_parser,
        re_filter,
        reDL,
        reDL_alias,
        annotation_merger])


# Show example result
text = """
WHEREAS, the Company Entities own certain Copyrights and Know-How which may be used in the Arizona Field, and in connection with the transactions contemplated by the Stock Purchase Agreement, Arizona desires to obtain a license from the Company Entities to use such Intellectual Property on the terms and subject to the conditions set forth herein.
"""
data = spark.createDataFrame([[text]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)

# Show the results 
result.selectExpr("relations_parties.result as PartiesRelation", 
                  "relations_whereas.result as WhereasRelation", 
                  "relations.result as MergedRelation").show(truncate=False)

+-----------------------------+--------------------------------------+-------------------------------------------------------------------+
|PartiesRelation              |WhereasRelation                       |MergedRelation                                                     |
+-----------------------------+--------------------------------------+-------------------------------------------------------------------+
|[signed_by, other, signed_by]|[has_subject, has_subject, has_object]|[signed_by, other, signed_by, has_subject, has_subject, has_object]|
+-----------------------------+--------------------------------------+-------------------------------------------------------------------+

import spark.implicits._

// Create the pipeline with two RE models
val documenter = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentencer = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentences")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentences"))
  .setOutputCol("tokens")

val words_embedder = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("sentences", "tokens"))
  .setOutputCol("embeddings")

val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models")
  .setInputCols(Array("sentences", "tokens"))
  .setOutputCol("pos_tags")

val pos_ner_tagger = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models")
  .setInputCols(Array("sentences", "tokens", "embeddings"))
  .setOutputCol("ner_pos")

val pos_ner_chunker = new NerConverterInternal()
  .setInputCols(Array("sentences", "tokens", "ner_pos"))
  .setOutputCol("pos_ner_chunks")

val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en")
  .setInputCols(Array("sentences", "pos_tags", "tokens"))
  .setOutputCol("dependencies")

val pos_reModel = RelationExtractionModel.pretrained("posology_re")
  .setInputCols(Array("embeddings", "pos_tags", "pos_ner_chunks", "dependencies"))
  .setOutputCol("pos_relations")
  .setMaxSyntacticDistance(4)

val ade_ner_tagger = MedicalNerModel.pretrained("ner_ade_clinical", "en", "clinical/models")
  .setInputCols(Array("sentences", "tokens", "embeddings"))
  .setOutputCol("ade_ner_tags")

val ade_ner_chunker = new NerConverterInternal()
  .setInputCols(Array("sentences", "tokens", "ade_ner_tags"))
  .setOutputCol("ade_ner_chunks")

val ade_reModel = RelationExtractionModel.pretrained("re_ade_clinical", "en", "clinical/models")
  .setInputCols(Array("embeddings", "pos_tags", "ade_ner_chunks", "dependencies"))
  .setOutputCol("ade_relations")
  .setMaxSyntacticDistance(10)
  .setRelationPairs(Array("drug-ade", "ade-drug"))

val annotation_merger = new AnnotationMerger()
  .setInputCols(Array("ade_relations", "pos_relations"))
  .setInputType("category")
  .setOutputCol("all_relations")

val merger_pipeline = new Pipeline().setStages(Array(
  documenter,
  sentencer,
  tokenizer,
  words_embedder,
  pos_tagger,
  pos_ner_tagger,
  pos_ner_chunker,
  dependency_parser,
  pos_reModel,
  ade_ner_tagger,
  ade_ner_chunker,
  ade_reModel,
  annotation_merger
))


// Show example result

val text =
  """
The patient was prescribed 1 unit of naproxen for 5 days after meals for chronic low back pain. The patient was also given 1 unit of oxaprozin daily for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands..
"""

val data = Seq(text).toDF("text")
val result = merger_pipeline.fit(data).transform(data)

+---------------------------------------------------------+-----------+---------------------------------------------------------------+
|PosologyRelation                                         |AdeRelation|MergedRelation                                                 |
+---------------------------------------------------------+-----------+---------------------------------------------------------------+
|[DOSAGE-DRUG, DRUG-DURATION, DOSAGE-DRUG, DRUG-FREQUENCY]|[1, 1]     |[1, 1, DOSAGE-DRUG, DRUG-DURATION, DOSAGE-DRUG, DRUG-FREQUENCY]|
+---------------------------------------------------------+-----------+---------------------------------------------------------------+
import spark.implicits._

// Create the pipeline with two RE models
val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val text_splitter = new TextSplitter()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val ner_model_date = FinanceNerModel.pretrained("finner_sec_dates", "en", "finance/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner_dates")

val ner_converter_date = new NerConverter()
  .setInputCols(Array("sentence", "token", "ner_dates"))
  .setOutputCol("ner_chunk_date")

val ner_model_org = FinanceNerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner_orgs")

val ner_converter_org = new NerConverter()
  .setInputCols(Array("sentence", "token", "ner_orgs"))
  .setOutputCol("ner_chunk_org")

val chunk_merger = new Chunker()
  .setInputCols(Array("ner_chunk_org", "ner_chunk_date"))
  .setOutputCol("ner_chunk")

val pos = new PerceptronModel()
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("pos")

val dependency_parser = new DependencyParserModel()
  .pretrained("dependency_conllu", "en")
  .setInputCols(Array("sentence", "pos", "token"))
  .setOutputCol("dependencies")

val re_filter = new RelationExtractionModel()
  .setInputCols(Array("ner_chunk", "dependencies"))
  .setOutputCol("re_ner_chunk")
  .setRelationPairs(Array("ORG-ORG", "ORG-DATE"))
  .setMaxSyntacticDistance(10)

val reDL = new RelationExtractionModel()
  .pretrained("finre_acquisitions_subsidiaries_md", "en", "finance/models")
  .setInputCols(Array("re_ner_chunk", "sentence"))
  .setOutputCol("relations_acq")
  .setPredictionThreshold(0.1)

val reDL_alias = new RelationExtractionModel()
  .pretrained("finre_org_prod_alias", "en", "finance/models")
  .setInputCols(Array("re_ner_chunk", "sentence"))
  .setOutputCol("relations_alias")
  .setPredictionThreshold(0.1)

val annotation_merger = new AnnotationMerger()
  .setInputCols("relations_acq", "relations_alias")
  .setOutputCol("relations")
  .setInputType("category")

val nlpPipeline = new Pipeline().setStages(Array(
  document_assembler,
  text_splitter,
  tokenizer,
  embeddings,
  ner_model_date,
  ner_converter_date,
  ner_model_org,
  ner_converter_org,
  chunk_merger,
  pos,
  dependency_parser,
  re_filter,
  reDL,
  reDL_alias,
  annotation_merger
))

// Show example result
val text = """
Definite-lived intangible assets acquired with Cadence’s fiscal 2021 acquisitions were as follows:
 
Acquisition Date Fair Value
Weighted Average Amortization Period
 
(In thousands)
 (in years)
Existing technology
$
59,100 
13.7 years
Agreements and relationships
28,900 
13.7 years
Tradenames, trademarks and patents
4,600 
14.3 years
Total acquired intangibles with definite lives
$
92,600 
13.7 years
2020 Acquisitions
In fiscal 2020, Cadence acquired all of the outstanding equity of AWR Corporation ("AWR") and Integrand Software, Inc. ("Integrand"). These acquisitions enhanced Cadence’s technology portfolio to address growing radio frequency design activity, driven by expanding use of 5G communications.
The aggregate cash consideration for these acquisitions was $195.6 million, after taking into account cash acquired of $1.5 million. The total purchase consideration was allocated to the assets acquired and liabilities assumed based on their respective estimated fair values on the acquisition dates. Cadence will also make payments to certain employees, subject to continued employment and other performance-based conditions, through the first quarter of fiscal 2023.
 With its acquisitions of AWR and Integrand, Cadence recorded $101.3 million of definite-lived intangible assets with a weighted average amortization period of approximately nine years. The definite-lived intangible assets related primarily to existing technology and customer agreements and relationships. Cadence also recorded $119.4 million of goodwill and $25.1 million of net liabilities, consisting primarily of deferred tax liabilities, assumed deferred revenue and trade accounts receivable. The recorded goodwill was primarily related to the acquired assembled workforce and expected synergies from combining operations of the acquired companies with Cadence. None of the goodwill related to the acquisitions of AWR and Integrand is deductible for tax purposes.
Cadence completed one additional acquisition during fiscal 2020 that was not material to the consolidated financial statements. 
Pro Forma Financial Information
Cadence has not presented pro forma financial information for any of the businesses it acquired during fiscal 2021 and fiscal 2020 because the results of operations for these businesses are not material to Cadence’s consolidated financial statements.
Acquisition-Related Transaction Costs
Transaction costs associated with acquisitions, which consist of professional fees and administrative costs, were not material during fiscal 2021, 2020 or 2019 and were expensed as incurred in Cadence’s consolidated income statements.
NOTE 7. GOODWILL AND ACQUIRED INTANGIBLES
Goodwill
The changes in the carrying amount of goodwill during fiscal 2021 and 2020 were as follows:
 
Gross CarryingAmount
 
(In thousands)
Balance as of December 28, 2019
$
661,856 
Goodwill resulting from acquisitions
120,564 
Effect of foreign currency translation
(333)
Balance as of January 2, 2021
782,087 
Goodwill resulting from acquisitions
154,362 
Effect of foreign currency translation
(8,091)
Balance as of January 1, 2022
$
928,358 
Cadence completed its annual goodwill impairment test during the third quarter of fiscal 2021 and determined that the fair value of Cadence’s single reporting unit exceeded the carrying amount of its net assets and that no impairment existed.
65
"""

val data = Seq(text).toDF("text")
val result = nlpPipeline.fit(data).transform(data)

+------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|AcqRelation                                                                                     |AliasRelation                                                                           |MergedRelation                                                                                                                                                                          |
+------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[has_acquisition_date, was_acquired_by, other, other, other, has_acquisition_date, other, other]|[has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias]|[has_acquisition_date, was_acquired_by, other, other, other, has_acquisition_date, other, other, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias]|
+------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._

// Create the pipeline with two RE models
val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val text_splitter = new TextSplitter()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val ner_model_date = LegalNerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner_parties")

val ner_converter_date = new NerConverter()
  .setInputCols(Array("sentence", "token", "ner_parties"))
  .setOutputCol("ner_chunk_parties")

val ner_model_org = LegalNerModel.pretrained("legner_whereas_md", "en", "legal/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner_whereas")

val ner_converter_org = new NerConverter()
  .setInputCols(Array("sentence", "token", "ner_whereas"))
  .setOutputCol("ner_chunk_whereas")

val chunk_merger = new Chunker()
  .setInputCols(Array("ner_chunk_whereas", "ner_chunk_parties"))
  .setOutputCol("ner_chunk")

val pos = new PerceptronModel()
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("pos")

val dependency_parser = new DependencyParserModel()
  .pretrained("dependency_conllu", "en")
  .setInputCols(Array("sentence", "pos", "token"))
  .setOutputCol("dependencies")

val re_filter = new RelationExtractionModel()
  .setInputCols(Array("ner_chunk", "dependencies"))
  .setOutputCol("re_ner_chunk")
  .setMaxSyntacticDistance(10)

val reDL = new RelationExtractionModel()
  .pretrained("legre_contract_doc_parties_md", "en", "legal/models")
  .setInputCols(Array("re_ner_chunk", "sentence"))
  .setOutputCol("relations_parties")
  .setPredictionThreshold(0.1)

val reDL_alias = new RelationExtractionModel()
  .pretrained("legre_whereas", "en", "legal/models")
  .setInputCols(Array("re_ner_chunk", "sentence"))
  .setOutputCol("relations_whereas")
  .setPredictionThreshold(0.1)

val annotation_merger = new AnnotationMerger()
  .setInputCols("relations_parties", "relations_whereas")
  .setOutputCol("relations")
  .setInputType("category")

val nlpPipeline = new Pipeline().setStages(Array(
  document_assembler,
  text_splitter,
  tokenizer,
  embeddings,
  ner_model_date,
  ner_converter_date,
  ner_model_org,
  ner_converter_org,
  chunk_merger,
  pos,
  dependency_parser,
  re_filter,
  reDL,
  reDL_alias,
  annotation_merger
))

// Show example result

val text = """WHEREAS, the Company Entities own certain Copyrights and Know-How which may be used in the Arizona Field, and in connection with the transactions contemplated by the Stock Purchase Agreement, Arizona desires to obtain a license from the Company Entities to use such Intellectual Property on the terms and subject to the conditions set forth herein.
"""

val data = Seq(text).toDF("text")
val result = nlpPipeline.fit(data).transform(data)

+-----------------------------+--------------------------------------+-------------------------------------------------------------------+
|PartiesRelation              |WhereasRelation                       |MergedRelation                                                     |
+-----------------------------+--------------------------------------+-------------------------------------------------------------------+
|[signed_by, other, signed_by]|[has_subject, has_subject, has_object]|[signed_by, other, signed_by, has_subject, has_subject, has_object]|
+-----------------------------+--------------------------------------+-------------------------------------------------------------------+

AssertionChunkConverter

Model

This annotator creates a CHUNK column with metadata useful for training an Assertion Status Detection model (see AssertionDL).

In some cases, there may be issues while creating the chunk column when using token indices that can lead to loss of data to train assertion status models.

The AssertionChunkConverter annotator uses both the begin and end indices of the tokens as input to add more robust metadata to the chunk column in a way that improves the reliability of the indices and avoids loss of data.

Parameters:

  • chunkBeginCol: (Str) The column containing the start index of the chunk.

  • chunkEndCol: (Str) The column containing the end index of the chunk.

  • chunkTextCol: (Str) The column containing the text chunk.

  • outputTokenBeginCol: (Str) The column containing the selected token start.

  • outputTokenEndCol: (Str) The column containing the selected token end index.

NOTE: Chunk begin and end indices in the assertion status model training dataframe can be populated using the new version of the ALAB module.

Input Annotator Types: TOKEN

Output Annotator Type: CHUNK

Python API: AssertionChunkConverter Scala API: AssertionChunkConverter Notebook: AssertionChunkConverterNootebook
Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("tokens")

converter = medical.AssertionChunkConverter()\
  .setInputCols("tokens")\
  .setChunkTextCol("target")\
  .setChunkBeginCol("char_begin")\
  .setChunkEndCol("char_end")\
  .setOutputTokenBeginCol("token_begin")\
  .setOutputTokenEndCol("token_end")\
  .setOutputCol("chunk")


pipeline = nlp.Pipeline().setStages(
    [document_assembler, 
     sentenceDetector, 
     tokenizer, 
     converter]
)

data = spark.createDataFrame([
    ["An angiography showed bleeding in two vessels off of the Minnie supplying the sigmoid that were succesfully embolized.","Minnie", 57, 64,],
    ["After discussing this with his PCP, Leon was clear that the patient had had recurrent DVTs and ultimately a PE and his PCP felt strongly that he required long-term anticoagulation","PCP",31,34,],
]).toDF("text", "target", "char_begin", "char_end")

results = pipeline.fit(data).transform(data)

results.selectExpr(
    "target",
    "char_begin",
    "char_end",
    "token_begin",
    "token_end",
    "tokens[token_begin].result",
    "tokens[token_end].result",
    "target",
    "chunk",
).show(truncate=False)

+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
|target|char_begin|char_end|token_begin|token_end|tokens[token_begin].result|tokens[token_end].result|target|chunk                                         |
+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
|Minnie|57        |64      |10         |10       |Minnie                    |Minnie                  |Minnie|[{chunk, 57, 62, Minnie, {sentence -> 0}, []}]|
|PCP   |31        |34      |5          |5        |PCP                       |PCP                     |PCP   |[{chunk, 31, 33, PCP, {sentence -> 0}, []}]   |
+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
from johnsnowlabs import nlp, finance

document_assembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("tokens")

converter = finance.AssertionChunkConverter()\
  .setInputCols("tokens")\
  .setChunkTextCol("target")\
  .setChunkBeginCol("char_begin")\
  .setChunkEndCol("char_end")\
  .setOutputTokenBeginCol("token_begin")\
  .setOutputTokenEndCol("token_end")\
  .setOutputCol("chunk")


pipeline = nlp.Pipeline().setStages(
    [document_assembler, 
     sentenceDetector, 
     tokenizer, 
     converter]
)

data = spark.createDataFrame([
    ["Tom Martin worked as Cadence's CTO until 2010","Cadence's CTO",21,33],
    ["Mrs. Charles was before Managing Director at a big consultancy company","Managing Director",24,40],
]).toDF("text", "target", "char_begin", "char_end")

results = pipeline.fit(data).transform(data)

results.selectExpr(
    "target",
    "char_begin",
    "char_end",
    "token_begin",
    "token_end",
    "tokens[token_begin].result",
    "tokens[token_end].result",
    "target",
    "chunk",
).show(truncate=False)

+-----------------+----------+--------+-----------+---------+--------------------------+------------------------+-----------------+---------------------------------------------------------+
|target           |char_begin|char_end|token_begin|token_end|tokens[token_begin].result|tokens[token_end].result|target           |chunk                                                    |
+-----------------+----------+--------+-----------+---------+--------------------------+------------------------+-----------------+---------------------------------------------------------+
|Cadence's CTO    |21        |33      |4          |4        |Cadence's                 |Cadence's               |Cadence's CTO    |[{chunk, 21, 29, Cadence's CTO, {sentence -> 0}, []}]    |
|Managing Director|24        |40      |5          |5        |Managing                  |Managing                |Managing Director|[{chunk, 24, 31, Managing Director, {sentence -> 0}, []}]|
+-----------------+----------+--------+-----------+---------+--------------------------+------------------------+-----------------+---------------------------------------------------------+
from johnsnowlabs import nlp, legal

document_assembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("tokens")

converter = legal.AssertionChunkConverter()\
  .setInputCols("tokens")\
  .setChunkTextCol("target")\
  .setChunkBeginCol("char_begin")\
  .setChunkEndCol("char_end")\
  .setOutputTokenBeginCol("token_begin")\
  .setOutputTokenEndCol("token_end")\
  .setOutputCol("chunk")


pipeline = nlp.Pipeline().setStages(
    [document_assembler,
     sentenceDetector,
     tokenizer, 
     converter]
)

data = spark.createDataFrame([
    ["This Agreement may be executed by different parties hereto","parties",44,50,],
    ["The Administrative Agent will determine the Dollar Equivalent amount","Agent",19,23,],
]).toDF("text", "target", "char_begin", "char_end")

results = pipeline.fit(data).transform(data)

results.selectExpr(
    "target",
    "char_begin",
    "char_end",
    "token_begin",
    "token_end",
    "tokens[token_begin].result",
    "tokens[token_end].result",
    "target",
    "chunk",
).show(truncate=False)

+-------+----------+--------+-----------+---------+--------------------------+------------------------+-------+-----------------------------------------------+
|target |char_begin|char_end|token_begin|token_end|tokens[token_begin].result|tokens[token_end].result|target |chunk                                          |
+-------+----------+--------+-----------+---------+--------------------------+------------------------+-------+-----------------------------------------------+
|parties|44        |50      |7          |6        |parties                   |different               |parties|[{chunk, 44, 42, parties, {sentence -> 0}, []}]|
|Agent  |19        |23      |2          |1        |Agent                     |Administrative          |Agent  |[{chunk, 19, 17, Agent, {sentence -> 0}, []}]  |
+-------+----------+--------+-----------+---------+--------------------------+------------------------+-------+-----------------------------------------------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("tokens")

val converter = new AssertionChunkConverter()
  .setInputCols("tokens")
  .setOutputCol("chunk")
  .setChunkTextCol("target")
  .setChunkBeginCol("char_begin")
  .setChunkEndCol("char_end")
  .setOutputTokenBeginCol("token_begin")
  .setOutputTokenEndCol("token_end")

val pipeline = new Pipeline().setStages(Array(
                                              document_assembler, 
                                              sentenceDetector, 
                                              tokenizer, 
                                              converter))

val data = Seq(Array(
    ("An angiography showed bleeding in two vessels off of the Minnie supplying the sigmoid that were succesfully embolized.", "Minnie",57,64,),
    ("After discussing this with his PCP, Leon was clear that the patient had had recurrent DVTs and ultimately a PE and his PCP felt strongly that he required long-term anticoagulation", "PCP", 31, 34,)
)).toDF("text", "target", "char_begin", "char_end")

val results = pipeline.fit(data).transform(data)

+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
|target|char_begin|char_end|token_begin|token_end|tokens[token_begin].result|tokens[token_end].result|target|chunk                                         |
+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
|Minnie|57        |64      |10         |10       |Minnie                    |Minnie                  |Minnie|[{chunk, 57, 62, Minnie, {sentence -> 0}, []}]|
|PCP   |31        |34      |5          |5        |PCP                       |PCP                     |PCP   |[{chunk, 31, 33, PCP, {sentence -> 0}, []}]   |
+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
import spark.implicits._

val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("tokens")

val converter = new AssertionChunkConverter()
  .setInputCols("tokens")
  .setOutputCol("chunk")
  .setChunkTextCol("target")
  .setChunkBeginCol("char_begin")
  .setChunkEndCol("char_end")
  .setOutputTokenBeginCol("token_begin")
  .setOutputTokenEndCol("token_end")

val pipeline = new Pipeline().setStages(Array(
                                              document_assembler, 
                                              sentenceDetector, 
                                              tokenizer, 
                                              converter))

val data = Seq(Array(
    ("Tom Martin worked as Cadence's CTO until 2010","Cadence's CTO",21,33,),
    ("Mrs. Charles was before Managing Director at a big consultancy company", "Managing Director",24, 40,)
)).toDF("text", "target", "char_begin", "char_end")

val results = pipeline.fit(data).transform(data)

+-----------------+----------+--------+-----------+---------+-----+----+------------+----------+-------------------+
|target           |char_begin|char_end|token_begin|token_end|begin|end |begin_result|end_result|result             |
+-----------------+----------+--------+-----------+---------+-----+----+------------+----------+-------------------+
|Cadence's CTO    |21        |33      |4          |4        |[21] |[29]|Cadence's   |Cadence's |[Cadence's CTO]    |
|Managing Director|24        |40      |5          |5        |[24] |[31]|Managing    |Managing  |[Managing Director]|
+-----------------+----------+--------+-----------+---------+-----+----+------------+----------+-------------------+
import spark.implicits._

val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("tokens")

val converter = new AssertionChunkConverter()
  .setInputCols("tokens")
  .setOutputCol("chunk")
  .setChunkTextCol("target")
  .setChunkBeginCol("char_begin")
  .setChunkEndCol("char_end")
  .setOutputTokenBeginCol("token_begin")
  .setOutputTokenEndCol("token_end")

val pipeline = new Pipeline().setStages(Array(
                                              document_assembler, 
                                              sentenceDetector, 
                                              tokenizer, 
                                              converter))

val data = Seq(Array(
    ("Tom Martin worked as Cadence's CTO until 2010","Cadence's CTO", 21,33,),
    ("Mrs. Charles was before Managing Director at a big consultancy company","Managing Director",24,40,)
)).toDF("text", "target", "char_begin", "char_end")
 
val results = pipeline.fit(data).transform(data)

+-------+----------+--------+-----------+---------+--------------------------+------------------------+-------+-----------------------------------------------+
|target |char_begin|char_end|token_begin|token_end|tokens[token_begin].result|tokens[token_end].result|target |chunk                                          |
+-------+----------+--------+-----------+---------+--------------------------+------------------------+-------+-----------------------------------------------+
|parties|44        |50      |7          |6        |parties                   |different               |parties|[{chunk, 44, 42, parties, {sentence -> 0}, []}]|
|Agent  |19        |23      |2          |1        |Agent                     |Administrative          |Agent  |[{chunk, 19, 17, Agent, {sentence -> 0}, []}]  |
+-------+----------+--------+-----------+---------+--------------------------+------------------------+-------+-----------------------------------------------+

AssertionDL

AssertionDL is a deep Learning based approach used to extract Assertion Status from extracted entities and text. AssertionDLModel requires DOCUMENT, CHUNK and WORD_EMBEDDINGS type annotator inputs, which can be obtained by e.g a DocumentAssembler, NerConverter and WordEmbeddingsModel. The result is an assertion status annotation for each recognized entity. Possible values include “present”,“absent”,“hypothetical”,“conditional”,“associated_with_other_person” etc.

Parameters:

  • inputCols: Gets current column names of input annotations.

  • outputCol: Gets output column name of annotations.

  • ScopeWindow: Sets the scope of the window of the assertion expression.

  • EntityAssertionCaseSensitive: Sets the case sensitivity of entities and assertion labels.

  • DoExceptionHandling: If it is set as True, the annotator tries to process as usual and ff exception-causing data (e.g. corrupted record/ document) is passed to the annotator, an exception warning is emitted which has the exception message.

For pretrained models please see the Models Hub for available models.

Input Annotator Types: DOCUMENT, CHUNK, WORD_EMBEDDINGS

Output Annotator Type: ASSERTION

Python API: AssertionDLModel Scala API: AssertionDLModel Notebook: AssertionDLModelNotebook
Show Example

from johnsnowlabs import nlp, medical
# Define pipeline stages to extract NER chunks first
documentAssembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embeddings")

nerModel = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"])\
  .setOutputCol("ner")

nerConverter = nlp.NerConverter()\
  .setInputCols(["sentence", "token", "ner"])\
  .setOutputCol("ner_chunk")

# Then a pretrained AssertionDLModel is used to extract the assertion status
clinicalAssertion = medical.AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models") \
  .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
  .setOutputCol("assertion")

assertionPipeline = nlp.Pipeline(stages=[
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter,
  clinicalAssertion
])

data = spark.createDataFrame([
  ["Patient with severe fever and sore throat"],
  ["Patient shows no stomach pain"],
  ["She was maintained on an epidural and PCA for pain control."]]).toDF("text")


# Show results
result = assertionPipeline.fit(data).transform(data)
result.selectExpr("ner_chunk.result as chunk_result", "assertion.result as assertion_result").show(3, truncate=False)

+--------------------------------+--------------------------------+
|chunk_result                    |assertion_result                |
+--------------------------------+--------------------------------+
|[severe fever, sore throat]     |[present, present]              |
|[stomach pain]                  |[absent]                        |
|[an epidural, PCA, pain control]|[present, present, hypothetical]|
+--------------------------------+--------------------------------+
from johnsnowlabs import nlp, finance

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector =  nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer =  nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings =  nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")\

ner_converter = finance.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\

assertion = finance.AssertionDLModel.pretrained("finassertion_competitors", "en", "finance/models")\
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")
    
pipeline = nlp.Pipeline(stages=[
    document_assembler, 
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    assertion
    ])

data = spark.createDataFrame([["Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc."]]).toDF("text")


# Show results
result = pipeline.fit(data).transform(data)
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata, result.assertion.result)).alias("cols"))\
      .select(F.expr("cols['1']['sentence']").alias("sent_id"),
              F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label"),
              F.expr("cols['2']").alias("assertion")).show(truncate=False)

+-------+------------+---------+----------+
|sent_id|chunk       |ner_label|assertion |
+-------+------------+---------+----------+
|0      |McAfee LLC  |ORG      |COMPETITOR|
|0      |Broadcom Inc|ORG      |COMPETITOR|
+-------+------------+---------+----------+
from johnsnowlabs import nlp, legal

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings_ner = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
    .setInputCols("sentence", "token") \
    .setOutputCol("embeddings_ner")\

ner_model = legal.NerModel.pretrained('legner_contract_doc_parties', 'en', 'legal/models')\
    .setInputCols(["sentence", "token", "embeddings_ner"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setWhiteList(["DOC", "EFFDATE", "PARTY"])

embeddings_ass = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings_ass")

assertion = legal.AssertionDLModel.pretrained("legassertion_time", "en", "legal/models")\
    .setInputCols(["sentence", "ner_chunk", "embeddings_ass"]) \
    .setOutputCol("assertion")

nlpPipeline = nlp.Pipeline(stages=[
            document_assembler, 
            sentence_detector,
            tokenizer,
            embeddings_ner,
            ner_model,
            ner_converter,
            embeddings_ass,
            assertion
            ])

data = spark.createDataFrame([["This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc."]]).toDF("text")


# Show results
result = nlpPipeline.fit(data).transform(data)
result.select(F.explode(F.arrays_zip(result.ner_chunk.result,  
                                     result.ner_chunk.begin, 
                                     result.ner_chunk.end, 
                                     result.ner_chunk.metadata, 
                                     result.assertion.result)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias("ner_label"),
              F.expr("cols['4']").alias("assertion")).show(truncate=False)

+-------------------------------+-----+---+---------+---------+
|chunk                          |begin|end|ner_label|assertion|
+-------------------------------+-----+---+---------+---------+
|Intellectual Property Agreement|11   |41 |DOC      |PRESENT  |
|Amazon Inc                     |51   |60 |PARTY    |PRESENT  |
|Atlantic Inc                   |67   |78 |PARTY    |PRESENT  |
+-------------------------------+-----+---+---------+---------+

import spark.implicits._
// Define pipeline stages to extract NER chunks first

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val nerModel = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val nerConverter = new NerConverter()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

// Then a pretrained AssertionDLModel is used to extract the assertion status
val clinicalAssertion = AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models")
    .setInputCols(Array("sentence", "ner_chunk", "embeddings"))
    .setOutputCol("assertion")

val assertionPipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter,
  clinicalAssertion
))

val data = Seq(
  "Patient with severe fever and sore throat",
  "Patient shows no stomach pain",
  "She was maintained on an epidural and PCA for pain control.").toDF("text")
  

// Show results
val result = assertionPipeline.fit(data).transform(data)

+--------------------------------+--------------------------------+
|chunk_result                    |assertion_result                |
+--------------------------------+--------------------------------+
|[severe fever, sore throat]     |[present, present]              |
|[stomach pain]                  |[absent]                        |
|[an epidural, PCA, pain control]|[present, present, hypothetical]|
+--------------------------------+--------------------------------+
import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentence_detector =  new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer =  new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val embeddings =  BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val ner_model = FinanceNerModel.pretrained("finner_orgs_prods_alias","en","finance/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = new NerConverter()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

val assertion = AssertionDLModel.pretrained("finassertion_competitors", "en", "finance/models")
    .setInputCols(Array("sentence", "ner_chunk", "embeddings"))
    .setOutputCol("assertion")
    
val pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    assertion
    ))

val data = Seq("Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc.").toDF("text")


// Show results
val result = pipeline.fit(data).transform(data)

+-------+------------+---------+----------+
|sent_id|chunk       |ner_label|assertion |
+-------+------------+---------+----------+
|0      |McAfee LLC  |ORG      |COMPETITOR|
|0      |Broadcom Inc|ORG      |COMPETITOR|
+-------+------------+---------+----------+
import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val embeddings_ner = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings_ner")

val ner_model = LegalNerModel.pretrained('legner_contract_doc_parties', 'en', 'legal/models')
    .setInputCols(Array("sentence", "token", "embeddings_ner"))
    .setOutputCol("ner")

val ner_converter = new NerConverter()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")
    .setWhiteList(Array("DOC", "EFFDATE", "PARTY"))

val embeddings_ass = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings_ass")

val assertion = AssertionDLModel.pretrained("legassertion_time", "en", "legal/models")
    .setInputCols(Array("sentence", "ner_chunk", "embeddings_ass"))
    .setOutputCol("assertion")
    
val pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    sentence_detector,
    tokenizer,
    embeddings_ner,
    ner_model,
    ner_converter,
    embeddings_ass,
    assertion
    ))

val data = Seq("This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc.").toDF("text")


// Show results
val result = pipeline.fit(data).transform(data)

+-------------------------------+-----+---+---------+---------+
|chunk                          |begin|end|ner_label|assertion|
+-------------------------------+-----+---+---------+---------+
|Intellectual Property Agreement|11   |41 |DOC      |PRESENT  |
|Amazon Inc                     |51   |60 |PARTY    |PRESENT  |
|Atlantic Inc                   |67   |78 |PARTY    |PRESENT  |
+-------------------------------+-----+---+---------+---------+

Trains AssertionDL, a deep Learning based approach used to extract Assertion Status from extracted entities and text. Contains all the methods for training an AssertionDLModel. For pretrained models please use AssertionDLModel and see the Models Hub for available models.

Parameters:

  • inputCols: Gets current column names of input annotations.

  • outputCol: Gets output column name of annotations.

  • ScopeWindow: Sets the scope of the window of the assertion expression.

  • StartCol: Set a column that contains the token number for the start of the target.

Input Annotator Types: DOCUMENT, CHUNK, WORD_EMBEDDINGS

Output Annotator Type: ASSERTION

Python API: AssertionDLApproach Scala API: AssertionDLApproach Notebook: AssertionDLApproachNotebook
Show Example

from johnsnowlabs import nlp, medical

# First, pipeline stages for pre-processing the dataset (containing columns for text and label) are defined.
document = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

chunk = nlp.Doc2Chunk() \
    .setInputCols(["document"]) \
    .setOutputCol("chunk") \
    .setChunkCol("target")\
    .setStartCol("start")\
    .setStartColByTokenIndex(True)\
    .setFailOnMissing(False)\
    .setLowerCase(True)

token = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

# Define AssertionDLApproach with parameters and start training
assertionStatus = medical.AssertionDLApproach() \
    .setLabelCol("label") \
    .setInputCols(["document", "chunk", "embeddings"]) \
    .setOutputCol("assertion") \
    .setBatchSize(128) \
    .setDropout(0.012) \
    .setLearningRate(0.015) \
    .setEpochs(1) \
    .setStartCol("start") \
    .setEndCol("end") \
    .setMaxSentLen(250)

trainingPipeline = nlp.Pipeline().setStages([
    document,
    chunk,
    token,
    embeddings,
    assertionStatus
])

assertionResults = trainingPipeline.fit(data).transform(data).cache()
from johnsnowlabs import nlp, finance

# First, pipeline stages for pre-processing the dataset (containing columns for text and label) are defined.
document = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

chunk = nlp.Doc2Chunk() \
    .setInputCols(["document"]) \
    .setOutputCol("chunk")

token = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

# Define AssertionDLApproach with parameters and start training
assertionStatus = finance.AssertionDLApproach() \
    .setLabelCol("label") \
    .setInputCols(["document", "chunk", "embeddings"]) \
    .setOutputCol("assertion") \
    .setBatchSize(128) \
    .setDropout(0.012) \
    .setLearningRate(0.015) \
    .setEpochs(1) \
    .setStartCol("start") \
    .setEndCol("end") \
    .setMaxSentLen(250)

trainingPipeline = nlp.Pipeline().setStages([
    document,
    chunk,
    token,
    embeddings,
    assertionStatus
])

assertionResults = trainingPipeline.fit(data).transform(data).cache()
from johnsnowlabs import nlp, legal

# First, pipeline stages for pre-processing the dataset (containing columns for text and label) are defined.
document = nlp.DocumentAssembler()\
    .setInputCol("sentence")\
    .setOutputCol("document")

chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("doc_chunk")

token = nlp.Tokenizer()\
    .setInputCols(['document'])\
    .setOutputCol('token')

roberta_embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings") \
    .setMaxSentenceLength(512)

# Define AssertionDLApproach with parameters and start training
assertionStatus = legal.AssertionDLApproach()\
    .setLabelCol("assertion_label")\
    .setInputCols(["document", "doc_chunk", "embeddings"])\
    .setOutputCol("assertion")\
    .setBatchSize(128)\
    .setLearningRate(0.001)\
    .setEpochs(2)\
    .setStartCol("tkn_start")\
    .setEndCol("tkn_end")\
    .setMaxSentLen(1200)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath('training_logs/')\
    .setGraphFolder(graph_folder)\
    .setGraphFile(f"{graph_folder}/assertion_graph.pb")\
    .setTestDataset(path="test_data.parquet", read_as='SPARK', options={'format': 'parquet'})\
    .setScopeWindow(scope_window)
    #.setValidationSplit(0.2)\    
    #.setDropout(0.1)\    

trainingPipeline = nlp.Pipeline().setStages([
    document,
    chunk,
    token,
    roberta_embeddings,
    assertionStatus
])

assertionResults = trainingPipeline.fit(data).transform(data).cache()

import spark.implicits._

// First, pipeline stages for pre-processing the dataset (containing columns for text and label) are defined.
val document = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val chunk = new Doc2Chunk()
  .setInputCols(Array("document"))
  .setOutputCol("chunk")

val token = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("document", "token"))
  .setOutputCol("embeddings")

// Define AssertionDLApproach with parameters and start training
val assertionStatus = new AssertionDLApproach()
  .setLabelCol("label")
  .setInputCols(Array("document", "chunk", "embeddings"))
  .setOutputCol("assertion")
  .setBatchSize(128)
  .setDropout(0.012)
  .setLearningRate(0.015)
  .setEpochs(1)
  .setStartCol("start")
  .setEndCol("end")
  .setMaxSentLen(250)

val trainingPipeline = new Pipeline().setStages(Array(
  document,
  chunk,
  token,
  embeddings,
  assertionStatus
))

val assertionResults = trainingPipeline.fit(data).transform(data).cache()
import spark.implicits._
// First, pipeline stages for pre-processing the dataset (containing columns for text and label) are defined.

val document = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val chunk = new Doc2Chunk()
  .setInputCols(Array("document"))
  .setOutputCol("chunk")

val token = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("document", "token"))
  .setOutputCol("embeddings")

// Define AssertionDLApproach with parameters and start training
val assertionStatus = new AssertionDLApproach()
  .setLabelCol("label")
  .setInputCols(Array("document", "chunk", "embeddings"))
  .setOutputCol("assertion")
  .setBatchSize(128)
  .setDropout(0.012)
  .setLearningRate(0.015)
  .setEpochs(1)
  .setStartCol("start")
  .setEndCol("end")
  .setMaxSentLen(250)

val trainingPipeline = new Pipeline().setStages(Array(
  document,
  chunk,
  token,
  embeddings,
  assertionStatus
))

val assertionResults = trainingPipeline.fit(data).transform(data).cache()
import spark.implicits._

val document = new DocumentAssembler()
    .setInputCol("sentence")
    .setOutputCol("document")

val chunk = new Doc2Chunk()
    .setInputCols(Array("document"))
    .setOutputCol("doc_chunk")
    .setChunkCol("chunk")
    .setStartCol("tkn_start")
    .setStartColByTokenIndex(true)
    .setFailOnMissing(false)
    .setLowerCase(false)

val token = new Tokenizer()
    .setInputCols(Array('document'))
    .setOutputCol('token')

val roberta_embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") 
    .setInputCols(Array("document", "token")) 
    .setOutputCol("embeddings") 
    .setMaxSentenceLength(512)

# Define AssertionDLApproach with parameters and start training
val assertionStatus = new AssertionDLApproach()
    .setLabelCol("assertion_label")
    .setInputCols(Array("document", "doc_chunk", "embeddings"))
    .setOutputCol("assertion")
    .setBatchSize(128)
    .setLearningRate(0.001)
    .setEpochs(2)
    .setStartCol("tkn_start")
    .setEndCol("tkn_end")
    .setMaxSentLen(1200)
    .setEnableOutputLogs(true)
    .setOutputLogsPath('training_logs/')
    .setGraphFolder(graph_folder)
    .setGraphFile(f"{graph_folder}/assertion_graph.pb")
    .setTestDataset(path="test_data.parquet", read_as='SPARK', options={'format': 'parquet'})
    .setScopeWindow(scope_window)
    #.setValidationSplit(0.2) 
    #.setDropout(0.1) 

val trainingPipeline = new Pipeline().setStages(Array(
  document,
  chunk,
  token,
  roberta_embeddings,
  assertionStatus
))

val assertionResults = trainingPipeline.fit(data).transform(data).cache()

AssertionFilterer

Model

Filters entities coming from ASSERTION type annotations and returns the CHUNKS. Filters can be set via a white list on the extracted chunk, the assertion or a regular expression. White list for assertion is enabled by default. To use chunk white list, criteria has to be set to "isin". For regex, criteria has to be set to "regex".

Parameters:

  • whiteList: (list) If defined, list of entities to process. The rest will be ignored.

  • CaseSensitive: (bool) Determines whether the definitions of the white listed entities are case sensitive.

  • regex: (list) List of dash-separated pairs of named entities.

  • criteria: (list) Set tag representing what is the criteria to filter the chunks. possibles values (assertion,isIn,regex). assertion: Filter by the assertion, isIn : Filter by the chunk, regex : Filter using a regex.

  • entitiesConfidence: (Str) Entity pairs to remove based on the confidence level.

Input Annotator Types: DOCUMENT, CHUNK, ASSERTION

Output Annotator Type: CHUNK

Python API: AssertionFilterer Scala API: AssertionFilterer Notebook: AssertionFiltererNotebook
Show Example

from johnsnowlabs import nlp, medical 

# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")\
    #.setIncludeAllConfidenceScores(False)

ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\
    .setWhiteList(["PROBLEM", "TEST","TREATMENT"])

clinical_assertion = medical.AssertionDLModel.pretrained("assertion_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")

assertion_filterer = medical.AssertionFilterer()\
    .setInputCols("sentence","ner_chunk","assertion")\
    .setOutputCol("assertion_filtered")\
    .setCaseSensitive(False)\
    .setWhiteList(["Present"])
#or .setBlackList([["absent"]])

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      clinical_assertion,
      assertion_filterer
    ])

data = spark.createDataFrame([["Patient has a headache for the last 2 weeks, needs to get a head CT, and appears anxious when she walks fast. Alopecia noted. She denies pain."]]).toDF("text")

result = nlpPipeline.fit(data).transform(data)

# Show results:

result.selectExpr("ner_chunk.result as ner_chunk", "assertion.result as assertion").show(3, truncate=False)
+------------------------------------------------+--------------------------------------------------+
|ner_chunk                                       |assertion                                         |
+------------------------------------------------+--------------------------------------------------+
|[a headache, a head CT, anxious, Alopecia, pain]|[Present, Hypothetical, Possible, Present, Absent]|
+------------------------------------------------+--------------------------------------------------+

result.select("filtered.result").show(3, truncate=False)
+----------------------+
|result                |
+----------------------+
|[a headache, Alopecia]|
+----------------------+
from johnsnowlabs import nlp, finance 

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector =  nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer =  nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings =  nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")\

ner_converter = finance.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\

assertion = finance.AssertionDLModel.pretrained("finassertion_competitors", "en", "finance/models")\
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")

assertion_filterer = finance.AssertionFilterer()\
    .setInputCols("sentence","ner_chunk","assertion")\
    .setOutputCol("assertion_filtered")\
    .setCaseSensitive(False)\
    .setWhiteList(["Competitor"])

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    assertion,
    assertion_filterer
    ])

data = spark.createDataFrame([["Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc."]]).toDF("text")

# Show results

result = pipeline.fit(data).transform(data)
result.selectExpr("ner_chunk.result as ner_chunk", "assertion.result as assertion").show(3, truncate=False)
+--------------------------+------------------------+
|ner_chunk                 |assertion               |
+--------------------------+------------------------+
|[McAfee LLC, Broadcom Inc]|[COMPETITOR, COMPETITOR]|
+--------------------------+------------------------+

result.select("assertion_filtered.result").show(3, truncate=False)
+--------------------------+
|result                    |
+--------------------------+
|[McAfee LLC, Broadcom Inc]|
+--------------------------+
from johnsnowlabs import nlp, legal 

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings_ner = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
    .setInputCols("sentence", "token") \
    .setOutputCol("embeddings_ner")\

ner_model = legal.NerModel.pretrained('legner_contract_doc_parties', 'en', 'legal/models')\
    .setInputCols(["sentence", "token", "embeddings_ner"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setWhiteList(["DOC", "EFFDATE", "PARTY"])

embeddings_ass = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings_ass")

assertion = legal.AssertionDLModel.pretrained("legassertion_time", "en", "legal/models")\
    .setInputCols(["sentence", "ner_chunk", "embeddings_ass"]) \
    .setOutputCol("assertion")

assertion_filterer = legal.AssertionFilterer()\
    .setInputCols("sentence","ner_chunk","assertion")\
    .setOutputCol("assertion_filtered")\
    .setCaseSensitive(False)\
    .setWhiteList(["Present"])


nlpPipeline = nlp.Pipeline(stages=[
            document_assembler,
            sentence_detector,
            tokenizer,
            embeddings_ner,
            ner_model,
            ner_converter,
            embeddings_ass,
            assertion,
            assertion_filterer
            ])

data = spark.createDataFrame([["This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc."]]).toDF("text")

# Show results

result = nlpPipeline.fit(data).transform(data)
result.selectExpr("ner_chunk.result as ner_chunk", "assertion.result as assertion").show(3, truncate=False)
+-----------------------------------------------------------+---------------------------+
|ner_chunk                                                  |assertion                  |
+-----------------------------------------------------------+---------------------------+
|[Intellectual Property Agreement, Amazon Inc, Atlantic Inc]|[PRESENT, PRESENT, PRESENT]|
+-----------------------------------------------------------+---------------------------+

result.select("assertion_filtered.result").show(3, truncate=False)
+-----------------------------------------------------------+
|result                                                     |
+-----------------------------------------------------------+
|[Intellectual Property Agreement, Amazon Inc, Atlantic Inc]|
+-----------------------------------------------------------+

import spark.implicits._

// Annotator that transforms a text column from dataframe into an Annotation ready for NLP
val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

// Sentence Detector annotator, processes various sentences per line
val sentenceDetector = new SentenceDetector()
    .setInputCols(Array("document"))
    .setOutputCol("sentence")

// Tokenizer splits words in a relevant format for NLP
val tokenizer = new Tokenizer()
    .setInputCols(Array("sentence"))
    .setOutputCol("token")

// Clinical word embeddings trained on PubMED dataset
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") 
    .setInputCols(Array("sentence", "token", "embeddings")) 
    .setOutputCol("ner")
    //.setIncludeAllConfidenceScores(false)

val ner_converter = new NerConverterInternal() 
    .setInputCols(Array("sentence", "token", "ner")) 
    .setOutputCol("ner_chunk")
    .setWhiteList(Array("PROBLEM", "TEST","TREATMENT"))

val clinical_assertion = AssertionDLModel.pretrained("assertion_jsl", "en", "clinical/models") 
    .setInputCols(Array("sentence", "ner_chunk", "embeddings")) 
    .setOutputCol("assertion")

val assertion_filterer = new AssertionFilterer()
    .setInputCols("sentence","ner_chunk","assertion")
    .setOutputCol("assertion_filtered")
    .setCaseSensitive(false)
    .setWhiteList(Array("Present"))
//or .setBlackList(Array("absent"))

val nlpPipeline = new Pipeline().setStages(Array(
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      clinical_assertion,
      assertion_filterer
))


val text ="""Patient has a headache for the last 2 weeks, needs to get a head CT, and appears anxious when she walks fast. Alopecia noted. She denies pain."""

val data = Seq(text).toDF("text")
val result = nlpPipeline.fit(data).transform(data)

// Show results:

+------------------------------------------------+--------------------------------------------------+
|ner_chunk                                       |assertion                                         |
+------------------------------------------------+--------------------------------------------------+
|[a headache, a head CT, anxious, Alopecia, pain]|[Present, Hypothetical, Possible, Present, Absent]|
+------------------------------------------------+--------------------------------------------------+

+----------------------+
|result                |
+----------------------+
|[a headache, Alopecia]|
+----------------------+
import spark.implicits._

// Annotator that transforms a text column from dataframe into an Annotation ready for NLP
val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

// Sentence Detector annotator, processes various sentences per line
val sentence_detector = new SentenceDetector()
    .setInputCols(Array("document"))
    .setOutputCol("sentence")

// Tokenizer splits words in a relevant format for NLP
val tokenizer = new Tokenizer()
    .setInputCols(Array("sentence"))
    .setOutputCol("token")

// Clinical word embeddings trained on PubMED dataset
val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val ner_model = FinanceNerModel.pretrained("finner_orgs_prods_alias","en","finance/models") 
    .setInputCols(Array("sentence", "token", "embeddings")) 
    .setOutputCol("ner")

val ner_converter = new NerConverterInternal() 
    .setInputCols(Array("sentence", "token", "ner")) 
    .setOutputCol("ner_chunk")

val assertion = AssertionDLModel.pretrained("finassertion_competitors", "en", "finance/models") 
    .setInputCols(Array("sentence", "ner_chunk", "embeddings")) 
    .setOutputCol("assertion")

val assertion_filterer = new AssertionFilterer()
    .setInputCols("sentence","ner_chunk","assertion")
    .setOutputCol("assertion_filtered")
    .setCaseSensitive(false)
    .setWhiteList(Array("Competitor"))


val nlpPipeline = new Pipeline().setStages(Array(
        document_assembler,
        sentence_detector,
        tokenizer,
        embeddings,
        ner_model,
        ner_converter,
        assertion,
        assertion_filterer
))

val text ="""Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc."""

val data = Seq(text).toDF("text")
val result = nlpPipeline.fit(data).transform(data)

// Show results:
+--------------------------+------------------------+
|ner_chunk                 |assertion               |
+--------------------------+------------------------+
|[McAfee LLC, Broadcom Inc]|[COMPETITOR, COMPETITOR]|
+--------------------------+------------------------+

+--------------------------+
|result                    |
+--------------------------+
|[McAfee LLC, Broadcom Inc]|
+--------------------------+
import spark.implicits._

// Annotator that transforms a text column from dataframe into an Annotation ready for NLP
val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

// Sentence Detector annotator, processes various sentences per line
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
    .setInputCols(Array("document"))
    .setOutputCol("sentence")

// Tokenizer splits words in a relevant format for NLP
val tokenizer = new Tokenizer()
    .setInputCols(Array("sentence"))
    .setOutputCol("token")

// Clinical word embeddings trained on PubMED dataset
val embeddings_ner = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings_ner")

val ner_model = LegalNerModel.pretrained('legner_contract_doc_parties', 'en', 'legal/models') 
    .setInputCols(Array("sentence", "token", "embeddings_ner")) 
    .setOutputCol("ner")
    //.setIncludeAllConfidenceScores(false)

val ner_converter = new NerConverterInternal() 
    .setInputCols(Array("sentence", "token", "ner")) 
    .setOutputCol("ner_chunk")
    .setWhiteList(Array("DOC", "EFFDATE", "PARTY"))

val embeddings_ass = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") 
    .setInputCols(Array("sentence", "token")) 
    .setOutputCol("embeddings_ass")

val assertion = AssertionDLModel.pretrained("assertion_jsl", "en", "clinical/models") 
    .setInputCols(Array("sentence", "ner_chunk", "embeddings_ass")) 
    .setOutputCol("assertion")

val assertion_filterer = new AssertionFilterer()
    .setInputCols("sentence","ner_chunk","assertion")
    .setOutputCol("assertion_filtered")
    .setCaseSensitive(false)
    .setWhiteList(Array("Present"))


val nlpPipeline = new Pipeline().setStages(Array(
      document_assembler,
      sentence_detector,
      tokenizer,
      embeddings_ner,
      ner_model,
      ner_converter,
      embeddings_ass,
      assertion,
      assertion_filterer
))


val text ="""This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc."""

val data = Seq(text).toDF("text")
val result = nlpPipeline.fit(data).transform(data)

// Show results:
+-----------------------------------------------------------+---------------------------+
|ner_chunk                                                  |assertion                  |
+-----------------------------------------------------------+---------------------------+
|[Intellectual Property Agreement, Amazon Inc, Atlantic Inc]|[PRESENT, PRESENT, PRESENT]|
+-----------------------------------------------------------+---------------------------+

+-----------------------------------------------------------+
|result                                                     |
+-----------------------------------------------------------+
|[Intellectual Property Agreement, Amazon Inc, Atlantic Inc]|
+-----------------------------------------------------------+

AssertionLogReg

This is a main class in AssertionLogReg family. Logarithmic Regression is used to extract Assertion Status from extracted entities and text. AssertionLogRegModel requires DOCUMENT, CHUNK and WORD_EMBEDDINGS type annotator inputs, which can be obtained by e.g a DocumentAssembler, NerConverter and WordEmbeddingsModel. The result is an assertion status annotation for each recognized entity. Possible values are "Negated", "Affirmed" and "Historical".

Unlike the DL Model, this class does not extend AnnotatorModel. Instead it extends the RawAnnotator, that’s why the main point of interest is method transform().

At the moment there are no pretrained models available for this class. Please refer to AssertionLogRegApproach to train your own model.

Parametres:

  • setAfter(Int): Length of the context after the target (Default: 13)

  • setBefore(Int): Length of the context before the target (Default: 11)

  • setEndCol(String): Column that contains the token number for the end of the target

  • setStartCol(String): Column that contains the token number for the start of the target

Input Annotator Types: DOCUMENT, CHUNK, WORD_EMBEDDINGS

Output Annotator Type: ASSERTION

Python API: AssertionLogRegModel Scala API: AssertionLogRegModel Notebook: AssertionLogRegModelNotebook

Trains a classification method, which uses the Logarithmic Regression Algorithm. It is used to extract Assertion Status from extracted entities and text. Contains all the methods for training a AssertionLogRegModel, together with trainWithChunk, trainWithStartEnd.

Parameters:

  • label : Column with label per each token

  • maxIter: This specifies the maximum number of iterations to be performed in the model’s training, default: 26

  • regParam : This specifies the regularization parameter. Regularization helps to control the complexity of the model, aiding in preventing the issue of overfitting.

  • eNetParam : Elastic net parameter

  • beforeParam : Length of the context before the target

  • afterParam : Length of the context after the target

  • startCol : Column that contains the token number for the start of the target

  • endCol : Column that contains the token number for the end of the target

Input Annotator Types: DOCUMENT, CHUNK, WORD_EMBEDDINGS

Output Annotator Type: ASSERTION

Python API: AssertionLogRegApproach Scala API: AssertionLogRegApproach Notebook: AssertionLogRegApproachNotebook
Show Example

from johnsnowlabs import nlp, medical

# First define pipeline stages to extract embeddings and text chunks
documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

glove = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("word_embeddings") \
    .setCaseSensitive(False)

chunkAssembler = nlp.Doc2Chunk() \
    .setInputCols(["document"]) \
    .setChunkCol("target") \
    .setOutputCol("chunk")

# Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
assertion = medical.AssertionLogRegApproach() \
    .setLabelCol("label") \
    .setInputCols(["document", "chunk", "word_embeddings"]) \
    .setOutputCol("assertion") \
    .setReg(0.01) \
    .setBefore(11) \
    .setAfter(13) \
    .setStartCol("start") \
    .setEndCol("end")

assertionPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    nerModel,
    nerConverter,
    assertion
])

assertionModel = assertionPipeline.fit(dataset)
from johnsnowlabs import nlp, finance

# First define pipeline stages to extract embeddings and text chunks
documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

glove = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("word_embeddings") \
    .setCaseSensitive(False)

chunkAssembler = nlp.Doc2Chunk() \
    .setInputCols(["document"]) \
    .setChunkCol("target") \
    .setOutputCol("chunk")

# Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
assertion = finance.AssertionLogRegApproach() \
    .setLabelCol("label") \
    .setInputCols(["document", "chunk", "word_embeddings"]) \
    .setOutputCol("assertion") \
    .setReg(0.01) \
    .setBefore(11) \
    .setAfter(13) \
    .setStartCol("start") \
    .setEndCol("end")

assertionPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    nerModel,
    nerConverter,
    assertion
])

assertionModel = assertionPipeline.fit(dataset)
from johnsnowlabs import nlp, legal

# First define pipeline stages to extract embeddings and text chunks
documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

glove = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("word_embeddings") \
    .setCaseSensitive(False)

chunkAssembler = nlp.Doc2Chunk() \
    .setInputCols(["document"]) \
    .setChunkCol("target") \
    .setOutputCol("chunk")

# Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
assertion = legal.AssertionLogRegApproach() \
    .setLabelCol("label") \
    .setInputCols(["document", "chunk", "word_embeddings"]) \
    .setOutputCol("assertion") \
    .setReg(0.01) \
    .setBefore(11) \
    .setAfter(13) \
    .setStartCol("start") \
    .setEndCol("end")

assertionPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    nerModel,
    nerConverter,
    assertion
])

assertionModel = assertionPipeline.fit(dataset)

import spark.implicits._

// First define pipeline stages to extract embeddings and text chunks
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val glove = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("document", "token"))
  .setOutputCol("word_embeddings")
  .setCaseSensitive(false)

val chunkAssembler = new Doc2Chunk()
  .setInputCols("document")
  .setChunkCol("target")
  .setOutputCol("chunk")

// Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
val assertion = new AssertionLogRegApproach()
  .setLabelCol("label")
  .setInputCols(Array("document", "chunk", "word_embeddings"))
  .setOutputCol("assertion")
  .setReg(0.01)
  .setBefore(11)
  .setAfter(13)
  .setStartCol("start")
  .setEndCol("end")

val assertionPipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter,
  assertion
))

val assertionModel = assertionPipeline.fit(dataset)
import spark.implicits._

// First define pipeline stages to extract embeddings and text chunks
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val glove = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("document", "token"))
  .setOutputCol("word_embeddings")
  .setCaseSensitive(false)

val chunkAssembler = new Doc2Chunk()
  .setInputCols("document")
  .setChunkCol("target")
  .setOutputCol("chunk")

// Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
val assertion = new AssertionLogRegApproach()
  .setLabelCol("label")
  .setInputCols(Array("document", "chunk", "word_embeddings"))
  .setOutputCol("assertion")
  .setReg(0.01)
  .setBefore(11)
  .setAfter(13)
  .setStartCol("start")
  .setEndCol("end")

val assertionPipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter,
  assertion
))

val assertionModel = assertionPipeline.fit(dataset)
import spark.implicits._

// First define pipeline stages to extract embeddings and text chunks
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val glove = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("document", "token"))
  .setOutputCol("word_embeddings")
  .setCaseSensitive(false)

val chunkAssembler = new Doc2Chunk()
  .setInputCols("document")
  .setChunkCol("target")
  .setOutputCol("chunk")

// Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
val assertion = new AssertionLogRegApproach()
  .setLabelCol("label")
  .setInputCols(Array("document", "chunk", "word_embeddings"))
  .setOutputCol("assertion")
  .setReg(0.01)
  .setBefore(11)
  .setAfter(13)
  .setStartCol("start")
  .setEndCol("end")

val assertionPipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter,
  assertion
))

val assertionModel = assertionPipeline.fit(dataset)

AverageEmbeddings

Model

AverageEmbeddings computes the mean of vector embeddings for two sentences of equal size, producing a unified representation.

Parameters:

  • inputCols: The name of the columns containing the input annotations. It can read either a String column or an Array.

  • outputCol: The name of the column in Document type that is generated. We can specify only one column here.

All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols().

Input Annotator Types: SENTENCE_EMBEDDINGS, SENTENCE_EMBEDDINGS, CHUNK

Output Annotator Type: EMBEDDINGS

Python API: AverageEmbeddings Scala API: AverageEmbeddings Notebook: AverageEmbeddingsNotebook
Show Example

from johnsnowlabs import nlp, medical

document_assembler =  nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")\

sentence_detector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

doc2Chunk = nlp.Doc2Chunk() \
    .setInputCols("sentence") \
    .setOutputCol("chunk") \
    .setIsArray(True)

sbiobert_base_cased_mli = nlp.BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
    .setInputCols("sentence")\
    .setOutputCol("sbiobert_base_cased_mli")

sent_biobert_clinical_base_cased = nlp.BertSentenceEmbeddings.pretrained("sent_biobert_clinical_base_cased", "en") \
    .setInputCols("sentence") \
    .setOutputCol("sent_biobert_clinical_base_cased")

avg_embeddings = medical.AverageEmbeddings()\
    .setInputCols(["sent_biobert_clinical_base_cased","sbiobert_base_cased_mli","chunk"])\
    .setOutputCol("embeddings")

pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        doc2Chunk,
        sbiobert_base_cased_mli,
        sent_biobert_clinical_base_cased,
        avg_embeddings
    ])

data = spark.createDataFrame([[" The patient was prescribed 1 capsule of Advil for 5 days "]]).toDF("text")

result = pipeline.fit(data).transform(data)

result_df = result.select(F.explode(F.arrays_zip(result.chunk.result,
                                                 result.chunk.metadata,
                                                 result.sentence.result,
                                                 result.embeddings.embeddings,
                                                 result.sent_biobert_clinical_base_cased.embeddings,
                                                 result.sbiobert_base_cased_mli.embeddings,)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("sentence"),
                          F.expr("cols['1']").alias("sentence_metadata"),
                          F.expr("cols['2']").alias("chunk"),
                          F.expr("cols['3']").alias("embeddings"),
                          F.expr("cols['4']").alias("sent_biobert_clinical_base_cased"),
                          F.expr("cols['5']").alias("sbiobert_base_cased_mli"))

result_df.show(50, truncate=1000)

## Result

+--------------------------------------------------+---------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|                                          sentence|          sentence_metadata|                                             chunk|                                        embeddings|                  sent_biobert_clinical_base_cased|                           sbiobert_base_cased_mli|
+--------------------------------------------------+---------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|The patient was prescribed 1 capsule of Advil f...|{sentence -> 0, chunk -> 0}|The patient was prescribed 1 capsule of Advil f...|[0.32466835, 0.12497781, -0.20237188, 0.3716198...|[-0.07857181, -0.061015874, -0.020198729, 0.177...|[0.7279085, 0.3109715, -0.38454503, 0.5657965, ...|
+--------------------------------------------------+---------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document") 

val sentence_detector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val doc2Chunk = new Doc2Chunk()
  .setInputCols("sentence")
  .setOutputCol("chunk")
  .setIsArray(true)

val sbiobert_base_cased_mli = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
  .setInputCols("sentence")
  .setOutputCol("sbiobert_base_cased_mli")

val sent_biobert_clinical_base_cased = BertSentenceEmbeddings.pretrained("sent_biobert_clinical_base_cased","en")
  .setInputCols("sentence")
  .setOutputCol("sent_biobert_clinical_base_cased")

val avg_embeddings = new AverageEmbeddings()
  .setInputCols(Array("sent_biobert_clinical_base_cased","sbiobert_base_cased_mli","chunk"))
  .setOutputCol("embeddings") 

val pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    sentence_detector, 
    doc2Chunk, 
    sbiobert_base_cased_mli, 
    sent_biobert_clinical_base_cased, 
    avg_embeddings)) 

val data = Seq(" The patient was prescribed 1 capsule of Advil for 5 days").toDF("text")

val result = pipeline.fit(data).transform(data)

// Show results
+--------------------------------------------------+---------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|                                          sentence|          sentence_metadata|                                             chunk|                                        embeddings|                  sent_biobert_clinical_base_cased|                           sbiobert_base_cased_mli|
+--------------------------------------------------+---------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|The patient was prescribed 1 capsule of Advil f...|{sentence -> 0, chunk -> 0}|The patient was prescribed 1 capsule of Advil f...|[0.32466835, 0.12497781, -0.20237188, 0.3716198...|[-0.07857181, -0.061015874, -0.020198729, 0.177...|[0.7279085, 0.3109715, -0.38454503, 0.5657965, ...|
+--------------------------------------------------+---------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+

BertForSequenceClassification

Model

BertForSequenceClassification can load Bert Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.

Parameters:

  • batchSize’: Size of every batch (default: 8).

  • coalesceSentences’: Instead of 1 class per sentence (if inputCols is “sentence” output 1 class per document by averaging probabilities in all sentences (default: False).

  • maxSentenceLength’: Max sentence length to process (default: 128).

  • caseSensitive’: Whether to ignore case in tokens for embeddings matching (default: True)

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CATEGORY

Python API: BertForSequenceClassification Scala API: BertForSequenceClassification
Show Example

from johnsnowlabs import nlp, medical
 
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

sequenceClassifier = medical.BertForSequenceClassification.pretrained("bert_sequence_classifier_ade", "en", "clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("classes")

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    tokenizer,
    sequenceClassifier
])


text =[["Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair 2137. Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. Name (NI) past or present smoking hx, no EtOH."],
     ["Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep Apnea. Denies tobacco and ETOH. Works as cafeteria worker."]]

data = spark.createDataFrame(text).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("text", "classes.result").show(2,truncate=100)

| text                                                                                           | result |
|------------------------------------------------------------------------------------------------|-------|
| Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair... | [False] |
| Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep... | [False] |

import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new Tokenizer()
    .setInputCols("document")
    .setOutputCol("token")

val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_ade", "en", "clinical/models")
    .setInputCols(Array("document","token"))
    .setOutputCol("classes")

val pipeline =  new Pipeline().setStages(Array(
    document_assembler, 
    tokenizer, 
    sequenceClassifier))

val text = List(
  List("Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair 2137. Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. Name (NI) past or present smoking hx, no EtOH."),
  List("Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep Apnea. Denies tobacco and ETOH. Works as cafeteria worker.")
)

val data = Seq(text).toDF("text")

val result = pipeline.fit(data).transform(data)

| text                                                                                           | result |
|------------------------------------------------------------------------------------------------|-------|
| Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair... | [False] |
| Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep... | [False] |

BertForTokenClassifier

Model

BertForTokenClassifier can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) for Named-Entity-Recognition (NER) tasks.\

Parameters:

  • CaseSensitive (Boolean): Whether to lowercase tokens or not (Default: False).

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: NAMED_ENTITY

Python API: BertForTokenClassifier Scala API: BertForTokenClassifier
Show Example

from johnsnowlabs import nlp, medical
 
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols("sentence")\
    .setOutputCol("token")

tokenClassifier = medical.BertForTokenClassification.pretrained("bert_token_classifier_ner_clinical", "en", "clinical/models")\
    .setInputCols("token", "sentence")\
    .setOutputCol("ner")\
    .setCaseSensitive(True)

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

pipeline =  nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    tokenClassifier,
    ner_converter
    ])


text = """A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .
Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection .
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation .
Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity .
Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 .
Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia .
The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission .
However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L .
The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again .
The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours .
Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use .
The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day .
It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge ."""

data = spark.createDataFrame([[text]]).toDF("text")

res = pipeline.fit(data).transform(data)

res.select(F.explode(F.arrays_zip(res.ner_chunk.result, res.ner_chunk.begin, res.ner_chunk.end, res.ner_chunk.metadata)).alias("cols")) \
   .select(F.expr("cols['3']['sentence']").alias("sentence_id"),
           F.expr("cols['0']").alias("chunk"),
           F.expr("cols['2']").alias("end"),
           F.expr("cols['3']['entity']").alias("ner_label"))\
   .filter("ner_label!='O'")\
   .show(truncate=False)

## Result
+-----------+-----------------------------+---+---------+
|sentence_id|chunk                        |end|ner_label|
+-----------+-----------------------------+---+---------+
|0          |gestational diabetes mellitus|67 |PROBLEM  |
|0          |type two diabetes mellitus   |153|PROBLEM  |
|0          |T2DM                         |160|PROBLEM  |
|0          |HTG-induced pancreatitis     |209|PROBLEM  |
|0          |an acute hepatitis           |280|PROBLEM  |
|0          |obesity                      |294|PROBLEM  |
|0          |a body mass index            |317|TEST     |
|0          |BMI                          |323|TEST     |
|0          |polyuria                     |387|PROBLEM  |
|0          |polydipsia                   |400|PROBLEM  |
|0          |poor appetite                |416|PROBLEM  |
|0          |vomiting                     |431|PROBLEM  |
|1          |amoxicillin                  |521|TREATMENT|
|1          |a respiratory tract infection|555|PROBLEM  |
|2          |metformin                    |578|TREATMENT|
|2          |glipizide                    |590|TREATMENT|
|2          |dapagliflozin                |610|TREATMENT|
|2          |T2DM                         |619|PROBLEM  |
|2          |atorvastatin                 |636|TREATMENT|
|2          |gemfibrozil                  |652|TREATMENT|
+-----------+-----------------------------+---+---------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val tokenClassifier = MedicalBertForTokenClassification.pretrained("bert_token_classifier_ner_clinical", "en", "clinical/models")
    .setInputCols(Array("token", "sentence"))
    .setOutputCol("ner")
    .setCaseSensitive(True)

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence","token","ner"))
    .setOutputCol("ner_chunk")

val pipeline =  new Pipeline().setStages(Array(
    documentAssembler,
    sentenceDetector, 
    tokenizer, 
    tokenClassifier, 
    ner_converter))

val text = """A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .
Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection .
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation .
Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity .
Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 .
Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia .
The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission .
However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L .
The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again .
The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours .
Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use .
The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day .
It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge ."""

val data = Seq(text).toDF("text")

val result = pipeline.fit(data).transform(data)

// Result

+-----------+-----------------------------+---+---------+
|sentence_id|chunk                        |end|ner_label|
+-----------+-----------------------------+---+---------+
|0          |gestational diabetes mellitus|67 |PROBLEM  |
|0          |type two diabetes mellitus   |153|PROBLEM  |
|0          |T2DM                         |160|PROBLEM  |
|0          |HTG-induced pancreatitis     |209|PROBLEM  |
|0          |an acute hepatitis           |280|PROBLEM  |
|0          |obesity                      |294|PROBLEM  |
|0          |a body mass index            |317|TEST     |
|0          |BMI                          |323|TEST     |
|0          |polyuria                     |387|PROBLEM  |
|0          |polydipsia                   |400|PROBLEM  |
|0          |poor appetite                |416|PROBLEM  |
|0          |vomiting                     |431|PROBLEM  |
|1          |amoxicillin                  |521|TREATMENT|
|1          |a respiratory tract infection|555|PROBLEM  |
|2          |metformin                    |578|TREATMENT|
|2          |glipizide                    |590|TREATMENT|
|2          |dapagliflozin                |610|TREATMENT|
|2          |T2DM                         |619|PROBLEM  |
|2          |atorvastatin                 |636|TREATMENT|
|2          |gemfibrozil                  |652|TREATMENT|
+-----------+-----------------------------+---+---------+

BertSentenceChunkEmbeddings

Model

This annotator allows aggregating sentence embeddings with ner chunk embeddings to get specific and more accurate resolution codes. It works by averaging sentence and chunk embeddings add contextual information in the embedding value. Input to this annotator is the context (sentence) and ner chunks, while the output is embedding for each chunk that can be fed to the resolver model.

Parameters:

  • inputCols: The name of the columns containing the input annotations. It can read either a String column or an Array.

  • outputCol: The name of the column in Document type that is generated. We can specify only one column here.

  • chunkWeight: Relative weight of chunk embeddings in comparison to sentence embeddings. The value should between 0 and 1. The default is 0.5, which means the chunk and sentence embeddings are given equal weight.

  • setMaxSentenceLength: Sets max sentence length to process, by default 128.

  • caseSensitive: Determines whether the definitions of the white listed entities are case sensitive.

All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols().

For more information and examples of BertSentenceChunkEmbeddings annotator, you can check the Spark NLP Workshop, and in special, the notebook 24.1.Improved_Entity_Resolution_with_SentenceChunkEmbeddings.ipynb.

Input Annotator Types: DOCUMENT, CHUNK

Output Annotator Type: SENTENCE_EMBEDDINGS

Python API: BertSentenceChunkEmbeddings Scala API: BertSentenceChunkEmbeddings Notebook: BertSentenceChunkEmbeddingsNotebook
Show Example

from johnsnowlabs import nlp, medical
# Define the pipeline

document_assembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
      .setInputCols(["document"])\
      .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["document", "token"])\
      .setOutputCol("word_embeddings")

clinical_ner = medical.NerModel.pretrained("ner_abbreviation_clinical", "en", "clinical/models") \
      .setInputCols(["document", "token", "word_embeddings"]) \
      .setOutputCol("ner")

ner_converter = medical.NerConverterInternal() \
      .setInputCols(["document", "token", "ner"]) \
      .setOutputCol("ner_chunk")\
      .setWhiteList(['ABBR'])

sentence_chunk_embeddings = medical.BertSentenceChunkEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
      .setInputCols(["document", "ner_chunk"])\
      .setOutputCol("sentence_embeddings")\
      .setChunkWeight(0.5)\
      .setCaseSensitive(True)
    
resolver_pipeline = nlp.Pipeline(
    stages = [
      document_assembler,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      sentence_chunk_embeddings
])


sample_text = [
"""The patient admitted from the IR for aggressive irrigation of the Miami pouch. DISCHARGE DIAGNOSES: 1. A 58-year-old female with a history of stage 2 squamous cell carcinoma of the cervix status post total pelvic exenteration in 1991.""",
"""Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal. 
Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet."""]

from pyspark.sql.types import StringType, IntegerType

df = spark.createDataFrame(sample_text, StringType()).toDF('text')
result = resolver_pipeline.fit(df).transform(df)

result.selectExpr("explode(sentence_embeddings) AS s")\
      .selectExpr("s.result", "slice(s.embeddings, 1, 5) AS averageEmbedding")\
      .show(truncate=False)

+------+--------------------------------------------------------------+
|result|averageEmbedding                                              |
+------+--------------------------------------------------------------+
|IR    |[0.11792798, 0.36022937, -1.0620842, 0.87576616, 0.5389829]   |
|CBC   |[-0.07262431, -0.671684, 0.009878114, 0.76053196, 0.4687413]  |
|AB    |[-0.2781681, -0.43619046, -0.20924012, 0.84943366, 0.40831584]|
|VDRL  |[-0.07109344, -0.20644212, 0.0367461, 0.43459156, 0.3684616]  |
|HIV   |[-0.1740405, -0.4599509, -0.041505605, 0.61368394, 0.66777927]|
+------+--------------------------------------------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")

val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("tokens")

val wordEmbeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
      .setInputCols(Array("document", "tokens"))
      .setOutputCol("word_embeddings")

val nerModel = MedicalNerModel.pretrained("ner_abbreviation_clinical", "en", "clinical/models")
      .setInputCols(Array("document", "tokens", "word_embeddings"))
      .setOutputCol("ner")

val nerConverter = new NerConverterInternal()
      .setInputCols("document", "tokens", "ner")
      .setOutputCol("ner_chunk")
      .setWhiteList(Array('ABBR'))

val sentenceChunkEmbeddings = BertSentenceChunkEmbeddings.pretrained("sbluebert_base_uncased_mli", "en", "clinical/models")
      .setInputCols(Array("document", "ner_chunk"))
      .setOutputCol("sentence_embeddings")
      .setChunkWeight(0.5)
      .setCaseSensitive(True)

val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      sentenceDetector,
      tokenizer,
      wordEmbeddings,
      nerModel,
      nerConverter,
      sentenceChunkEmbeddings))

val sampleText = "The patient admitted from the IR for aggressive irrigation of the Miami pouch. DISCHARGE DIAGNOSES: 1. A 58-year-old female with a history of stage 2 squamous cell carcinoma of the cervix status post total pelvic exenteration in 1991." +
"Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal. 
Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet."

val data = Seq(sampleText).toDF("sampleText")
val result = pipeline.fit(data).transform(data)

+------+--------------------------------------------------------------+
|result|averageEmbedding                                              |
+------+--------------------------------------------------------------+
|IR    |[0.11792798, 0.36022937, -1.0620842, 0.87576616, 0.5389829]   |
|CBC   |[-0.07262431, -0.671684, 0.009878114, 0.76053196, 0.4687413]  |
|AB    |[-0.2781681, -0.43619046, -0.20924012, 0.84943366, 0.40831584]|
|VDRL  |[-0.07109344, -0.20644212, 0.0367461, 0.43459156, 0.3684616]  |
|HIV   |[-0.1740405, -0.4599509, -0.041505605, 0.61368394, 0.66777927]|
+------+--------------------------------------------------------------+

Chunk2Token

Model

A feature transformer that converts the input array of strings (annotatorType CHUNK) into an array of chunk-based tokens (annotatorType TOKEN).

When the input is empty, an empty array is returned.

This Annotator is specially convenient when using NGramGenerator annotations as inputs to WordEmbeddingsModels.

Parameters:

  • inputCols: The name of the columns containing the input annotations. It can read either a String column or an Array.

  • outputCol: The name of the column in Document type that is generated. We can specify only one column here.

All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols().

Input Annotator Types: CHUNK

Output Annotator Type: TOKEN

Python API: Chunk2Token Scala API: Chunk2Token Notebook: Chunk2TokenNotebook
Show Example

from johnsnowlabs import nlp, medical 
# Define a pipeline for generating n-grams
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

token = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

ngrammer = nlp.NGramGenerator() \
    .setN(2) \
    .setEnableCumulative(False) \
    .setInputCols(["token"]) \
    .setOutputCol("ngrams") \
    .setDelimiter("_")

# Stage to convert n-gram CHUNKS to TOKEN type
chunk2Token = medical.Chunk2Token()\
    .setInputCols(["ngrams"])\
    .setOutputCol("ngram_tokens")

trainingPipeline = nlp.Pipeline(stages=[
    document, 
    sentenceDetector, 
    token, 
    ngrammer, 
    chunk2Token])

data = spark.createDataFrame([["A 63-year-old man presents to the hospital ..."]]).toDF("text")
result = trainingPipeline.fit(data).transform(data).cache()
result.selectExpr("explode(ngram_tokens)").show(5, False)

+----------------------------------------------------------------+
|col                                                             |
+----------------------------------------------------------------+
|{token, 0, 12, A_63-year-old, {sentence -> 0, chunk -> 0}, []}  |
|{token, 2, 16, 63-year-old_man, {sentence -> 0, chunk -> 1}, []}|
|{token, 14, 25, man_presents, {sentence -> 0, chunk -> 2}, []}  |
|{token, 18, 28, presents_to, {sentence -> 0, chunk -> 3}, []}   |
|{token, 27, 32, to_the, {sentence -> 0, chunk -> 4}, []}        |
+----------------------------------------------------------------+
from johnsnowlabs import nlp, finance

# Define a pipeline for generating n-grams
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

token = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

ngrammer = nlp.NGramGenerator() \
    .setN(2) \
    .setEnableCumulative(False) \
    .setInputCols(["token"]) \
    .setOutputCol("ngrams") 

# Stage to convert n-gram CHUNKS to TOKEN type
chunk2Token = finance.Chunk2Token()\
    .setInputCols(["ngrams"])\
    .setOutputCol("ngram_tokens")

trainingPipeline = nlp.Pipeline(stages=[
    document, 
    sentenceDetector, 
    token, 
    ngrammer, 
    chunk2Token])

data = spark.createDataFrame([["Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc."]]).toDF("text")

result = trainingPipeline.fit(data).transform(data)
result.selectExpr("explode(ngram_tokens)").show(5, False)

+--------------------------------------------------------------------+
|col                                                                 |
+--------------------------------------------------------------------+
|{token, 0, 14, Our competitors, {sentence -> 0, chunk -> 0}, []}    |
|{token, 4, 22, competitors include, {sentence -> 0, chunk -> 1}, []}|
|{token, 16, 26, include the, {sentence -> 0, chunk -> 2}, []}       |
|{token, 24, 36, the following, {sentence -> 0, chunk -> 3}, []}     |
|{token, 28, 39, following by, {sentence -> 0, chunk -> 4}, []}      |
+--------------------------------------------------------------------+
from johnsnowlabs import nlp, legal 
# Define a pipeline for generating n-grams
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

token = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

ngrammer = nlp.NGramGenerator() \
    .setN(2) \
    .setEnableCumulative(False) \
    .setInputCols(["token"]) \
    .setOutputCol("ngrams") 

# Stage to convert n-gram CHUNKS to TOKEN type
chunk2Token = legal.Chunk2Token()\
    .setInputCols(["ngrams"])\
    .setOutputCol("ngram_tokens")

trainingPipeline = nlp.Pipeline(stages=[
    document, 
    sentenceDetector, 
    token, 
    ngrammer, 
    chunk2Token])

data = spark.createDataFrame([["This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc."]]).toDF("text")
result = trainingPipeline.fit(data).transform(data).cache()
result.selectExpr("explode(ngram_tokens)").show(5, False)

+-----------------------------------------------------------------------+
|col                                                                    |
+-----------------------------------------------------------------------+
|{token, 0, 6, This is, {sentence -> 0, chunk -> 0}, []}                |
|{token, 5, 9, is an, {sentence -> 0, chunk -> 1}, []}                  |
|{token, 8, 22, an Intellectual, {sentence -> 0, chunk -> 2}, []}       |
|{token, 11, 31, Intellectual Property, {sentence -> 0, chunk -> 3}, []}|
|{token, 24, 41, Property Agreement, {sentence -> 0, chunk -> 4}, []}   |
+-----------------------------------------------------------------------+

import spark.implicits._

// Define a pipeline for generating n-grams
val document = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val token = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val ngrammer = new NGramGenerator()
    .setN(2)
    .setEnableCumulative(false)
    .setInputCols("token")
    .setOutputCol("ngrams")
    .setDelimiter("_")

// Stage to convert n-gram CHUNKS to TOKEN type
val chunk2Token = new Chunk2Token()
    .setInputCols("ngrams")
    .setOutputCol("ngram_tokens")

val trainingPipeline = new Pipeline().setStages(Array(
    document, 
    sentenceDetector, 
    token, 
    ngrammer, 
    chunk2Token))

val data = Seq(("A 63-year-old man presents to the hospital ...")).toDF("text")

val result = trainingPipeline.fit(data).transform(data)

+----------------------------------------------------------------+
|col                                                             |
+----------------------------------------------------------------+
|{token, 3, 15, A_63-year-old, {sentence -> 0, chunk -> 0}, []}  |
|{token, 5, 19, 63-year-old_man, {sentence -> 0, chunk -> 1}, []}|
|{token, 17, 28, man_presents, {sentence -> 0, chunk -> 2}, []}  |
|{token, 21, 31, presents_to, {sentence -> 0, chunk -> 3}, []}   |
|{token, 30, 35, to_the, {sentence -> 0, chunk -> 4}, []}        |
+----------------------------------------------------------------+
import spark.implicits._

// Define a pipeline for generating n-grams
val document = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val token = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val ngrammer = new NGramGenerator()
    .setN(2)
    .setEnableCumulative(false)
    .setInputCols("token")
    .setOutputCol("ngrams")

// Stage to convert n-gram CHUNKS to TOKEN type
val chunk2Token = new Chunk2Token()
    .setInputCols("ngrams")
    .setOutputCol("ngram_tokens")

val trainingPipeline = new Pipeline().setStages(Array(
    document, 
    sentenceDetector, 
    token, 
    ngrammer, 
    chunk2Token))

val data = Seq(("Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc.")).toDF("text")

val result = trainingPipeline.fit(data).transform(data)

+--------------------------------------------------------------------+
|col                                                                 |
+--------------------------------------------------------------------+
|{token, 0, 14, Our competitors, {sentence -> 0, chunk -> 0}, []}    |
|{token, 4, 22, competitors include, {sentence -> 0, chunk -> 1}, []}|
|{token, 16, 26, include the, {sentence -> 0, chunk -> 2}, []}       |
|{token, 24, 36, the following, {sentence -> 0, chunk -> 3}, []}     |
|{token, 28, 39, following by, {sentence -> 0, chunk -> 4}, []}      |
+--------------------------------------------------------------------+
import spark.implicits._

// Define a pipeline for generating n-grams
val document = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val token = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val ngrammer = new NGramGenerator()
    .setN(2)
    .setEnableCumulative(false)
    .setInputCols("token")
    .setOutputCol("ngrams")

// Stage to convert n-gram CHUNKS to TOKEN type
val chunk2Token = new Chunk2Token()
    .setInputCols("ngrams")
    .setOutputCol("ngram_tokens")

val trainingPipeline = new Pipeline().setStages(Array(
    document, 
    sentenceDetector, 
    token, 
    ngrammer, 
    chunk2Token))

val data = Seq(("This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc.")).toDF("text")

val result = trainingPipeline.fit(data).transform(data)

+-----------------------------------------------------------------------+
|col                                                                    |
+-----------------------------------------------------------------------+
|{token, 0, 6, This is, {sentence -> 0, chunk -> 0}, []}                |
|{token, 5, 9, is an, {sentence -> 0, chunk -> 1}, []}                  |
|{token, 8, 22, an Intellectual, {sentence -> 0, chunk -> 2}, []}       |
|{token, 11, 31, Intellectual Property, {sentence -> 0, chunk -> 3}, []}|
|{token, 24, 41, Property Agreement, {sentence -> 0, chunk -> 4}, []}   |
+-----------------------------------------------------------------------+

ChunkConverter

Model

Convert chunks from RegexMatcher to chunks with a entity in the metadata.

This annotator is important when the user wants to merge entities identified by NER models together with rules-based matching used by the RegexMathcer annotator. In the following steps of the pipeline, all the identified entities can be treated in a unified field.

Parameters:

  • inputCols: The name of the columns containing the input annotations. It can read either a String column or an Array.

  • outputCol: The name of the column in Document type that is generated. We can specify only one column here.

All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols().

Input Annotator Types: DOCUMENT, CHUNK

Output Annotator Type: CHUNK

Python API: ChunkConverter Scala API: ChunkConverter Notebook: ChunkConverterNotebook
Show Example

from johnsnowlabs import nlp, medical

# Creating the pipeline
rules = '''
\b[A-Z]+(\s+[A-Z]+)*:\b, SECTION_HEADER
'''

with open('regex_rules.txt', 'w') as f:
    f.write(rules)

sample_text = """
POSTOPERATIVE DIAGNOSIS: Cervical lymphadenopathy.
PROCEDURE:  Excisional biopsy of right cervical lymph node.
ANESTHESIA:  General endotracheal anesthesia.
Specimen:  Right cervical lymph node.
EBL: 10 cc.
COMPLICATIONS:  None.
FINDINGS: Enlarged level 2 lymph node was identified and removed and sent for pathologic examination.
FLUIDS:  Please see anesthesia report.
URINE OUTPUT:  None recorded during the case.
INDICATIONS FOR PROCEDURE:  This is a 43-year-old female with a several-year history of persistent cervical lymphadenopathy. She reports that it is painful to palpation on the right and has had multiple CT scans as well as an FNA which were all nondiagnostic. After risks and benefits of surgery were discussed with the patient, an informed consent was obtained. She was scheduled for an excisional biopsy of the right cervical lymph node.
PROCEDURE IN DETAIL:  The patient was taken to the operating room and placed in the supine position. She was anesthetized with general endotracheal anesthesia. The neck was then prepped and draped in the sterile fashion. Again, noted on palpation there was an enlarged level 2 cervical lymph node.A 3-cm horizontal incision was made over this lymph node. Dissection was carried down until the sternocleidomastoid muscle was identified. The enlarged lymph node that measured approximately 2 cm in diameter was identified and was removed and sent to Pathology for touch prep evaluation. The area was then explored for any other enlarged lymph nodes. None were identified, and hemostasis was achieved with electrocautery. A quarter-inch Penrose drain was placed in the wound.The wound was then irrigated and closed with 3-0 interrupted Vicryl sutures for a deep closure followed by a running 4-0 Prolene subcuticular suture. Mastisol and Steri-Strip were placed over the incision, and sterile bandage was applied. The patient tolerated this procedure well and was extubated without complications and transported to the recovery room in stable condition. She will return to the office tomorrow in followup to have the Penrose drain removed.
"""

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model = medical.NerModel.pretrained("ner_clinical_large","en","clinical/models") \
    .setInputCols("sentence","token","embeddings") \
    .setOutputCol("ner")

ner_converter= medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\

regex_matcher = nlp.RegexMatcher()\
    .setInputCols('document')\
    .setStrategy("MATCH_ALL")\
    .setOutputCol("regex_matches")\
    .setExternalRules(path='/content/regex_rules.txt', delimiter=',')

chunkConverter = medical.ChunkConverter()\
    .setInputCols("regex_matches")\
    .setOutputCol("regex_chunk")

merger= medical.ChunkMergeApproach()\
    .setInputCols(["regex_chunk", "ner_chunk"])\
    .setOutputCol("merged_chunks")\
    .setMergeOverlapping(True)\
    .setChunkPrecedence("field")

pipeline= nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    ner_model,
    ner_converter,
    regex_matcher,
    chunkConverter,
    merger
])

data= spark.createDataFrame([[sample_text]]).toDF("text")
result = pipeline.fit(data).transform(data)

# Results
result.select(F.explode(F.arrays_zip(result.merged_chunks.result, 
                                     result.merged_chunks.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("chunk"),
                          F.expr("cols['1']['entity']").alias("merged_entity")).show(15, truncate=100)

+----------------------------------------------+--------------+
|                                         chunk| merged_entity|
+----------------------------------------------+--------------+
|                      POSTOPERATIVE DIAGNOSIS:|SECTION_HEADER|
|                      Cervical lymphadenopathy|       PROBLEM|
|                                    PROCEDURE:|SECTION_HEADER|
|Excisional biopsy of right cervical lymph node|          TEST|
|                                   ANESTHESIA:|SECTION_HEADER|
|               General endotracheal anesthesia|     TREATMENT|
|                     Right cervical lymph node|       PROBLEM|
|                                          EBL:|SECTION_HEADER|
|                                COMPLICATIONS:|SECTION_HEADER|
|                                     FINDINGS:|SECTION_HEADER|
|                   Enlarged level 2 lymph node|       PROBLEM|
|                        pathologic examination|          TEST|
|                                       FLUIDS:|SECTION_HEADER|
|                                 URINE OUTPUT:|SECTION_HEADER|
|                    INDICATIONS FOR PROCEDURE:|SECTION_HEADER|
+----------------------------------------------+--------------+
from johnsnowlabs import nlp, finance

# Creating the pipeline
rules = '''
\b[A-Z]+(\s+[A-Z]+)*:\b, SECTION_HEADER
'''

with open('regex_rules.txt', 'w') as f:
    f.write(rules)

sample_text="""AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price. """

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter= nlp.NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")
    # .setWhiteList(["ORG"]) # Return only ORG entities

regex_matcher = nlp.RegexMatcher()\
    .setInputCols('document')\
    .setStrategy("MATCH_ALL")\
    .setOutputCol("regex_matches")\
    .setExternalRules(path='/content/regex_rules.txt', delimiter=',')

chunkConverter = finance.ChunkConverter()\
    .setInputCols("regex_matches")\
    .setOutputCol("regex_chunk")

merger= finance.ChunkMergeApproach()\
    .setInputCols(["regex_chunk", "ner_chunk"])\
    .setOutputCol("merged_chunks")\
    .setMergeOverlapping(True)\
    .setChunkPrecedence("field")

pipeline= nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    ner_model,
    ner_converter,
    regex_matcher,
    chunkConverter,
    merger
])

data= spark.createDataFrame([[sample_text]]).toDF("text")
result = pipeline.fit(data).transform(data)

# Results
result.select(F.explode(F.arrays_zip(result.merged_chunks.result, 
                                     result.merged_chunks.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("chunk"),
                          F.expr("cols['1']['entity']").alias("merged_entity")).show(15, truncate=100)

+--------+-------------+
|   chunk|merged_entity|
+--------+-------------+
|Group LP|          ORG|
+--------+-------------+
from johnsnowlabs import nlp, legal

# Creating the pipeline
rules = '''
\b[A-Z]+(\s+[A-Z]+)*:\b, SECTION_HEADER
'''

with open('regex_rules.txt', 'w') as f:
    f.write(rules)

sample_text="""AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price. """

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = legal.NerModel.pretrained("legner_org_per_role_date", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter= nlp.NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")
    # .setWhiteList(["ORG"]) # Return only ORG entities

regex_matcher = nlp.RegexMatcher()\
    .setInputCols('document')\
    .setStrategy("MATCH_ALL")\
    .setOutputCol("regex_matches")\
    .setExternalRules(path='/content/regex_rules.txt', delimiter=',')

chunkConverter = legal.ChunkConverter()\
    .setInputCols("regex_matches")\
    .setOutputCol("regex_chunk")

merger= legal.ChunkMergeApproach()\
    .setInputCols(["regex_chunk", "ner_chunk"])\
    .setOutputCol("merged_chunks")\
    .setMergeOverlapping(True)\
    .setChunkPrecedence("field")

pipeline= nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    ner_model,
    ner_converter,
    regex_matcher,
    chunkConverter,
    merger
])

data= spark.createDataFrame([[sample_text]]).toDF("text")
result = pipeline.fit(data).transform(data)

# Results
result.select(F.explode(F.arrays_zip(result.merged_chunks.result, 
                                     result.merged_chunks.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("chunk"),
                          F.expr("cols['1']['entity']").alias("merged_entity")).show(15, truncate=100)

+--------+-------------+
|   chunk|merged_entity|
+--------+-------------+
|Group LP|          ORG|
+--------+-------------+

// val rules = """\b[A-Z]+(\s+[A-Z]+)*:\b, SECTION_HEADER""" 
// with open("regex_rules.txt","w") as f: 
//    f.write(rules) 


import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text") 
    .setOutputCol("document") 

val sentenceDetector = new SentenceDetector()
    .setInputCols(Array("document")) 
    .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
    .setInputCols(Array("sentence")) 
    .setOutputCol("token") 

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
    .setInputCols(Array("sentence","token"))
    .setOutputCol("embeddings") 

val ner_model = MedicalNerModel.pretrained("ner_clinical_large","en","clinical/models")
    .setInputCols("sentence","token","embeddings") 
    .setOutputCol("ner") 

val ner_converter= new NerConverterInternal() 
    .setInputCols(Array("sentence","token","ner")) 
    .setOutputCol("ner_chunk") 

val regex_matcher = new RegexMatcher()
    .setInputCols("document") 
    .setStrategy("MATCH_ALL") 
    .setOutputCol("regex_matches") 
    .setExternalRules(path="/content/regex_rules.txt",delimiter=",") 

val chunkConverter = new ChunkConverter()
    .setInputCols("regex_matches") 
    .setOutputCol("regex_chunk") 

val merger= new ChunkMergeApproach() 
    .setInputCols(Array("regex_chunk","ner_chunk")) 
    .setOutputCol("merged_chunks") 
    .setMergeOverlapping(true) 
    .setChunkPrecedence("field") 

val pipeline= new Pipeline().setStages(Array( 
    documentAssembler, 
    sentenceDetector, 
    tokenizer, 
    word_embeddings,
    ner_model, 
    ner_converter, 
    regex_matcher, 
    chunkConverter, 
    merger )) 

val data = Seq(("POSTOPERATIVE DIAGNOSIS: Cervical lymphadenopathy. PROCEDURE: Excisional biopsy of right cervical lymph node. ANESTHESIA: General endotracheal anesthesia. Specimen: Right cervical lymph node. EBL: 10 cc. COMPLICATIONS: None. FINDINGS: Enlarged level 2 lymph node was identified and removed and sent for pathologic examination. FLUIDS: Please see anesthesia report. URINE OUTPUT: None recorded during the case. INDICATIONS FOR PROCEDURE: This is a 43-year-old female with a several-year history of persistent cervical lymphadenopathy. She reports that it is painful to palpation on the right and has had multiple CT scans as well as an FNA which were all nondiagnostic. After risks and benefits of surgery were discussed with the patient,an informed consent was obtained. She was scheduled for an excisional biopsy of the right cervical lymph node. PROCEDURE IN DETAIL: The patient was taken to the operating room and placed in the supine position. She was anesthetized with general endotracheal anesthesia. The neck was then prepped and draped in the sterile fashion. Again,noted on palpation there was an enlarged level 2 cervical lymph node.A 3-cm horizontal incision was made over this lymph node. Dissection was carried down until the sternocleidomastoid muscle was identified. The enlarged lymph node that measured approximately 2 cm in diameter was identified and was removed and sent to Pathology for touch prep evaluation. The area was then explored for any other enlarged lymph nodes. None were identified,and hemostasis was achieved with electrocautery. A quarter-inch Penrose drain was placed in the wound.The wound was then irrigated and closed with 3-0 interrupted Vicryl sutures for a deep closure followed by a running 4-0 Prolene subcuticular suture. Mastisol and Steri-Strip were placed over the incision,and sterile bandage was applied. The patient tolerated this procedure well and was extubated without complications and transported to the recovery room in stable condition. She will return to the office tomorrow in followup to have the Penrose drain removed.")).toDF("text")

val result = pipeline.fit(data).transform(data)

+----------------------------------------------+--------------+
|                                         chunk| merged_entity|
+----------------------------------------------+--------------+
|                      POSTOPERATIVE DIAGNOSIS:|SECTION_HEADER|
|                      Cervical lymphadenopathy|       PROBLEM|
|                                    PROCEDURE:|SECTION_HEADER|
|Excisional biopsy of right cervical lymph node|          TEST|
|                                   ANESTHESIA:|SECTION_HEADER|
|               General endotracheal anesthesia|     TREATMENT|
|                     Right cervical lymph node|       PROBLEM|
|                                          EBL:|SECTION_HEADER|
|                                COMPLICATIONS:|SECTION_HEADER|
|                                     FINDINGS:|SECTION_HEADER|
|                   Enlarged level 2 lymph node|       PROBLEM|
|                        pathologic examination|          TEST|
|                                       FLUIDS:|SECTION_HEADER|
|                                 URINE OUTPUT:|SECTION_HEADER|
|                    INDICATIONS FOR PROCEDURE:|SECTION_HEADER|
+----------------------------------------------+--------------+
// val rules = """\b[A-Z]+(\s+[A-Z]+)*:\b, SECTION_HEADER""" 
// with open("regex_rules.txt","w") as f: 
//    f.write(rules) 

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text") 
    .setOutputCol("document") 

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
    .setInputCols(Array("document")) 
    .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
    .setInputCols(Array("sentence")) 
    .setOutputCol("token") 

val word_embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
    .setInputCols(Array("sentence","token")) 
    .setOutputCol("embeddings") 

val ner_model = FinanceNerModel.pretrained("finner_orgs_prods_alias","en","finance/models")
    .setInputCols(Array("sentence","token","embeddings")) 
    .setOutputCol("ner") 

val ner_converter= new NerConverterInternal()  
    .setInputCols(Array("sentence","token","ner")) 
    .setOutputCol("ner_chunk") 
    // .setWhiteList(Array("ORG")) 

// Return only ORG entities 
val regex_matcher = new RegexMatcher()
    .setInputCols("document") 
    .setStrategy("MATCH_ALL") 
    .setOutputCol("regex_matches") 
    .setExternalRules(path="/content/regex_rules.txt",delimiter=",") 

val chunkConverter = new ChunkConverter()
    .setInputCols("regex_matches") 
    .setOutputCol("regex_chunk") 
 
val merger= new ChunkMergeApproach() 
    .setInputCols(Array("regex_chunk","ner_chunk")) 
    .setOutputCol("merged_chunks") 
    .setMergeOverlapping(true) 
    .setChunkPrecedence("field") 

val pipeline= new Pipeline().setStages(Array( 
    documentAssembler, 
    sentenceDetector, 
    tokenizer, 
    word_embeddings, 
    ner_model, 
    ner_converter, 
    regex_matcher, 
    chunkConverter, 
    merger )) 
    
val data = Seq(("AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price.")).toDF("text")

val result = pipeline.fit(data).transform(data)

+--------+-------------+
|   chunk|merged_entity|
+--------+-------------+
|Group LP|          ORG|
+--------+-------------+
// val rules = """[A-Z]+[\s+[A-Z]+]*,SECTION_HEADER """ 
// with open("regex_rules.txt","w") as f: 
//    f.write(rules) 

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text") 
    .setOutputCol("document") 

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
    .setInputCols(Array("document")) 
    .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
    .setInputCols(Array("sentence")) 
    .setOutputCol("token") 

val word_embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
    .setInputCols(Array("sentence","token")) 
    .setOutputCol("embeddings") 

val ner_model = LegalNerModel.pretrained("legner_org_per_role_date", "en", "legal/models")
    .setInputCols(Array("sentence","token","embeddings")) 
    .setOutputCol("ner") 

val ner_converter= new NerConverterInternal()  
    .setInputCols(Array("sentence","token","ner")) 
    .setOutputCol("ner_chunk") 
    // .setWhiteList(Array("ORG")) 

// Return only ORG entities 
val regex_matcher = new RegexMatcher()
    .setInputCols("document") 
    .setStrategy("MATCH_ALL") 
    .setOutputCol("regex_matches") 
    .setExternalRules(path="/content/regex_rules.txt",delimiter=",") 

val chunkConverter = new ChunkConverter()
    .setInputCols("regex_matches") 
    .setOutputCol("regex_chunk") 
 
val merger= new ChunkMergeApproach() 
    .setInputCols(Array("regex_chunk","ner_chunk")) 
    .setOutputCol("merged_chunks") 
    .setMergeOverlapping(true) 
    .setChunkPrecedence("field") 

val pipeline= new Pipeline().setStages(Array( 
    documentAssembler, 
    sentenceDetector, 
    tokenizer, 
    word_embeddings, 
    ner_model, 
    ner_converter, 
    regex_matcher, 
    chunkConverter, 
    merger )) 
    
val data = Seq(("AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price.")).toDF("text")

val result = pipeline.fit(data).transform(data)
                          
+--------+-------------+
|   chunk|merged_entity|
+--------+-------------+
|Group LP|          ORG|
+--------+-------------+

ChunkEntityResolver

The ChunkEntityResolverModel encompasses the functionality to produce a normalized entity from a specialized ontology or curated dataset (such as ICD-10, RxNorm, SNOMED, etc.). This model includes comprehensive parameters and methods essential for its training. It operates by transforming a dataset that incorporates two Input Annotations: TOKEN and WORD_EMBEDDINGS, sourced from tools like ChunkTokenizer and ChunkEmbeddings Annotators. Ultimately, it generates the normalized entity relevant to the specified trained ontology or curated dataset, ensuring accurate entity resolution within the given context.

For available pretrained models please see the Models Hub.

Importand Note: This annotator has been deprecated.

Input Annotator Types: TOKEN, WORD_EMBEDDINGS

Output Annotator Type: ENTITY

Scala API: ChunkEntityResolverModel
Show Example

from johnsnowlabs import nlp, medical
# Using pretrained models for SNOMED
# First the prior steps of the pipeline are defined.
# Output of types TOKEN and WORD_EMBEDDINGS are needed.
data = spark.createDataFrame([["A 63-year-old man presents to the hospital ..."]]).toDF("text")

docAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("word_embeddings")

icdo_ner = medical.NerModel.pretrained("ner_bionlp", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "word_embeddings"]) \
    .setOutputCol("icdo_ner")

icdo_chunk = nlp.NerConverter()\
    .setInputCols(["sentence","token","icdo_ner"])\
    .setOutputCol("icdo_chunk")\
    .setWhiteList(["Cancer"])

icdo_chunk_embeddings = nlp.ChunkEmbeddings() \
    .setInputCols(["icdo_chunk", "word_embeddings"]) \
    .setOutputCol("icdo_chunk_embeddings")

icdo_chunk_resolver = medical.ChunkEntityResolverModel\
    .pretrained("chunkresolve_icdo_clinical", "en", "clinical/models") \
    .setInputCols(["token","icdo_chunk_embeddings"]) \
    .setOutputCol("tm_icdo_code")

clinical_ner = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "word_embeddings"]) \
    .setOutputCol("ner")

ner_converter = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

ner_chunk_tokenizer = nlp.ChunkTokenizer() \
    .setInputCols(["ner_chunk"]) \
    .setOutputCol("ner_token")
    
ner_chunk_embeddings = nlp.ChunkEmbeddings() \
    .setInputCols(["ner_chunk", "word_embeddings"]) \
    .setOutputCol("ner_chunk_embeddings")

# Definition of the SNOMED Resolution
ner_snomed_resolver = medical.ChunkEntityResolverModel\
    .pretrained("chunkresolve_snomed_findings_clinical","en","clinical/models") \
    .setInputCols(["ner_token","ner_chunk_embeddings"])\
    .setOutputCol("snomed_result")

pipelineFull = nlp.Pipeline().setStages([
      docAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,

      clinical_ner,
      ner_converter,
      ner_chunk_embeddings,
      ner_chunk_tokenizer,
      ner_snomed_resolver,

      icdo_ner,
      icdo_chunk,
      icdo_chunk_embeddings,
      icdo_chunk_resolver
])
pipelineModelFull = pipelineFull.fit(data)
result = pipelineModelFull.transform(data).cache()

# Show results
result.selectExpr("explode(snomed_result)")
  .selectExpr(
    "col.metadata.target_text",
    "col.metadata.resolved_text",
    "col.metadata.confidence",
    "col.metadata.all_k_results",
    "col.metadata.all_k_resolutions")
  .filter($"confidence" > 0.2).show(5)
+--------------------+--------------------+----------+--------------------+--------------------+
|         target_text|       resolved_text|confidence|       all_k_results|   all_k_resolutions|
+--------------------+--------------------+----------+--------------------+--------------------+
|hypercholesterolemia|Hypercholesterolemia|    0.2524|13644009:::267432...|Hypercholesterole...|
|                 CBC|             Neocyte|    0.4980|259680000:::11573...|Neocyte:::Blood g...|
|                CD38|       Hypoviscosity|    0.2560|47872005:::370970...|Hypoviscosity:::E...|
|           platelets| Increased platelets|    0.5267|6631009:::2596800...|Increased platele...|
|                CD38|       Hypoviscosity|    0.2560|47872005:::370970...|Hypoviscosity:::E...|
+--------------------+--------------------+----------+--------------------+--------------------+

import spark.implicits._
// Using pretrained models for SNOMED
// First the prior steps of the pipeline are defined.
// Output of types TOKEN and WORD_EMBEDDINGS are needed.
val data = Seq(("A 63-year-old man presents to the hospital ...")).toDF("text")

val docAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("word_embeddings")

val icdo_ner = MedicalNerModel.pretrained("ner_bionlp", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "word_embeddings"))
    .setOutputCol("icdo_ner")

val icdo_chunk = new NerConverter()
    .setInputCols(Array("sentence","token","icdo_ner"))
    .setOutputCol("icdo_chunk")
    .setWhiteList("Cancer")

val icdo_chunk_embeddings = new ChunkEmbeddings()
    .setInputCols(Array("icdo_chunk", "word_embeddings"))
    .setOutputCol("icdo_chunk_embeddings")

val icdo_chunk_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_icdo_clinical", "en", "clinical/models")
    .setInputCols(Array("token","icdo_chunk_embeddings"))
    .setOutputCol("tm_icdo_code")

val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence", "token", "word_embeddings"))
  .setOutputCol("ner")

val ner_converter = new NerConverter()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")

val ner_chunk_tokenizer = new ChunkTokenizer()
    .setInputCols("ner_chunk")
    .setOutputCol("ner_token")
  
val ner_chunk_embeddings = new ChunkEmbeddings()
    .setInputCols(Array("ner_chunk", "word_embeddings"))
    .setOutputCol("ner_chunk_embeddings")

// Definition of the SNOMED Resolution
val ner_snomed_resolver = ChunkEntityResolverModel\
    .pretrained("chunkresolve_snomed_findings_clinical","en","clinical/models")
    .setInputCols(Array("ner_token","ner_chunk_embeddings"))\
    .setOutputCol("snomed_result")

val pipelineFull = new Pipeline().setStages(Array(
    docAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,

    clinical_ner,
    ner_converter,
    ner_chunk_embeddings,
    ner_chunk_tokenizer,
    ner_snomed_resolver,

    icdo_ner,
    icdo_chunk,
    icdo_chunk_embeddings,
    icdo_chunk_resolver
))
val pipelineModelFull = pipelineFull.fit(data)
val result = pipelineModelFull.transform(data).cache()

// Show results
//
// result.selectExpr("explode(snomed_result)")
//   .selectExpr(
//     "col.metadata.target_text",
//     "col.metadata.resolved_text",
//     "col.metadata.confidence",
//     "col.metadata.all_k_results",
//     "col.metadata.all_k_resolutions")
//   .filter($"confidence" > 0.2).show(5)
// +--------------------+--------------------+----------+--------------------+--------------------+
// |         target_text|       resolved_text|confidence|       all_k_results|   all_k_resolutions|
// +--------------------+--------------------+----------+--------------------+--------------------+
// |hypercholesterolemia|Hypercholesterolemia|    0.2524|13644009:::267432...|Hypercholesterole...|
// |                 CBC|             Neocyte|    0.4980|259680000:::11573...|Neocyte:::Blood g...|
// |                CD38|       Hypoviscosity|    0.2560|47872005:::370970...|Hypoviscosity:::E...|
// |           platelets| Increased platelets|    0.5267|6631009:::2596800...|Increased platele...|
// |                CD38|       Hypoviscosity|    0.2560|47872005:::370970...|Hypoviscosity:::E...|
// +--------------------+--------------------+----------+--------------------+--------------------+
//

Contains all the parameters and methods to train a ChunkEntityResolverModel. It transform a dataset with two Input Annotations of types TOKEN and WORD_EMBEDDINGS, coming from e.g. ChunkTokenizer and ChunkEmbeddings Annotators and returns the normalized entity for a particular trained ontology / curated dataset. (e.g. ICD-10, RxNorm, SNOMED etc.)

To use pretrained models please use ChunkEntityResolverModel and see the Models Hub for available models.

Input Annotator Types: TOKEN, WORD_EMBEDDINGS

Output Annotator Type: ENTITY

Scala API: ChunkEntityResolverApproach
Show Example

from johnsnowlabs import nlp, medical
# Training a SNOMED model
# Define pre-processing pipeline for training data. It needs consists of columns for the normalized training data and their labels.
document = nlp.DocumentAssembler() \
    .setInputCol("normalized_text") \
    .setOutputCol("document")

chunk = nlp.Doc2Chunk() \
    .setInputCols(["document"]) \
    .setOutputCol("chunk")

token = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel\
    .pretrained("embeddings_healthcare_100d", "en", "clinical/models") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

chunkEmb = nlp.ChunkEmbeddings() \
    .setInputCols(["chunk", "embeddings"]) \
    .setOutputCol("chunk_embeddings")

snomedTrainingPipeline = nlp.Pipeline().setStages([
    document,
    chunk,
    token,
    embeddings,
    chunkEmb
])

snomedTrainingModel = snomedTrainingPipeline.fit(data)

snomedData = snomedTrainingModel.transform(data).cache()

# Then the Resolver can be trained with
snomedExtractor = medical.ChunkEntityResolverApproach() \
    .setInputCols(["token", "chunk_embeddings"]) \
    .setOutputCol("recognized") \
    .setNeighbours(1000) \
    .setAlternatives(25) \
    .setNormalizedCol("normalized_text") \
    .setLabelCol("label") \
    .setEnableWmd(True).setEnableTfidf(True).setEnableJaccard(True) \
    .setEnableSorensenDice(True).setEnableJaroWinkler(True).setEnableLevenshtein(True) \
    .setDistanceWeights([1, 2, 2, 1, 1, 1]) \
    .setAllDistancesMetadata(True) \
    .setPoolingStrategy("MAX") \
    .setThreshold(1e32)
model = snomedExtractor.fit(snomedData)

import spark.implicits._
// Training a SNOMED model
// Define pre-processing pipeline for training data. It needs consists of columns for the normalized training data and their labels.
val document = new DocumentAssembler()
    .setInputCol("normalized_text")
    .setOutputCol("document")

val chunk = new Doc2Chunk()
    .setInputCols("document")
    .setOutputCol("chunk")

val token = new Tokenizer()
    .setInputCols("document")
    .setOutputCol("token")

val embeddings = WordEmbeddingsModel\
    .pretrained("embeddings_healthcare_100d", "en", "clinical/models")
    .setInputCols(Array("document", "token"))
    .setOutputCol("embeddings")

val chunkEmb = new ChunkEmbeddings()
    .setInputCols(Array("chunk", "embeddings"))
    .setOutputCol("chunk_embeddings")

val snomedTrainingPipeline = new Pipeline().setStages(Array(
    document,
    chunk,
    token,
    embeddings,
    chunkEmb
))

val snomedTrainingModel = snomedTrainingPipeline.fit(data)

val snomedData = snomedTrainingModel.transform(data).cache()

// Then the Resolver can be trained with
val snomedExtractor = new ChunkEntityResolverApproach()
    .setInputCols(Array("token", "chunk_embeddings"))
    .setOutputCol("recognized")
    .setNeighbours(1000)
    .setAlternatives(25)
    .setNormalizedCol("normalized_text")
    .setLabelCol("label")
    .setEnableWmd(true).setEnableTfidf(true).setEnableJaccard(true)
    .setEnableSorensenDice(true).setEnableJaroWinkler(true).setEnableLevenshtein(true)
    .setDistanceWeights(Array(1, 2, 2, 1, 1, 1))
    .setAllDistancesMetadata(true)
    .setPoolingStrategy("MAX")
    .setThreshold(1e32)
val model = snomedExtractor.fit(snomedData)

ChunkFilterer

Model

Filters entities coming from CHUNK annotations. Filters can be set via a white list of terms or a regular expression. White list criteria is enabled by default. To use regex, criteria has to be set to regex.

Parametres:

  • inputCols: The name of the columns containing the input annotations. It can read either a String column or an Array.

  • outputCol: The name of the column in Document type that is generated. We can specify only one column here.

  • criteria: Tag representing what is the criteria to filter the chunks. Possibles values are: - isIn: Filter by the chunk - regex: Filter using a regex

  • whiteList: If defined, list of entities to process. The rest will be ignored.

  • blackList: If defined, list of entities to ignore. The rest will be processed.

  • regex: If defined, list of regex to process the chunks (Default: []).

  • filterEntity: If equal to “entity”, use the ner label to filter. If set to “result”, use the result attribute of the annotation to filter.

  • entitiesConfidence: Path to csv with pairs (entity,confidenceThreshold). Filter the chunks with entities which have confidence lower than the confidence threshold.

All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols().

Input Annotator Types: DOCUMENT,CHUNK

Output Annotator Type: CHUNK

Python API: ChunkFilterer Scala API: ChunkFilterer Notebook: ChunkFiltererNotebook
Show Example

from johnsnowlabs import nlp, medical

# Filtering POS tags
# First pipeline stages to extract the POS tags are defined

docAssembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")

posTagger = nlp.PerceptronModel.pretrained() \
  .setInputCols(["sentence", "token"]) \
  .setOutputCol("pos")

chunker = nlp.Chunker() \
  .setInputCols(["pos", "sentence"]) \
  .setOutputCol("chunk") \
  .setRegexParsers(["(<NN>)+"])

# Then the chunks can be filtered via a white list. Here only terms with "gastroenteritis" remain.
chunkerFilter = medical.ChunkFilterer() \
  .setInputCols(["sentence","chunk"]) \
  .setOutputCol("filtered") \
  .setCriteria("isin") \
  .setWhiteList(["gastroenteritis"])

pipeline = nlp.Pipeline(stages=[
  docAssembler,
  sentenceDetector,
  tokenizer,
  posTagger,
  chunker,
  chunkerFilter])

data = spark.createDataFrame([["Has a past history of gastroenteritis and stomach pain, however patient ..."]]).toDF("text")

result = pipeline.fit(data).transform(data)
result.selectExpr("explode(chunk)").show(truncate=False)
+---------------------------------------------------------------------------------+
|col                                                                              |
+---------------------------------------------------------------------------------+
|{chunk, 11, 17, history, {sentence -> 0, chunk -> 0}, []}                        |
|{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []}                |
|{chunk, 42, 53, stomach pain, {sentence -> 0, chunk -> 2}, []}                   |
|{chunk, 64, 70, patient, {sentence -> 0, chunk -> 3}, []}                        |
|{chunk, 81, 110, stomach pain now.We don't care, {sentence -> 0, chunk -> 4}, []}|
|{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []}              |
+---------------------------------------------------------------------------------+

result.selectExpr("explode(filtered)").show(truncate=False)
+-------------------------------------------------------------------+
|col                                                                |
+-------------------------------------------------------------------+
|{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []}  |
|{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []}|
+-------------------------------------------------------------------+
from johnsnowlabs import nlp, finance

# Filtering POS tags
# First pipeline stages to extract the POS tags are defined

docAssembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")

posTagger = nlp.PerceptronModel.pretrained() \
  .setInputCols(["sentence", "token"]) \
  .setOutputCol("pos")

chunker = nlp.Chunker() \
  .setInputCols(["pos", "sentence"]) \
  .setOutputCol("chunk") \
  .setRegexParsers(["(<NN>)+"])

# Then the chunks can be filtered via a white list. Here only terms with "gastroenteritis" remain.
chunkerFilter = finance.ChunkFilterer() \
  .setInputCols(["sentence","chunk"]) \
  .setOutputCol("filtered") \
  .setCriteria("isin") \
  .setWhiteList(["rate"])

pipeline = nlp.Pipeline(stages=[
  docAssembler,
  sentenceDetector,
  tokenizer,
  posTagger,
  chunker,
  chunkerFilter])

data = spark.createDataFrame([["AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price."]]).toDF("text")

result = pipeline.fit(data).transform(data)
result.selectExpr("explode(chunk)").show(truncate=False)
+-------------------------------------------------------+
|col                                                    |
+-------------------------------------------------------+
|{chunk, 73, 77, basis, {sentence -> 0, chunk -> 0}, []}|
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+

result.selectExpr("explode(filtered)").show(truncate=False)
+-------------------------------------------------------+
|col                                                    |
+-------------------------------------------------------+
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+
from johnsnowlabs import nlp, legal

# Filtering POS tags
# First pipeline stages to extract the POS tags are defined

docAssembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")

posTagger = nlp.PerceptronModel.pretrained() \
  .setInputCols(["sentence", "token"]) \
  .setOutputCol("pos")

chunker = nlp.Chunker() \
  .setInputCols(["pos", "sentence"]) \
  .setOutputCol("chunk") \
  .setRegexParsers(["(<NN>)+"])

# Then the chunks can be filtered via a white list. Here only terms with "gastroenteritis" remain.
chunkerFilter = legal.ChunkFilterer() \
  .setInputCols(["sentence","chunk"]) \
  .setOutputCol("filtered") \
  .setCriteria("isin") \
  .setWhiteList(["rate"])

pipeline = nlp.Pipeline(stages=[
  docAssembler,
  sentenceDetector,
  tokenizer,
  posTagger,
  chunker,
  chunkerFilter])

data = spark.createDataFrame([["AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price."]]).toDF("text")

result = pipeline.fit(data).transform(data)
result.selectExpr("explode(chunk)").show(truncate=False)
+-------------------------------------------------------+
|col                                                    |
+-------------------------------------------------------+
|{chunk, 73, 77, basis, {sentence -> 0, chunk -> 0}, []}|
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+

result.selectExpr("explode(filtered)").show(truncate=False)
+-------------------------------------------------------+
|col                                                    |
+-------------------------------------------------------+
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+

// Filtering POS tags
// First pipeline stages to extract the POS tags are defined

import spark.implicits._

val docAssembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val sentenceDetector = new SentenceDetector()
 .setInputCols(Array("document")) 
 .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
 .setInputCols(Array("sentence")) 
 .setOutputCol("token") 

val posTagger = PerceptronModel.pretrained()
 .setInputCols(Array("sentence","token")) 
 .setOutputCol("pos") 

val chunker = new Chunker()
 .setInputCols(Array("pos","sentence")) 
 .setOutputCol("chunk") 
 .setRegexParsers(Array("(<NN>) +")) 

val chunkerFilter = new ChunkFilterer()
 .setInputCols(Array("sentence","chunk")) 
 .setOutputCol("filtered") 
 .setCriteria("isin") 
 .setWhiteList(Array("gastroenteritis"))

val pipeline = new Pipeline().setStages(Array(
  docAssembler, 
  sentenceDetector, 
  tokenizer, 
  posTagger, 
  chunker, 
  chunkerFilter)) 

val text ="""Has a past history of gastroenteritis and stomach pain, however patient ..."""

val data = Seq(text).toDF("text")
val result = pipeline.fit(data).transform(data)

// result.selectExpr("explode(chunk)").show(truncate=false)
+---------------------------------------------------------------------------------+
|col                                                                              |
+---------------------------------------------------------------------------------+
|{chunk, 11, 17, history, {sentence -> 0, chunk -> 0}, []}                        |
|{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []}                |
|{chunk, 42, 53, stomach pain, {sentence -> 0, chunk -> 2}, []}                   |
|{chunk, 64, 70, patient, {sentence -> 0, chunk -> 3}, []}                        |
|{chunk, 81, 110, stomach pain now.We don't care, {sentence -> 0, chunk -> 4}, []}|
|{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []}              |
+---------------------------------------------------------------------------------+

// result.selectExpr("explode(filtered)").show(truncate=false)
+-------------------------------------------------------------------+
|col                                                                |
+-------------------------------------------------------------------+
|{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []}  |
|{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []}|
+-------------------------------------------------------------------+
import spark.implicits._

val docAssembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val sentenceDetector = new SentenceDetector()
 .setInputCols(Array("document")) 
 .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
 .setInputCols(Array("sentence")) 
 .setOutputCol("token") 

val posTagger = PerceptronModel.pretrained()
 .setInputCols(Array("sentence","token")) 
 .setOutputCol("pos") 

val chunker = new Chunker()
 .setInputCols(Array("pos","sentence")) 
 .setOutputCol("chunk") 
 .setRegexParsers(Array("(<NN>) +")) 

val chunkerFilter = new ChunkFilterer()
 .setInputCols(Array("sentence","chunk")) 
 .setOutputCol("filtered") 
 .setCriteria("isin") 

val pipeline = new Pipeline().setStages(Array(
  docAssembler, 
  sentenceDetector, 
  tokenizer, 
  posTagger, 
  chunker, 
  chunkerFilter)) 

val text ="""AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price."""

val data = Seq(text).toDF("text")
val result = pipeline.fit(data).transform(data)

// result.selectExpr("explode(chunk)").show(truncate=false)
+-------------------------------------------------------+
|col                                                    |
+-------------------------------------------------------+
|{chunk, 73, 77, basis, {sentence -> 0, chunk -> 0}, []}|
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+

// result.selectExpr("explode(filtered)").show(truncate=False)
+-------------------------------------------------------+
|col                                                    |
+-------------------------------------------------------+
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+
import spark.implicits._

val docAssembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val sentenceDetector = new SentenceDetector()
 .setInputCols(Array("document")) 
 .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
 .setInputCols(Array("sentence")) 
 .setOutputCol("token") 

val posTagger = PerceptronModel.pretrained()
 .setInputCols(Array("sentence","token")) 
 .setOutputCol("pos") 

val chunker = new Chunker()
 .setInputCols(Array("pos","sentence")) 
 .setOutputCol("chunk") 
 .setRegexParsers(Array("(<NN>) +")) 

val chunkerFilter = new ChunkFilterer()
 .setInputCols(Array("sentence","chunk")) 
 .setOutputCol("filtered") 
 .setCriteria("isin") 

val pipeline = new Pipeline().setStages(Array(
  docAssembler, 
  sentenceDetector, 
  tokenizer, 
  posTagger, 
  chunker, 
  chunkerFilter)) 

val text ="""AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price."""

val data = Seq(text).toDF("text")
val result = pipeline.fit(data).transform(data)

// result.selectExpr("explode(chunk)").show(truncate=false)
+-------------------------------------------------------+
|col                                                    |
+-------------------------------------------------------+
|{chunk, 73, 77, basis, {sentence -> 0, chunk -> 0}, []}|
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+

// result.selectExpr("explode(filtered)").show(truncate=False)
+-------------------------------------------------------+
|col                                                    |
+-------------------------------------------------------+
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+

ChunkKeyPhraseExtraction

Model

Chunk KeyPhrase Extraction uses Bert Sentence Embeddings to determine the most relevant key phrases describing a text. The input to the model consists of chunk annotations and sentence or document annotation. The model compares the chunks against the corresponding sentences/documents and selects the chunks which are most representative of the broader text context (i.e. the document or the sentence they belong to). The key phrases candidates (i.e. the input chunks) can be generated in various ways, e.g. by NGramGenerator, TextMatcher or NerConverter. The model operates either at sentence (selecting the most descriptive chunks from the sentence they belong to) or at document level. In the latter case, the key phrases are selected to represent all the input document annotations.

Parametres:

  • setConcatenateSentences(value: Boolean): Concatenate the input sentence/documentation annotations before computing their embedding Default value is ‘true’.

  • setDivergence(value: Float): Set the level of divergence of the extracted key phrases.

  • setDocumentLevelProcessing(value: Boolean): Extract key phrases from the whole document (true) or from particular sentences which the chunks refer to (false) Default value is ‘true’.

  • setDropPunctuation(value: Boolean): Remove punctuation marks from input chunks.

  • setSelectMostDifferent(value: Boolean): Let the model return the top N key phrases which are the most different from each other.

  • setTopN(value: Int): Set the number of key phrases to extract.

This model is a subclass of [[BertSentenceEmbeddings]] and shares all parameters with it. It can load any pretrained BertSentenceEmbeddings model. Available models can be found at the Models Hub.

Input Annotator Types: DOCUMENT, CHUNK

Output Annotator Type: CHUNK

Python API: ChunkKeyPhraseExtraction Scala API: ChunkKeyPhraseExtraction Notebook: ChunkKeyPhraseExtractionNotebook
Show Example

from johnsnowlabs import nlp, medical

documenter = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentencer = nlp.SentenceDetector() \
    .setInputCols(["document"])\
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("tokens") \

embeddings = nlp.WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "tokens"]) \
    .setOutputCol("embeddings")

ner_tagger = medical.NerModel().pretrained("ner_jsl_slim", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens", "embeddings"]) \
    .setOutputCol("ner_tags")

ner_converter = nlp.NerConverter()\
    .setInputCols("sentences", "tokens", "ner_tags")\
    .setOutputCol("ner_chunks")

key_phrase_extractor = medical.ChunkKeyPhraseExtraction.pretrained()\
    .setTopN(1)\
    .setDocumentLevelProcessing(False)\
    .setDivergence(0.4)\
    .setInputCols(["sentences", "ner_chunks"])\
    .setOutputCol("ner_chunk_key_phrases")

pipeline = nlp.Pipeline(stages=[
    documenter, 
    sentencer, 
    tokenizer, 
    embeddings, 
    ner_tagger, 
    ner_converter,
    key_phrase_extractor])

data = spark.createDataFrame([["Her Diabetes has become type 2 in the last year with her Diabetes.He complains of swelling in his right forearm."]]).toDF("text")
results = pipeline.fit(data).transform(data)

results.selectExpr("explode(ner_chunk_key_phrases) AS key_phrase")\
       .selectExpr("key_phrase.result",
                   "key_phrase.metadata.entity",
                   "key_phrase.metadata.DocumentSimilarity",
                   "key_phrase.metadata.MMRScore").show(truncate=False)

+--------+-------------------------+------------------+-----------------+
|result  |entity                   |DocumentSimilarity|MMRScore         |
+--------+-------------------------+------------------+-----------------+
|Diabetes|Disease_Syndrome_Disorder|0.66827321499841  |0.400963944931921|
+--------+-------------------------+------------------+-----------------+
from johnsnowlabs import nlp, finance

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
        
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

key_phrase_extractor = finance.ChunkKeyPhraseExtraction\
    .pretrained()\
    .setTopN(1)\
    .setDocumentLevelProcessing(False)\
    .setDivergence(0.4)\
    .setInputCols(["sentence", "ner_chunk"])\
    .setOutputCol("ner_chunk_key_phrases")

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    key_phrase_extractor])

text = ["""In 2020, we acquired certain assets of Spell Security Private Limited (also known as "Spell Security"). More specifically, their Compliance product - Policy Compliance (PC)")."""]

data = spark.createDataFrame([text]).toDF("text")
result = nlpPipeline.fit(data).transform(data)

result.selectExpr("explode(ner_chunk_key_phrases) AS key_phrase")\
      .selectExpr("key_phrase.result",
                  "key_phrase.metadata.entity",
                  "key_phrase.metadata.DocumentSimilarity",
                  "key_phrase.metadata.MMRScore").show(truncate=False)

+------------------------------+-------+------------------+-------------------+
|result                        |entity |DocumentSimilarity|MMRScore           |
+------------------------------+-------+------------------+-------------------+
|Policy Compliance             |PRODUCT|0.6446724461374882|0.38680348305268175|
|Spell Security Private Limited|ORG    |0.6282153013401193|0.3769291957818915 |
+------------------------------+-------+------------------+-------------------+
from johnsnowlabs import nlp, legal

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
        
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias","en","legal/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

key_phrase_extractor = legal.ChunkKeyPhraseExtraction\
    .pretrained()\
    .setTopN(1)\
    .setDocumentLevelProcessing(False)\
    .setDivergence(0.4)\
    .setInputCols(["sentence", "ner_chunk"])\
    .setOutputCol("ner_chunk_key_phrases")

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    key_phrase_extractor])

text = ["""This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties").
"""]

data = spark.createDataFrame([text]).toDF("text")
result = nlpPipeline.fit(data).transform(data)

result.selectExpr("explode(ner_chunk_key_phrases) AS key_phrase")\
      .selectExpr("key_phrase.result",
                  "key_phrase.metadata.entity",
                  "key_phrase.metadata.DocumentSimilarity",
                  "key_phrase.metadata.MMRScore").show(truncate=False)

+--------------+------+------------------+-------------------+
|result        |entity|DocumentSimilarity|MMRScore           |
+--------------+------+------------------+-------------------+
|Buyer Entities|ALIAS |0.5680936022739617|0.34085617490878395|
+--------------+------+------------------+-------------------+

import spark.implicits._

val documenter = new DocumentAssembler()
  .setInputCol("text") 
  .setOutputCol("document") 

val sentencer = new SentenceDetector()
  .setInputCols(Array("document")) 
  .setOutputCol("sentences") 

val tokenizer = new Tokenizer()
  .setInputCols(Array("document")) 
  .setOutputCol("tokens") 

val embeddings = WordEmbeddingsModel
  .pretrained("embeddings_clinical","en","clinical/models") 
  .setInputCols(Array("document","tokens")) 
  .setOutputCol("embeddings") 

val ner_tagger = MedicalNerModel.pretrained("ner_jsl_slim","en","clinical/models") 
  .setInputCols(Array("sentences","tokens","embeddings")) 
  .setOutputCol("ner_tags") 
 
val ner_converter = new NerConverter()
  .setInputCols("sentences","tokens","ner_tags") 
  .setOutputCol("ner_chunks") 

val key_phrase_extractor = ChunkKeyPhraseExtraction.pretrained()
  .setTopN(1) 
  .setDocumentLevelProcessing(false) 
  .setDivergence(0.4) 
  .setInputCols(Array("sentences","ner_chunks")) 
  .setOutputCol("ner_chunk_key_phrases") 

val pipeline = new Pipeline().setStages(Array( 
  documenter, 
  sentencer, 
  tokenizer, 
  embeddings, 
  ner_tagger, 
  ner_converter, 
  key_phrase_extractor)) 

val text ="""Her Diabetes has become type 2 in the last year with her Diabetes.He complains of swelling in his right forearm."""
val data = Seq(text).toDF("text")

val results = pipeline.fit(data).transform(data)

+--------+-------------------------+------------------+-----------------+
|result  |entity                   |DocumentSimilarity|MMRScore         |
+--------+-------------------------+------------------+-----------------+
|Diabetes|Disease_Syndrome_Disorder|0.66827321499841  |0.400963944931921|
+--------+-------------------------+------------------+-----------------+
import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text") 
  .setOutputCol("document") 

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
  .setInputCols(Array("document") ) 
  .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence")) 
  .setOutputCol("token") 

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
  .setInputCols(Array("sentence","token")) 
  .setOutputCol("embeddings") 

val ner_model = FinanceNerModel.pretrained("finner_orgs_prods_alias","en","finance/models")
  .setInputCols(Array("sentence","token","embeddings")) 
  .setOutputCol("ner") 

val ner_converter = new NerConverter()
  .setInputCols(Array("sentence","token","ner")) 
  .setOutputCol("ner_chunk") 

val key_phrase_extractor = ChunkKeyPhraseExtraction.pretrained() 
  .setTopN(1) 
  .setDocumentLevelProcessing(false) 
  .setDivergence(0.4) 
  .setInputCols(Array("sentence","ner_chunk")) 
  .setOutputCol("ner_chunk_key_phrases")

val nlpPipeline = new Pipeline().setStages(Array( 
  documentAssembler, 
  sentenceDetector, 
  tokenizer, 
  embeddings, 
  ner_model, n
  er_converter, 
  key_phrase_extractor) ) 

val text ="""In 2020, we acquired certain assets of Spell Security Private Limited (also known as "Spell Security"). More specifically, their Compliance product - Policy Compliance (PC)."""
val data = Seq(text).toDF("text")

val result = nlpPipeline.fit(data).transform(data)

+------------------------------+-------+------------------+-------------------+
|result                        |entity |DocumentSimilarity|MMRScore           |
+------------------------------+-------+------------------+-------------------+
|Policy Compliance             |PRODUCT|0.6446724461374882|0.38680348305268175|
|Spell Security Private Limited|ORG    |0.6282153013401193|0.3769291957818915 |
+------------------------------+-------+------------------+-------------------+
import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text") 
  .setOutputCol("document") 

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
  .setInputCols(Array("document") ) 
  .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence")) 
  .setOutputCol("token") 

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
  .setInputCols(Array("sentence","token")) 
  .setOutputCol("embeddings") 

val ner_model = LegalNerModel.pretrained("legner_orgs_prods_alias","en","legal/models")
  .setInputCols(Array("sentence","token","embeddings")) 
  .setOutputCol("ner") 

val ner_converter = new NerConverter()
  .setInputCols(Array("sentence","token","ner")) 
  .setOutputCol("ner_chunk") 

val key_phrase_extractor = ChunkKeyPhraseExtraction.pretrained() 
  .setTopN(1) 
  .setDocumentLevelProcessing(false) 
  .setDivergence(0.4) 
  .setInputCols(Array("sentence","ner_chunk")) 
  .setOutputCol("ner_chunk_key_phrases")

val nlpPipeline = new Pipeline().setStages(Array( 
  documentAssembler, 
  sentenceDetector, 
  tokenizer, 
  embeddings, 
  ner_model, n
  er_converter, 
  key_phrase_extractor) ) 

val text ="""This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties")."""
val data = Seq(text).toDF("text")

val result = nlpPipeline.fit(data).transform(data)

+--------------+------+------------------+-------------------+
|result        |entity|DocumentSimilarity|MMRScore           |
+--------------+------+------------------+-------------------+
|Buyer Entities|ALIAS |0.5680936022739617|0.34085617490878395|
+--------------+------+------------------+-------------------+

ChunkMapper

We can use ChunkMapper to map entities with their associated code/reference based on pre-defined dictionaries.

This is the AnnotatorModel of the ChunkMapper, which can be used to access pretrained models with the .pretrained() or .load() methods. To train a new model, check the documentation of the ChunkMapperApproach annotator.

The annotator also allows using fuzzy matching, which can take into consideration parts of the tokens tha can map even when word order is different, char ngrams that can map even when thre are typos, and using fuzzy distance metric (Jaccard, Levenshtein, etc.).

Parametres:

  • setRels (List[str]): Relations that we are going to use to map the chunk

  • setLowerCase (Boolean): Set if we want to map the chunks in lower case or not (Default: True)

  • setAllowMultiTokenChunk (Boolean): Whether to skip relations with multitokens (Default: True)

  • setMultivaluesRelations (Boolean): Whether to decide to return all values in a relation together or separately (Default: False)

Example usage and more details can be found on Spark NLP Workshop repository accessible in GitHub, for example the notebook Healthcare Chunk Mapping.

Input Annotator Types: CHUNK

Output Annotator Type: LABEL_DEPENDENCY

Python API: ChunkMapperModel Scala API: ChunkMapperModel Notebook: ChunkMapperModelNotebook
Show Example

from johnsnowlabs import nlp, medical

documenter = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentencer = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentences")

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentences"])\
  .setOutputCol("tokens")

words_embedder = nlp.WordEmbeddingsModel()\
  .pretrained("embeddings_clinical", "en", "clinical/models")\
  .setInputCols(["sentences", "tokens"])\
  .setOutputCol("embeddings")

ner_tagger = medical.NerModel()\
  .pretrained("ner_posology", "en", "clinical/models")\
  .setInputCols("sentences", "tokens", "embeddings")\
  .setOutputCol("ner_tags")

ner_converter = medical.NerConverterInternal()\
  .setInputCols(["sentences", "tokens", "ner_tags"])\
  .setOutputCol("ner_chunks")\
  .setWhiteList(["DRUG"])

chunkToDoc = nlp.Chunk2Doc()\
  .setInputCols("ner_chunks")\
  .setOutputCol("ner_chunks_doc")

sbert_embedder = nlp.BertSentenceEmbeddings\
  .pretrained("sbiobert_base_cased_mli", "en","clinical/models")\
  .setInputCols(["ner_chunks_doc"])\
  .setOutputCol("sbert_embeddings")\
  .setCaseSensitive(False)

rxnorm_resolver = medical.SentenceEntityResolverModel\
  .pretrained("sbiobertresolve_rxnorm_augmented", "en", "clinical/models")\
  .setInputCols(["sbert_embeddings"])\
  .setOutputCol("rxnorm_code")\
  .setDistanceFunction("EUCLIDEAN")\

resolver2chunk = medical.Resolution2Chunk()\
  .setInputCols(["rxnorm_code"]) \
  .setOutputCol("rxnorm_chunk")\

chunkerMapper = medical.ChunkMapperModel.pretrained("rxnorm_drug_brandname_mapper", "en", "clinical/models")\
  .setInputCols(["rxnorm_chunk"])\
  .setOutputCol("rxnorm_drug_brandname_mapper")\
  .setRels(["rxnorm_brandname"])

pipeline = nlp.Pipeline(
    stages = [
        documenter,
        sentencer,
        tokenizer,
        words_embedder,
        ner_tagger,
        ner_converter,
        chunkToDoc,
        sbert_embedder,
        rxnorm_resolver,
        resolver2chunk,
        chunkerMapper
        ])


data = spark.createDataFrame([["The doctor prescribed Sinequan 150 MG for depression and Zonalon 50 mg for managing skin itching"]]).toDF("text")

result= pipeline.fit(data).transform(data)

result.select(F.explode(F.arrays_zip(result.ner_chunks.result,
                                     result.rxnorm_code.result)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("ner_chunks"),
                          F.expr("cols['1']").alias("rxnorm_code")).show(15, truncate=100)

+----------+-----------+----------------------------+
|ner_chunks|rxnorm_code|rxnorm_drug_brandname_mapper|
+----------+-----------+----------------------------+
|  Sinequan|     224915|         Sinequan (Sinequan)|
|   Zonalon|       9801|           Zonalon (Zonalon)|
+----------+-----------+----------------------------+
from johnsnowlabs import nlp, finance

document_assembler = nlp.DocumentAssembler()\
  .setInputCol('text')\
  .setOutputCol('document')

tokenizer = nlp.Tokenizer()\
  .setInputCols("document")\
  .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
  .setInputCols(["document", "token"]) \
  .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_ticker", "en", "finance/models")\
  .setInputCols(["document", "token", "embeddings"])\
  .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
  .setInputCols(["document", "token", "ner"])\
  .setOutputCol("ner_chunk")

CM = finance.ChunkMapperModel.pretrained('finmapper_nasdaq_ticker_stock_screener', 'en', 'finance/models')\
  .setInputCols(["ner_chunk"])\
  .setOutputCol("mappings")

pipeline = nlp.Pipeline().setStages([
  document_assembler,
  tokenizer, 
  embeddings,
  ner_model, 
  ner_converter, 
  CM])
                                 
text = ["""There are some serious purchases and sales of AMZN stock today."""]

data = spark.createDataFrame([text]).toDF("text")

result = pipeline.fit(data).transform(data)

+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result|result                                                                                                                                                             |
+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[AMZN]|[AMZN, Amazon.com Inc. Common Stock, $98.12, 2.85, 2.991%, 9.98556270184E11, United States, 1997, 85412563, Consumer Discretionary, Catalog/Specialty Distribution]|
+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
from johnsnowlabs import nlp, legal

document_assembler = nlp.DocumentAssembler()\
  .setInputCol('text')\
  .setOutputCol('document')

tokenizer = nlp.Tokenizer()\
  .setInputCols("document")\
  .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained('glove_100d') \
  .setInputCols(['document', 'token']) \
  .setOutputCol('embeddings')

ner_model = nlp.NerDLModel.pretrained("onto_100", "en") \
  .setInputCols(["document", "token", "embeddings"]) \
  .setOutputCol("ner")
 
ner_converter = nlp.NerConverter()\
  .setInputCols(["document", "token", "ner"])\
  .setOutputCol("ner_chunk")\
  .setWhiteList(["CARDINAL"])

CM = legal.ChunkMapperModel().pretrained("legmapper_edgar_irs", "en", "legal/models")\
  .setInputCols(["ner_chunk"])\
  .setOutputCol("mappings")

pipeline = nlp.Pipeline().setStages([
  document_assembler,
  tokenizer, 
  embeddings,
  ner_model, 
  ner_converter, 
  CM])

text = ["""873474341 is an American multinational corporation that is engaged in the design, development, manufacturing, and worldwide marketing and sales of footwear, apparel, equipment, accessories, and services"""]

data = spark.createDataFrame([text]).toDF("text")

result= pipeline.fit(data).transform(data)

+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result     |result                                                                                                                                                               |
+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[873474341]|[Masterworks 096, LLC, RETAIL-RETAIL STORES, NEC [5990], 5990, 873474341, 1231, NY, DE, 225 LIBERTY STREET, NEW YORK, NY, 10281, 2035185172, , , 2022-01-10, 1894064]|
+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val documenter = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val sentencer = new SentenceDetector()
 .setInputCols("document")
 .setOutputCol("sentences") 

val tokenizer = new Tokenizer()
 .setInputCols("sentences")
 .setOutputCol("tokens") 

val words_embedder = WordEmbeddingsModel
 .pretrained("embeddings_clinical","en","clinical/models") 
 .setInputCols(Array("sentences","tokens")) 
 .setOutputCol("embeddings") 

val ner_tagger = MedicalNerModel
 .pretrained("ner_posology","en","clinical/models") 
 .setInputCols(Array("sentences","tokens","embeddings"))
 .setOutputCol("ner_tags") 

val ner_converter = new NerConverterInternal()
 .setInputCols(Array("sentences","tokens","ner_tags")) 
 .setOutputCol("ner_chunks") 
 .setWhiteList("DRUG") 

val chunkToDoc = new Chunk2Doc()
 .setInputCols("ner_chunks") 
 .setOutputCol("ner_chunks_doc") 

val sbert_embedder = BertSentenceEmbeddings
 .pretrained("sbiobert_base_cased_mli","en","clinical/models") 
 .setInputCols("ner_chunks_doc")
 .setOutputCol("sbert_embeddings") 
 .setCaseSensitive(false) 

val rxnorm_resolver = SentenceEntityResolverModel
 .pretrained("sbiobertresolve_rxnorm_augmented","en","clinical/models") 
 .setInputCols("sbert_embeddings")
 .setOutputCol("rxnorm_code") 
 .setDistanceFunction("EUCLIDEAN") 

val resolver2chunk = new Resolution2Chunk()
 .setInputCols("rxnorm_code")
 .setOutputCol("rxnorm_chunk") 

val chunkerMapper = ChunkMapperModel.pretrained("rxnorm_drug_brandname_mapper","en","clinical/models")
 .setInputCols("rxnorm_chunk")
 .setOutputCol("rxnorm_drug_brandname_mapper") 
 .setRels(Array("rxnorm_brandname")) 

val pipeline = new Pipeline().setStages(Array(
 documenter, 
 sentencer, 
 tokenizer, 
 words_embedder, 
 ner_tagger, 
 ner_converter, 
 chunkToDoc, 
 sbert_embedder, 
 rxnorm_resolver, 
 resolver2chunk,
  chunkerMapper )) 

val text ="""The doctor prescribed Sinequan 150 MG for depression and Zonalon 50 mg for managing skin itching"""
val data = Seq(text).toDF("text")

val result= mapper_pipeline.fit(data).transform(data)

+----------+-----------+----------------------------+
|ner_chunks|rxnorm_code|rxnorm_drug_brandname_mapper|
+----------+-----------+----------------------------+
|  Sinequan|     224915|         Sinequan (Sinequan)|
|   Zonalon|       9801|           Zonalon (Zonalon)|
+----------+-----------+----------------------------+
import spark.implicits._

val document_assembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val tokenizer = new Tokenizer()
 .setInputCols("document") 
 .setOutputCol("token") 

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
 .setInputCols(Array("document","token")) 
 .setOutputCol("embeddings") 

val ner_model = FinanceNerModel.pretrained("finner_ticker","en","finance/models")
 .setInputCols(Array("document","token","embeddings")) 
 .setOutputCol("ner") 

val ner_converter = new NerConverter()
 .setInputCols(Array("document","token","ner")) 
 .setOutputCol("ner_chunk") 

val CM = ChunkMapperModel.pretrained("finmapper_nasdaq_ticker_stock_screener","en","finance/models")
 .setInputCols("ner_chunk")
 .setOutputCol("mappings") 

val pipeline = new Pipeline().setStages(Array( 
  document_assembler, 
  tokenizer, 
  embeddings, 
  ner_model, 
  ner_converter, 
  CM) ) 
 
val text ="""There are some serious purchases and sales of AMZN stock today."""
val data = Seq(text).toDF("text")

val result = pipeline.fit(data).transform(data)

+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result|result                                                                                                                                                             |
+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[AMZN]|[AMZN, Amazon.com Inc. Common Stock, $98.12, 2.85, 2.991%, 9.98556270184E11, United States, 1997, 85412563, Consumer Discretionary, Catalog/Specialty Distribution]|
+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._

val document_assembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val tokenizer = new Tokenizer()
 .setInputCols("document") 
 .setOutputCol("token") 

val embeddings = WordEmbeddingsModel.pretrained("glove_100d")
 .setInputCols(Array("document","token")) 
 .setOutputCol("embeddings") 

val ner_model = NerDLModel.pretrained("onto_100","en")
 .setInputCols(Array("document","token","embeddings")) 
 .setOutputCol("ner") 

val ner_converter = new NerConverter()
 .setInputCols(Array("document","token","ner")) 
 .setOutputCol("ner_chunk") 
 .setWhiteList(Array("CARDINAL")) 

val CM = ChunkMapperModel.pretrained("legmapper_edgar_irs","en","legal/models") 
.setInputCols("ner_chunk")
.setOutputCol("mappings") 

val pipeline = new Pipeline().setStages(Array( 
  document_assembler, 
  tokenizer, 
  embeddings, 
  ner_model, 
  ner_converter, 
  CM) ) 

val text ="""873474341 is an American multinational corporation that is engaged in the design,development,manufacturing,and worldwide marketing and sales of footwear,apparel,equipment,accessories,and services"""
val data = Seq(text).toDF("text")

val result= pipeline.fit(data).transform(data)

+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result     |result                                                                                                                                                               |
+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[873474341]|[Masterworks 096, LLC, RETAIL-RETAIL STORES, NEC [5990], 5990, 873474341, 1231, NY, DE, 225 LIBERTY STREET, NEW YORK, NY, 10281, 2035185172, , , 2022-01-10, 1894064]|
+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+

We can use ChunkMapper to map entities with their associated code/reference based on pre-defined dictionaries.

This is the AnnotatorApproach of the ChunkMapper, which can be used to train ChunkMapper models by giving a custom mapping dictionary. To use pretriained models, check the documentation of the ChunkMapperModel annotator.

The annotator also allows using fuzzy matching, which can take into consideration parts of the tokens tha can map even when word order is different, char ngrams that can map even when thre are typos, and using fuzzy distance metric (Jaccard, Levenshtein, etc.).

Example usage and more details can be found on Spark NLP Workshop repository accessible in GitHub, for example the notebook Healthcare Chunk Mapping.

Input Annotator Types: CHUNK

Output Annotator Type: LABEL_DEPENDENCY

Python API: ChunkMapperApproach Scala API: ChunkMapperApproach Notebook: ChunkMapperApproachModelNotebook
Show Example

from johnsnowlabs import nlp, medical

# First, create a dictionay in JSON format following this schema:
import json
data_set= {
  "mappings": [
    {
      "key": "metformin",
      "relations": [
        {
          "key": "action",
          "values" : ["hypoglycemic", "Drugs Used In Diabetes"]
        },
        {
          "key": "treatment",
          "values" : ["diabetes", "t2dm"]
        }
      ]
    }
  ]
}

with open('sample_drug.json', 'w', encoding='utf-8') as f:
    json.dump(data_set, f, ensure_ascii=False, indent=4)


# Create a pipeline
document_assembler = nlp.DocumentAssembler()\
    .setInputCol('text')\
    .setOutputCol('document')

sentence_detector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols("sentence")\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

#NER model to detect drug in the text
clinical_ner =  medical.NerModel.pretrained("ner_posology_small","en","clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner")\
    .setLabelCasing("upper")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setWhiteList(["DRUG"])

chunkerMapper =  medical.ChunkMapperApproach()\
    .setInputCols(["ner_chunk"])\
    .setOutputCol("mappings")\
    .setDictionary("/content/sample_drug.json")\
    .setRels(["action"]) #or treatment

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    sentence_detector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    chunkerMapper])

text = ["The patient was given 1 unit of metformin daily."]

test_data = spark.createDataFrame([text]).toDF("text")

model = pipeline.fit(test_data)
res= model.transform(test_data)

model.stages[-1].write().save("models/drug_mapper")
from johnsnowlabs import nlp, finance

# First, create a dictionay in JSON format following this schema:
import json

data_set= {
  "mappings": [
    {
      "key": "Rayton Solar Inc.",
      "relations": [
        {
          "key": "name",
          "values" : ['Rayton Solar Inc.']
        },
        {
          "key": "sic",
          "values" : ['SEMICONDUCTORS & RELATED DEVICES [3674]']
        }]
    }]
}

with open('sample_finance.json', 'w', encoding='utf-8') as f:
    json.dump(data_set, f, ensure_ascii=False, indent=4)

# Create a pipeline
document_assembler = nlp.DocumentAssembler()\
  .setInputCol('text')\
  .setOutputCol('document')

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols("sentence")\
  .setOutputCol("token")

word_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
  .setInputCols(["sentence", "token"]) \
  .setOutputCol("embeddings")

finance_ner = finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\
  .setInputCols(["sentence", "token", "embeddings"])\
  .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
  .setInputCols(["sentence","token","ner"])\
  .setOutputCol("ner_chunk")\
  .setWhiteList(["ORG"]) # Return only ORG entities

chunkerMapper =  finance.ChunkMapperApproach()\
  .setInputCols(["ner_chunk"])\
  .setOutputCol("mappings")\
  .setDictionary("/content/sample_finance.json")\
  .setRels(all_rels)

pipeline = nlp.Pipeline().setStages([
  document_assembler,
  sentence_detector,
  tokenizer,
  word_embeddings,
  finance_ner,
  ner_converter,
  chunkerMapper])

text = ["AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price. "]

test_data = spark.createDataFrame([text]).toDF("text")

model = pipeline.fit(test_data)
res= model.transform(test_data)

model.stages[-1].write().save("models/finance_mapper")
from johnsnowlabs import nlp, legal

# First, create a dictionay in JSON format following this schema:
import json

data_set= {
  "mappings": [
    {
      "key": "Rayton Solar Inc.",
      "relations": [
        {
          "key": "name",
          "values" : ['Rayton Solar Inc.']
        },
        {
          "key": "sic",
          "values" : ['SEMICONDUCTORS & RELATED DEVICES [3674]']
        }]
    }]
}

with open('sample_legal.json', 'w', encoding='utf-8') as f:
    json.dump(data_set, f, ensure_ascii=False, indent=4)

# Create a pipeline
document_assembler = nlp.DocumentAssembler()\
  .setInputCol('text')\
  .setOutputCol('document')

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols("sentence")\
  .setOutputCol("token")

word_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
  .setInputCols(["sentence", "token"]) \
  .setOutputCol("embeddings")

legal_ner = legal.NerModel.pretrained("legner_org_per_role_date", "en", "legal/models")\
  .setInputCols(["sentence", "token", "embeddings"])\
  .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
  .setInputCols(["sentence","token","ner"])\
  .setOutputCol("ner_chunk")\
  .setWhiteList(["ORG"]) # Return only ORG entities

chunkerMapper =  legal.ChunkMapperApproach()\
  .setInputCols(["ner_chunk"])\
  .setOutputCol("mappings")\
  .setDictionary("/content/sample_legal.json")\
  .setRels(all_rels)

pipeline = nlp.Pipeline().setStages([
  document_assembler,
  sentence_detector,
  tokenizer,
  word_embeddings,
  legal_ner,
  ner_converter,
  chunkerMapper])

text = ["AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price. "]

test_data = spark.createDataFrame([text]).toDF("text")

model = pipeline.fit(test_data)
res= model.transform(test_data)

model.stages[-1].write().save("models/legal_mapper")

import spark.implicits._

val document_assembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val sentence_detector = new SentenceDetector()
 .setInputCols("document")
 .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
 .setInputCols("sentence") 
 .setOutputCol("token") 

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
 .setInputCols(Array("sentence","token")) 
 .setOutputCol("embeddings") //NER model to detect drug in the text 

val clinical_ner = MedicalNerModel.pretrained("ner_posology_small","en","clinical/models")
 .setInputCols(Array("sentence","token","embeddings")) 
 .setOutputCol("ner") 
 .setLabelCasing("upper") 

val ner_converter = new NerConverterInternal()
 .setInputCols(Array("sentence","token","ner")) 
 .setOutputCol("ner_chunk") 
 .setWhiteList(Array("DRUG")) 

val chunkerMapper = new ChunkMapperApproach()
 .setInputCols("ner_chunk") 
 .setOutputCol("mappings") 
 .setDictionary("/content/sample_drug.json") 
 .setRels(Array("action") ) //or treatment 

val pipeline = new Pipeline()
 .setStages(Array(
  document_assembler, 
  sentence_detector, 
  tokenizer, 
  word_embeddings, 
  clinical_ner, 
  ner_converter, 
  chunkerMapper) ) 
val text = new Array("The patient was given 1 unit of metformin daily.") 

val test_data = seq(Array(text)) .toDF("text") 
val model = pipeline.fit(test_data) 

res= model.transform(test_data) 
model.stagesArray(-1) .write() .save("models/drug_mapper")
import spark.implicits._

val document_assembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
 .setInputCols("document")
 .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
 .setInputCols("sentence") 
 .setOutputCol("token") 

val word_embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
 .setInputCols(Array("sentence","token") ) 
 .setOutputCol("embeddings") 

val finance_ner = FinanceNerModel.pretrained("finner_orgs_prods_alias","en","finance/models")
 .setInputCols(Array("sentence","token","embeddings") ) 
 .setOutputCol("ner") 

val ner_converter = new NerConverter()
 .setInputCols(Array("sentence","token","ner") ) 
 .setOutputCol("ner_chunk") 
 .setWhiteList(Array("ORG") ) // Return only ORG entities 

val chunkerMapper = new ChunkMapperApproach()
 .setInputCols("ner_chunk")
 .setOutputCol("mappings") 
 .setDictionary("/content/sample_json") 
 .setRels(all_rels) 

val pipeline = new Pipeline()
 .setStages(Array( 
  document_assembler, 
  sentence_detector, 
  tokenizer, 
  word_embeddings, 
  finance_ner, 
  ner_converter, 
  chunkerMapper) ) 

val text = new Array("AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price. ") 

val test_data = seq(Array(text)).toDF("text") 

val model = pipeline.fit(test_data) 
res= model.transform(test_data) 

model.stagesArray(-1) .write() .save("models/finance_mapper")
import spark.implicits._
 
val document_assembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
 .setInputCols("document")
 .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
 .setInputCols("sentence") 
 .setOutputCol("token") 

val word_embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
 .setInputCols(Array("sentence","token")) 
 .setOutputCol("embeddings") 

val legal_ner = LegalNerModel.pretrained("legner_org_per_role_date","en","legal/models")
 .setInputCols(Array("sentence","token","embeddings")) 
 .setOutputCol("ner") 

val ner_converter = new NerConverter()
 .setInputCols(Array("sentence","token","ner")) 
 .setOutputCol("ner_chunk") 
 .setWhiteList("ORG") // Return only ORG entities 

val chunkerMapper = new ChunkMapperApproach()
 .setInputCols("ner_chunk")
 .setOutputCol("mappings") 
 .setDictionary("/content/sample_json") 
 .setRels(all_rels) 

val pipeline = new Pipeline()
 .setStages(Array( 
  document_assembler, 
  sentence_detector, 
  tokenizer, 
  word_embeddings, 
  legal_ner, 
  ner_converter, 
  chunkerMapper) ) 

val text = new Array("AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price. ") 
val test_data = seq(Array(text) ) .toDF("text") 
val model = pipeline.fit(test_data) 

res= model.transform(test_data) 
model.stagesArray(-1) .write() .save("models/legal_mapper")

ChunkMapperFilterer

Model

ChunkMapperFilterer is an annotator to be used after ChunkMapper that allows to filter chunks based on the results of the mapping, whether it was successful or failed.

Parametres:

  • ReturnCriteria (String): Has two possible values: “success” or “fail”. If “fail” (default), returns the chunks that are not in the label dependencies; if “success”, returns the labels that were successfully mapped by the ChunkMapperModel annotator.

Example usage and more details can be found on Spark NLP Workshop repository accessible in GitHub, for example the notebook Healthcare Chunk Mapping.

Input Annotator Types: CHUNK, LABEL_DEPENDENCY

Output Annotator Type: CHUNK

Python API: ChunkMapperFilterer Scala API: ChunkMapperFilterer Notebook: ChunkMapperFiltererNotebook
Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

sentence_detector = nlp.SentenceDetector()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
      .setInputCols("sentence")\
      .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

ner_model = medical.NerModel.pretrained("ner_posology_greedy", "en", "clinical/models")\
      .setInputCols(["sentence", "token", "embeddings"])\
      .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
      .setInputCols("sentence", "token", "ner")\
      .setOutputCol("chunk")

chunkerMapper = medical.ChunkMapperModel.pretrained("rxnorm_mapper", "en", "clinical/models")\
      .setInputCols(["chunk"])\
      .setOutputCol("RxNorm_Mapper")\
      .setRels(["rxnorm_code"])

chunk_mapper_filterer = medical.ChunkMapperFilterer() \
      .setInputCols(["chunk", "RxNorm_Mapper"]) \
      .setOutputCol("chunks_fail") \
      .setReturnCriteria("fail")

mapper_pipeline = nlp.Pipeline(
      stages = [
          document_assembler,
          sentence_detector,
          tokenizer,
          word_embeddings,
          ner_model,
          ner_converter,
          chunkerMapper,
          chunkerMapper,
          chunk_mapper_filterer
      ])

samples = [["The patient was given Adapin 10 MG, coumadn 5 mg"],
           ["The patient was given Avandia 4 mg, Tegretol, zitiga"] ]
           
data = spark.createDataFrame(samples).toDF("text")

result = mapper_pipeline.fit(data).transform(data)

result.selectExpr("chunk.result as chunk", 
                  "RxNorm_Mapper.result as RxNorm_Mapper", 
                  "chunks_fail.result as chunks_fail").show(truncate = False)

+--------------------------------+----------------------+--------------+
|chunk                           |RxNorm_Mapper         |chunks_fail   |
+--------------------------------+----------------------+--------------+
|[Adapin 10 MG, coumadn 5 mg]    |[1000049, NONE]       |[coumadn 5 mg]|
|[Avandia 4 mg, Tegretol, zitiga]|[261242, 203029, NONE]|[zitiga]      |
+--------------------------------+----------------------+--------------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
 
val sentence_detector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols("sentence", "token")
    .setOutputCol("embeddings")

val ner_model = MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models")
    .setInputCols("sentence", "token", "embeddings")
    .setOutputCol("ner")

val ner_converter = new NerConverterInternal()
    .setInputCols("sentence", "token", "ner")
    .setOutputCol("chunk")

val chunkerMapper = ChunkMapperModel.pretrained("rxnorm_mapper", "en", "clinical/models")
    .setInputCols("chunk")
    .setOutputCol("RxNorm_Mapper")
    .setRels(Array("rxnorm_code"))

val chunk_mapper_filterer = new ChunkMapperFilterer()
    .setInputCols("chunk", "RxNorm_Mapper")
    .setOutputCol("chunks_fail")
    .setReturnCriteria("fail")

val mapper_pipeline = new Pipeline().setStages(Array(
    document_assembler,
    sentence_detector,
    tokenizer,
    word_embeddings,
    ner_model,
    ner_converter,
    chunkerMapper,
    chunk_mapper_filterer
    ))


val data = Seq("The patient was given Adapin 10 MG, coumadn 5 mg",
"The patient was given Avandia 4 mg, Tegretol, zitiga").toDF("text")

val result = mapper_pipeline.fit(data).transform(data)

+--------------------------------+----------------------+--------------+
|chunk                           |RxNorm_Mapper         |chunks_fail   |
+--------------------------------+----------------------+--------------+
|[Adapin 10 MG, coumadn 5 mg]    |[1000049, NONE]       |[coumadn 5 mg]|
|[Avandia 4 mg, Tegretol, zitiga]|[261242, 203029, NONE]|[zitiga]      |
+--------------------------------+----------------------+--------------+

ChunkMerge

Approach

Merges two chunk columns coming from two annotators(NER, ContextualParser or any other annotator producing chunks). The merger of the two chunk columns is made by selecting one chunk from one of the columns according to certain criteria. The decision on which chunk to select is made according to the chunk indices in the source document. (chunks with longer lengths and highest information will be kept from each source) Labels can be changed by setReplaceDictResource.

Parameters:

  • inputCols: The name of the columns containing the input annotations. It can read either a String column or an Array.
  • outputCol: The name of the column in Document type that is generated. We can specify only one column here.
  • mergeOverlapping: (Boolean) Sets whether to merge overlapping matched chunks. Default True.
  • falsePositivesResource: Sets file with false positive pairs
  • replaceDictResource: Sets replace dictionary pairs for NER labels
  • blackList: (String List) If defined, list of entities to ignore. The rest will be processed.
  • whiteList: (String List) If defined, list of entities to accept.
  • selectionStrategy: (String) Sets Whether to select annotations sequentially based on annotation order Sequential or using any other available strategy; currently only Sequential and DiverseLonger are available. Default Sequential.
  • orderingFeatures: (String List) The ordering features to use for overlapping entities. Possible values are ChunkBegin, ChunkLength, ChunkPrecedence, ChunkConfidence.
  • defaultConfidence: (Float) Sets when ChunkConfidence ordering feature is included and a given annotation does not have any confidence. The value of this param will be used as a confidence score for annotations without a confidence score.
  • chunkPrecedence: (String List) Sets what is the precedence order when a chunk labeled by two models.
  • chunkPrecedenceValuePrioritization: (String List) Sets when ChunkPrecedence ordering feature is used. This param contains an Array of comma-separated values representing the desired order of prioritization for the values in the metadata fields included from chunkPrecedence.

All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols().

Input Annotator Types: CHUNK, CHUNK

Output Annotator Type: CHUNK

Python API: ChunkMergeApproach Scala API: ChunkMergeApproach
Show Example

from johnsnowlabs import nlp, medical

# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# 1- ner_clinical model
clinical_ner = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("clinical_ner")

clinical_ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "clinical_ner"]) \
    .setOutputCol("clinical_ner_chunk")

# 2- posology ner model
posology_ner = medical.NerModel.pretrained("ner_posology", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("posology_ner")

posology_ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "posology_ner"]) \
    .setOutputCol("posology_ner_chunk")

# 3- generate a text matcher annotator that extracts female related entities
entities = ['she', 'her', 'girl', 'woman', 'women', 'womanish', 'womanlike', 'womanly', 'madam', 'madame', 'senora', 'lady', 'miss', 'girlfriend', 'wife', 'bride', 'misses', 'mrs.', 'female']
with open ('female_entities.txt', 'w') as f:
    for i in entities:
        f.write(i+'\n')

# Find female entities using TextMatcher
female_entity_extractor = nlp.TextMatcher() \
    .setInputCols(["sentence",'token'])\
    .setOutputCol("female_entities")\
    .setEntities("female_entities.txt")\
    .setCaseSensitive(False)\
    .setEntityValue('female_entity')

# Chunk Merge annotator is used to merge columns
chunk_merger = medical.ChunkMergeApproach()\
    .setInputCols("posology_ner_chunk", 'clinical_ner_chunk', "female_entities")\
    .setOutputCol('merged_ner_chunk')

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    clinical_ner_converter,
    posology_ner,
    posology_ner_converter,
    female_entity_extractor,
    chunk_merger])

sample_text = """The lady was treated with a five-day course of amoxicillin for a respiratory tract infection .
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG ."""


data = spark.createDataFrame([[sample_text]]).toDF("text")
model = nlpPipeline.fit(data).transform(data)

# Show results
model.selectExpr("explode(merged_ner_chunk) as a") \
  .selectExpr("a.begin","a.end","a.result as chunk","a.metadata.entity as entity") \
  .show(10, False)
+-----+---+-----------------------------+-------------+
|begin|end|chunk                        |entity       |
+-----+---+-----------------------------+-------------+
|4    |7  |lady                         |female_entity|
|47   |57 |amoxicillin                  |DRUG         |
|63   |91 |a respiratory tract infection|PROBLEM      |
|95   |97 |She                          |female_entity|
|106  |114|metformin                    |DRUG         |
|118  |126|glipizide                    |TREATMENT    |
|134  |146|dapagliflozin                |TREATMENT    |
|152  |155|T2DM                         |PROBLEM      |
|161  |172|atorvastatin                 |DRUG         |
|178  |188|gemfibrozil                  |TREATMENT    |
+-----+---+-----------------------------+-------------+
from johnsnowlabs import nlp, finance

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

bert_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("bert_embeddings")

fin_ner = finance.NerModel.pretrained('finner_deid', "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

ner_converter =  finance.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setReplaceLabels({"ORG": "PARTY"}) # Replace "ORG" entity as "PARTY"

ner_finner = finance.NerModel.pretrained("finner_org_per_role_date", "en", "finance/models")\
    .setInputCols(["sentence", "token", "bert_embeddings"]) \
    .setOutputCol("ner_finner") 
    #.setLabelCasing("upper")

ner_converter_finner = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner_finner"]) \
    .setOutputCol("ner_finner_chunk") \
    .setWhiteList(['ROLE']) # Just use "ROLE" entity from this NER

chunk_merge =  finance.ChunkMergeApproach()\
    .setInputCols("ner_finner_chunk", "ner_chunk")\
    .setOutputCol("deid_merged_chunk")

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      bert_embeddings,
      fin_ner,
      ner_converter,
      ner_finner,
      ner_converter_finner,
      chunk_merge])

data = spark.createDataFrame([["Jeffrey Preston Bezos is an American entrepreneur, founder and CEO of Amazon"]]).toDF("text")

# Show results
result = nlpPipeline.fit(data).transform(data).cache()
result.select(F.explode(F.arrays_zip(result.deid_merged_chunk.result, 
                                     result.deid_merged_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)
+---------------------+---------+
|chunk                |ner_label|
+---------------------+---------+
|Jeffrey Preston Bezos|PERSON   |
|founder              |ROLE     |
|CEO                  |ROLE     |
|Amazon               |PARTY    |
+---------------------+---------+
from johnsnowlabs import nlp, legal

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

legal_ner = legal.NerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

ner_converter = legal.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setReplaceLabels({"ALIAS": "PARTY"})

ner_signers = legal.NerModel.pretrained("legner_signers", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_signers") 
    #.setLabelCasing("upper")

ner_converter_signers = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner_signers"]) \
    .setOutputCol("ner_signer_chunk")

chunk_merge = legal.ChunkMergeApproach()\
    .setInputCols("ner_signer_chunk", "ner_chunk")\
    .setOutputCol("deid_merged_chunk")

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      legal_ner,
      ner_converter,
      ner_signers,
      ner_converter_signers,
      chunk_merge])


data = spark.createDataFrame([["ENTIRE AGREEMENT.  This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby, supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter.

2THEMART.COM, INC.:  I-ESCROW, INC.: By:Dominic J. Magliarditi By:Sanjay Bajaj Name: Dominic J. Magliarditi Name: Sanjay Bajaj Title: President Title: VP Business Development Date: 6/21/99    Date: 6/11/99 "]]).toDF("text")

# Show results
result = nlpPipeline.fit(data).transform(data).cache()
result.select(F.explode(F.arrays_zip(result.deid_merged_chunk.result, 
                                     result.deid_merged_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)
+-----------------------+--------------+
|chunk                  |ner_label     |
+-----------------------+--------------+
|ENTIRE AGREEMENT       |DOC           |
|INC                    |PARTY         |
|J. Magliarditi         |SIGNING_PERSON|
|Bajaj                  |SIGNING_PERSON|
|Dominic J. Magliarditi |SIGNING_PERSON|
|Sanjay Bajaj           |SIGNING_PERSON|
|President              |SIGNING_TITLE |
|VP Business Development|SIGNING_TITLE |
+-----------------------+--------------+

import spark.implicits._

// Annotator that transforms a text column from dataframe into an Annotation ready for NLP 
val documentAssembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 
 
// Sentence Detector annotator,processes various sentences per line 
val sentenceDetector = new SentenceDetector()
 .setInputCols("document")
 .setOutputCol("sentence") 
 
// Tokenizer splits words in a relevant format for NLP 
val tokenizer = new Tokenizer()
 .setInputCols("sentence") 
 .setOutputCol("token") 
 
// Clinical word embeddings trained on PubMED dataset 
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
 .setInputCols(Array("sentence","token")) 
 .setOutputCol("embeddings") 
 
// 1- ner_clinical model 
val clinical_ner = MedicalNerModel.pretrained("ner_clinical","en","clinical/models")
 .setInputCols(Array("sentence","token","embeddings")) 
 .setOutputCol("clinical_ner") 

val clinical_ner_converter = new NerConverterInternal()
 .setInputCols(Array("sentence","token","clinical_ner")) 
 .setOutputCol("clinical_ner_chunk") 
 
// 2- posology ner model 
val posology_ner = MedicalNerModel.pretrained("ner_posology","en","clinical/models")
 .setInputCols(Array("sentence","token","embeddings")) 
 .setOutputCol("posology_ner") 

val posology_ner_converter = new NerConverterInternal()
 .setInputCols(Array("sentence","token","posology_ner")) 
 .setOutputCol("posology_ner_chunk") 
 
// 3- generate a text matcher annotator that extracts female related entities 
val entities = new Array("she","her","girl","woman","women","womanish","womanlike","womanly","madam","madame","senora","lady","miss","girlfriend","wife","bride","misses","mrs.","female")

with open ('female_entities.txt', 'w') as f:
    for i in entities:
        f.write(i+'\n')
 
// Find female entities using TextMatcher 
val female_entity_extractor = new TextMatcher()
 .setInputCols(Array("sentence","token")) 
 .setOutputCol("female_entities") 
 .setEntities("female_entities.txt") 
 .setCaseSensitive(false) 
 .setEntityValue("female_entity") 
 
// Chunk Merge annotator is used to merge columns 
val chunk_merger = new ChunkMergeApproach()
 .setInputCols(Array("posology_ner_chunk","clinical_ner_chunk","female_entities"))
 .setOutputCol("merged_ner_chunk") 

val nlpPipeline = new Pipeline().setStages(Array( 
    documentAssembler, 
    sentenceDetector, 
    tokenizer, 
    word_embeddings, 
    clinical_ner, 
    clinical_ner_converter, 
    posology_ner, 
    posology_ner_converter, 
    female_entity_extractor, 
    chunk_merger)) 

val text ="""The lady was treated with a five-day course of amoxicillin for a respiratory tract infection .
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG ."""
val data = Seq(text).toDF("text")

val model = nlpPipeline.fit(data).transform(data)

+-----+---+-----------------------------+-------------+
|begin|end|chunk                        |entity       |
+-----+---+-----------------------------+-------------+
|4    |7  |lady                         |female_entity|
|47   |57 |amoxicillin                  |DRUG         |
|63   |91 |a respiratory tract infection|PROBLEM      |
|95   |97 |She                          |female_entity|
|106  |114|metformin                    |DRUG         |
|118  |126|glipizide                    |TREATMENT    |
|134  |146|dapagliflozin                |TREATMENT    |
|152  |155|T2DM                         |PROBLEM      |
|161  |172|atorvastatin                 |DRUG         |
|178  |188|gemfibrozil                  |TREATMENT    |
+-----+---+-----------------------------+-------------+
import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCol("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCol("sentence")
    .setOutputCol("token")

val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val bert_embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("bert_embeddings")

val fin_ner = FinanceNerModel.pretrained('finner_deid', "en", "finance/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

val ner_converter =  new NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")
    .setReplaceLabels({"ORG": "PARTY"}) # Replace "ORG" entity as "PARTY"

val ner_finner = FinanceNerModel.pretrained("finner_org_per_role_date", "en", "finance/models")\
    .setInputCols(Array("sentence", "token", "bert_embeddings"))
    .setOutputCol("ner_finner") 
    #.setLabelCasing("upper")

val ner_converter_finner = new NerConverter()
    .setInputCols(Array("sentence", "token", "ner_finner"))
    .setOutputCol("ner_finner_chunk")
    .setWhiteList(['ROLE']) # Just use "ROLE" entity from this NER

val chunk_merge =  new ChunkMergeApproach()
    .setInputCols(Array("ner_finner_chunk", "ner_chunk"))
    .setOutputCol("deid_merged_chunk")

val nlpPipeline = new Pipeline().setStages(Array(
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      bert_embeddings,
      fin_ner,
      ner_converter,
      ner_finner,
      ner_converter_finner,
      chunk_merge))

val data = Seq(("Jeffrey Preston Bezos is an American entrepreneur, founder and CEO of Amazon")).toDF("text")

# Show results
result = nlpPipeline.fit(data).transform(data)

+---------------------+---------+
|chunk                |ner_label|
+---------------------+---------+
|Jeffrey Preston Bezos|PERSON   |
|founder              |ROLE     |
|CEO                  |ROLE     |
|Amazon               |PARTY    |
+---------------------+---------+
import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCol("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCol("sentence")
    .setOutputCol("token")

val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val legal_ner = LegalNerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")\
    .setReplaceLabels({"ALIAS": "PARTY"})

val ner_signers = LegalNerModel.pretrained("legner_signers", "en", "legal/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner_signers") 
    #.setLabelCasing("upper")

val ner_converter_signers = new NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner_signers"))
    .setOutputCol("ner_signer_chunk")

val chunk_merge = new ChunkMergeApproach()
    .setInputCols(Array("ner_signer_chunk", "ner_chunk"))
    .setOutputCol("deid_merged_chunk")

val nlpPipeline = new Pipeline().setStages(Array(
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      legal_ner,
      ner_converter,
      ner_signers,
      ner_converter_signers,
      chunk_merge))

val data = Seq(("ENTIRE AGREEMENT.  This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby, supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter.
2THEMART.COM, INC.: I-ESCROW, INC.: By:Dominic J. Magliarditi By:Sanjay Bajaj Name: Dominic J. Magliarditi Name: Sanjay Bajaj Title: President Title: VP Business Development Date: 6/21/99 Date: 6/11/99 ")).toDF("text")

# Show results
result = nlpPipeline.fit(data).transform(data)

+-----------------------+--------------+
|chunk                  |ner_label     |
+-----------------------+--------------+
|ENTIRE AGREEMENT       |DOC           |
|INC                    |PARTY         |
|J. Magliarditi         |SIGNING_PERSON|
|Bajaj                  |SIGNING_PERSON|
|Dominic J. Magliarditi |SIGNING_PERSON|
|Sanjay Bajaj           |SIGNING_PERSON|
|President              |SIGNING_TITLE |
|VP Business Development|SIGNING_TITLE |
+-----------------------+--------------+

ChunkSentenceSplitter

Model

ChunkSentenceSplitter annotator can split the documents into chunks according to separators given as CHUNK columns. It is useful when you need to perform different models or analysis in different sections of your document (for example, for different headers, clauses, items, etc.). The given separator chunk can be the output from, for example, RegexMatcher or NerModel.

Parametres;

  • GroupBySentences: (boolean) Sets the groupBySentences that allow split the paragraphs grouping the chunks by sentences.

  • InsertChunk: (boolean) Whether to insert the chunk in the paragraph or not.

  • DefaultEntity: (str) Sets the key in the metadata dictionary that you want to filter (by default ‘entity’)

For detailed usage of this annotator, visit this notebook from our Spark NLP Workshop.

Input Annotator Types: DOCUMENT, CHUNK

Output Annotator Type: DOCUMENT

Python API: ChunkSentenceSplitter Scala API: ChunkSentenceSplitter Notebook: ChunkSentenceSplitterNotebook
Show Example

from johnsnowlabs import nlp, medical

# Defining the pipeline
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")\

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("ner_jsl_slim", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\
    .setWhiteList(["Header"])

#applying ChunkSentenceSplitter
chunkSentenceSplitter = medical.ChunkSentenceSplitter()\
    .setInputCols("document","ner_chunk")\
    .setOutputCol("paragraphs")\
    .setGroupBySentences(False)

pipeline_model = nlp.Pipeline(
  stages = [
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    chunkSentenceSplitter
  ])


sentences = [["""Sample Name: Mesothelioma - Pleural Biopsy
Description: Right pleural effusion and suspected malignant mesothelioma. (Medical Transcription Sample Report)
PREOPERATIVE DIAGNOSIS:  Right pleural effusion and suspected malignant mesothelioma.
POSTOPERATIVE DIAGNOSIS: Right pleural effusion, suspected malignant mesothelioma.
ANESTHESIA: General double-lumen endotracheal.
DESCRIPTION OF FINDINGS:  Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface.
SPECIMEN:  Pleural biopsies for pathology and microbiology.
INDICATIONS:  Briefly, this is a 66-year-old gentleman who has been transferred from an outside hospital after a pleural effusion had been drained and biopsies taken from the right chest that were thought to be consistent with mesothelioma. Upon transfer, he had a right pleural effusion demonstrated on x-ray as well as some shortness of breath and dyspnea on exertion. The risks, benefits, and alternatives to right VATS pleurodesis and pleural biopsy were discussed with the patient and his family and they wished to proceed.
Dr. X was present for the entire procedure which was right VATS pleurodesis and pleural biopsies.The counts were correct x2 at the end of the case."""]]

df = spark.createDataFrame(sentences).toDF("text")
paragraphs = pipeline_model.fit(df).transform(df)

paragraphs.selectExpr("explode(paragraphs) as result")\
          .selectExpr("result.result","result.metadata.entity", "result.metadata.splitter_chunk").show(truncate=80)

+--------------------------------------------------------------------------------+------------+------------------------+
|                                                                          result|      entity|          splitter_chunk|
+--------------------------------------------------------------------------------+------------+------------------------+
|                                     Sample Name: Mesothelioma - Pleural Biopsy |introduction|                     UNK|
|Description: Right pleural effusion and suspected malignant mesothelioma. (Me...|      Header|            Description:|
|PREOPERATIVE DIAGNOSIS:  Right pleural effusion and suspected malignant mesot...|      Header| PREOPERATIVE DIAGNOSIS:|
|POSTOPERATIVE DIAGNOSIS: Right pleural effusion, suspected malignant mesothel...|      Header|POSTOPERATIVE DIAGNOSIS:|
|                                 ANESTHESIA: General double-lumen endotracheal. |      Header|             ANESTHESIA:|
|DESCRIPTION OF FINDINGS:  Right pleural effusion, firm nodules, diffuse scatt...|      Header|DESCRIPTION OF FINDINGS:|
|                    SPECIMEN:  Pleural biopsies for pathology and microbiology. |      Header|               SPECIMEN:|
|INDICATIONS:  Briefly, this is a 66-year-old gentleman who has been transferr...|      Header|            INDICATIONS:|
+--------------------------------------------------------------------------------+------------+------------------------+
from johnsnowlabs import nlp, finance, legal

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_headers", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = finance.NerConverterInternal()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

chunkSentenceSplitter = legal.ChunkSentenceSplitter()\
    .setInputCols("document","ner_chunk")\
    .setOutputCol("paragraphs")\
    .setGroupBySentences(False)
    
nlp_pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    chunkSentenceSplitter])


text = """2. DEFINITION. 

For purposes of this Agreement, the following terms have the meanings ascribed thereto in this Section 1 and 2 Appointment as Reseller.

2.1 Appointment. 

The Company hereby [***]. Allscripts may also disclose Company's pricing information relating to its Merchant Processing Services and facilitate procurement of Merchant Processing Services on behalf of Sublicensed Customers, including, without limitation by references to such pricing information and Merchant Processing Services in Customer Agreements. 6

2.2 Customer Agreements."""

sdf = spark.createDataFrame([[text]]).toDF("text")
paragraphs = nlp_pipeline.fit(sdf).transform(sdf)

paragraphs.selectExpr("explode(paragraphs) as result")\
          .selectExpr("result.result","result.metadata.entity").show(truncate=50)

+--------------------------------------------------+---------+
|                                            result|   entity|
+--------------------------------------------------+---------+
|                                               2. |   HEADER|
|DEFINITION.   For purposes of this Agreement, t...|SUBHEADER|
|                               2.1 Appointment.   |SUBHEADER|
|The Company hereby [***]. Allscripts may also d...|SUBHEADER|
|                        6  2.2 Customer Agreements|   HEADER|
+--------------------------------------------------+---------+
from johnsnowlabs import nlp, legal

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
        
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = legal.NerModel.pretrained("legner_headers", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

chunkSentenceSplitter = legal.ChunkSentenceSplitter()\
    .setInputCols("document","ner_chunk")\
    .setOutputCol("paragraphs")\
    .setGroupBySentences(False)
    
nlp_pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    chunkSentenceSplitter])


text = """AGREEMENT

NOW, THEREFORE, for good and valuable consideration, and in consideration of the mutual covenants and conditions herein contained, the Parties agree as follows:

2. Definitions. For purposes of this Agreement, the following terms have the meanings ascribed thereto in this Section 1. 2. Appointment as Reseller.

2.1 Appointment. The Company hereby [***]. Allscripts may also disclose Company's pricing information relating to its Merchant Processing Services and facilitate procurement of Merchant Processing Services on behalf of Sublicensed Customers, including, without limitation by references to such pricing information and Merchant Processing Services in Customer Agreements. 6

2.2 Customer Agreements.

a) Subscriptions. Allscripts and its Affiliates may sell Subscriptions for terms no less than one year and no greater than four (4) years on a subscription basis to Persons who subsequently execute a Customer Agreement, provided that Allscripts may enter into Customer Agreements with terms longer than four (4) years with large organizations, provided that Phreesia consents in each instance in writing in advance, which consent will not be unreasonably withheld."""

sdf = spark.createDataFrame([[text]]).toDF("text")
paragraphs = nlp_pipeline.fit(sdf).transform(sdf)

paragraphs.selectExpr("explode(paragraphs) as result")\
          .selectExpr("result.result","result.metadata.entity").show(truncate=50)

+--------------------------------------------------+---------+
|                                            result|   entity|
+--------------------------------------------------+---------+
|AGREEMENT NOW, THEREFORE, for good and valuabl... |SUBHEADER|
|                        Appointment as Reseller.  |SUBHEADER|
|                                 2.1 Appointment. |SUBHEADER|
|The Company hereby [***]. Allscripts may also d...|SUBHEADER|
|                     6 2.2 Customer Agreements.   |   HEADER|
|a) Subscriptions. Allscripts and its Affiliates...|SUBHEADER|
+--------------------------------------------------+---------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val clinical_ner = MedicalNerModel.pretrained("ner_jsl_slim", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")
    .setWhiteList("Header")

#applying ChunkSentenceSplitter
val chunkSentenceSplitter = new ChunkSentenceSplitter()
    .setInputCols(Array("document","ner_chunk"))
    .setOutputCol("paragraphs")
    .setGroupBySentences(false)

val pipeline_model = new Pipeline().setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    chunkSentenceSplitter
))


val sentences = ("""Sample Name: Mesothelioma - Pleural Biopsy
Description: Right pleural effusion and suspected malignant mesothelioma. (Medical Transcription Sample Report)
PREOPERATIVE DIAGNOSIS:  Right pleural effusion and suspected malignant mesothelioma.
POSTOPERATIVE DIAGNOSIS: Right pleural effusion, suspected malignant mesothelioma.
ANESTHESIA: General double-lumen endotracheal.
DESCRIPTION OF FINDINGS:  Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface.
SPECIMEN:  Pleural biopsies for pathology and microbiology.
INDICATIONS:  Briefly, this is a 66-year-old gentleman who has been transferred from an outside hospital after a pleural effusion had been drained and biopsies taken from the right chest that were thought to be consistent with mesothelioma. Upon transfer, he had a right pleural effusion demonstrated on x-ray as well as some shortness of breath and dyspnea on exertion. The risks, benefits, and alternatives to right VATS pleurodesis and pleural biopsy were discussed with the patient and his family and they wished to proceed.
Dr. X was present for the entire procedure which was right VATS pleurodesis and pleural biopsies.The counts were correct x2 at the end of the case.""")

val data = Seq(sentences).toDF("text")
val paragraphs = pipeline_model.fit(df).transform(df)


+--------------------------------------------------------------------------------+------------+------------------------+
|                                                                          result|      entity|          splitter_chunk|
+--------------------------------------------------------------------------------+------------+------------------------+
|                                     Sample Name: Mesothelioma - Pleural Biopsy |introduction|                     UNK|
|Description: Right pleural effusion and suspected malignant mesothelioma. (Me...|      Header|            Description:|
|PREOPERATIVE DIAGNOSIS:  Right pleural effusion and suspected malignant mesot...|      Header| PREOPERATIVE DIAGNOSIS:|
|POSTOPERATIVE DIAGNOSIS: Right pleural effusion, suspected malignant mesothel...|      Header|POSTOPERATIVE DIAGNOSIS:|
|                                 ANESTHESIA: General double-lumen endotracheal. |      Header|             ANESTHESIA:|
|DESCRIPTION OF FINDINGS:  Right pleural effusion, firm nodules, diffuse scatt...|      Header|DESCRIPTION OF FINDINGS:|
|                    SPECIMEN:  Pleural biopsies for pathology and microbiology. |      Header|               SPECIMEN:|
|INDICATIONS:  Briefly, this is a 66-year-old gentleman who has been transferr...|      Header|            INDICATIONS:|
+--------------------------------------------------------------------------------+------------+------------------------+
import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
        
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer
    .setInputCols("sentence")
    .setOutputCol("token")

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val ner_model = FinanceNerModel.pretrained("finner_headers", "en", "finance/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence","token","ner"))
    .setOutputCol("ner_chunk")

val chunkSentenceSplitter = new ChunkSentenceSplitter()
    .setInputCols(Array("document","ner_chunk"))
    .setOutputCol("paragraphs")
    .setGroupBySentences(false)

val nlp_pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    chunkSentenceSplitter))


val text = """2. DEFINITION. 

For purposes of this Agreement, the following terms have the meanings ascribed thereto in this Section 1 and 2 Appointment as Reseller.

2.1 Appointment. 

The Company hereby [***]. Allscripts may also disclose Company's pricing information relating to its Merchant Processing Services and facilitate procurement of Merchant Processing Services on behalf of Sublicensed Customers, including, without limitation by references to such pricing information and Merchant Processing Services in Customer Agreements. 6

2.2 Customer Agreements."""

val data = Seq(text).toDF("text")
val paragraphs = nlp_pipeline.fit(data).transform(data)


+--------------------------------------------------+---------+
|                                            result|   entity|
+--------------------------------------------------+---------+
|                                               2. |   HEADER|
|DEFINITION.   For purposes of this Agreement, t...|SUBHEADER|
|                               2.1 Appointment.   |SUBHEADER|
|The Company hereby [***]. Allscripts may also d...|SUBHEADER|
|                        6  2.2 Customer Agreements|   HEADER|
+--------------------------------------------------+---------+
import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
        
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val ner_model = LegalNerModel.pretrained("legner_headers", "en", "legal/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence","token","ner"))
    .setOutputCol("ner_chunk")

val chunkSentenceSplitter = new ChunkSentenceSplitter()
    .setInputCols(Array("document","ner_chunk"))
    .setOutputCol("paragraphs")
    .setGroupBySentences(false)
    
val nlp_pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    chunkSentenceSplitter))


val text = """AGREEMENT

NOW, THEREFORE, for good and valuable consideration, and in consideration of the mutual covenants and conditions herein contained, the Parties agree as follows:

2. Definitions. For purposes of this Agreement, the following terms have the meanings ascribed thereto in this Section 1. 2. Appointment as Reseller.

2.1 Appointment. The Company hereby [***]. Allscripts may also disclose Company's pricing information relating to its Merchant Processing Services and facilitate procurement of Merchant Processing Services on behalf of Sublicensed Customers, including, without limitation by references to such pricing information and Merchant Processing Services in Customer Agreements. 6

2.2 Customer Agreements.

a) Subscriptions. Allscripts and its Affiliates may sell Subscriptions for terms no less than one year and no greater than four (4) years on a subscription basis to Persons who subsequently execute a Customer Agreement, provided that Allscripts may enter into Customer Agreements with terms longer than four (4) years with large organizations, provided that Phreesia consents in each instance in writing in advance, which consent will not be unreasonably withheld."""

val data = Seq(text).toDF("text")
val paragraphs = nlp_pipeline.fit(data).transform(data)

+--------------------------------------------------+---------+
|                                            result|   entity|
+--------------------------------------------------+---------+
|AGREEMENT NOW, THEREFORE, for good and valuabl... |SUBHEADER|
|                        Appointment as Reseller.  |SUBHEADER|
|                                 2.1 Appointment. |SUBHEADER|
|The Company hereby [***]. Allscripts may also d...|SUBHEADER|
|                     6 2.2 Customer Agreements.   |   HEADER|
|a) Subscriptions. Allscripts and its Affiliates...|SUBHEADER|
+--------------------------------------------------+---------+

ContextualParser

Approach

Creates a model, that extracts entity from a document based on user defined rules. Rule matching is based on a RegexMatcher defined in a JSON file. It is set through the parameter setJsonPath() In this JSON file, regex is defined that you want to match along with the information that will output on metadata field. Additionally, a dictionary can be provided with setDictionary to map extracted entities to a unified representation. The first column of the dictionary file should be the representation with following columns the possible matches.

Parametres;

  • inputCols: The name of the columns containing the input annotations. It can read either a String column or an Array.
  • outputCol: The name of the column in Document type that is generated. We can specify only one column here.
  • jsonPath: Path to json file containing regex patterns and rules to match the entities.
  • dictionary: Path to dictionary file in tsv or csv format.
  • caseSensitive: Whether to use case sensitive when matching values.
  • prefixAndSuffixMatch: Whether to match both prefix and suffix to annotate the match.
  • optionalContextRules: When set to true, it will output regex match regardless of context matches.
  • shortestContextMatch: When set to true, it will stop finding for matches when prefix/suffix data is found in the text.
  • completeContextMatch: Whether to do an exact match of prefix and suffix.

All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols().

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CHUNK

Python API: ContextualParserApproach Scala API: ContextualParserApproach Notebook: ContextualParserApproachNotebook
Show Example

from johnsnowlabs import nlp, medical

# An example JSON file `regex_token.json` can look like this:
#
# {
#    "entity": "Stage",
#    "ruleScope": "sentence",
#    "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]",
#    "matchScope": "token"
#  }
#
# Which means to extract the stage code on a sentence level.
# An example pipeline could then be defined like this
# Pipeline could then be defined like this

documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

contextualParser = medical.ContextualParserApproach() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("entity") \
    .setJsonPath("/path/to/regex_token.json") \
    .setCaseSensitive(True) \
    .setContextMatch(False)

pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    contextualParser
  ])

# Define the parser (json file needs to be provided)
data = spark.createDataFrame([["A patient has liver metastases pT1bN0M0 and the T5 primary site may be colon or... "]]).toDF("text")

result = pipeline.fit(data).transform(data)

# Show Results
result.selectExpr("explode(entity)").show(5, truncate=False)
+-------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                      |
+-------------------------------------------------------------------------------------------------------------------------+
|{chunk, 32, 39, pT1bN0M0, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 0}, []}   |
|{chunk, 49, 50, T5, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 0}, []}         |
|{chunk, 148, 156, cT4bcN2M1, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 1}, []}|
|{chunk, 189, 194, T?N3M1, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 2}, []}   |
|{chunk, 316, 323, pT1bN0M0, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 3}, []} |
+-------------------------------------------------------------------------------------------------------------------------+
from johnsnowlabs import nlp, finance

# An example JSON file `regex_token.json` can look like this:
#
# {
#    "entity": "Stage",
#    "ruleScope": "sentence",
#    "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]",
#    "matchScope": "token"
#  }
#
# Which means to extract the stage code on a sentence level.
# An example pipeline could then be defined like this
# Pipeline could then be defined like this

documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

# Define the parser (json file needs to be provided)

contextualParser = finance.ContextualParserApproach() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("entity") \
    .setJsonPath("/path/to/regex_token.json") \
    .setCaseSensitive(True) \
    .setContextMatch(False)

pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    contextualParser
  ])

# Define the parser (json file needs to be provided)
data = spark.createDataFrame([["Peter Parker is a nice guy and lives in New York . Bruce Wayne is also a nice guy and lives in San Antonio and Gotham City ."]]).toDF("text")

result = pipeline.fit(data).transform(data)

# Show Results
result.selectExpr("explode(entity)").show(5, truncate=False)

+---------------------------------------------------------------+
|result                                                         |
+---------------------------------------------------------------+
|[Peter Parker, New York, Bruce Wayne, San Antonio, Gotham City]|
+---------------------------------------------------------------+
from johnsnowlabs import nlp, legal

# An example JSON file `regex_token.json` can look like this:
#
# {
#    "entity": "Stage",
#    "ruleScope": "sentence",
#    "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]",
#    "matchScope": "token"
#  }
#
# Which means to extract the stage code on a sentence level.
# An example pipeline could then be defined like this
# Pipeline could then be defined like this

documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

contextualParser = legal.ContextualParserApproach() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("entity") \
    .setJsonPath("/path/to/regex_token.json") \
    .setCaseSensitive(True) \
    .setContextMatch(False)

pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    contextualParser
  ])

# Define the parser (json file needs to be provided)
data = spark.createDataFrame([["Peter Parker is a nice guy and lives in New York . Bruce Wayne is also a nice guy and lives in San Antonio and Gotham City ."]]).toDF("text")

result = pipeline.fit(data).transform(data)

# Show Results
result.selectExpr("explode(entity)").show(5, truncate=False)

+---------------------------------------------------------------+
|result                                                         |
+---------------------------------------------------------------+
|[Peter Parker, New York, Bruce Wayne, San Antonio, Gotham City]|
+---------------------------------------------------------------+

import spark.implicits._

// An example JSON file `regex_token.json` can look like this:
//
// {
//    "entity": "Stage",
//    "ruleScope": "sentence",
//    "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]",
//    "matchScope": "token"
//  }
//
// Which means to extract the stage code on a sentence level.
// An example pipeline could then be defined like this

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val contextualParser = new ContextualParserApproach()
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("entity")
    .setJsonPath("/path/to/regex_token.json")
    .setCaseSensitive(true)
    .setContextMatch(false)

val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    contextualParser
  ))

// Define the parser (json file needs to be provided)
val data = Seq("A patient has liver metastases pT1bN0M0 and the T5 primary site may be colon or... ").toDF("text")

val result = pipeline.fit(data).transform(data)

// Show Results
//
// result.selectExpr("explode(entity)").show(5, truncate=false)
// +-------------------------------------------------------------------------------------------------------------------------+
// |col                                                                                                                      |
// +-------------------------------------------------------------------------------------------------------------------------+
// |{chunk, 32, 39, pT1bN0M0, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 0}, []}   |
// |{chunk, 49, 50, T5, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 0}, []}         |
// |{chunk, 148, 156, cT4bcN2M1, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 1}, []}|
// |{chunk, 189, 194, T?N3M1, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 2}, []}   |
// |{chunk, 316, 323, pT1bN0M0, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 3}, []} |
// +-------------------------------------------------------------------------------------------------------------------------+
//
import spark.implicits._

// An example JSON file `regex_token.json` can look like this:
//
// {
//    "entity": "Stage",
//    "ruleScope": "sentence",
//    "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]",
//    "matchScope": "token"
//  }
//
// Which means to extract the stage code on a sentence level.
// An example pipeline could then be defined like this

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val contextualParser = new ContextualParserApproach()
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("entity")
    .setJsonPath("/path/to/regex_token.json")
    .setCaseSensitive(true)
    .setContextMatch(false)

val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    contextualParser
  ))

// Define the parser (json file needs to be provided)
val data = Seq("Peter Parker is a nice guy and lives in New York . Bruce Wayne is also a nice guy and lives in San Antonio and Gotham City .").toDF("text")

val result = pipeline.fit(data).transform(data)

// Show Results

+---------------------------------------------------------------+
|result                                                         |
+---------------------------------------------------------------+
|[Peter Parker, New York, Bruce Wayne, San Antonio, Gotham City]|
+---------------------------------------------------------------+
import spark.implicits._

// An example JSON file `regex_token.json` can look like this:
//
// {
//    "entity": "Stage",
//    "ruleScope": "sentence",
//    "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]",
//    "matchScope": "token"
//  }
//
// Which means to extract the stage code on a sentence level.
// An example pipeline could then be defined like this

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val contextualParser = new ContextualParserApproach()
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("entity")
    .setJsonPath("/path/to/regex_token.json")
    .setCaseSensitive(true)
    .setContextMatch(false)

val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    contextualParser
  ))

// Define the parser (json file needs to be provided)
val data = Seq("Peter Parker is a nice guy and lives in New York . Bruce Wayne is also a nice guy and lives in San Antonio and Gotham City .").toDF("text")

val result = pipeline.fit(data).transform(data)

// Show Results
+---------------------------------------------------------------+
|result                                                         |
+---------------------------------------------------------------+
|[Peter Parker, New York, Bruce Wayne, San Antonio, Gotham City]|
+---------------------------------------------------------------+

DateNormalizer

Model

This annotator transforms date mentions to a common standard format: YYYY/MM/DD. It is useful when using data from different sources, some times from different countries that has different formats to represent dates.

For the relative dates (next year, past month, etc.), you can define an achor date to create the normalized date by setting the parameters anchorDateYear, anchorDateMonth, and anchorDateDay.

The resultant chunk date will contain a metada indicating whether the normalization was successful or not (True / False).

Parametres:

  • anchorDateYear: (Int) Sets an anchor year for the relative dates such as a day after tomorrow. If not set it will use the current year.

  • anchorDateMonth: (Int) Sets an anchor month for the relative dates such as a day after tomorrow. If not set it will use the current month.

  • anchorDateDay: (Int) Sets an anchor day of the day for the relative dates such as a day after tomorrow. If not set it will use the current day.

  • outputDateformat: (string) Select what output format to use. If not set, the dates will be formatted as YYYY/MM/DD. Options are:
    • eu: Format the dates as DD/MM/YYYY
    • us: Format the dates as MM/DD/YYYY
  • defaultReplacementDay: (Int) Defines which value to use for creating the Day Value when original Date-Entity has no Day Information. Defaults to 15.

  • defaultReplacementMonth: (Int) Defines which value to use for creating the Month Value when original Date-Entity has no Month Information. Defaults to 06.

  • defaultReplacementYear: (Int) Defines which value to use for creating the Year Value when original Date-Entity has no Year Information. Defaults to 2020.

Input Annotator Types: CHUNK

Output Annotator Type: CHUNK

Python API: DateNormalizer Scala API: DateNormalizer Notebook: DateNormalizerNotebook
Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("original_date")\
    .setOutputCol("document")

doc2chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("date_chunk")

date_normalizer = medical.DateNormalizer()\
    .setInputCols("date_chunk")\
    .setOutputCol("date")\
    .setAnchorDateYear(2000)

pipeline = nlp.Pipeline(stages=[document_assembler, doc2chunk, date_normalizer])

dates = [
    "08/02/2018",
    "11/2018",
    "11/01/2018",
    "12Mar2021",
    "Jan 30, 2018",
    "13.04.1999",
    "3April 2020",
    "next monday",
    "today",
    "next week",
]
df = spark.createDataFrame(dates, StringType()).toDF("original_date")

result = pipeline.fit(df).transform(df)
result.selectExpr(
    "date.result as normalized_date",
    "original_date",
    "date.metadata[0].normalized as metadata",
).show()

+---------------+-------------+--------+
|normalized_date|original_date|metadata|
+---------------+-------------+--------+
|   [2018/08/02]|   08/02/2018|    true|
|   [2018/11/15]|      11/2018|    true|
|   [2018/11/01]|   11/01/2018|    true|
|   [2021/03/12]|    12Mar2021|    true|
|   [2018/01/30]| Jan 30, 2018|    true|
|   [1999/04/13]|   13.04.1999|    true|
|   [2020/04/03]|  3April 2020|    true|
|   [2000/12/11]|  next monday|    true|
|   [2000/12/06]|        today|    true|
|   [2000/12/13]|    next week|    true|
+---------------+-------------+--------+
from johnsnowlabs import nlp, finance

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("original_date")\
    .setOutputCol("document")

doc2chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("date_chunk")

date_normalizer = finance.DateNormalizer()\
    .setInputCols("date_chunk")\
    .setOutputCol("date")\
    .setAnchorDateYear(2000)

pipeline = nlp.Pipeline(stages=[document_assembler, doc2chunk, date_normalizer])

dates = [
    "08/02/2018",
    "11/2018",
    "11/01/2018",
    "12Mar2021",
    "Jan 30, 2018",
    "13.04.1999",
    "3April 2020",
    "next monday",
    "today",
    "next week",
]
df = spark.createDataFrame(dates, StringType()).toDF("original_date")

result = pipeline.fit(df).transform(df)
result.selectExpr(
    "date.result as normalized_date",
    "original_date",
    "date.metadata[0].normalized as metadata",
).show()

+---------------+-------------+--------+
|normalized_date|original_date|metadata|
+---------------+-------------+--------+
|   [2018/08/02]|   08/02/2018|    true|
|   [2018/11/15]|      11/2018|    true|
|   [2018/11/01]|   11/01/2018|    true|
|   [2021/03/12]|    12Mar2021|    true|
|   [2018/01/30]| Jan 30, 2018|    true|
|   [1999/04/13]|   13.04.1999|    true|
|   [2020/04/03]|  3April 2020|    true|
|   [2000/12/11]|  next monday|    true|
|   [2000/12/06]|        today|    true|
|   [2000/12/13]|    next week|    true|
+---------------+-------------+--------+
from johnsnowlabs import nlp, legal

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("original_date")\
    .setOutputCol("document")

doc2chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("date_chunk")

date_normalizer = legal.DateNormalizer()\
    .setInputCols("date_chunk")\
    .setOutputCol("date")\
    .setAnchorDateYear(2000)

pipeline = nlp.Pipeline(stages=[document_assembler, doc2chunk, date_normalizer])

dates = [
    "08/02/2018",
    "11/2018",
    "11/01/2018",
    "12Mar2021",
    "Jan 30, 2018",
    "13.04.1999",
    "3April 2020",
    "next monday",
    "today",
    "next week",
]
df = spark.createDataFrame(dates, StringType()).toDF("original_date")

result = pipeline.fit(df).transform(df)


+---------------+-------------+--------+
|normalized_date|original_date|metadata|
+---------------+-------------+--------+
|   [2018/08/02]|   08/02/2018|    true|
|   [2018/11/15]|      11/2018|    true|
|   [2018/11/01]|   11/01/2018|    true|
|   [2021/03/12]|    12Mar2021|    true|
|   [2018/01/30]| Jan 30, 2018|    true|
|   [1999/04/13]|   13.04.1999|    true|
|   [2020/04/03]|  3April 2020|    true|
|   [2000/12/11]|  next monday|    true|
|   [2000/12/06]|        today|    true|
|   [2000/12/13]|    next week|    true|
+---------------+-------------+--------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("original_date")
    .setOutputCol("document")

val doc2chunk = new Doc2Chunk()
    .setInputCols("document")
    .setOutputCol("date_chunk")

val date_normalizer = new DateNormalizer()
    .setInputCols("date_chunk")
    .setOutputCol("date")
    .setAnchorDateYear(2000)

val pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    doc2chunk, 
    date_normalizer
))

val df = Seq(("08/02/2018"),("11/2018"),("11/01/2018"),("next monday"),("today"),("next week")).toDF("original_date")

val result = pipeline.fit(df).transform(df)

+---------------+-------------+--------+
|normalized_date|original_date|metadata|
+---------------+-------------+--------+
|   [2018/08/02]|   08/02/2018|    true|
|   [2018/11/15]|      11/2018|    true|
|   [2018/11/01]|   11/01/2018|    true|
|   [2021/03/12]|    12Mar2021|    true|
|   [2018/01/30]| Jan 30, 2018|    true|
|   [1999/04/13]|   13.04.1999|    true|
|   [2020/04/03]|  3April 2020|    true|
|   [2000/12/11]|  next monday|    true|
|   [2000/12/06]|        today|    true|
|   [2000/12/13]|    next week|    true|
+---------------+-------------+--------+
import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("original_date")
    .setOutputCol("document")

val doc2chunk = new Doc2Chunk()
    .setInputCols("document")
    .setOutputCol("date_chunk")

val date_normalizer = new DateNormalizer()
    .setInputCols("date_chunk")
    .setOutputCol("date")
    .setAnchorDateYear(2000)

val pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    doc2chunk, 
    date_normalizer
))
 
val df = Seq(("08/02/2018"),("11/2018"),("11/01/2018"),("next monday"),("today"),("next week")).toDF("original_date")

val result = pipeline.fit(df).transform(df)


+---------------+-------------+--------+
|normalized_date|original_date|metadata|
+---------------+-------------+--------+
|   [2018/08/02]|   08/02/2018|    true|
|   [2018/11/15]|      11/2018|    true|
|   [2018/11/01]|   11/01/2018|    true|
|   [2021/03/12]|    12Mar2021|    true|
|   [2018/01/30]| Jan 30, 2018|    true|
|   [1999/04/13]|   13.04.1999|    true|
|   [2020/04/03]|  3April 2020|    true|
|   [2000/12/11]|  next monday|    true|
|   [2000/12/06]|        today|    true|
|   [2000/12/13]|    next week|    true|
+---------------+-------------+--------+
import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("original_date")
    .setOutputCol("document")

val doc2chunk = new Doc2Chunk()
    .setInputCols("document")
    .setOutputCol("date_chunk")

val date_normalizer = new DateNormalizer()
    .setInputCols("date_chunk")
    .setOutputCol("date")
    .setAnchorDateYear(2000)

val pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    doc2chunk, 
    date_normalizer
))
 
val df = Seq(("08/02/2018"),("11/2018"),("11/01/2018"),("next monday"),("today"),("next week")).toDF("original_date")

val result = pipeline.fit(df).transform(df)


+---------------+-------------+--------+
|normalized_date|original_date|metadata|
+---------------+-------------+--------+
|   [2018/08/02]|   08/02/2018|    true|
|   [2018/11/15]|      11/2018|    true|
|   [2018/11/01]|   11/01/2018|    true|
|   [2021/03/12]|    12Mar2021|    true|
|   [2018/01/30]| Jan 30, 2018|    true|
|   [1999/04/13]|   13.04.1999|    true|
|   [2020/04/03]|  3April 2020|    true|
|   [2000/12/11]|  next monday|    true|
|   [2000/12/06]|        today|    true|
|   [2000/12/13]|    next week|    true|
+---------------+-------------+--------+

DeIdentification

Deidentification is a critical and important technology to facilitate the use of structured or unstructured clinical text while protecting patient privacy and confidentiality. John Snow Labs teams has invested great efforts in developing methods and corpora for deidentification of clinical text, PDF, image, DICOM, containing Protected Health Information (PHI):

  • individual’s past, present, or future physical or mental health or condition.
  • provision of health care to the individual.
  • past, present, or future payment for the health care.

Protected health information includes many common identifiers (e.g., name, address, birth date, Social Security Number) when they can be associated with the health information.

Spark NLP for Healthcare proposes several techniques and strategies for deidentification, the principal ones are:

Mask:

  • entity_labels: Mask with the entity type of that chunk. (default)
  • same_length_chars: Mask the deid entities with same length of asterix ( * ) with brackets ( [ , ] ) on both end.
  • fixed_length_chars: Mask the deid entities with a fixed length of asterix ( * ). The length is setting up using the setFixedMaskLength() method.

Obfuscation: replace sensetive entities with random values of the same type.

Faker: allows the user to use a set of fake entities that are in the memory of spark-nlp-internal

Also there is an advanced option allowing to deidentify with multiple modes at the same time. (Multi-Mode Deididentification). Deidentifies Input Annotations of types DOCUMENT, TOKEN and CHUNK, by either masking or obfuscating the given CHUNKS.

Parameters:

  • ageRanges: (IntArrayParam) List of integers specifying limits of the age groups to preserve during obfuscation

  • blackList: (StringArrayParam) List of entities that will be ignored to in the regex file.

  • consistentObfuscation: (BooleanParam) Whether to replace very similar entities in a document with the same randomized term (default: true) The similarity is based on the Levenshtein Distance between the words.

  • dateFormats: (StringArrayParam) Format of dates to displace

  • dateTag: (Param[String]) Tag representing what are the NER entity (default: DATE)

  • dateToYear: (BooleanParam) true if dates must be converted to years, false otherwise

  • days: (IntParam) Number of days to obfuscate the dates by displacement.

  • fixedMaskLength: (IntParam) Select the fixed mask length: this is the length of the masking sequence that will be used when the ‘fixed_length_chars’ masking policy is selected.

  • ignoreRegex: (BooleanParam) Select if you want to use regex file loaded in the model.

  • isRandomDateDisplacement: (BooleanParam) Use a random displacement days in dates entities,that random number is based on the DeIdentificationParams.seed If true use random displacement days in dates entities,if false use the DeIdentificationParams.days The default value is false.

  • language: (Param[String]) The language used to select the regex file and some faker entities.’en’(english),’de’(German), ‘es’(Spanish), ‘fr’(French) or ‘ro’(Romanian)

  • mappingsColumn: (Param[String]) This is the mapping column that will return the Annotations chunks with the fake entities

  • maskingPolicy: (Param[String]) Select the masking policy: same_length_chars: Replace the obfuscated entity with a masking sequence composed of asterisks and surrounding squared brackets, being the total length of the masking sequence of the same length as the original sequence. Example, Smith -> [***]. If the entity is less than 3 chars (like Jo, or 5), asterisks without brackets will be returned. entity_labels: Replace the values with the corresponding entity labels. fixed_length_chars: Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisks.

  • minYear: (IntParam) Minimum year to use when converting date to year

  • mode: (Param[String]) Mode for Anonymizer [‘mask’, ‘obfuscate’] Given the following text

  • obfuscateDate: (BooleanParam) When mode==”obfuscate” whether to obfuscate dates or not.

  • obfuscateRefFile: (Param[String]) File with the terms to be used for Obfuscation

  • obfuscateRefSource: (Param[String]) The source of obfuscation of to obfuscate the entities.For dates entities doesnt apply tha method.

  • outputAsDocument: (BooleanParam) Whether to return all sentences joined into a single document

  • refFileFormat: (Param[String]) Format of the reference file for Obfuscation the default value for that is “csv”

  • refSep: (Param[String]) Separator character for the csv reference file for Obfuscation de default value is “#”

  • regexOverride: (BooleanParam) If is true prioritize the regex entities, if is false prioritize the ner.

  • regexPatternsDictionary: (ExternalResourceParam) dictionary with regular expression patterns that match some protected entity if the dictionary in not setting up we will use the default regex file.

  • region: (Param[String]) Usa or eu

  • returnEntityMappings: (BooleanParam) With this property you select if you want to return mapping column

  • sameEntityThreshold: (DoubleParam) Similarity threshold [0.0-1.0] to consider two appearances of an entity as the same (default: 0.9) For date entities this method doesn’t apply.

  • sameLengthFormattedEntities: (StringArrayParam) List of formatted entities to generate the same length outputs as original ones during obfuscation.

  • seed: (IntParam) It is the seed to select the entities on obfuscate mode.With the seed you can reply a execution several times with the same ouptut.

  • selectiveObfuscationModesPath: (Param[String]) Dictionary path where is the json that contains the selective obfuscation modes

  • unnormalizedDateMode: (Param[String]) The mode to use if the date is not formatted.

  • zipCodeTag: (Param[String]) Tag representing zip codes in the obfuscate reference file (default: ZIP).

  • MetadataMaskingPolicy(str): (Param[String]) Options : ‘entity_labels’, ‘same_length_chars’, ‘fixed_length_chars’ If set, metadata includes the masked form of the document.

To create a configured DeIdentificationModel, please see the example of DeIdentification.

Input Annotator Types: DOCUMENT, TOKEN, CHUNK

Output Annotator Type: DOCUMENT

Python API: DeIdentificationModel Scala API: DeIdentificationModel Notebook: DeIdentificationModelNotebook
Show Example

from johnsnowlabs import nlp, medical

documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence") \
    .setUseAbbreviations(True)

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")\

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_sensitive_entities = medical.NerModel \
    .pretrained("ner_deid_enriched", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

nerConverter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

deIdentification = medical.DeIdentificationModel.pretrained("deidentify_large", "en", "clinical/models") \
    .setInputCols(["ner_chunk", "token", "sentence"]) \
    .setOutputCol("dei") \
    .setMode("obfuscate") \
    .setDateFormats(["MM/dd/yy","yyyy-MM-dd"]) \
    .setObfuscateDate(True) \
    .setDateTag("DATE") \
    .setDays(5) \
    .setObfuscateRefSource("both")

data = spark.createDataFrame([
    ["# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09."]
    ]).toDF("text")

pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    clinical_sensitive_entities,
    nerConverter,
    deIdentification
])

result = pipeline.fit(data).transform(data)
result.select(F.expr("sentence.result as Input") ,F.expr("dei.result as deidentified")).show(truncate=100)
+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+
|                                                                                Input|                                                                            deidentified|
+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+
|[# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09.]|[# 1610960 Date : 01/18/93 PCP : Vida Rigger , 27 years-old , Record date : 2079-11-14.]|
+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+
from johnsnowlabs import nlp, medical, finance, legal

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = legal.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

bert_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("bert_embeddings")

fin_ner = finance.NerModel.pretrained('finner_deid', "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")
    #.setLabelCasing("upper")

ner_converter =  medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setReplaceLabels({"ORG": "COMPANY"}) # Replace "ORG" entity as "COMPANY"

ner_finner = finance.NerModel.pretrained("finner_org_per_role_date", "en", "finance/models")\
    .setInputCols(["sentence", "token", "bert_embeddings"]) \
    .setOutputCol("ner_finner")
    #.setLabelCasing("upper")

ner_converter_finner = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner_finner"]) \
    .setOutputCol("ner_finner_chunk") 
    # .setWhiteList(['ROLE']) # Just use "ROLE" entity from this NER

chunk_merge =  medical.ChunkMergeApproach()\
    .setInputCols("ner_finner_chunk", "ner_chunk")\
    .setOutputCol("deid_merged_chunk")

deidentification =  finance.DeIdentification() \
    .setInputCols(["sentence", "token", "deid_merged_chunk"]) \
    .setOutputCol("deidentified") \
    .setMode("mask")\
    .setIgnoreRegex(True)

# Pipeline
nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      embeddings,
      bert_embeddings,
      fin_ner,
      ner_converter,
      ner_finner,
      ner_converter_finner,
      chunk_merge,
      deidentification])

data = spark.createDataFrame([
    ["Jeffrey Preston Bezos, DoB 12/01/1964, is an American entrepreneur, founder and CEO of Amazon"]
]).toDF("text")

result = nlpPipeline.fit(data).transform(data)
result.select("sentence.result", "deidentified.result").show(truncate = False)

+-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------+
|result                                                                                         |result                                                                     |
+-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------+
|[Jeffrey Preston Bezos, DoB 12/01/1964, is an American entrepreneur, founder and CEO of Amazon]|[<PERSON>, <DATE>, is an American entrepreneur, <ROLE> and <ROLE> of <ORG>]|
+-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------+
from johnsnowlabs import nlp, legal, medical

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = legal.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

legal_ner = legal.NerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")
    #.setLabelCasing("upper")

ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setReplaceLabels({"ALIAS": "PARTY"})

ner_signers = legal.NerModel.pretrained("legner_signers", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_signers")
    #.setLabelCasing("upper")

ner_converter_signers = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner_signers"]) \
    .setOutputCol("ner_signer_chunk")

chunk_merge = medical.ChunkMergeApproach()\
    .setInputCols("ner_signer_chunk", "ner_chunk")\
    .setOutputCol("deid_merged_chunk")

deidentification = legal.DeIdentification() \
    .setInputCols(["sentence", "token", "deid_merged_chunk"]) \
    .setOutputCol("deidentified") \
    .setMode("mask")\
    .setIgnoreRegex(True)

# Pipeline
nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      embeddings,
      legal_ner,
      ner_converter,
      ner_signers,
      ner_converter_signers,
      chunk_merge,
      deidentification])

data = spark.createDataFrame([["ENTIRE AGREEMENT.  This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby,\
 supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter. THE MART.COM, INC.:                         I-ESCROW, INC.: By:Dominic J. Magliarditi               \
  By:Sanjay Bajaj Name: Dominic J. Magliarditi                Name: Sanjay Bajaj Title: President                            Title: VP Business Development Date: 6/21/2023 "]]).toDF("text")

result = nlpPipeline.fit(data).transform(data)
result.select("sentence.result", "deidentified.result").toPandas()

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|sentence                                                                                                                                                                                                                                |deidentified                                                                                                                                                                                                                            |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ENTIRE AGREEMENT.                                                                                                                                                                                                                       |<DOC>.                                                                                                                                                                                                                                  |
|This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby, supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter.|This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby, supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter.|
|THE MART.COM, INC.: I-ESCROW, INC.: By:Dominic J. Magliarditi                 By:Sanjay Bajaj Name: Dominic J. Magliarditi Name: Sanjay Bajaj Title: President Title: VP Business Development Date: 6/21/2023                           |<PARTY>.: <PARTY>.: By:Dominic <SIGNING_PERSON>                 By:Sanjay <SIGNING_PERSON> Name: <SIGNING_PERSON> Name: <SIGNING_PERSON> Title: <SIGNING_TITLE> Title: <SIGNING_TITLE> Date: 6/21/2023                                  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")
  .setUseAbbreviations(true)

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val clinicalSensitiveEntities = MedicalNerModel.pretrained("ner_deid_enriched", "en", "clinical/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")

val nerConverter = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")

val deIdentification = DeIdentificationModel.pretrained("deidentify_large", "en", "clinical/models")
  .setInputCols(Array("ner_chunk", "token", "sentence"))
  .setOutputCol("dei")
  .setMode("obfuscate")
  .setDateFormats(Array("MM/dd/yy", "yyyy-MM-dd"))
  .setObfuscateDate(true)
  .setDateTag("DATE")
  .setDays(5)
  .setObfuscateRefSource("both")

val data = Seq(
  "# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09."
).toDF("text")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  clinicalSensitiveEntities,
  nerConverter,
  deIdentification
))

val result = pipeline.fit(data).transform(data)

+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+
|                                                                                Input|                                                                            deidentified|
+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+
|[# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09.]|[# 1610960 Date : 01/18/93 PCP : Vida Rigger , 27 years-old , Record date : 2079-11-14.]|
+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+
import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val bertEmbeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("bert_embeddings")

val finNer = FinanceNerModel.pretrained("finner_deid", "en", "finance/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")

val nerConverter = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")
  .setReplaceLabels(Map("ORG" -> "COMPANY"))

val nerFinner = FinanceNerModel.pretrained("finner_org_per_role_date", "en", "finance/models")
  .setInputCols(Array("sentence", "token", "bert_embeddings"))
  .setOutputCol("ner_finner")

val nerConverterFinner = new NerConverter()
  .setInputCols(Array("sentence", "token", "ner_finner"))
  .setOutputCol("ner_finner_chunk")

val chunkMerge = new ChunkMergeApproach()
  .setInputCols(Array("ner_finner_chunk", "ner_chunk"))
  .setOutputCol("deid_merged_chunk")

val deidentification = new DeIdentification()
  .setInputCols(Array("sentence", "token", "deid_merged_chunk"))
  .setOutputCol("deidentified")
  .setMode("mask")
  .setIgnoreRegex(true)

val nlpPipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  bertEmbeddings,
  finNer,
  nerConverter,
  nerFinner,
  nerConverterFinner,
  chunkMerge,
  deidentification
))

val data = Seq(
  "Jeffrey Preston Bezos, DoB 12/01/1964, is an American entrepreneur, founder and CEO of Amazon"
).toDF("text")

val result = nlpPipeline.fit(data).transform(data)

+-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------+
|result                                                                                         |result                                                                     |
+-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------+
|[Jeffrey Preston Bezos, DoB 12/01/1964, is an American entrepreneur, founder and CEO of Amazon]|[<PERSON>, <DATE>, is an American entrepreneur, <ROLE> and <ROLE> of <ORG>]|
+-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------+
import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val legalNer = LegalNerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")
  .setLabelCasing("upper")

val nerConverter = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")
  .setReplaceLabels(Map("ALIAS" -> "PARTY"))

val nerSigners = LegalNerModel.pretrained("legner_signers", "en", "legal/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner_signers")
  .setLabelCasing("upper")

val nerConverterSigners = new NerConverter()
  .setInputCols(Array("sentence", "token", "ner_signers"))
  .setOutputCol("ner_signer_chunk")

val chunkMerge = new ChunkMergeApproach()
  .setInputCols(Array("ner_signer_chunk", "ner_chunk"))
  .setOutputCol("deid_merged_chunk")

val deidentification = new DeIdentification()
  .setInputCols(Array("sentence", "token", "deid_merged_chunk"))
  .setOutputCol("deidentified")
  .setMode("mask")
  .setIgnoreRegex(true)

val nlpPipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  legalNer,
  nerConverter,
  nerSigners,
  nerConverterSigners,
  chunkMerge,
  deidentification
))

val data = Seq(
  "ENTIRE AGREEMENT. This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby, supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter. THE MART.COM, INC.: I-ESCROW, INC.: By:Dominic J. Magliarditi By:Sanjay Bajaj Name: Dominic J. Magliarditi Name: Sanjay Bajaj Title: President Title: VP Business Development Date: 6/21/2023"
).toDF("text")

val result = nlpPipeline.fit(data).transform(data)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|sentence                                                                                                                                                                                                                                |deidentified                                                                                                                                                                                                                            |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ENTIRE AGREEMENT.                                                                                                                                                                                                                       |<DOC>.                                                                                                                                                                                                                                  |
|This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby, supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter.|This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby, supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter.|
|THE MART.COM, INC.: I-ESCROW, INC.: By:Dominic J. Magliarditi                 By:Sanjay Bajaj Name: Dominic J. Magliarditi Name: Sanjay Bajaj Title: President Title: VP Business Development Date: 6/21/2023                           |<PARTY>.: <PARTY>.: By:Dominic <SIGNING_PERSON>                 By:Sanjay <SIGNING_PERSON> Name: <SIGNING_PERSON> Name: <SIGNING_PERSON> Title: <SIGNING_TITLE> Title: <SIGNING_TITLE> Date: 6/21/2023                                  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Contains all the methods for training a DeIdentificationModel model. This module can obfuscate or mask the entities that contains personal information. These can be set with a file of regex patterns with setRegexPatternsDictionary, where each line is a mapping of entity to regex.

DATE \d{4}
AID \d{6,7}

Additionally, obfuscation strings can be defined with setObfuscateRefFile, where each line is a mapping of string to entity. The format and seperator can be speficied with setRefFileFormat and setRefSep.

Dr. Gregory House#DOCTOR
01010101#MEDICALRECORD

Ideally this annotator works in conjunction with Demographic Named EntityRecognizers that can be trained either using TextMatchers, RegexMatchers, DateMatchers, NerCRFs or NerDLs

Input Annotator Types: DOCUMENT, TOKEN, CHUNK

Output Annotator Type: DOCUMENT

Python API: DeIdentification Scala API: DeIdentification
Show Example

from johnsnowlabs import nlp, medical

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# NER model trained on n2c2 (de-identification and Heart Disease Risk Factors Challenge) datasets)
clinical_ner = medical.NerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

#deid model with "entity_labels"
deid_entity_labels= medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk"])\
    .setOutputCol("deid_entity_label")\
    .setMode("mask")\
    .setReturnEntityMappings(True)\
    .setMaskingPolicy("entity_labels")

obs_lines = """Marvin MARSHALL#PATIENT
Hubert GROGAN#PATIENT
ALTHEA COLBURN#PATIENT
Kalil AMIN#PATIENT
Inci FOUNTAIN#PATIENT
Ekaterina Rosa#DOCTOR
Rudiger Chao#DOCTOR
COLLETTE KOHLER#NAME
Mufi HIGGS#NAME"""

with open ('obfuscation.txt', 'w') as f:
  f.write(obs_lines)

obfuscation = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk"]) \
    .setOutputCol("deidentified") \
    .setMode("obfuscate")\
    .setObfuscateDate(True)\
    .setObfuscateRefFile('obfuscation.txt')\
    .setObfuscateRefSource("both")\  #file or faker
    .setGenderAwareness(True)\
    .setLanguage("en")\
    .setUnnormalizedDateMode("obfuscate")  #mask or skip

deidPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      deid_entity_labels,
      obfuscation
      ])


empty_data = spark.createDataFrame([[""]]).toDF("text")


model = deidPipeline.fit(empty_data)

#sample data
text ='''
Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .
'''

result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

result.select(F.explode(F.arrays_zip(result.sentence.result,
                                     result.deid_entity_label.result,
                                     result.deidentified.result,
                                     )).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"),
              F.expr("cols['1']").alias("deid_entity_label"),
              F.expr("cols['2']").alias("deidentified"),
              ).toPandas()

+-----------------------------------------------------------------------+-------------------------------------------------------+-----------------------------------------------------------------+
|                                                               sentence|                                      deid_entity_label|                                                     deidentified|
+-----------------------------------------------------------------------+-------------------------------------------------------+-----------------------------------------------------------------+
|                          Record date : 2093-01-13 , David Hale , M.D .|                  Record date : <DATE> , <NAME> , M.D .|                  Record date : 2093-01-25 , Daryl Dieter , M.D .|
|              , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 .|            , Name : <NAME> , MR # <ID> Date : <DATE> .|         , Name : Langston Papas , MR # 4784828 Date : 01/25/93 .|
|             PCP : Oliveira , 25 years-old , Record date : 2079-11-09 .|PCP : <NAME> , <AGE> years-old , Record date : <DATE> .|PCP : Roseann Lederer , 23 years-old , Record date : 2079-11-21 .|
|Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .|            <LOCATION> , <LOCATION> , Phone <CONTACT> .|    31 North St Joseph Ave , 400 Tickle St , Phone (59) 106-048 .|
+-----------------------------------------------------------------------+-------------------------------------------------------+-----------------------------------------------------------------+
from johnsnowlabs import nlp, finance

documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

 sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence") \
    .setUseAbbreviations(True)

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel \
    .pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

# Ner entities
ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

nerConverter = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_con")

# Deidentification
deIdentification = finance.DeIdentification() \
    .setInputCols(["ner_chunk", "token", "sentence"]) \
    .setOutputCol("dei") \
    # file with custom regex pattern for custom entities
    .setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt") \
    # file with custom obfuscator names for the entities
    .setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt") \
    .setRefFileFormat("csv") \
    .setRefSep("#") \
    .setMode("obfuscate") \
    .setDateFormats(Array("MM/dd/yy","yyyy-MM-dd")) \
    .setObfuscateDate(True) \
    .setDateTag("DATE") \
    .setDays(5) \
    .setObfuscateRefSource("file")

# Pipeline
pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    ner_model,
    nerConverter,
    deIdentification
])
from johnsnowlabs import nlp, legal

documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

 sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence") \
    .setUseAbbreviations(True)

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel \
    .pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

# Ner entities
ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

nerConverter = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_con")

# Deidentification
deIdentification = legal.DeIdentification() \
    .setInputCols(["ner_chunk", "token", "sentence"]) \
    .setOutputCol("dei") \
    # file with custom regex pattern for custom entities
    .setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt") \
    # file with custom obfuscator names for the entities
    .setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt") \
    .setRefFileFormat("csv") \
    .setRefSep("#") \
    .setMode("obfuscate") \
    .setDateFormats(Array("MM/dd/yy","yyyy-MM-dd")) \
    .setObfuscateDate(True) \
    .setDateTag("DATE") \
    .setDays(5) \
    .setObfuscateRefSource("file")

# Pipeline
pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    ner_model,
    nerConverter,
    deIdentification
])

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

// Sentence Detector annotator, processes various sentences per line
val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

// Tokenizer splits words in a relevant format for NLP
val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

// Clinical word embeddings trained on PubMED dataset
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

// NER model trained on n2c2 (de-identification and Heart Disease Risk Factors Challenge) datasets)
val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

//deid model with "entity_labels"
val deid_entity_labels= new DeIdentification()
    .setInputCols(Array("ner_chunk", "token", "sentence"))
    .setOutputCol("deid_entity_label")
    .setMode("mask")
    .setReturnEntityMappings(true)
    .setMaskingPolicy("entity_labels")
    
//
val obs_lines = """Marvin MARSHALL#PATIENT
Hubert GROGAN#PATIENT
ALTHEA COLBURN#PATIENT
Kalil AMIN#PATIENT
Inci FOUNTAIN#PATIENT
Ekaterina Rosa#DOCTOR
Rudiger Chao#DOCTOR
COLLETTE KOHLER#NAME
Mufi HIGGS#NAME"""
//

val obfuscation =  new DeIdentification()
    .setInputCols(Array("ner_chunk", "token", "sentence"))
    .setOutputCol("deidentified")
    .setMode("obfuscate")
    .setObfuscateDate(true)
    .setObfuscateRefFile("obfuscation.txt")
    .setObfuscateRefSource("both")       //file or faker  
    .setGenderAwareness(true)
    .setLanguage("en")
    .setUnnormalizedDateMode("obfuscate") //mask or skip


val deidPipeline = new Pipeline().setStages(Array(
                                                  documentAssembler,
                                                  sentenceDetector,
                                                  tokenizer,
                                                  word_embeddings,
                                                  clinical_ner,
                                                  ner_converter,
                                                  deid_entity_labels,
                                                  obfuscation
                                                ))

//sample data

val text =
          '''
          Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old ,
          Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .
          '''

val data = Seq(text).toDF("text")

val result = new deidPipeline.fit(data).transform(data)

+-----------------------------------------------------------------------+-------------------------------------------------------+-----------------------------------------------------------------+
|                                                               sentence|                                      deid_entity_label|                                                     deidentified|
+-----------------------------------------------------------------------+-------------------------------------------------------+-----------------------------------------------------------------+
|                          Record date : 2093-01-13 , David Hale , M.D .|                  Record date : <DATE> , <NAME> , M.D .|                  Record date : 2093-01-25 , Daryl Dieter , M.D .|
|              , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 .|            , Name : <NAME> , MR # <ID> Date : <DATE> .|         , Name : Langston Papas , MR # 4784828 Date : 01/25/93 .|
|             PCP : Oliveira , 25 years-old , Record date : 2079-11-09 .|PCP : <NAME> , <AGE> years-old , Record date : <DATE> .|PCP : Roseann Lederer , 23 years-old , Record date : 2079-11-21 .|
|Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .|            <LOCATION> , <LOCATION> , Phone <CONTACT> .|    31 North St Joseph Ave , 400 Tickle St , Phone (59) 106-048 .|
+-----------------------------------------------------------------------+-------------------------------------------------------+-----------------------------------------------------------------+
import spark.implicits._

val documentAssembler = new DocumentAssembler()
     .setInputCol("text")
     .setOutputCol("document")

 val sentenceDetector = new SentenceDetector()
     .setInputCols(document)
     .setOutputCol("sentence")
     .setUseAbbreviations(true)

 val tokenizer = new Tokenizer()
     .setInputCols("sentence")
     .setOutputCol("token")

 val embeddings = WordEmbeddingsModel
     .pretrained("embeddings_clinical", "en", "clinical/models")
     .setInputCols(Array("sentence", "token"))
     .setOutputCol("embeddings")

// Ner entities
val ner_model = FinanceNerModel.pretrained("finner_orgs_prods_alias","en","finance/models")
     .setInputCols(Array("sentence", "token", "embeddings"))
     .setOutputCol("ner")

 val nerConverter = new NerConverter()
     .setInputCols(Array("sentence", "token", "ner"))
     .setOutputCol("ner_con")

// Deidentification
val deIdentification = new DeIdentification()
     .setInputCols(Array("ner_chunk", "token", "sentence"))
     .setOutputCol("dei")
     // file with custom regex patterns for custom entities
     .setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt")
     // file with custom obfuscator names for the entities
     .setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt")
     .setRefFileFormat("csv")
     .setRefSep("#")
     .setMode("obfuscate")
     .setDateFormats(Array("MM/dd/yy","yyyy-MM-dd"))
     .setObfuscateDate(true)
     .setDateTag("DATE")
     .setDays(5)
     .setObfuscateRefSource("file")

// Pipeline
val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  ner_model,
  nerConverter,
  deIdentification
))
import spark.implicits._

val documentAssembler = new DocumentAssembler()
     .setInputCol("text")
     .setOutputCol("document")

 val sentenceDetector = new SentenceDetector()
     .setInputCols("document")
     .setOutputCol("sentence")
     .setUseAbbreviations(true)

 val tokenizer = new Tokenizer()
     .setInputCols("sentence")
     .setOutputCol("token")

 val embeddings = WordEmbeddingsModel
     .pretrained("embeddings_clinical", "en", "clinical/models")
     .setInputCols(Array("sentence", "token"))
     .setOutputCol("embeddings")

// Ner entities
val ner_model = LegalNerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

 val nerConverter = new NerConverter()
     .setInputCols(Array("sentence", "token", "ner"))
     .setOutputCol("ner_con")

// Deidentification
val deIdentification = new DeIdentification()
     .setInputCols(Array("ner_chunk", "token", "sentence"))
     .setOutputCol("dei")
     // file with custom regex patterns for custom entities
     .setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt")
     // file with custom obfuscator names for the entities
     .setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt")
     .setRefFileFormat("csv")
     .setRefSep("#")
     .setMode("obfuscate")
     .setDateFormats(Array("MM/dd/yy","yyyy-MM-dd"))
     .setObfuscateDate(true)
     .setDateTag("DATE")
     .setDays(5)
     .setObfuscateRefSource("file")

// Pipeline

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  ner_model,
  nerConverter,
  deIdentification
))

DistilBertForSequenceClassification

Model

DistilBertForSequenceClassification can load DistilBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.

Parameters:

  • batchSize’, ‘Size of every batch’: default: 8,

  • coalesceSentences’: “Instead of 1 class per sentence (if inputCols is ‘'’sentence’’’ output 1 class per document by averaging probabilities in all sentences.” default: False,

  • maxSentenceLength’, ‘Max sentence length to process’, default: 128

  • caseSensitive’, ‘whether to ignore case in tokens for embeddings matching’,default: True,

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CATEGORY

Python API: DistilBertForSequenceClassification Scala API: DistilBertForSequenceClassification
Show Example

from johnsnowlabs import nlp, medical
 
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

sequenceClassifier = medical.DistilBertForSequenceClassification.pretrained("distilbert_sequence_classifier_ade", "en", "clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("classes")

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    tokenizer,
    sequenceClassifier
])

data = spark.createDataFrame([["I have an allergic reaction to vancomycin so I have itchy skin, sore throat/burning/itching, numbness of tongue and gums.I would not recommend this drug to anyone, especially since I have never had such an adverse reaction to any other medication."],
                              ["Religare Capital Ranbaxy has been accepting approval for Diovan since 2012"]]).toDF("text")

result = pipeline.fit(data).transform(data)

result.select("text", "classes.result").show(truncate=100)

| text                                                                                           | result |
|------------------------------------------------------------------------------------------------|-------|
| I have an allergic reaction to vancomycin so I have itchy skin, sore throat/burning/itching, numb... | [True] |
| Religare Capital Ranbaxy has been accepting approval for Diovan since 2012 | [False] |

import spark.implicits._

val document_assembler = new DocumentAssembler() 
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new Tokenizer()
    .setInputCols("document")
    .setOutputCol("token")

val sequenceClassifier = MedicalDistilBertForSequenceClassification.pretrained("distilbert_sequence_classifier_ade", "en", "clinical/models")
    .setInputCols(Array("document","token"))
    .setOutputCol("classes")

val pipeline =  new Pipeline().setStages(Array(
    document_assembler, 
    tokenizer, 
    sequenceClassifier))

var text =List(
    List("I have an allergic reaction to vancomycin so I have itchy skin, sore throat/burning/itching, numbness of tongue and gums.I would not recommend this drug to anyone, especially since I have never had such an adverse reaction to any other medication."),
    List("Religare Capital Ranbaxy has been accepting approval for Diovan since 2012")
)

val data = Seq(text).toDF("text")
val result = pipeline.fit(data).transform(data)

| text                                                                                           | result |
|------------------------------------------------------------------------------------------------|-------|
| I have an allergic reaction to vancomycin so I have itchy skin, sore throat/burning/itching, numb... | [True] |
| Religare Capital Ranbaxy has been accepting approval for Diovan since 2012 | [False] |

Doc2ChunkInternal

Model

Converts DOCUMENT, TOKEN typed annotations into CHUNK type with the contents of a chunkCol. Chunk text must be contained within input DOCUMENT. May be either StringType or ArrayType[StringType] (using setIsArray). Useful for annotators that require a CHUNK type input.

Parameters:

  • inputCols: The name of the columns containing the input annotations. It can read either a String column or an Array.

  • outputCol: The name of the column in Document type that is generated. We can specify only one column here.

All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols().

For more extended examples on document pre-processing see the Spark NLP Workshop.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CHUNK

Python API: Doc2ChunkInternal Scala API: Doc2ChunkInternal Notebook: Doc2ChunkInternalNotebook
Show Example

from johnsnowlabs import nlp, medical

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

chunkAssembler = medical.Doc2ChunkInternal()\
    .setInputCols("document", "token")\
    .setChunkCol("target")\
    .setOutputCol("chunk")\
    .setIsArray(True)

pipeline = nlp.Pipeline().setStages([
                                    documentAssembler,
                                    tokenizer, 
                                    chunkAssembler])

data = spark.createDataFrame(
    [
        [
            "Spark NLP is an open-source text processing library for advanced natural language processing.",
            ["Spark NLP", "text processing library", "natural language processing"],
        ]
    ]
).toDF("text", "target")


result = pipeline.fit(data).transform(data)
result.selectExpr("chunk.result", "chunk.annotatorType").show(truncate=False)

+-----------------------------------------------------------------+---------------------+
|result                                                           |annotatorType        |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+
from johnsnowlabs import nlp, finance

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

chunkAssembler = finance.Doc2ChunkInternal()\
    .setInputCols("document", "token")\
    .setChunkCol("target")\
    .setOutputCol("chunk")\
    .setIsArray(True)

pipeline = nlp.Pipeline().setStages([documentAssembler, tokenizer, chunkAssembler])

data = spark.createDataFrame(
    [
        [
            "Spark NLP is an open-source text processing library for advanced natural language processing.",
            ["Spark NLP", "text processing library", "natural language processing"],
        ]
    ]
).toDF("text", "target")


result = pipeline.fit(data).transform(data)
result.selectExpr("chunk.result", "chunk.annotatorType").show(truncate=False)

+-----------------------------------------------------------------+---------------------+
|result                                                           |annotatorType        |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+
from johnsnowlabs import nlp, legal

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

chunkAssembler = legal.Doc2ChunkInternal()\
    .setInputCols("document", "token")\
    .setChunkCol("target")\
    .setOutputCol("chunk")\
    .setIsArray(True)

pipeline = nlp.Pipeline().setStages([documentAssembler, tokenizer, chunkAssembler])

data = spark.createDataFrame(
    [
        [
            "Spark NLP is an open-source text processing library for advanced natural language processing.",
            ["Spark NLP", "text processing library", "natural language processing"],
        ]
    ]
).toDF("text", "target")


result = pipeline.fit(data).transform(data)
result.selectExpr("chunk.result", "chunk.annotatorType").show(truncate=False)

+-----------------------------------------------------------------+---------------------+
|result                                                           |annotatorType        |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new Tokenizer()
    .setInputCols("document")
    .setOutputCol("token")

val chunkAssembler = new Doc2ChunkInternal()
    .setInputCols(Array("document", "token"))
    .setChunkCol("target")
    .setOutputCol("chunk")
    .setIsArray(true)

val pipeline = new Pipeline().setStages(Array(
    documentAssembler, 
    tokenizer, 
    chunkAssembler
))

val data = Seq(("Spark NLP is an open-source text processing library for advanced natural language processing.",
               "Spark NLP", "text processing library", "natural language processing")).toDF("text", "target")

val result = pipeline.fit(data).transform(data)

+-----------------------------------------------------------------+---------------------+
|result                                                           |annotatorType        |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+
import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new Tokenizer()
    .setInputCols("document")
    .setOutputCol("token")

val chunkAssembler = new Doc2ChunkInternal()
    .setInputCols(Array("document", "token"))
    .setChunkCol("target")
    .setOutputCol("chunk")
    .setIsArray(true)

val pipeline = new Pipeline().setStages(Array(
    documentAssembler, 
    tokenizer, 
    chunkAssembler
))

val data = Seq(("Spark NLP is an open-source text processing library for advanced natural language processing.",
               "Spark NLP", "text processing library", "natural language processing")).toDF("text", "target")

val result = pipeline.fit(data).transform(data)

+-----------------------------------------------------------------+---------------------+
|result                                                           |annotatorType        |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+
import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new Tokenizer()
    .setInputCols("document")
    .setOutputCol("token")

val chunkAssembler = new Doc2ChunkInternal()
    .setInputCols(Array("document", "token"))
    .setChunkCol("target")
    .setOutputCol("chunk")
    .setIsArray(true)

val pipeline = new Pipeline().setStages(Array(
    documentAssembler, 
    tokenizer, 
    chunkAssembler
))

val data = Seq(("Spark NLP is an open-source text processing library for advanced natural language processing.",
               "Spark NLP", "text processing library", "natural language processing")).toDF("text", "target")

val result = pipeline.fit(data).transform(data)

+-----------------------------------------------------------------+---------------------+
|result                                                           |annotatorType        |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+

DocMapper

DocMapper uses the text representation of document annotations to map clinical codes to other codes or relevant information.

Parametres:

  • setRels (List[str]): Relations that we are going to use to map the document

  • setLowerCase (Boolean): Set if we want to map the documents in lower case or not (Default: True)

  • setAllowMultiTokenChunk (Boolean): Whether to skip relations with multitokens (Default: True)

  • setMultivaluesRelations (Boolean): Whether to decide to return all values in a relation together or separately (Default: False)

  • setDoExceptionHandling: If it is set as True, the annotator tries to process as usual and ff exception-causing data (e.g. corrupted record/ document) is passed to the annotator, an exception warning is emitted which has the exception message.

Input Annotator Types: DOCUMENT

Output Annotator Type: LABEL_DEPENDENCY

Python API: DocMapperModel Scala API: DocMapperModel Notebook: DocMapperModelNotebook
Show Example

from johnsnowlabs import nlp, medical

#ChunkMapper Pipeline
document_assembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

#drug_action_treatment_mapper 
docMapper= medical.DocMapperModel().pretrained("drug_action_treatment_mapper", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("mappings")\
    .setRels(["action", "treatment"])

mapperPipeline = nlp.Pipeline().setStages([
    document_assembler,
    docMapper])

test_data = spark.createDataFrame([["Dermovate"], ["Aspagin"]]).toDF("text")

res = mapperPipeline.fit(test_data).transform(test_data)

# Show results
res.select(F.explode(F.arrays_zip(res.mappings.result, 
                                  res.mappings.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("ner_chunk"),
            F.expr("col['0']").alias("mapping_result"),
            F.expr("col['1']['relation']").alias("relation"),
            F.expr("col['1']['all_relations']").alias("all_mappings")).show(truncate=False)

+---------+----------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ner_chunk|mapping_result        |relation |all_mappings                                                                                                                                                                                                           |
+---------+----------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Dermovate|anti-inflammatory     |action   |corticosteroids::: dermatological preparations:::very strong                                                                                                                                                           |
|Dermovate|lupus                 |treatment|discoid lupus erythematosus:::empeines:::psoriasis:::eczema                                                                                                                                                            |
|Aspagin  |analgesic             |action   |anti-inflammatory:::antipyretic                                                                                                                                                                                        |
|Aspagin  |ankylosing spondylitis|treatment|arthralgia:::pain:::bursitis:::headache:::migraine:::myositis:::neuralgia:::osteoarthritis:::gout:::rheumatoid arthritis:::spondylitis:::spondyloarthritis:::tendinitis:::tenosynovitis:::crush injury:::golfer's elbow|
+---------+----------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

// ChunkMapper Pipeline
val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

// drug_action_treatment_mapper 
val docMapper= DocMapperModel().pretrained("drug_action_treatment_mapper", "en", "clinical/models")
    .setInputCols("document")
    .setOutputCol("mappings")
    .setRels(Array("action", "treatment"))

val mapperPipeline = new Pipeline().setStages(Array(
    document_assembler,
    docMapper))


val test_data = Seq(("Dermovate", "Aspagin")).toDF("text")

val res = mapperPipeline.fit(test_data).transform(test_data)

// Show results

+---------+----------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ner_chunk|mapping_result        |relation |all_mappings                                                                                                                                                                                                           |
+---------+----------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Dermovate|anti-inflammatory     |action   |corticosteroids::: dermatological preparations:::very strong                                                                                                                                                           |
|Dermovate|lupus                 |treatment|discoid lupus erythematosus:::empeines:::psoriasis:::eczema                                                                                                                                                            |
|Aspagin  |analgesic             |action   |anti-inflammatory:::antipyretic                                                                                                                                                                                        |
|Aspagin  |ankylosing spondylitis|treatment|arthralgia:::pain:::bursitis:::headache:::migraine:::myositis:::neuralgia:::osteoarthritis:::gout:::rheumatoid arthritis:::spondylitis:::spondyloarthritis:::tendinitis:::tenosynovitis:::crush injury:::golfer's elbow|
+---------+----------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

DocMapper that can be used to map short strings via DocumentAssembler without using any other annotator between to convert strings to Chunk type that ChunkMapperModel expects.

Parameters:

  • setDictionary (Str): Dictionary path where is the JsonDictionary that contains the mappings columns

  • setRels (Boolean): Relations that we are going to use to map the document

  • setLowerCase (Boolean): Set if we want to map the documents in lower case or not (Default: True)

  • setAllowMultiTokenChunk (Boolean): Whether to skip relations with multitokens (Default: True)

  • setMultivaluesRelations (Boolean): Whether to decide to return all values in a relation together or separately (Default: False)

  • setDoExceptionHandling: If it is set as True, the annotator tries to process as usual and ff exception-causing data (e.g. corrupted record/ document) is passed to the annotator, an exception warning is emitted which has the exception message.

Input Annotator Types: DOCUMENT

Output Annotator Type: LABEL_DEPENDENCY

Python API: DocMapperApproach Scala API: DocMapperApproach Notebook: DocMapperApproachNotebook
Show Example

from johnsnowlabs import nlp,  medical

data_set= {
  "mappings": [
    {
      "key": "metformin",
      "relations": [
        {
          "key": "action",
          "values" : ["hypoglycemic", "Drugs Used In Diabetes"]
        },
        {
          "key": "treatment",
          "values" : ["diabetes", "t2dm"]
        }
      ]
    }
  ]
}

import json
with open('sample_drug.json', 'w', encoding='utf-8') as f:
    json.dump(data_set, f, ensure_ascii=False, indent=4)

document_assembler = nlp.DocumentAssembler()\
      .setInputCol('text')\
      .setOutputCol('document')

chunkerMapper = medical.DocMapperApproach()\
      .setInputCols(["document"])\
      .setOutputCol("mappings")\
      .setDictionary("./sample_drug.json")\
      .setRels(["action"])

pipeline = nlp.Pipeline().setStages([document_assembler,
                                     chunkerMapper])

test_data = spark.createDataFrame([["metformin"]]).toDF("text")

res = pipeline.fit(test_data).transform(test_data)


## Results
res.select(F.explode(F.arrays_zip(res.mappings.result,
                                  res.mappings.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("document"),
            F.expr("col['0']").alias("mapping_result"),
            F.expr("col['1']['relation']").alias("relation"),
            F.expr("col['1']['all_relations']").alias("all_mappings")).show(truncate=False)

+---------+--------------+--------+----------------------+
|document |mapping_result|relation|all_mappings          |
+---------+--------------+--------+----------------------+
|metformin|hypoglycemic  |action  |Drugs Used In Diabetes|
+---------+--------------+--------+----------------------+

import spark.implicits._

/* sample_drug.json file
{
  "mappings": [
    {
      "key": "metformin",
      "relations": [
        {
          "key": "action",
          "values" : ["hypoglycemic", "Drugs Used In Diabetes"]
        },
        {
          "key": "treatment",
          "values" : ["diabetes", "t2dm"]
        }
      ]
    }
  ]
}
*/

val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document") 

val chunkerMapper = new DocMapperApproach()
  .setInputCols("document")
  .setOutputCol("mappings")
  .setDictionary("./sample_drug.json")
  .setRels("action")

val pipeline = new Pipeline().setStages(Array(document_assembler, chunkerMapper))

val test_data = Seq("metformin").toDF("text") 

val res = pipeline.fit(test_data).transform(test_data)


// Results 

+---------+--------------+--------+----------------------+
|document |mapping_result|relation|all_mappings          |
+---------+--------------+--------+----------------------+
|metformin|hypoglycemic  |action  |Drugs Used In Diabetes|
+---------+--------------+--------+----------------------+

DocumentFiltererByClassifier

Model

The DocumentFiltererByClassifier function is designed to filter documents based on the outcomes generated by classifier annotators. It operates using a white list and a black list. The white list comprises classifier results that meet the criteria to pass through the filter, while the black list includes results that are prohibited from passing through. This filtering process is sensitive to cases by default. However, by setting caseSensitive to False, the filter becomes case-insensitive, allowing for a broader range of matches based on the specified criteria. This function serves as an effective tool for systematically sorting and managing documents based on specific classifier outcomes, facilitating streamlined document handling and organization.

Parameters:

  • whiteList: (list) If defined, list of entities to process. The rest will be ignored.

  • CaseSensitive: (bool) Determines whether the definitions of the white listed entities are case sensitive.

Input Annotator Types: DOCUMENT, CATEGORY

Output Annotator Type: DOCUMENT

Notebook: DocumentFiltererByClassifierNotebook
Show Example

example = """Medical Specialty:
Cardiovascular / Pulmonary

Sample Name: Aortic Valve Replacement

Description: Aortic valve replacement using a mechanical valve and two-vessel coronary artery bypass grafting procedure using saphenous vein graft to the first obtuse marginal artery and left radial artery graft to the left anterior descending artery.
(Medical Transcription Sample Report)

DIAGNOSIS: Aortic valve stenosis with coronary artery disease associated with congestive heart failure. The patient has diabetes and is morbidly obese.

PROCEDURES: Aortic valve replacement using a mechanical valve and two-vessel coronary artery bypass grafting procedure using saphenous vein graft to the first obtuse marginal artery and left radial artery graft to the left anterior descending artery.

ANESTHESIA: General endotracheal

INCISION: Median sternotomy

INDICATIONS: The patient presented with severe congestive heart failure associated with the patient's severe diabetes. The patient was found to have moderately stenotic aortic valve. In addition, The patient had significant coronary artery disease consisting of a chronically occluded right coronary artery but a very important large obtuse marginal artery coming off as the main circumflex system. The patient also has a left anterior descending artery which has moderate disease and this supplies quite a bit of collateral to the patient's right system. It was decided to perform a valve replacement as well as coronary artery bypass grafting procedure.

FINDINGS: The left ventricle is certainly hypertrophied· The aortic valve leaflet is calcified and a severe restrictive leaflet motion. It is a tricuspid type of valve. The coronary artery consists of a large left anterior descending artery which is associated with 60% stenosis but a large obtuse marginal artery which has a tight proximal stenosis.

The radial artery was used for the left anterior descending artery. Flow was excellent. Looking at the targets in the posterior descending artery territory, there did not appear to be any large branches. On the angiogram these vessels appeared to be quite small. Because this is a chronically occluded vessel and the patient has limited conduit due to the patient's massive obesity, attempt to bypass to this area was not undertaken. The patient was brought to the operating room

PROCEDURE: The patient was brought to the operating room and placed in supine position. A median sternotomy incision was carried out and conduits were taken from the left arm as well as the right thigh. The patient weighs nearly three hundred pounds. There was concern as to taking down the left internal mammary artery. Because the radial artery appeared to be a good conduit The patient would have arterial graft to the left anterior descending artery territory. The patient was cannulated after the aorta and atrium were exposed and full heparinization.

The patient went on cardiopulmonary bypass and the aortic cross-clamp was applied Cardioplegia was delivered through the coronary sinuses in a retrograde manner. The patient was cooled to 32 degrees. Iced slush was applied to the heart. The aortic valve was then exposed through the aortic root by transverse incision. The valve leaflets were removed and the #23 St. Jude mechanical valve was secured into position by circumferential pledgeted sutures. At this point, aortotomy was closed.

The first obtuse marginal artery was a very large target and the vein graft to this target indeed produced an excellent amount of flow. Proximal anastomosis was then carried out to the foot of the aorta. The left anterior descending artery does not have severe disease but is also a very good target and the radial artery was anastomosed to this target in an end-to-side manner. The two proximal anastomoses were then carried out to the root of the aorta.

The patient came off cardiopulmonary bypass after aortic cross-clamp was released. The patient was adequately warmed. Protamine was given without adverse effect. Sternal closure was then done using wires. The subcutaneous layers were closed using Vicryl suture. The skin was approximated using staples.
"""

df = spark.createDataFrame([[example]]).toDF("text")

from johnsnowlabs import nlp, medical 

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")\

document_splitter = medical.InternalDocumentSplitter() \
    .setInputCols("document")\
    .setOutputCol("splits")\
    .setSplitMode("recursive")\
    .setChunkSize(100)\
    .setChunkOverlap(3)\
    .setExplodeSplits(True)\
    .setPatternsAreRegex(False)\
    .setSplitPatterns(["\n\n", "\n"])\
    .setKeepSeparators(False)\
    .setTrimWhitespace(True)
    #.setEnableSentenceIncrement(False)

sequenceClassifier = medical.BertForSequenceClassification\
    .pretrained('bert_sequence_classifier_clinical_sections', 'en', 'clinical/models')\
    .setInputCols(["splits", "token"])\
    .setOutputCol("prediction")\
    .setCaseSensitive(False)

document_filterer = medical.DocumentFiltererByClassifier()\
    .setInputCols(["splits", "prediction"])\
    .setOutputCol("filteredDocuments")\
    .setWhiteList(["Diagnostic and Laboratory Data"])\
    .setCaseSensitive(False)\


pipeline = nlp.Pipeline().setStages([
    document_assembler,
    tokenizer,
    document_splitter,
    sequenceClassifier,
    #document_filterer
])

result = pipeline.fit(df).transform(df)

# before filterer result

result.selectExpr("splits.result[0] as splits",
                  "prediction.result[0] as classes"
                  ).show(truncate=80)

+--------------------------------------------------------------------------------+------------------------------+
|                                                                          splits|                       classes|
+--------------------------------------------------------------------------------+------------------------------+
|Medical Specialty:\nCardiovascular / Pulmonary\n\nSample Name: Aortic Valve R...|                       History|
|Description: Aortic valve replacement using a mechanical valve and two-vessel...|Complications and Risk Factors|
|                                           (Medical Transcription Sample Report)|Complications and Risk Factors|
|DIAGNOSIS: Aortic valve stenosis with coronary artery disease associated with...|Diagnostic and Laboratory Data|
|PROCEDURES: Aortic valve replacement using a mechanical valve and two-vessel ...|                    Procedures|
|                 ANESTHESIA: General endotracheal\n\nINCISION: Median sternotomy|                    Procedures|
|INDICATIONS: The patient presented with severe congestive heart failure assoc...|     Consultation and Referral|
|FINDINGS: The left ventricle is certainly hypertrophied· The aortic valve lea...|Diagnostic and Laboratory Data|
|The radial artery was used for the left anterior descending artery. Flow was ...|Diagnostic and Laboratory Data|
|PROCEDURE: The patient was brought to the operating room and placed in supine...|                    Procedures|
|The patient went on cardiopulmonary bypass and the aortic cross-clamp was app...|                    Procedures|
|The first obtuse marginal artery was a very large target and the vein graft t...|Diagnostic and Laboratory Data|
|The patient came off cardiopulmonary bypass after aortic cross-clamp was rele...|                    Procedures|
+--------------------------------------------------------------------------------+------------------------------+


# after filterer result

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    tokenizer,
    document_splitter,
    sequenceClassifier,
    document_filterer
])

result = pipeline.fit(df).transform(df)
from pyspark.sql.functions import col
result.selectExpr("filteredDocuments.result[0] as splits",
                  "filteredDocuments.metadata[0].class_label as classes")\
                  .filter(col("classes").isNotNull()).show(truncate=80)

+--------------------------------------------------------------------------------+------------------------------+
|                                                                          splits|                       classes|
+--------------------------------------------------------------------------------+------------------------------+
|DIAGNOSIS: Aortic valve stenosis with coronary artery disease associated with...|Diagnostic and Laboratory Data|
|FINDINGS: The left ventricle is certainly hypertrophied· The aortic valve lea...|Diagnostic and Laboratory Data|
|The radial artery was used for the left anterior descending artery. Flow was ...|Diagnostic and Laboratory Data|
|The first obtuse marginal artery was a very large target and the vein graft t...|Diagnostic and Laboratory Data|
+--------------------------------------------------------------------------------+------------------------------+

import spark.implicits._
 
val example = "Medical Specialty:
Cardiovascular / Pulmonary
Sample Name: Aortic Valve Replacement
Description: Aortic valve replacement using a mechanical valve and two-vessel coronary artery bypass grafting procedure using saphenous vein graft to the first obtuse marginal artery and left radial artery graft to the left anterior descending artery.
(Medical Transcription Sample Report)
DIAGNOSIS: Aortic valve stenosis with coronary artery disease associated with congestive heart failure. The patient has diabetes and is morbidly obese.
PROCEDURES: Aortic valve replacement using a mechanical valve and two-vessel coronary artery bypass grafting procedure using saphenous vein graft to the first obtuse marginal artery and left radial artery graft to the left anterior descending artery.
ANESTHESIA: General endotracheal
INCISION: Median sternotomy
INDICATIONS: The patient presented with severe congestive heart failure associated with the patient's severe diabetes. The patient was found to have moderately stenotic aortic valve. In addition, The patient had significant coronary artery disease consisting of a chronically occluded right coronary artery but a very important large obtuse marginal artery coming off as the main circumflex system. The patient also has a left anterior descending artery which has moderate disease and this supplies quite a bit of collateral to the patient's right system. It was decided to perform a valve replacement as well as coronary artery bypass grafting procedure.
FINDINGS: The left ventricle is certainly hypertrophied· The aortic valve leaflet is calcified and a severe restrictive leaflet motion. It is a tricuspid type of valve. The coronary artery consists of a large left anterior descending artery which is associated with 60% stenosis but a large obtuse marginal artery which has a tight proximal stenosis.
The radial artery was used for the left anterior descending artery. Flow was excellent. Looking at the targets in the posterior descending artery territory, there did not appear to be any large branches. On the angiogram these vessels appeared to be quite small. Because this is a chronically occluded vessel and the patient has limited conduit due to the patient's massive obesity, attempt to bypass to this area was not undertaken. The patient was brought to the operating room
PROCEDURE: The patient was brought to the operating room and placed in supine position. A median sternotomy incision was carried out and conduits were taken from the left arm as well as the right thigh. The patient weighs nearly three hundred pounds. There was concern as to taking down the left internal mammary artery. Because the radial artery appeared to be a good conduit The patient would have arterial graft to the left anterior descending artery territory. The patient was cannulated after the aorta and atrium were exposed and full heparinization.
The patient went on cardiopulmonary bypass and the aortic cross-clamp was applied Cardioplegia was delivered through the coronary sinuses in a retrograde manner. The patient was cooled to 32 degrees. Iced slush was applied to the heart. The aortic valve was then exposed through the aortic root by transverse incision. The valve leaflets were removed and the #23 St. Jude mechanical valve was secured into position by circumferential pledgeted sutures. At this point, aortotomy was closed.
The first obtuse marginal artery was a very large target and the vein graft to this target indeed produced an excellent amount of flow. Proximal anastomosis was then carried out to the foot of the aorta. The left anterior descending artery does not have severe disease but is also a very good target and the radial artery was anastomosed to this target in an end-to-side manner. The two proximal anastomoses were then carried out to the root of the aorta.
The patient came off cardiopulmonary bypass after aortic cross-clamp was released. The patient was adequately warmed. Protamine was given without adverse effect. Sternal closure was then done using wires. The subcutaneous layers were closed using Vicryl suture. The skin was approximated using staples.
"

val df = Seq(example).toDF("text") 

val document_assembler = new DocumentAssembler()
  .setInputCol("text") 
  .setOutputCol("document") 

val tokenizer = new Tokenizer()
  .setInputCols(Array("document")) 
  .setOutputCol("token")

val document_splitter = new InternalDocumentSplitter()
  .setInputCols("document") 
  .setOutputCol("splits") 
  .setSplitMode("recursive") 
  .setChunkSize(100) 
  .setChunkOverlap(3) 
  .setExplodeSplits(true) 
  .setPatternsAreRegex(false) 
  .setSplitPatterns(Array(" "," ")) 
  .setKeepSeparators(false) 
  .setTrimWhitespace(true) 
  //.setEnableSentenceIncrement(false) 

val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_clinical_sections","en","clinical/models") 
  .setInputCols(Array("splits","token")) 
  .setOutputCol("prediction") 
  .setCaseSensitive(false) 

val document_filterer = new DocumentFiltererByClassifier()
  .setInputCols(Array("splits","prediction")) 
  .setOutputCol("filteredDocuments") 
  .setWhiteList(Array("Diagnostic and Laboratory Data")) 
  .setCaseSensitive(false) 

val pipeline = new Pipeline().setStages(Array( 
                                              document_assembler, 
                                              tokenizer, 
                                              document_splitter, 
                                              sequenceClassifier, 
                                              //document_filterer )) 

val result = pipeline.fit(df).transform(df) 

// before filterer result 

+--------------------------------------------------------------------------------+------------------------------+
|                                                                          splits|                       classes|
+--------------------------------------------------------------------------------+------------------------------+
|Medical Specialty:\nCardiovascular / Pulmonary\n\nSample Name: Aortic Valve R...|                       History|
|Description: Aortic valve replacement using a mechanical valve and two-vessel...|Complications and Risk Factors|
|                                           (Medical Transcription Sample Report)|Complications and Risk Factors|
|DIAGNOSIS: Aortic valve stenosis with coronary artery disease associated with...|Diagnostic and Laboratory Data|
|PROCEDURES: Aortic valve replacement using a mechanical valve and two-vessel ...|                    Procedures|
|                 ANESTHESIA: General endotracheal\n\nINCISION: Median sternotomy|                    Procedures|
|INDICATIONS: The patient presented with severe congestive heart failure assoc...|     Consultation and Referral|
|FINDINGS: The left ventricle is certainly hypertrophied· The aortic valve lea...|Diagnostic and Laboratory Data|
|The radial artery was used for the left anterior descending artery. Flow was ...|Diagnostic and Laboratory Data|
|PROCEDURE: The patient was brought to the operating room and placed in supine...|                    Procedures|
|The patient went on cardiopulmonary bypass and the aortic cross-clamp was app...|                    Procedures|
|The first obtuse marginal artery was a very large target and the vein graft t...|Diagnostic and Laboratory Data|
|The patient came off cardiopulmonary bypass after aortic cross-clamp was rele...|                    Procedures|
+--------------------------------------------------------------------------------+------------------------------+


// after filterer result

val pipeline = new Pipeline().setStages(Array( 
                                              document_assembler, 
                                              tokenizer, 
                                              document_splitter, 
                                              sequenceClassifier, 
                                              document_filterer )) 

val result = pipeline.fit(df) .transform(df) 

+--------------------------------------------------------------------------------+------------------------------+
|                                                                          splits|                       classes|
+--------------------------------------------------------------------------------+------------------------------+
|DIAGNOSIS: Aortic valve stenosis with coronary artery disease associated with...|Diagnostic and Laboratory Data|
|FINDINGS: The left ventricle is certainly hypertrophied· The aortic valve lea...|Diagnostic and Laboratory Data|
|The radial artery was used for the left anterior descending artery. Flow was ...|Diagnostic and Laboratory Data|
|The first obtuse marginal artery was a very large target and the vein graft t...|Diagnostic and Laboratory Data|
+--------------------------------------------------------------------------------+------------------------------+

DocumentHashCoder

Model

This annotator can replace dates in a column of DOCUMENT type according with the hash code of any other column. It uses the hash of the specified column and creates a new document column containing the day shift information. In sequence, the DeIdentification annotator deidentifies the document with the shifted date information.

If the specified column contains strings that can be parsed to integers, use those numbers to make the shift in the data accordingly.

Parametres:

  • PatientIdColumn (String): Name of the column containing patient ID.

  • setDateShiftColumn (String): Sets column to be used for hash or predefined shift.

  • setNewDateShift (String): Sets column that has a reference of where chunk begins.

  • setRangeDays (int): Sets the range of dates to be sampled from.

  • setSeed (int): Sets the seed for random number generator.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

Python API: DocumentHashCoder Scala API: DocumentHashCoder Notebook: DocumentHashCoderNotebook
Show Example

from johnsnowlabs import nlp, medical
import pandas as pd

data = pd.DataFrame(
    {'patientID' : ['A001', 'A001', 
                    'A003', 'A003'],
     'text' : ['Chris Brown was discharged on 10/02/2022', 
               'Mark White was discharged on 10/04/2022', 
               'John was discharged on 15/03/2022',
               'John Moore was discharged on 15/12/2022'
              ],
     'dateshift' : ['10', '10', 
                    '30', '30']
    }
)

my_input_df = spark.createDataFrame(data)

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

documentHasher = medical.DocumentHashCoder()\
    .setInputCols("document")\
    .setOutputCol("document2")\
    .setPatientIdColumn("patientID")\
    .setNewDateShift("shift_days")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document2"])\
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["document2", "token"])\
    .setOutputCol("word_embeddings")

clinical_ner = medical.NerModel\
    .pretrained("ner_deid_subentity_augmented", "en", "clinical/models")\
    .setInputCols(["document2","token", "word_embeddings"])\
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["document2", "token", "ner"])\
    .setOutputCol("ner_chunk")

de_identification = medical.DeIdentification() \
    .setInputCols(["ner_chunk", "token", "document2"]) \
    .setOutputCol("deid_text") \
    .setMode("obfuscate") \
    .setObfuscateDate(True) \
    .setDateTag("DATE") \
    .setLanguage("en") \
    .setObfuscateRefSource('faker') \
    .setUseShifDays(True)\
    .setRegion('us')

pipeline = nlp.Pipeline().setStages([
    documentAssembler,
    documentHasher,
    tokenizer,
    embeddings,
    clinical_ner,
    ner_converter,
    de_identification

])

empty_data = spark.createDataFrame([["", ""]]).toDF("text", "patientID")
pipeline_model = pipeline.fit(empty_data)

output = pipeline_model.transform(my_input_df)
output.select('patientID','text', 'deid_text.result').show(truncate = False)

+---------+----------------------------------------+---------------------------------------------+
|patientID|text                                    |result                                       |
+---------+----------------------------------------+---------------------------------------------+
|A001     |Chris Brown was discharged on 10/02/2022|[Aldona Bar was discharged on 05/18/2022]    |
|A001     |Mark White was discharged on 02/28/2020 |[Leta Speller was discharged on 10/14/2019]  |
|A002     |John was discharged on 03/15/2022       |[Lonia Blood was discharged on 01/19/2022]   |
|A002     |John Moore was discharged on 12/31/2022 |[Murriel Hopper was discharged on 11/06/2022]|
+---------+----------------------------------------+---------------------------------------------+
from johnsnowlabs import nlp, finance
import pandas as pd

data = pd.DataFrame(
    {'patientID' : ['A001', 'A001', 
                    'A003', 'A003'],
     'text' : ['Chris Brown was discharged on 10/02/2022', 
               'Mark White was discharged on 10/04/2022', 
               'John was discharged on 15/03/2022',
               'John Moore was discharged on 15/12/2022'
              ],
     'dateshift' : ['10', '10', 
                    '30', '30']
    }
)

my_input_df = spark.createDataFrame(data)

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

documentHasher = finance.DocumentHashCoder()\
    .setInputCols("document")\
    .setOutputCol("document2")\
    .setPatientIdColumn("patientID")\
    .setNewDateShift("shift_days")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document2"])\
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["document2", "token"])\
    .setOutputCol("word_embeddings")

clinical_ner = finance.NerModel\
    .pretrained("ner_deid_subentity_augmented", "en", "clinical/models")\
    .setInputCols(["document2","token", "word_embeddings"])\
    .setOutputCol("ner")

ner_converter = finance.NerConverterInternal()\
    .setInputCols(["document2", "token", "ner"])\
    .setOutputCol("ner_chunk")

de_identification = finance.DeIdentification() \
    .setInputCols(["ner_chunk", "token", "document2"]) \
    .setOutputCol("deid_text") \
    .setMode("obfuscate") \
    .setObfuscateDate(True) \
    .setDateTag("DATE") \
    .setLanguage("en") \
    .setObfuscateRefSource('faker') \
    .setUseShifDays(True)\
    .setRegion('us')

pipeline = nlp.Pipeline().setStages([
    documentAssembler,
    documentHasher,
    tokenizer,
    embeddings,
    clinical_ner,
    ner_converter,
    de_identification

])

empty_data = spark.createDataFrame([["", ""]]).toDF("text", "patientID")
pipeline_model = pipeline.fit(empty_data)

output = pipeline_model.transform(my_input_df)
output.select('patientID','text', 'deid_text.result').show(truncate = False)

+---------+----------------------------------------+----------------------------------------------+
|patientID|text                                    |result                                        |
+---------+----------------------------------------+----------------------------------------------+
|A001     |Chris Brown was discharged on 10/02/2022|[Andreas Newport was discharged on 04/09/2022]|
|A001     |Mark White was discharged on 02/28/2020 |[Kara Dies was discharged on 09/05/2019]      |
|A002     |John was discharged on 03/15/2022       |[Lane Hacker was discharged on 02/17/2022]    |
|A002     |John Moore was discharged on 12/31/2022 |[Orlena Sheldon was discharged on 12/05/2022] |
+---------+----------------------------------------+----------------------------------------------+
from johnsnowlabs import nlp, legal
import pandas as pd

data = pd.DataFrame(
    {'patientID' : ['A001', 'A001', 
                    'A003', 'A003'],
     'text' : ['Chris Brown was discharged on 10/02/2022', 
               'Mark White was discharged on 10/04/2022', 
               'John was discharged on 15/03/2022',
               'John Moore was discharged on 15/12/2022'
              ],
     'dateshift' : ['10', '10', 
                    '30', '30']
    }
)

my_input_df = spark.createDataFrame(data)

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

documentHasher = legal.DocumentHashCoder()\
    .setInputCols("document")\
    .setOutputCol("document2")\
    .setPatientIdColumn("patientID")\
    .setNewDateShift("shift_days")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document2"])\
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["document2", "token"])\
    .setOutputCol("word_embeddings")

clinical_ner = legal.NerModel\
    .pretrained("ner_deid_subentity_augmented", "en", "clinical/models")\
    .setInputCols(["document2","token", "word_embeddings"])\
    .setOutputCol("ner")

ner_converter = legal.NerConverterInternal()\
    .setInputCols(["document2", "token", "ner"])\
    .setOutputCol("ner_chunk")

de_identification = legal.DeIdentification() \
    .setInputCols(["ner_chunk", "token", "document2"]) \
    .setOutputCol("deid_text") \
    .setMode("obfuscate") \
    .setObfuscateDate(True) \
    .setDateTag("DATE") \
    .setLanguage("en") \
    .setObfuscateRefSource('faker') \
    .setUseShifDays(True)\
    .setRegion('us')

pipeline = nlp.Pipeline().setStages([
    documentAssembler,
    documentHasher,
    tokenizer,
    embeddings,
    clinical_ner,
    ner_converter,
    de_identification

])

empty_data = spark.createDataFrame([["", ""]]).toDF("text", "patientID")
pipeline_model = pipeline.fit(empty_data)

output = pipeline_model.transform(my_input_df)
output.select('patientID','text', 'deid_text.result').show(truncate = False)

+---------+----------------------------------------+----------------------------------------------+
|patientID|text                                    |result                                        |
+---------+----------------------------------------+----------------------------------------------+
|A001     |Chris Brown was discharged on 10/02/2022|[Andreas Newport was discharged on 04/09/2022]|
|A001     |Mark White was discharged on 02/28/2020 |[Kara Dies was discharged on 09/05/2019]      |
|A002     |John was discharged on 03/15/2022       |[Lane Hacker was discharged on 02/17/2022]    |
|A002     |John Moore was discharged on 12/31/2022 |[Orlena Sheldon was discharged on 12/05/2022] |
+---------+----------------------------------------+----------------------------------------------+

import spark.implicits._
  
val data = Seq(
  ("A001", "Chris Brown was discharged on 10/02/2022"),
  ("A001", "Mark White was discharged on 02/28/2020"),
  ("A002", "John was discharged on 03/15/2022"),
  ("A002", "John Moore was discharged on 12/31/2022")
)

val columns = Seq("patientID", "text")
val myInputDF: DataFrame = spark.createDataFrame(data).toDF(columns: _*)


val my_input_df = spark.createDataFrame(data) 

val documentAssembler = new DocumentAssembler()
    .setInputCol("text") 
    .setOutputCol("document") 

val documentHasher = new DocumentHashCoder()
    .setInputCols("document") 
    .setOutputCol("document2") 
    .setPatientIdColumn("patientID") 
    .setNewDateShift("shift_days") 

val tokenizer = new Tokenizer()
    .setInputCols("document2") 
    .setOutputCol("token") 

val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
    .setInputCols(Array("document2","token")) 
    .setOutputCol("word_embeddings") 

val clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity_augmented","en","clinical/models")
    .setInputCols(Array("document2","token","word_embeddings")) 
    .setOutputCol("ner") 

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("document2","token","ner")) 
    .setOutputCol("ner_chunk") 

val de_identification = new DeIdentification()
    .setInputCols(Array("ner_chunk","token","document2")) 
    .setOutputCol("deid_text") 
    .setMode("obfuscate") 
    .setObfuscateDate(true) 
    .setDateTag("DATE") 
    .setLanguage("en") 
    .setObfuscateRefSource("faker") 
    .setUseShifDays(true) 
    .setRegion("us") 

val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      documentHasher,
      tokenizer,
      embeddings,
      clinicalNer,
      nerConverter,
      deIdentification
))

val emptyData = Seq(("", "")).toDF("text", "patientID")

val pipelineModel = pipeline.fit(emptyData)
val result = pipelineModel.transform(myInputDF)

+---------+----------------------------------------+----------------------------------------------+
|patientID|text                                    |result                                        |
+---------+----------------------------------------+----------------------------------------------+
|A001     |Chris Brown was discharged on 10/02/2022|[Andreas Newport was discharged on 04/09/2022]|
|A001     |Mark White was discharged on 02/28/2020 |[Kara Dies was discharged on 09/05/2019]      |
|A002     |John was discharged on 03/15/2022       |[Lane Hacker was discharged on 02/17/2022]    |
|A002     |John Moore was discharged on 12/31/2022 |[Orlena Sheldon was discharged on 12/05/2022] |
+---------+----------------------------------------+----------------------------------------------+
import spark.implicits._
  
val data = Seq(
  ("A001", "Chris Brown was discharged on 10/02/2022"),
  ("A001", "Mark White was discharged on 02/28/2020"),
  ("A002", "John was discharged on 03/15/2022"),
  ("A002", "John Moore was discharged on 12/31/2022")
)

val columns = Seq("patientID", "text")
val myInputDF: DataFrame = spark.createDataFrame(data).toDF(columns: _*)

val my_input_df = spark.createDataFrame(data) 

val documentAssembler = new DocumentAssembler()
    .setInputCol("text") 
    .setOutputCol("document") 

val documentHasher = new DocumentHashCoder()
    .setInputCols("document") 
    .setOutputCol("document2") 
    .setPatientIdColumn("patientID") 
    .setNewDateShift("shift_days") 

val tokenizer = new Tokenizer()
    .setInputCols("document2")
    .setOutputCol("token") 

val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
    .setInputCols(Array("document2","token")) 
    .setOutputCol("word_embeddings") 

val clinical_ner = FinanceNerModel.pretrained("ner_deid_subentity_augmented","en","clinical/models")
    .setInputCols(Array("document2","token","word_embeddings")) 
    .setOutputCol("ner") 

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("document2","token","ner")) 
    .setOutputCol("ner_chunk") 

val de_identification = new DeIdentification()
    .setInputCols(Array("ner_chunk","token","document2")) 
    .setOutputCol("deid_text") 
    .setMode("obfuscate") 
    .setObfuscateDate(true) 
    .setDateTag("DATE") 
    .setLanguage("en") 
    .setObfuscateRefSource("faker") 
    .setUseShifDays(true) 
    .setRegion("us") 

val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      documentHasher,
      tokenizer,
      embeddings,
      clinicalNer,
      nerConverter,
      deIdentification
))

val emptyData = Seq(("", "")).toDF("text", "patientID")
val pipelineModel = pipeline.fit(emptyData)
val result = pipelineModel.transform(myInputDF)

+---------+----------------------------------------+----------------------------------------------+
|patientID|text                                    |result                                        |
+---------+----------------------------------------+----------------------------------------------+
|A001     |Chris Brown was discharged on 10/02/2022|[Andreas Newport was discharged on 04/09/2022]|
|A001     |Mark White was discharged on 02/28/2020 |[Kara Dies was discharged on 09/05/2019]      |
|A002     |John was discharged on 03/15/2022       |[Lane Hacker was discharged on 02/17/2022]    |
|A002     |John Moore was discharged on 12/31/2022 |[Orlena Sheldon was discharged on 12/05/2022] |
+---------+----------------------------------------+----------------------------------------------+
import spark.implicits._
  
val data = Seq(
  ("A001", "Chris Brown was discharged on 10/02/2022"),
  ("A001", "Mark White was discharged on 02/28/2020"),
  ("A002", "John was discharged on 03/15/2022"),
  ("A002", "John Moore was discharged on 12/31/2022")
)

val columns = Seq("patientID", "text")
val myInputDF: DataFrame = spark.createDataFrame(data).toDF(columns: _*)


val my_input_df = spark.createDataFrame(data) 

val documentAssembler = new DocumentAssembler()
    .setInputCol("text") 
    .setOutputCol("document") 

val documentHasher = new DocumentHashCoder()
    .setInputCols("document") 
    .setOutputCol("document2") 
    .setPatientIdColumn("patientID") 
    .setNewDateShift("shift_days") 

val tokenizer = new Tokenizer()
    .setInputCols("document2")
    .setOutputCol("token") 

val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
    .setInputCols(Array("document2","token")) 
    .setOutputCol("word_embeddings") 

val clinical_ner = LegalNerModel.pretrained("ner_deid_subentity_augmented","en","clinical/models")
    .setInputCols(Array("document2","token","word_embeddings")) 
    .setOutputCol("ner") 

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("document2","token","ner")) 
    .setOutputCol("ner_chunk") 

val de_identification = new DeIdentification()
    .setInputCols(Array("ner_chunk","token","document2")) 
    .setOutputCol("deid_text") 
    .setMode("obfuscate") 
    .setObfuscateDate(true) 
    .setDateTag("DATE") 
    .setLanguage("en") 
    .setObfuscateRefSource("faker") 
    .setUseShifDays(true) 
    .setRegion("us") 

val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      documentHasher,
      tokenizer,
      embeddings,
      clinicalNer,
      nerConverter,
      deIdentification
))

val emptyData = Seq(("", "")).toDF("text", "patientID")

val pipelineModel = pipeline.fit(emptyData)
val result = pipelineModel.transform(myInputDF)

+---------+----------------------------------------+----------------------------------------------+
|patientID|text                                    |result                                        |
+---------+----------------------------------------+----------------------------------------------+
|A001     |Chris Brown was discharged on 10/02/2022|[Andreas Newport was discharged on 04/09/2022]|
|A001     |Mark White was discharged on 02/28/2020 |[Kara Dies was discharged on 09/05/2019]      |
|A002     |John was discharged on 03/15/2022       |[Lane Hacker was discharged on 02/17/2022]    |
|A002     |John Moore was discharged on 12/31/2022 |[Orlena Sheldon was discharged on 12/05/2022] |
+---------+----------------------------------------+----------------------------------------------+

DocumentLogRegClassifier

Classifies documents with a Logarithmic Regression algorithm. Currently there are no pretrained models available. Please see DocumentLogRegClassifierApproach to train your own model.

Parameters:

  • setMergeChunks: Sets whether to merge all chunks in a document or not (Default: false).

  • setLabels: Sets array to output the label in the original form.

  • setVectorizationModel: Sets a path to the classification model if it has been already trained.

  • setClassificationModel: Sets a path to the the classification model if it has been already trained.

Please check out the Models Hub for available models in the future.

Input Annotator Types: TOKEN

Output Annotator Type: CATEGORY

Python API: DocumentLogRegClassifierModel Scala API: DocumentLogRegClassifierModel

Trains a model to classify documents with a Logarithmic Regression algorithm. Training data requires columns for text and their label. The result is a trained DocumentLogRegClassifierModel.

Parameters:

  • maxIter: Maximum number of iterations.

  • tol: Convergence tolerance after each iteration.

  • setLabels: Sets array to output the label in the original form.

  • setVectorizationModel: Sets a path to the classification model if it has been already trained.

  • setClassificationModel: Sets a path to the the classification model if it has been already trained.

Input Annotator Types: TOKEN

Output Annotator Type: CATEGORY

Python API: DocumentLogRegClassifierApproach Scala API: DocumentLogRegClassifierApproach
Show Example

from johnsnowlabs import nlp, medical
# Define pipeline stages to prepare the data
document_assembler = nlp.DocumentAssembler() \
  .setInputCol("text") \
  .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
  .setInputCols(["document"]) \
  .setOutputCol("token")

normalizer = nlp.Normalizer() \
  .setInputCols(["token"]) \
  .setOutputCol("normalized")

stopwords_cleaner = nlp.StopWordsCleaner() \
  .setInputCols(["normalized"]) \
  .setOutputCol("cleanTokens") \
  .setCaseSensitive(False)

stemmer = nlp.Stemmer() \
  .setInputCols(["cleanTokens"]) \
  .setOutputCol("stem")

# Define the document classifier and fit training data to it
logreg = medical.DocumentLogRegClassifierApproach() \
  .setInputCols(["stem"]) \
  .setLabelCol("category") \
  .setOutputCol("prediction")

pipeline = nlp.Pipeline(stages=[
  document_assembler,
  tokenizer,
  normalizer,
  stopwords_cleaner,
  stemmer,
  logreg
])

model = pipeline.fit(trainingData)
from johnsnowlabs import nlp, finance
# Define pipeline stages to prepare the data
document_assembler = nlp.DocumentAssembler() \
  .setInputCol("text") \
  .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
  .setInputCols(["document"]) \
  .setOutputCol("token")

normalizer = nlp.Normalizer() \
  .setInputCols(["token"]) \
  .setOutputCol("normalized")

stopwords_cleaner = nlp.StopWordsCleaner() \
  .setInputCols(["normalized"]) \
  .setOutputCol("cleanTokens") \
  .setCaseSensitive(False)

stemmer = nlp.Stemmer() \
  .setInputCols(["cleanTokens"]) \
  .setOutputCol("stem")

# Define the document classifier and fit training data to it
logreg = finance.DocumentLogRegClassifierApproach() \
  .setInputCols(["stem"]) \
  .setLabelCol("category") \
  .setOutputCol("prediction")

pipeline = nlp.Pipeline(stages=[
  document_assembler,
  tokenizer,
  normalizer,
  stopwords_cleaner,
  stemmer,
  logreg
])

model = pipeline.fit(trainingData)
from johnsnowlabs import nlp, legal
# Define pipeline stages to prepare the data
document_assembler = nlp.DocumentAssembler() \
  .setInputCol("text") \
  .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
  .setInputCols(["document"]) \
  .setOutputCol("token")

normalizer = nlp.Normalizer() \
  .setInputCols(["token"]) \
  .setOutputCol("normalized")

stopwords_cleaner = nlp.StopWordsCleaner() \
  .setInputCols(["normalized"]) \
  .setOutputCol("cleanTokens") \
  .setCaseSensitive(False)

stemmer = nlp.Stemmer() \
  .setInputCols(["cleanTokens"]) \
  .setOutputCol("stem")

# Define the document classifier and fit training data to it
logreg = legal.DocumentLogRegClassifierApproach() \
  .setInputCols(["stem"]) \
  .setLabelCol("category") \
  .setOutputCol("prediction")

pipeline = nlp.Pipeline(stages=[
  document_assembler,
  tokenizer,
  normalizer,
  stopwords_cleaner,
  stemmer,
  logreg
])

model = pipeline.fit(trainingData)

import spark.implicits._ 
// Define pipeline stages to prepare the data
val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val normalizer = new Normalizer()
  .setInputCols("token")
  .setOutputCol("normalized")

val stopwords_cleaner = new StopWordsCleaner()
  .setInputCols("normalized")
  .setOutputCol("cleanTokens")
  .setCaseSensitive(false)

val stemmer = new Stemmer()
  .setInputCols("cleanTokens")
  .setOutputCol("stem")

// Define the document classifier and fit training data to it
val logreg = new DocumentLogRegClassifierApproach()
  .setInputCols("stem")
  .setLabelCol("category")
  .setOutputCol("prediction")

val pipeline = new Pipeline().setStages(Array(
  document_assembler,
  tokenizer,
  normalizer,
  stopwords_cleaner,
  stemmer,
  logreg
))

val model = pipeline.fit(trainingData)
import spark.implicits._ 

// Define pipeline stages to prepare the data
val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val normalizer = new Normalizer()
  .setInputCols("token")
  .setOutputCol("normalized")

val stopwords_cleaner = new StopWordsCleaner()
  .setInputCols("normalized")
  .setOutputCol("cleanTokens")
  .setCaseSensitive(false)

val stemmer = new Stemmer()
  .setInputCols("cleanTokens")
  .setOutputCol("stem")

// Define the document classifier and fit training data to it
val logreg = new DocumentLogRegClassifierApproach()
  .setInputCols("stem")
  .setLabelCol("category")
  .setOutputCol("prediction")

val pipeline = new Pipeline().setStages(Array(
  document_assembler,
  tokenizer,
  normalizer,
  stopwords_cleaner,
  stemmer,
  logreg
))

val model = pipeline.fit(trainingData)
import spark.implicits._

// Define pipeline stages to prepare the data
val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val normalizer = new Normalizer()
  .setInputCols("token")
  .setOutputCol("normalized")

val stopwords_cleaner = new StopWordsCleaner()
  .setInputCols("normalized")
  .setOutputCol("cleanTokens")
  .setCaseSensitive(false)

val stemmer = new Stemmer()
  .setInputCols("cleanTokens")
  .setOutputCol("stem")

// Define the document classifier and fit training data to it
val logreg = new DocumentLogRegClassifierApproach()
  .setInputCols("stem")
  .setLabelCol("category")
  .setOutputCol("prediction")

val pipeline = new Pipeline().setStages(Array(
  document_assembler,
  tokenizer,
  normalizer,
  stopwords_cleaner,
  stemmer,
  logreg
))

val model = pipeline.fit(trainingData)

DocumentMLClassifier

DocumentMLClassifier classifies documents with a Logarithmic Regression algorithm.

Input Annotator Types: TOKEN

Output Annotator Type: CATEGORY

Python API: DocumentMLClassifierModel Scala API: DocumentMLClassifierModel Notebook: DocumentMLClassifierModelNotebook
Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

classifier_ml = medical.DocumentMLClassifierModel.pretrained("classifierml_ade", "en", "clinical/models")\
    .setInputCols("token")\
    .setOutputCol("prediction")

clf_Pipeline = nlp.Pipeline(stages=[
    document_assembler,
    tokenizer,
    classifier_ml])

data = spark.createDataFrame([["""I feel great after taking tylenol."""], ["""Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient."""]]).toDF("text")

result = clf_Pipeline.fit(data).transform(data)


# Show results
result.select('text','prediction.result').show(truncate=False)

+----------------------------------------------------------------------------------------+-------+
|text                                                                                    |result |
+----------------------------------------------------------------------------------------+-------+
|Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient.|[False]|
|I feel great after taking tylenol.                                                      |[False]|
+----------------------------------------------------------------------------------------+-------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new Tokenizer()
    .setInputCols("document")
    .setOutputCol("token")

val classifier_ml = DocumentMLClassifierModel.pretrained("classifierml_ade", "en", "clinical/models")
    .setInputCols("token")
    .setOutputCol("prediction")

val clf_Pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    tokenizer, 
    classifier_ml))

val data = Seq(
  "I feel great after taking tylenol.",
  "Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient.").toDF("text")
  
val result = clf_Pipeline.fit(data).transform(data)

// Show results

+----------------------------------------------------------------------------------------+-------+
|text                                                                                    |result |
+----------------------------------------------------------------------------------------+-------+
|Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient.|[False]|
|I feel great after taking tylenol.                                                      |[False]|
+----------------------------------------------------------------------------------------+-------+

Trains a model to classify documents with a Logarithmic Regression algorithm. Training data requires columns for text and their label. The result is a trained DocumentMLClassifierModel.

Parametres:

  • labelCol: (str) Sets column with the value result we are trying to predict.
  • maxIter: (Int) Sets maximum number of iterations.
  • tol: (float) Sets convergence tolerance after each iteration.
  • fitIntercept: (str) Sets whether to fit an intercept term, default is true.
  • vectorizationModelPath: (str) Sets a path to the classification model if it has been already trained.
  • classificationModelPath: (str) Sets a path to the classification model if it has been already trained.
  • classificationModelClass: (str) Sets a the classification model class from SparkML to use; possible values are: logreg, svm.
  • minTokenNgram: (int) Sets minimum number of tokens for Ngrams.
  • maxTokenNgram: (int) Sets maximum number of tokens for Ngrams.
  • mergeChunks: (boolean) whether to merge all chunks in a document or not (Default: false)

Input Annotator Types: TOKEN

Output Annotator Type: CATEGORY

Python API: DocumentMLClassifierApproach Scala API: DocumentMLClassifierApproach Notebook: DocumentMLClassifierApproachNotebook
Show Example

from johnsnowlabs import nlp, medical 

document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

token = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

classifier_logreg = medical.DocumentMLClassifierApproach() \
    .setInputCols("token") \
    .setLabelCol("category") \
    .setOutputCol("prediction") \
    .setClassificationModelClass("logreg")\
    .setFitIntercept(True)

pipeline = nlp.Pipeline(stages=[
    document, 
    token, 
    classifier_logreg])

result_logreg = pipeline.fit(train_data).transform(test_data).cache()

import spark.implicits._

val document = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val token = new Tokenizer()
    .setInputCols("document")
    .setOutputCol("token")

val classifier_logreg = new DocumentMLClassifierApproach()
    .setInputCols("token")
    .setLabelCol("category")
    .setOutputCol("prediction")
    .setClassificationModelClass("logreg")
    .setFitIntercept(true) 

val pipeline = new Pipeline().setStages(Array(
    document,
    token,
    classifier_logreg)) 

val result_logreg = pipeline.fit(train_data).transform(test_data).cache()

DrugNormalizer

Model

Annotator which normalizes raw text from clinical documents, e.g. scraped web pages or xml documents, from document type columns into Sentence. Removes all dirty characters from text following one or more input regex patterns. Can apply non wanted character removal which a specific policy. Can apply lower case normalization.

Parametres:

  • lowercase: (boolean) whether to convert strings to lowercase. Default is False.

  • policy: (str) rule to remove patterns from text. Valid policy values are:

    • “all”,
    • “abbreviations”,
    • “dosages”

See Spark NLP Workshop for more examples of usage.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

Python API: DrugNormalizer Scala API: DrugNormalizer Notebook: DrugNormalizerNotebook
Show Example

from johnsnowlabs import nlp, medical

# Sample data
data_to_normalize = spark.createDataFrame([
            ("A", "Sodium Chloride/Potassium Chloride 13bag", "Sodium Chloride / Potassium Chloride 13 bag"),
            ("B", "interferon alfa-2b 10 million unit ( 1 ml ) injec", "interferon alfa - 2b 10000000 unt ( 1 ml ) injection"),
            ("C", "aspirin 10 meq/ 5 ml oral sol", "aspirin 2 meq/ml oral solution")
        ]).toDF("cuid", "text", "target_normalized_text")

# Annotator that transforms a text column from dataframe into normalized text (with all policy)

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

drug_normalizer = medical.DrugNormalizer() \
    .setInputCols("document") \
    .setOutputCol("document_normalized") \
    .setPolicy("all")

drug_normalizer_pipeline = nlp.Pipeline(stages=[
    document_assembler,
    drug_normalizer
    ])

ds = drug_normalizer_pipeline.fit(data_to_normalize).transform(data_to_normalize)

ds = ds.selectExpr("document", "target_normalized_text", "explode(document_normalized.result) as all_normalized_text")
ds.show(truncate = False)

+-------------------------------------------------------------------------------------------+----------------------------------------------------+----------------------------------------------------+
|document                                                                                   |target_normalized_text                              |all_normalized_text                                 |
+-------------------------------------------------------------------------------------------+----------------------------------------------------+----------------------------------------------------+
|[{document, 0, 39, Sodium Chloride/Potassium Chloride 13bag, {sentence -> 0}, []}]         |Sodium Chloride / Potassium Chloride 13 bag         |Sodium Chloride / Potassium Chloride 13 bag         |
|[{document, 0, 48, interferon alfa-2b 10 million unit ( 1 ml ) injec, {sentence -> 0}, []}]|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|
|[{document, 0, 28, aspirin 10 meq/ 5 ml oral sol, {sentence -> 0}, []}]                    |aspirin 2 meq/ml oral solution                      |aspirin 2 meq/ml oral solution                      |
+-------------------------------------------------------------------------------------------+----------------------------------------------------+----------------------------------------------------+

import spark.implicits._

// Sample data 
val data_to_normalize = Seq(Array( ("A","Sodium Chloride/Potassium Chloride 13bag","Sodium Chloride / Potassium Chloride 13 bag") , ("B","interferon alfa-2b 10 million unit ( 1 ml ) injec","interferon alfa - 2b 10000000 unt ( 1 ml ) injection") , ("C","aspirin 10 meq/ 5 ml oral sol","aspirin 2 meq/ml oral solution") )) .toDF("cuid","text","target_normalized_text") 

// Annotator that transforms a text column from dataframe into normalized text (with all policy) 

val document_assembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val drug_normalizer = new DrugNormalizer()
 .setInputCols("document") 
 .setOutputCol("document_normalized") 
 .setPolicy("all") 

val drug_normalizer_pipeline = new Pipeline().setStages(Array(
  document_assembler, 
  drug_normalizer)) 

val ds = drug_normalizer_pipeline.fit(data_to_normalize).transform(data_to_normalize) 

+-------------------------------------------------------------------------------------------+----------------------------------------------------+----------------------------------------------------+
|document                                                                                   |target_normalized_text                              |all_normalized_text                                 |
+-------------------------------------------------------------------------------------------+----------------------------------------------------+----------------------------------------------------+
|[{document, 0, 39, Sodium Chloride/Potassium Chloride 13bag, {sentence -> 0}, []}]         |Sodium Chloride / Potassium Chloride 13 bag         |Sodium Chloride / Potassium Chloride 13 bag         |
|[{document, 0, 48, interferon alfa-2b 10 million unit ( 1 ml ) injec, {sentence -> 0}, []}]|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|
|[{document, 0, 28, aspirin 10 meq/ 5 ml oral sol, {sentence -> 0}, []}]                    |aspirin 2 meq/ml oral solution                      |aspirin 2 meq/ml oral solution                      |
+-------------------------------------------------------------------------------------------+----------------------------------------------------+----------------------------------------------------+

EntityChunkEmbeddings

Model

Weighted average embeddings of multiple named entities chunk annotations.

Entity Chunk Embeddings uses BERT Sentence embeddings to compute a weighted average vector represention of related entity chunks. The input the model consists of chunks of recognized named entities. One or more entities are selected as target entities and for each of them a list of related entities is specified (if empty, all other entities are assumed to be related).

The model looks for chunks of the target entities and then tries to pair each target entity (e.g. DRUG) with other related entities (e.g. DOSAGE, STRENGTH, FORM, etc). The criterion for pairing a target entity with another related entity is that they appear in the same sentence and the maximal syntactic distance is below a predefined threshold.

The relationship between target and related entities is one-to-many, meaning that if there multiple instances of the same target entity (e.g.) within a sentence, the model will map a related entity (e.g. DOSAGE) to at most one of the instances of the target entity. For example, if there is a sentence “The patient was given 125 mg of paracetamol and metformin”, the model will pair “125 mg” to “paracetamol”, but not to “metformin”.

The output of the model is an average embeddings of the chunks of each of the target entities and their related entities. It is possible to specify a particular weight for each entity type.

An entity can be defined both as target a entity and as a related entity for some other target entity. For example, we may want to compute the embeddings of SYMPTOMs and their related entities, as well as the embeddings of DRUGs and their related entities, one of each is also SYMPTOM. In such cases, it is possible to use the TARGET_ENTITY:RELATED_ENTITY notation to specify the weight of an related entity (e.g. “DRUG:SYMPTOM” to set the weight of SYMPTOM when it appears as an related entity to target entity DRUG). The relative weights of entities for particular entity chunk embeddings are available in the annotations metadata.

This model is a subclass of BertSentenceEmbeddings and shares all parameters with it. It can load any pretrained BertSentenceEmbeddings model.

Parametres:

  • targetEntities: (dict) The target entities mapped to lists of their related entities. A target entity with an empty list of related entities means all other entities are assumed to be related to it. Entity names are case insensitive. Mandatory to set at least one entity

  • entityWeights: (dict) The relative weights of drug related entities. If not set, all entities have equal weights. If the list is non-empty and some entity is not in it, then its weight is set to 0. The notation TARGET_ENTITY:RELATED_ENTITY can be used to specify the weight of a entity which is related to specific target entity (e.g. “DRUG:SYMPTOM” -> 0.3f). Entity names are case insensitive.

  • maxSyntacticDistance: (Int) Maximal syntactic distance between the drug entity and the other drug related entities. Default value is 2.

The default model is "sbiobert_base_cased_mli" from clinical/models. Other available models can be found at Models Hub.

Input Annotator Types: DEPENDENCY, CHUNK

Output Annotator Type: SENTENCE_EMBEDDINGS

Python API: EntityChunkEmbeddingsModel Scala API: EntityChunkEmbeddingsModel Notebook: EntityChunkEmbeddingsModelNotebook
Show Example

from johnsnowlabs import nlp, medical

documenter = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentence_detector =  nlp.SentenceDetector()\
    .setInputCols("document")\
    .setOutputCol("sentence")\

tokenizer = nlp.Tokenizer()\
    .setInputCols("sentence")\
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

posology_ner_model = medical.NerModel().pretrained("ner_posology_large", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols("sentence", "token", "ner")\
    .setOutputCol("ner_chunk")

pos_tager = nlp.PerceptronModel().pretrained("pos_clinical", "en", "clinical/models")\
    .setInputCols("sentence", "token")\
    .setOutputCol("pos_tag")

dependency_parser = nlp.DependencyParserModel().pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos_tag", "token"])\
    .setOutputCol("dependencies")

entity_chunk_embeddings = medical.EntityChunkEmbeddings().pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
    .setInputCols(["ner_chunk", "dependencies"])\
    .setOutputCol("drug_chunk_embeddings")

entity_chunk_embeddings.setTargetEntities({"DRUG": ["STRENGTH", "ROUTE", "FORM"]})

rxnorm_re = medical.SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented_re", "en", "clinical/models")\
    .setInputCols(["drug_chunk_embeddings"])\
    .setOutputCol("rxnorm_code")\
    .setDistanceFunction("EUCLIDEAN")

rxnorm_pipeline_re = nlp.Pipeline(
    stages=[
        documenter,
        sentence_detector,
        tokenizer,
        embeddings,
        posology_ner_model,
        ner_converter,
        pos_tager,
        dependency_parser,
        entity_chunk_embeddings,
        rxnorm_re,
    ]
)

rxnorm_model = rxnorm_pipeline_re.fit(spark.createDataFrame([[""]]).toDF("text"))

data_df = spark.createDataFrame(
    [
        [
            "The patient was given metformin 500 mg tablet, 2.5 mg of coumadin and then ibuprofen."
        ],
        [
            "The patient was given metformin 400 mg, coumadin 5 mg, coumadin, amlodipine 10 MG tablet"
        ],
    ]
).toDF("text")

results = rxnorm_model.transform(data_df)
results.select("drug_chunk_embeddings.result", "drug_chunk_embeddings.embeddings").show(truncate=200)

+--------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                              result|                                                                                                                                                                                              embeddings|
+--------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|               [metformin 500 mg tablet, 2.5 mg coumadin, ibuprofen]|[[0.13060866, 0.26946265, -0.50702775, 0.7724293, 0.7356907, 0.0962475, -0.5546377, 0.0534295, -0.55345106, 0.48484787, -0.35735086, 0.49109104, 0.84404886, 0.30384326, -0.9923568, -0.24454081, 0.3...|
|[metformin 400 mg, coumadin 5 mg, coumadin, amlodipine 10 MG tablet]|[[-0.177948, 0.25489503, -0.5724586, 0.8031439, 0.9211674, 0.3558219, -0.37258363, -0.194855, -0.7407244, 0.48175216, 0.040639203, 0.6822441, 0.5768623, -0.19830275, -1.1513872, -0.32279214, 0.6181...|
+--------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val documenter = new DocumentAssembler()
    .setInputCol("text") 
    .setOutputCol("document") 

val sentence_detector = new SentenceDetector()
    .setInputCols("document") 
    .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
    .setInputCols("sentence") 
    .setOutputCol("token") 

val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models") 
    .setInputCols(Array("sentence","token")) 
    .setOutputCol("embeddings") 

val posology_ner_model = MedicalNerModel.pretrained("ner_posology_large","en","clinical/models") 
    .setInputCols(Array("sentence","token","embeddings")) 
    .setOutputCol("ner") 

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence","token","ner")) 
    .setOutputCol("ner_chunk") 

val pos_tager = PerceptronModel.pretrained("pos_clinical","en","clinical/models") 
    .setInputCols(Array("sentence","token")) 
    .setOutputCol("pos_tag") 

val dependency_parser = DependencyParserModel.pretrained("dependency_conllu","en") 
    .setInputCols(Array("sentence","pos_tag","token")) 
    .setOutputCol("dependencies") 

val entity_chunk_embeddings = EntityChunkEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models") 
    .setInputCols(Array("ner_chunk","dependencies")) 
    .setOutputCol("drug_chunk_embeddings") 

val entity_chunk_embeddings.setTargetEntities(Map("DRUG" -> "Array("STRENGTH","ROUTE","FORM")")) 

val rxnorm_re = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented_re","en","clinical/models")
    .setInputCols("drug_chunk_embeddings")
    .setOutputCol("rxnorm_code") 
    .setDistanceFunction("EUCLIDEAN") 

val rxnorm_pipeline_re = new Pipeline().setStages(Array( 
    documenter, 
    sentence_detector, 
    tokenizer, 
    embeddings, 
    posology_ner_model, 
    ner_converter, 
    pos_tager, 
    dependency_parser, 
    entity_chunk_embeddings,
    rxnorm_re)) 

val rxnorm_model = Seq(( "The patient was given metformin 500 mg tablet,2.5 mg of coumadin and then ibuprofen." ), ( "The patient was given metformin 400 mg,coumadin 5 mg,coumadin,amlodipine 10 MG tablet" )).toDF("text")

val results = rxnorm_model.fit(rxnorm_model).transform(rxnorm_model) 


+--------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                              result|                                                                                                                                                                                              embeddings|
+--------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|               [metformin 500 mg tablet, 2.5 mg coumadin, ibuprofen]|[[0.13060866, 0.26946265, -0.50702775, 0.7724293, 0.7356907, 0.0962475, -0.5546377, 0.0534295, -0.55345106, 0.48484787, -0.35735086, 0.49109104, 0.84404886, 0.30384326, -0.9923568, -0.24454081, 0.3...|
|[metformin 400 mg, coumadin 5 mg, coumadin, amlodipine 10 MG tablet]|[[-0.177948, 0.25489503, -0.5724586, 0.8031439, 0.9211674, 0.3558219, -0.37258363, -0.194855, -0.7407244, 0.48175216, 0.040639203, 0.6822441, 0.5768623, -0.19830275, -1.1513872, -0.32279214, 0.6181...|
+--------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

EntityRulerInternal

This annotator match exact strings or regex patterns provided in a file against a Document and assigns them an named entity. The definitions can contain any number of named entities.

Parametres:

  • setPatternsResource (str): Sets Resource in JSON or CSV format to map entities to patterns. path : str Path to the resource read_as : str, optional How to interpret the resource, by default ReadAs.TEXT options : dict, optional Options for parsing the resource, by default {“format”: “JSON”}

  • setSentenceMatch (Boolean): Whether to find match at sentence level. True: sentence level. False: token level.

  • setAlphabetResource (str): Alphabet Resource (a simple plain text with all language characters)

  • setUseStorage (Boolean): Sets whether to use RocksDB storage to serialize patterns.

See Spark NLP Workshop for more examples of usage.

Input Annotator Types: DOCUMENT, CHUNK

Output Annotator Type: CHUNK

Show Example

from johnsnowlabs import nlp, medical

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

entityRuler = medical.EntityRulerInternalApproach()\
    .setInputCols(["document", "token"])\
    .setOutputCol("entities")\
    .setPatternsResource("entities.json")\
    .setCaseSensitive(False)\

pipeline = nlp.Pipeline().setStages([
    documentAssembler,
    tokenizer,
    entityRuler
])

data = spark.createDataFrame([['''John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01.''']]).toDF("text")

model = pipeline.fit(data)
result = model.transform(data)

# Result
+---------------+-----+---+-------+
|          chunk|begin|end|  label|
+---------------+-----+---+-------+
|        aspirin|   25| 31|   Drug|
|heart condition|   41| 55|Disease|
|    paracetamol|   69| 79|   Drug|
|          fever|   89| 93|Symptom|
|       headache|   99|106|Symptom|
|     tonsilitis|  129|138|Disease|
|      ibuprofen|  141|149|   Drug|
|    lansoprazol|  177|187|   Drug|
|           GORD|  198|201|Disease|
+---------------+-----+---+-------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("token")

val entityRuler = new EntityRulerInternalApproach()
  .setInputCols(Array("document", "token"))
  .setOutputCol("entities")
  .setPatternsResource("entities.json")
  .setCaseSensitive(false)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  entityRuler
))

val data = Seq(
  ("""John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsillitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01.""")
).toDF("text")

val model = pipeline.fit(data)

# Result
+---------------+-----+---+-------+
|          chunk|begin|end|  label|
+---------------+-----+---+-------+
|        aspirin|   25| 31|   Drug|
|heart condition|   41| 55|Disease|
|    paracetamol|   69| 79|   Drug|
|          fever|   89| 93|Symptom|
|       headache|   99|106|Symptom|
|     tonsilitis|  129|138|Disease|
|      ibuprofen|  141|149|   Drug|
|    lansoprazol|  177|187|   Drug|
|           GORD|  198|201|Disease|
+---------------+-----+---+-------+

EntityRulerInternal will handle the chunks output based on the patterns defined, as shown in the example below. We can define an id field to identify entities.

Parameters:

  • setPatternsResource (str): Sets Resource in JSON or CSV format to map entities to patterns. path : str Path to the resource read_as : str, optional How to interpret the resource, by default ReadAs.TEXT options : dict, optional Options for parsing the resource, by default {“format”: “JSON”}

  • setSentenceMatch (Boolean): Whether to find match at sentence level. True: sentence level. False: token level.

  • setAlphabetResource (str): Alphabet Resource (a simple plain text with all language characters)

  • setUseStorage (Boolean): Sets whether to use RocksDB storage to serialize patterns.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CHUNK

FeaturesAssembler

Model

The FeaturesAssembler is used to collect features from different columns. It can collect features from single value columns (anything which can be cast to a float, if casts fails then the value is set to 0), array columns or SparkNLP annotations (if the annotation is an embedding, it takes the embedding, otherwise tries to cast the result field). The output of the transformer is a FEATURE_VECTOR annotation (the numeric vector is in the embeddings field).

Parameters:

  • inputCols: The name of the columns containing the input annotations. It can read either a String column name or an Array of strings (column names).

  • outputCol: The name of the column in Document type that is generated. We can specify only one column here.

All the parameters can be set using the corresponding set method in the camel case. For example, .setInputcols().

Input Annotator Types: NONE

Output Annotator Type: FEATURE_VECTOR

Python API: FeaturesAssembler Scala API: FeaturesAssembler Notebook: FeatureAssemblerNotebook
Show Example

from johnsnowlabs import medical, nlp 

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_healthcare_100d","en","clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("word_embeddings")

sentence_embeddings = nlp.SentenceEmbeddings() \
    .setInputCols(["document", "word_embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

features_asm = medical.FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")

embeddings_pipeline = nlp.Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        word_embeddings,
        sentence_embeddings,
        features_asm
    ])

data_df = spark.createDataFrame(
    [
        [
            "PROCEDURES PERFORMED: Colonoscopy. INDICATIONS: Renewed symptoms likely consistent with active flare of Inflammatory Bowel Disease, not responsive to conventional therapy including sulfasalazine, cortisone, local therapy. PROCEDURE: Informed consent was obtained prior to the procedure with special attention to benefits, risks, alternatives. Risks explained as bleeding, infection, bowel perforation, aspiration pneumonia, or reaction to the medications. Vital signs were monitored by blood pressure, heart rate, and oxygen saturation. Supplemental O2 given. Specifics discussed. Preprocedure physical exam performed. Stable vital signs. Lungs clear. Cardiac exam showed regular rhythm. Abdomen soft. Her past history, her past workup, her past visitation with me for Inflammatory Bowel Disease, well responsive to sulfasalazine reviewed. She currently has a flare and is not responding, therefore, likely may require steroid taper. At the same token, her symptoms are mild. She has rectal bleeding, essentially only some rusty stools. There is not significant diarrhea, just some lower stools. No significant pain. Therefore, it is possible that we are just dealing with a hemorrhoidal bleed, therefore, colonoscopy now needed. Past history reviewed. Specifics of workup, need for followup, and similar discussed. All questions answered. A normal digital rectal examination was performed. The PCF-160 AL was inserted into the anus and advanced to the cecum without difficulty, as identified by the ileocecal valve, cecal stump, and appendical orifice. All mucosal aspects thoroughly inspected, including a retroflexed examination. Withdrawal time was greater than six minutes. Unfortunately, the terminal ileum could not be intubated despite multiple attempts. Findings were those of a normal cecum, right colon, transverse colon, descending colon. A small cecal polyp was noted, this was biopsy-removed, placed in bottle #1. Random biopsies from the cecum obtained, bottle #2; random biopsies from the transverse colon obtained, as well as descending colon obtained, bottle #3. There was an area of inflammation in the proximal sigmoid colon, which was biopsied, placed in bottle #4. There was an area of relative sparing, with normal sigmoid lining, placed in bottle #5, randomly biopsied, and then inflammation again in the distal sigmoid colon and rectum biopsied, bottle #6, suggesting that we may be dealing with Crohn disease, given the relative sparing of the sigmoid colon and junk lesion. Retroflexed showed hemorrhoidal disease. Scope was then withdrawn, patient left in good condition. IMPRESSION: Active flare of Inflammatory Bowel Disease, question of Crohn disease. PLAN: I will have the patient follow up with me, will follow up on histology, follow up on the polyps. She will be put on a steroid taper and make an appointment and hopefully steroids alone will do the job. If not, she may be started on immune suppressive medication, such as azathioprine, or similar. All of this has been reviewed with the patient. All questions answered."
        ],
    ]
).toDF("text")

result = embeddings_pipeline.fit(data_df).transform(data_df)
result.select("features").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|features|

|[{feature_vector, 0, 0, , {sentence -> 0}, [-0.00896873, 0.011731416, 0.12154201, 0.1149235, -0.14689414, 0.0103584975, 0.053073216, -0.056412186, -0.05143186, -0.0118978135, -0.12175384, -0.035894137, 0.11812756, 0.094671555, 0.15838866, 0.15260744, -0.004094441, -0.13675772, -0.07472433, -0.035856977, -0.026730005, -0.21840473, 0.029632289, -0.011515695, -0.20407394, -0.07848257, 0.040990185, 0.23028605, 0.077140555, 0.066990435, 0.015219222, -0.10295644, 0.038072545, 0.10786369, 0.121525764, -0.09569349, -0.06309264, 0.2778952, 0.06462455, -0.10851931, -0.14370486, -0.1466352, 0.08354363, -0.078758985, -0.08377953, 0.12384644, -0.23281692, -0.25607574, 0.16399069, -0.07780675, -0.18302177, -0.18325584, -0.12128636, -0.0010129504, 0.0070792097, 0.20506753, 0.034964647, 0.058425985, 0.19572404, -0.103953235, -0.20159312, -0.099047214, -0.07337802, -0.03713124, -0.055443633, 0.11107734, 0.048563413, -0.038048305, -0.020617828, 0.17082842, 0.069010496, 0.08457101, -0.038229663, 0.073144384, -0.092326105, -0.10054428, -4.3286112E-4, -0.046703782, -0.080231875, 0.02524295, 0.01368699, -0.19783853, -0.03501917, 0.13324805, 0.09053264, -0.0958231, -0.0032442473, 0.19218525, -0.027179888, 0.030672349, 0.12848215, -0.014700146, -0.089054875, 0.13839856, -0.15778734, 0.07103226, -0.060303356, 0.20854644, -0.008389737, -0.1473986]}]|

from johnsnowlabs import nlp, finance 

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained()\
    .setInputCols(["document","token"])\
    .setOutputCol("word_embeddings")

sentence_embeddings = nlp.SentenceEmbeddings() \
    .setInputCols(["document", "word_embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

features_asm =finance.FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")

embeddings_pipeline = nlp.Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        word_embeddings,
        sentence_embeddings,
        features_asm
    ])

data_df = spark.createDataFrame(
    [
        [
            "Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc."
        ],
    ]
).toDF("text")

result = embeddings_pipeline.fit(data_df).transform(data_df)
result.select("features").show(truncate=False)


|features|

|[{feature_vector, 0, 0, , {sentence -> 0}, [-0.05989722, 0.10907035, 0.25595385, -0.21656203, 0.20777024, -0.17276664, -0.045803867, -0.14506632, -0.16928527, -0.10008922, 0.18800992, -0.36529806, 0.22592439, -0.118487455, 0.006129823, -0.2674002, 0.37149927, 0.12375746, -0.30488327, 0.2507765, -0.060471725, -0.22705032, 0.39436466, 0.40368417, 0.15569581, 0.083455965, 0.11193783, -0.2783573, -0.23566169, -0.12444999, 0.22503565, 0.43343276, -0.3165808, -0.057086047, 0.050554093, 0.3512633, 0.17572127, 0.19258633, -0.09170296, -0.25344467, 0.018219033, -0.117947415, -0.03234701, -0.1549039, -0.0147800855, 0.076972865, 0.08612865, -0.14120182, -0.18348631, -0.4500436, 0.038739346, 0.12991442, -0.032128494, 0.7483725, -0.09843177, -1.6700389, 0.0060545397, -0.1044135, 1.2469376, 0.32064447, -0.17263599, 0.31999183, 0.0077194544, 0.15370668, 0.59472036, -0.16953614, 0.3042488, 0.25355336, 0.60402286, 0.07441569, -0.12468894, 0.03140718, -0.2630037, -0.37703836, 0.034783553, -0.058904923, 0.022686867, 0.07962498, -0.7945683, -0.21051218, 0.6615892, -0.18747853, -0.25412843, 0.26003888, -1.0803214, -0.026889319, -0.11805089, -0.14200646, -0.019682527, -0.2372327, 0.0090960255, -0.071929, -0.115089305, 0.21781716, -0.3569975, 0.07799677, -0.096894525, -0.34368798, 0.66465, 0.14913023]}]|

from johnsnowlabs import nlp, legal 

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained()\
    .setInputCols(["document","token"])\
    .setOutputCol("word_embeddings")

sentence_embeddings = nlp.SentenceEmbeddings() \
    .setInputCols(["document", "word_embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

features_asm =legal.FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")

embeddings_pipeline = nlp.Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        word_embeddings,
        sentence_embeddings,
        features_asm
    ])

data_df = spark.createDataFrame(
    [
        [
            "This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc."
        ],
    ]
).toDF("text")

result = embeddings_pipeline.fit(data_df).transform(data_df)
result.select("features").show(truncate=False)


|features|

|[{feature_vector, 0, 0, , {sentence -> 0}, [0.02474357, -0.08310143, 0.4801927, -0.070223466, 0.33147717, -0.18737249, -0.048361354, -0.052325998, 0.053252153, -0.0067390013, 0.2836935, -0.25569317, 0.3415577, -0.19251995, 0.051623292, -0.25131556, 0.3472208, -0.036604006, -0.35653928, 0.13225944, 0.18795085, -0.09561886, 0.4695179, 0.22093144, 0.32058474, 0.057281215, 0.082858086, -0.3714214, -0.19219379, -0.26751986, -0.148075, 0.6410107, -0.07821157, -0.06398429, 6.32831E-5, 0.21222909, 0.33145514, 0.2575328, 0.009346781, -0.21482512, -0.22197871, -0.14005142, 0.04592571, -0.2919176, 0.011854073, -0.14047821, 0.22201888, -0.13500921, -0.101019345, -0.31175214, -0.0031539474, 0.07841865, 0.23760447, 0.8622971, -0.21095662, -1.9944092, -0.090888076, -0.45743433, 1.5815442, 0.4848822, -0.12528154, 0.33802572, -0.16203907, -0.09874586, 0.63106954, -0.21860953, 0.39005432, 0.25023165, 0.66769457, -0.13867687, 0.02832079, -0.17432508, -0.05764636, -0.44529453, 0.032839067, -0.2266792, -0.002856281, 0.007823931, -1.0165309, 0.08553613, 0.38090998, 0.011592574, -0.18031952, 0.37968582, -0.77948713, -0.068393, -0.029594865, -0.2165647, 0.1665183, -0.23963346, -0.017649503, -0.24768801, -0.2725593, 0.14533372, -0.36786577, 0.23388086, -0.20129707, -0.33582142, 0.5970527, 0.12596472]}]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new Tokenizer() 
    .setInputCols("document")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d","en","clinical/models")
    .setInputCols(Array("document","token"))
    .setOutputCol("word_embeddings")

val sentence_embeddings = new SentenceEmbeddings() 
    .setInputCols(Array("document", "word_embeddings")) 
    .setOutputCol("sentence_embeddings") 
    .setPoolingStrategy("AVERAGE")

val features_asm = new FeaturesAssembler()
    .setInputCols("sentence_embeddings")
    .setOutputCol("features")

val nlpPipeline = new Pipeline().setStages(Array(
        document_assembler,
        tokenizer,
        word_embeddings,
        sentence_embeddings,
        features_asm
))

val data = Seq(
  "PROCEDURES PERFORMED: Colonoscopy. INDICATIONS: Renewed symptoms likely consistent with active flare of Inflammatory Bowel Disease, not responsive to conventional therapy including sulfasalazine, cortisone, local therapy. PROCEDURE: Informed consent was obtained prior to the procedure with special attention to benefits, risks, alternatives. Risks explained as bleeding, infection, bowel perforation, aspiration pneumonia, or reaction to the medications. Vital signs were monitored by blood pressure, heart rate, and oxygen saturation. Supplemental O2 given. Specifics discussed. Preprocedure physical exam performed. Stable vital signs. Lungs clear. Cardiac exam showed regular rhythm. Abdomen soft. Her past history, her past workup, her past visitation with me for Inflammatory Bowel Disease, well responsive to sulfasalazine reviewed. She currently has a flare and is not responding, therefore, likely may require steroid taper. At the same token, her symptoms are mild. She has rectal bleeding, essentially only some rusty stools. There is not significant diarrhea, just some lower stools. No significant pain. Therefore, it is possible that we are just dealing with a hemorrhoidal bleed, therefore, colonoscopy now needed. Past history reviewed. Specifics of workup, need for followup, and similar discussed. All questions answered. A normal digital rectal examination was performed. The PCF-160 AL was inserted into the anus and advanced to the cecum without difficulty, as identified by the ileocecal valve, cecal stump, and appendical orifice. All mucosal aspects thoroughly inspected, including a retroflexed examination. Withdrawal time was greater than six minutes. Unfortunately, the terminal ileum could not be intubated despite multiple attempts. Findings were those of a normal cecum, right colon, transverse colon, descending colon. A small cecal polyp was noted, this was biopsy-removed, placed in bottle #1. Random biopsies from the cecum obtained, bottle #2; random biopsies from the transverse colon obtained, as well as descending colon obtained, bottle #3. There was an area of inflammation in the proximal sigmoid colon, which was biopsied, placed in bottle #4. There was an area of relative sparing, with normal sigmoid lining, placed in bottle #5, randomly biopsied, and then inflammation again in the distal sigmoid colon and rectum biopsied, bottle #6, suggesting that we may be dealing with Crohn disease, given the relative sparing of the sigmoid colon and junk lesion. Retroflexed showed hemorrhoidal disease. Scope was then withdrawn, patient left in good condition. IMPRESSION: Active flare of Inflammatory Bowel Disease, question of Crohn disease. PLAN: I will have the patient follow up with me, will follow up on histology, follow up on the polyps. She will be put on a steroid taper and make an appointment and hopefully steroids alone will do the job. If not, she may be started on immune suppressive medication, such as azathioprine, or similar. All of this has been reviewed with the patient. All questions answered."
).toDF("text")

val result = nlpPipeline.fit(data_df).transform(data_df)


|features|

|[{feature_vector, 0, 0, , {sentence -> 0}, [-0.00896873, 0.011731416, 0.12154201, 0.1149235, -0.14689414, 0.0103584975, 0.053073216, -0.056412186, -0.05143186, -0.0118978135, -0.12175384, -0.035894137, 0.11812756, 0.094671555, 0.15838866, 0.15260744, -0.004094441, -0.13675772, -0.07472433, -0.035856977, -0.026730005, -0.21840473, 0.029632289, -0.011515695, -0.20407394, -0.07848257, 0.040990185, 0.23028605, 0.077140555, 0.066990435, 0.015219222, -0.10295644, 0.038072545, 0.10786369, 0.121525764, -0.09569349, -0.06309264, 0.2778952, 0.06462455, -0.10851931, -0.14370486, -0.1466352, 0.08354363, -0.078758985, -0.08377953, 0.12384644, -0.23281692, -0.25607574, 0.16399069, -0.07780675, -0.18302177, -0.18325584, -0.12128636, -0.0010129504, 0.0070792097, 0.20506753, 0.034964647, 0.058425985, 0.19572404, -0.103953235, -0.20159312, -0.099047214, -0.07337802, -0.03713124, -0.055443633, 0.11107734, 0.048563413, -0.038048305, -0.020617828, 0.17082842, 0.069010496, 0.08457101, -0.038229663, 0.073144384, -0.092326105, -0.10054428, -4.3286112E-4, -0.046703782, -0.080231875, 0.02524295, 0.01368699, -0.19783853, -0.03501917, 0.13324805, 0.09053264, -0.0958231, -0.0032442473, 0.19218525, -0.027179888, 0.030672349, 0.12848215, -0.014700146, -0.089054875, 0.13839856, -0.15778734, 0.07103226, -0.060303356, 0.20854644, -0.008389737, -0.1473986]}]|

import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new Tokenizer() 
    .setInputCols("document")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained()
    .setInputCols(Array("document","token"))
    .setOutputCol("word_embeddings")

val sentence_embeddings = new SentenceEmbeddings() 
    .setInputCols(Array("document", "word_embeddings")) 
    .setOutputCol("sentence_embeddings") 
    .setPoolingStrategy("AVERAGE")

val features_asm = new FeaturesAssembler()
    .setInputCols("sentence_embeddings")
    .setOutputCol("features")

val nlpPipeline = new Pipeline().setStages(Array(
        document_assembler,
        tokenizer,
        word_embeddings,
        sentence_embeddings,
        features_asm
))

val data = Seq(
  "Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc."
).toDF("text")

val result = nlpPipeline.fit(data_df).transform(data_df)


|features|

|[{feature_vector, 0, 0, , {sentence -> 0}, [-0.05989722, 0.10907035, 0.25595385, -0.21656203, 0.20777024, -0.17276664, -0.045803867, -0.14506632, -0.16928527, -0.10008922, 0.18800992, -0.36529806, 0.22592439, -0.118487455, 0.006129823, -0.2674002, 0.37149927, 0.12375746, -0.30488327, 0.2507765, -0.060471725, -0.22705032, 0.39436466, 0.40368417, 0.15569581, 0.083455965, 0.11193783, -0.2783573, -0.23566169, -0.12444999, 0.22503565, 0.43343276, -0.3165808, -0.057086047, 0.050554093, 0.3512633, 0.17572127, 0.19258633, -0.09170296, -0.25344467, 0.018219033, -0.117947415, -0.03234701, -0.1549039, -0.0147800855, 0.076972865, 0.08612865, -0.14120182, -0.18348631, -0.4500436, 0.038739346, 0.12991442, -0.032128494, 0.7483725, -0.09843177, -1.6700389, 0.0060545397, -0.1044135, 1.2469376, 0.32064447, -0.17263599, 0.31999183, 0.0077194544, 0.15370668, 0.59472036, -0.16953614, 0.3042488, 0.25355336, 0.60402286, 0.07441569, -0.12468894, 0.03140718, -0.2630037, -0.37703836, 0.034783553, -0.058904923, 0.022686867, 0.07962498, -0.7945683, -0.21051218, 0.6615892, -0.18747853, -0.25412843, 0.26003888, -1.0803214, -0.026889319, -0.11805089, -0.14200646, -0.019682527, -0.2372327, 0.0090960255, -0.071929, -0.115089305, 0.21781716, -0.3569975, 0.07799677, -0.096894525, -0.34368798, 0.66465, 0.14913023]}]|

import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new Tokenizer() 
    .setInputCols("document")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained()
    .setInputCols(Array("document","token"))
    .setOutputCol("word_embeddings")

val sentence_embeddings = new SentenceEmbeddings() 
    .setInputCols(Array("document", "word_embeddings")) 
    .setOutputCol("sentence_embeddings") 
    .setPoolingStrategy("AVERAGE")

val features_asm = new FeaturesAssembler()
    .setInputCols("sentence_embeddings")
    .setOutputCol("features")

val nlpPipeline = new Pipeline().setStages(Array(
        document_assembler,
        tokenizer,
        word_embeddings,
        sentence_embeddings,
        features_asm
))

val data = Seq(
  "This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc."
).toDF("text")

val result = nlpPipeline.fit(data_df).transform(data_df)


|features|

|[{feature_vector, 0, 0, , {sentence -> 0}, [0.02474357, -0.08310143, 0.4801927, -0.070223466, 0.33147717, -0.18737249, -0.048361354, -0.052325998, 0.053252153, -0.0067390013, 0.2836935, -0.25569317, 0.3415577, -0.19251995, 0.051623292, -0.25131556, 0.3472208, -0.036604006, -0.35653928, 0.13225944, 0.18795085, -0.09561886, 0.4695179, 0.22093144, 0.32058474, 0.057281215, 0.082858086, -0.3714214, -0.19219379, -0.26751986, -0.148075, 0.6410107, -0.07821157, -0.06398429, 6.32831E-5, 0.21222909, 0.33145514, 0.2575328, 0.009346781, -0.21482512, -0.22197871, -0.14005142, 0.04592571, -0.2919176, 0.011854073, -0.14047821, 0.22201888, -0.13500921, -0.101019345, -0.31175214, -0.0031539474, 0.07841865, 0.23760447, 0.8622971, -0.21095662, -1.9944092, -0.090888076, -0.45743433, 1.5815442, 0.4848822, -0.12528154, 0.33802572, -0.16203907, -0.09874586, 0.63106954, -0.21860953, 0.39005432, 0.25023165, 0.66769457, -0.13867687, 0.02832079, -0.17432508, -0.05764636, -0.44529453, 0.032839067, -0.2266792, -0.002856281, 0.007823931, -1.0165309, 0.08553613, 0.38090998, 0.011592574, -0.18031952, 0.37968582, -0.77948713, -0.068393, -0.029594865, -0.2165647, 0.1665183, -0.23963346, -0.017649503, -0.24768801, -0.2725593, 0.14533372, -0.36786577, 0.23388086, -0.20129707, -0.33582142, 0.5970527, 0.12596472]}]|


FewShotClassifier

FewShotClassifier annotators specifically target few-shot classification tasks, which involve training a model to make accurate predictions with limited labeled data.

These annotators provide a valuable capability for handling scenarios where labeled data is scarce or expensive to obtain. By effectively utilizing limited labeled examples, the few-shot classification approach enables the creation of models that can generalize and classify new instances accurately, even with minimal training data.

The FewShotClassifier is designed to process sentence embeddings as input. It generates category annotations, providing labels along with confidence scores that range from 0 to 1.

Input Annotator Types: SENTENCE EMBEDDINGS

Output Annotator Type: CATEGORY

Python API: FewShotClassifierModel Scala API: FewShotClassifierModel
Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

bert_sent = nlp.BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")

few_shot_classifier = medical.FewShotClassifierModel.pretrained("few_shot_classifier_age_group_sbiobert_cased_mli", "en", "clinical/models")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("prediction")

clf_Pipeline = nlp.Pipeline(stages=[
    document_assembler,
    bert_sent,
    few_shot_classifier
])

data = spark.createDataFrame([
    ["""A patient presented with complaints of chest pain and shortness of breath. The medical history revealed the patient had a smoking habit for over 30 years, and was diagnosed with hypertension two years ago. After a detailed physical examination, the doctor found a noticeable wheeze on lung auscultation and prescribed a spirometry test, which showed irreversible airway obstruction. The patient was diagnosed with Chronic obstructive pulmonary disease (COPD) caused by smoking."""],
 ["""Hi, wondering if anyone has had a similar situation. My 1 year old daughter has the following; loose stools/ pale stools, elevated liver enzymes, low iron.  5 months and still no answers from drs. """],
 ["""Hi have chronic gastritis from 4 month(confirmed by endoscopy).I do not have acid reflux.Only dull ache above abdomen and left side of chest.I am on reberprozole and librax.My question is whether chronic gastritis is curable or is it a lifetime condition?I am loosing hope because this dull ache is not going away.Please please reply"""]
    ]).toDF("text")

result = clf_Pipeline.fit(data).transform(data)


# Show results
result.select('prediction.result','text').show(truncate=150)

+---------+------------------------------------------------------------------------------------------------------------------------------------------------------+
|   result|                                                                                                                                                  text|
+---------+------------------------------------------------------------------------------------------------------------------------------------------------------+
|  [Adult]|A patient presented with complaints of chest pain and shortness of breath. The medical history revealed the patient had a smoking habit for over 30...|
|  [Child]|Hi, wondering if anyone has had a similar situation. My 1 year old daughter has the following; loose stools/ pale stools, elevated liver enzymes, l...|
|[Unknown]|Hi have chronic gastritis from 4 month(confirmed by endoscopy).I do not have acid reflux.Only dull ache above abdomen and left side of chest.I am o...|
+---------+------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val bert_sent = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
    .setInputCols("document")
    .setOutputCol("sentence_embeddings")

val few_shot_classifier = FewShotClassifierModel.pretrained("few_shot_classifier_age_group_sbiobert_cased_mli", "en", "clinical/models")
    .setInputCols("sentence_embeddings")
    .setOutputCol("prediction") 
    
val clf_Pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    bert_sent, 
    few_shot_classifier))

val data = Seq(
    ("""A patient presented with complaints of chest pain and shortness of breath. The medical history revealed the patient had a smoking habit for over 30 years, and was diagnosed with hypertension two years ago. After a detailed physical examination, the doctor found a noticeable wheeze on lung auscultation and prescribed a spirometry test, which showed irreversible airway obstruction. The patient was diagnosed with Chronic obstructive pulmonary disease (COPD) caused by smoking."""),
 ("""Hi, wondering if anyone has had a similar situation. My 1 year old daughter has the following; loose stools/ pale stools, elevated liver enzymes, low iron.  5 months and still no answers from drs. """),
 ("""Hi have chronic gastritis from 4 month(confirmed by endoscopy).I do not have acid reflux.Only dull ache above abdomen and left side of chest.I am on reberprozole and librax.My question is whether chronic gastritis is curable or is it a lifetime condition?I am loosing hope because this dull ache is not going away.Please please reply""")).toDF("text")
  
val result = clf_Pipeline.fit(data).transform(data)

// Show results


+---------+------------------------------------------------------------------------------------------------------------------------------------------------------+
|   result|                                                                                                                                                  text|
+---------+------------------------------------------------------------------------------------------------------------------------------------------------------+
|  [Adult]|A patient presented with complaints of chest pain and shortness of breath. The medical history revealed the patient had a smoking habit for over 30...|
|  [Child]|Hi, wondering if anyone has had a similar situation. My 1 year old daughter has the following; loose stools/ pale stools, elevated liver enzymes, l...|
|[Unknown]|Hi have chronic gastritis from 4 month(confirmed by endoscopy).I do not have acid reflux.Only dull ache above abdomen and left side of chest.I am o...|
+---------+------------------------------------------------------------------------------------------------------------------------------------------------------+

FewShotClassifier annotators specifically target few-shot classification tasks, which involve training a model to make accurate predictions with limited labeled data.

These annotators provide a valuable capability for handling scenarios where labeled data is scarce or expensive to obtain. By effectively utilizing limited labeled examples, the few-shot classification approach enables the creation of models that can generalize and classify new instances accurately, even with minimal training data.

The FewShotClassifier is designed to process sentence embeddings as input. It generates category annotations, providing labels along with confidence scores that range from 0 to 1.

Input Annotator Types: SENTENCE EMBEDDINGS

Output Annotator Type: CATEGORY

Python API: FewShotClassifierApproach Scala API: FewShotClassifierApproach
Show Example

from johnsnowlabs import nlp, medical 

document_asm = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("sentence")

sentence_embeddings = nlp.BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
    .setInputCols(["sentence"])\
    .setOutputCol("sentence_embeddings")

graph_builder = medical.TFGraphBuilder()\
    .setModelName("fewshot_classifier")\
    .setInputCols(["sentence_embeddings"]) \
    .setLabelColumn("label")\
    .setGraphFolder("/tmp")\
    .setGraphFile("log_reg_graph.pb")\

few_shot_approach = medical.FewShotClassifierApproach()\
    .setLabelColumn("label")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("prediction")\
    .setModelFile(f"/tmp/log_reg_graph.pb")\
    .setEpochsNumber(10)\
    .setBatchSize(1)\
    .setLearningRate(0.001)

pipeline = nlp.Pipeline(
    stages=[
        document_asm,
        sentence_embeddings,
        graph_builder,
        few_shot_approach
    ])

model = pipeline.fit(train_data)

import spark.implicits._

val document_asm = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("sentence")

val sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
    .setInputCols("sentence")
    .setOutputCol("sentence_embeddings")

val few_shot_approach = new FewShotClassifierApproach()
    .setLabelColumn("label")
    .setInputCols("sentence_embeddings")
    .setOutputCol("prediction")
    .setModelFile("tmp/log_reg_graph.pb")
    .setEpochsNumber(10)
    .setBatchSize(1)
    .setLearningRate(0.001) 

val pipeline = new Pipeline().setStages(Array(
    document_asm, 
    sentence_embeddings, 
    few_shot_approach ))

val result = pipeline.fit(train_data).transform(test_data).cache()

Flattener

Model

The Flattener converts annotation results into a format that easier to use. This annotator produces a DataFrame with flattened and exploded columns containing annotation results, making it easier to interpret and analyze the information. It is particularly useful for extracting and organizing the results obtained from Spark NLP Pipelines.

Parametres:

  • inputCols: Input annotations.
  • cleanAnnotations: Whether to remove annotation columns, by default True.
  • explodeSelectedFields: Dict of input columns to their corresponding selected fields.
  • flattenExplodedColumns: Whether to flatten exploded columns(default : True).
  • orderByColumn: Specify the column by which the DataFrame should be ordered..
  • orderDescending: specifying whether to order the DataFrame in descending order.(default : True).

See Spark NLP Workshop for more examples of usage.

Input Annotator Types: ANY

Output Annotator Type: NONE

Show Example

from johnsnowlabs import nlp, medical

documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner") \
    .setLabelCasing("upper")

ner_converter = medical.NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk") \
    .setWhiteList(["SYMPTOM","VS_FINDING","DISEASE_SYNDROME_DISORDER","ADMISSION_DISCHARGE","PROCEDURE"])

flattener = medical.Flattener()\
    .setInputCols("ner_chunk") \
    .setExplodeSelectedFields({"ner_chunk": ["result as ner_chunks",
                                             "begin as begins",
                                             "end as ends",
                                             "metadata.entity as entities"]})

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    flattener
])

text = """
GENERAL: He is an elderly gentleman in no acute distress. He is sitting up in bed eating his breakfast. He is alert and oriented and answering questions appropriately.
HEENT: Sclerae showed mild arcus senilis in the right. Left was clear. Pupils are equally round and reactive to light. Extraocular movements are intact. Oropharynx is clear.
NECK: Supple. Trachea is midline. No jugular venous pressure distention is noted. No adenopathy in the cervical, supraclavicular, or axillary areas.
ABDOMEN: Soft and not tender. There may be some fullness in the left upper quadrant, although I do not appreciate a true spleen with inspiration.
EXTREMITIES: There is some edema, but no cyanosis and clubbing .
"""

data = spark.createDataFrame([[text]]).toDF("text")

result = nlpPipeline.fit(data).transform(data)
result.show(truncate=False)

# result
+----------------------------------+------+----+-------------------------+
|ner_chunks                        |begins|ends|entities                 |
+----------------------------------+------+----+-------------------------+
|distress                          |49    |56  |SYMPTOM                  |
|arcus senilis                     |196   |208 |DISEASE_SYNDROME_DISORDER|
|jugular venous pressure distention|380   |413 |SYMPTOM                  |
|adenopathy                        |428   |437 |SYMPTOM                  |
|tender                            |514   |519 |SYMPTOM                  |
|fullness                          |540   |547 |SYMPTOM                  |
|edema                             |665   |669 |SYMPTOM                  |
|cyanosis                          |679   |686 |VS_FINDING               |
|clubbing                          |692   |699 |SYMPTOM                  |
+----------------------------------+------+----+-------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val wordEmbeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val clinicalNer = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")
  .setLabelCasing("upper")

val nerConverter = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")
  .setWhiteList(Array("SYMPTOM", "VS_FINDING", "DISEASE_SYNDROME_DISORDER", "ADMISSION_DISCHARGE", "PROCEDURE"))

val flattener = new Flattener()
  .setInputCols(Array("ner_chunk"))
  .setExplodeSelectedFields(Map("ner_chunk" -> Array("result", "begin", "end", "metadata.entity")))

val nlpPipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  wordEmbeddings,
  clinicalNer,
  nerConverter,
  flattener
))

val text = """
GENERAL: He is an elderly gentleman in no acute distress. He is sitting up in bed eating his breakfast. He is alert and oriented and answering questions appropriately.
HEENT: Sclerae showed mild arcus senilis in the right. Left was clear. Pupils are equally round and reactive to light. Extraocular movements are intact. Oropharynx is clear.
NECK: Supple. Trachea is midline. No jugular venous pressure distention is noted. No adenopathy in the cervical, supraclavicular, or axillary areas.
ABDOMEN: Soft and not tender. There may be some fullness in the left upper quadrant, although I do not appreciate a true spleen with inspiration.
EXTREMITIES: There is some edema, but no cyanosis and clubbing .
"""

val data = Seq(text).toDF("text")

val result = nlpPipeline.fit(data).transform(data)

# result
+----------------------------------+------+----+-------------------------+
|ner_chunks                        |begins|ends|entities                 |
+----------------------------------+------+----+-------------------------+
|distress                          |49    |56  |SYMPTOM                  |
|arcus senilis                     |196   |208 |DISEASE_SYNDROME_DISORDER|
|jugular venous pressure distention|380   |413 |SYMPTOM                  |
|adenopathy                        |428   |437 |SYMPTOM                  |
|tender                            |514   |519 |SYMPTOM                  |
|fullness                          |540   |547 |SYMPTOM                  |
|edema                             |665   |669 |SYMPTOM                  |
|cyanosis                          |679   |686 |VS_FINDING               |
|clubbing                          |692   |699 |SYMPTOM                  |
+----------------------------------+------+----+-------------------------+

GenericClassifier

Creates a generic single-label classifier which uses pre-generated Tensorflow graphs. The model operates on FEATURE_VECTOR annotations which can be produced using FeatureAssembler. Requires the FeaturesAssembler to create the input.

Parameter:

  • multiClass (Boolean): Whether to return all clases or only the one with highest score (Default: False)

Input Annotator Types: FEATURE_VECTOR

Output Annotator Type: CATEGORY

Python API: GenericClassifierModel Scala API: GenericClassifierModel Notebook: GenericClassifierModelNotebook
Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_embeddings = nlp.BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", 'en','clinical/models')\
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")

features_asm = medical.FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")

generic_classifier = medical.GenericClassifierModel.pretrained("genericclassifier_sdoh_economics_binary_sbiobert_cased_mli", 'en', 'clinical/models')\
    .setInputCols(["features"])\
    .setOutputCol("classes")

pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        sentence_embeddings,
        features_asm,
        generic_classifier
])

text = """Patient works as a building inspector and remodeler. Married with 2 children. He is a current smoker, 1PPD for 25years. He drinks to beers/night, but has not had any alcohol in past 4 days. No IVDU."""

df = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(df).transform(df)
result.select("text", "classes.result").show(truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|text                                                                                                                                                                                                  |result|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|Patient works as a building inspector and remodeler. Married with 2 children. He is a current smoker, 1PPD for 25years. He drinks to beers/night, but has not had any alcohol in past 4 days. No IVDU.|[True]|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
  .setInputCol("text") 
  .setOutputCol("document")

val sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
  .setInputCols("document") 
  .setOutputCol("sentence_embeddings") 

val features_asm = new FeaturesAssembler()
  .setInputCols("sentence_embeddings")
  .setOutputCol("features") 

val generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_economics_binary_sbiobert_cased_mli","en","clinical/models")
  .setInputCols(Array("features")) 
  .setOutputCol("classes") 

val pipeline = new Pipeline().setStages(Array( 
                                            document_assembler, 
                                            sentence_embeddings, 
                                            features_asm, 
                                            generic_classifier )) 

val text = "Patient works as a building inspector and remodeler. Married with 2 children. He is a current smoker,1PPD for 25years. He drinks to beers/night,but has not had any alcohol in past 4 days. No IVDU." 

val df = Seq(text) .toDF("text") 
val result = pipeline.fit(df).transform(df)   

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|text                                                                                                                                                                                                  |result|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|Patient works as a building inspector and remodeler. Married with 2 children. He is a current smoker, 1PPD for 25years. He drinks to beers/night, but has not had any alcohol in past 4 days. No IVDU.|[True]|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+

Trains a TensorFlow model for generic classification of feature vectors. It takes FEATURE_VECTOR annotations from FeaturesAssembler as input, classifies them and outputs CATEGORY annotations. Please see the Parameters section for required training parameters.

Parametres:

  • batchSize: (int) Batch size

  • dropout: (float) Dropout coefficient

  • epochsN: (int) Maximum number of epochs to train

  • featureScaling: (str) Feature scaling method. Possible values are ‘zscore’, ‘minmax’ or empty (no scaling)

  • fixImbalance: (boolean) Fix the imbalance in the training set by replicating examples of under represented categories

  • labelColumn: (str) Column with label per each document

  • learningRate: (float) Learning Rate

  • modelFile: (str) Location of file of the model used for classification

  • multiClass: (boolean) If multiClass is set, the model will return all the labels with corresponding scores. By default, multiClass is false.

  • outputLogsPath: (str) Folder path to save training logs. If no path is specified, the logs won’t be stored in disk. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3).

  • validationSplit: (float) The proportion of training dataset to be used as validation set.The model will be validated against this dataset on each Epoch and will not be used for training. The value should be between 0.0 and 1.0.

For a more extensive example please see the Spark NLP Workshop.

Input Annotator Types: FEATURE_VECTOR

Output Annotator Type: CATEGORY

Python API: GenericClassifierApproach Scala API: GenericClassifierApproach Notebook: GenericClassifierApproachNotebook
Show Example

from johnsnowlabs import nlp, medical

features_asm = medical.FeaturesAssembler() \
    .setInputCols(["feature_1", "feature_2", "...", "feature_n"]) \
    .setOutputCol("features")

gen_clf = medical.GenericClassifierApproach() \
    .setLabelColumn("target") \
    .setInputCols(["features"]) \
    .setOutputCol("prediction") \
    .setModelFile("/path/to/graph_file.pb") \
    .setEpochsNumber(50) \
    .setBatchSize(100) \
    .setFeatureScaling("zscore") \
    .setlearningRate(0.001) \
    .setFixImbalance(True) \
    .setOutputLogsPath("logs") \
    .setValidationSplit(0.2) # keep 20% of the data for validation purposes

pipeline = nlp.Pipeline().setStages([
    features_asm,
    gen_clf
])

clf_model = pipeline.fit(data)
from johnsnowlabs import nlp, finance

features_asm = finance.FeaturesAssembler() \
    .setInputCols(["feature_1", "feature_2", "...", "feature_n"]) \
    .setOutputCol("features")

gen_clf = finance.GenericClassifierApproach() \
    .setLabelColumn("target") \
    .setInputCols(["features"]) \
    .setOutputCol("prediction") \
    .setModelFile("/path/to/graph_file.pb") \
    .setEpochsNumber(50) \
    .setBatchSize(100) \
    .setFeatureScaling("zscore") \
    .setlearningRate(0.001) \
    .setFixImbalance(True) \
    .setOutputLogsPath("logs") \
    .setValidationSplit(0.2) # keep 20% of the data for validation purposes

pipeline = nlp.Pipeline().setStages([
    features_asm,
    gen_clf
])

clf_model = pipeline.fit(data)
from johnsnowlabs import nlp, legal

features_asm = legal.FeaturesAssembler() \
    .setInputCols(["feature_1", "feature_2", "...", "feature_n"]) \
    .setOutputCol("features")

gen_clf = legal.GenericClassifierApproach() \
    .setLabelColumn("target") \
    .setInputCols(["features"]) \
    .setOutputCol("prediction") \
    .setModelFile("/path/to/graph_file.pb") \
    .setEpochsNumber(50) \
    .setBatchSize(100) \
    .setFeatureScaling("zscore") \
    .setlearningRate(0.001) \
    .setFixImbalance(True) \
    .setOutputLogsPath("logs") \
    .setValidationSplit(0.2) # keep 20% of the data for validation purposes

pipeline = nlp.Pipeline().setStages([
    features_asm,
    gen_clf
])

clf_model = pipeline.fit(data)

import spark.implicits._

val features_asm = new FeaturesAssembler()
  .setInputCols(Array("feature_1", "feature_2", "...", "feature_n"))
  .setOutputCol("features")

val gen_clf = new GenericClassifierApproach()
  .setLabelColumn("target")
  .setInputCols("features")
  .setOutputCol("prediction")
  .setModelFile("/path/to/graph_file.pb")
  .setEpochsNumber(50)
  .setBatchSize(100)
  .setFeatureScaling("zscore")
  .setlearningRate(0.001f)
  .setFixImbalance(true)
  .setOutputLogsPath("logs")
  .setValidationSplit(0.2f) // keep 20% of the data for validation purposes

val pipeline = new Pipeline().setStages(Array(
  features_asm,
  gen_clf
))

val clf_model = pipeline.fit(data)
import spark.implicits._

val features_asm = new FeaturesAssembler()
  .setInputCols(Array("feature_1", "feature_2", "...", "feature_n"))
  .setOutputCol("features")

val gen_clf = new GenericClassifierApproach()
  .setLabelColumn("target")
  .setInputCols("features")
  .setOutputCol("prediction")
  .setModelFile("/path/to/graph_file.pb")
  .setEpochsNumber(50)
  .setBatchSize(100)
  .setFeatureScaling("zscore")
  .setlearningRate(0.001f)
  .setFixImbalance(true)
  .setOutputLogsPath("logs")
  .setValidationSplit(0.2f) // keep 20% of the data for validation purposes

val pipeline = new Pipeline().setStages(Array(
  features_asm,
  gen_clf
))

val clf_model = pipeline.fit(data)
import spark.implicits._

val features_asm = new FeaturesAssembler()
  .setInputCols(Array("feature_1", "feature_2", "...", "feature_n"))
  .setOutputCol("features")

val gen_clf = new GenericClassifierApproach()
  .setLabelColumn("target")
  .setInputCols("features")
  .setOutputCol("prediction")
  .setModelFile("/path/to/graph_file.pb")
  .setEpochsNumber(50)
  .setBatchSize(100)
  .setFeatureScaling("zscore")
  .setlearningRate(0.001f)
  .setFixImbalance(true)
  .setOutputLogsPath("logs")
  .setValidationSplit(0.2f) // keep 20% of the data for validation purposes

val pipeline = new Pipeline().setStages(Array(
  features_asm,
  gen_clf
))

val clf_model = pipeline.fit(data)

GenericLogRegClassifier

Approach

GenericLogRegClassifier is a derivative of GenericClassifier which implements a multinomial logistic regression. This is a single layer neural network with the logistic function at the output. The input to the model is FeatureVector and the output is category annotations with labels and corresponding confidence scores varying between 0 and 1.

Parameters:

  • LabelColumn: This parameter sets the name of the column in your input data that contains the labels (categories) for the classification task. The classifier will use this column to learn from the data and make predictions.

  • ModelFile: This parameter specifies the path to the pre-trained model file for the logistic regression classifier. It should be a protobuf file containing the model graph and trained weights.

  • EpochsNumber: This parameter sets the number of epochs (iterations) the classifier will go through during the training process. An epoch represents one complete pass through the entire training dataset.

  • BatchSize: This parameter sets the batch size used during training. The training data is divided into batches, and the model’s weights are updated after processing each batch. A larger batch size may speed up training, but it requires more memory.

  • LearningRate: This parameter sets the learning rate for the optimization algorithm used during training. The learning rate determines how much the model’s weights are updated based on the computed gradients. A higher learning rate may lead to faster convergence but risks overshooting the optimal solution.

  • OutputLogsPath: This parameter specifies the path where the logs related to the training process will be stored. These logs can include information such as training loss, accuracy, and other metrics.

  • Dropout: Dropout is a regularization technique used to prevent overfitting in neural networks. This parameter sets the dropout rate, which determines the probability that each neuron’s output will be temporarily ignored during training.

  • FixImbalance: Imbalance refers to the situation when some classes have significantly more training examples than others. Setting this parameter to True indicates that the classifier will handle class imbalance during training to help ensure that the model doesn’t become biased towards the majority class.

  • ValidationSplit: This line seems to be commented out, but it’s worth mentioning its purpose. If uncommented and set to a value between 0 and 1, it would specify the fraction of the training data to be used for validation during the training process. The remaining data would be used for actual training.

Input Annotator Types: FEATURE_VECTOR

Output Annotator Type: CATEGORY

Python API: GenericLogRegClassifierApproach Scala API: GenericLogRegClassifierApproach Notebook: GenericLogRegClassifierApproachNotebook
Show Example

from johnsnowlabs import nlp, medical

features_asm = medical.FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

graph_folder = "gc_graph"

gc_logreg_graph_builder = medical.TFGraphBuilder()\
    .setModelName("logreg_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("log_reg_graph.pb")

gen_clf = medical.GenericLogRegClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/log_reg_graph.pb")\
    .setEpochsNumber(20)\
    .setBatchSize(128)\
    .setLearningRate(0.01)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = nlp.Pipeline(stages=[
    features_asm,
    gc_logreg_graph_builder,
    gen_clf])

import spark.implicits._
  
val features_asm = new FeaturesAssembler()
  .setInputCols("sentence_embeddings")
  .setOutputCol("feature_vector")

val gc_logreg_graph_builder = new TFGraphBuilder()
  .setModelName("logreg_classifier")
  .setInputCols("feature_vector")
  .setLabelColumn("category")
  .setGraphFolder("gc_graph")
  .setGraphFile("log_reg_graph.pb")

val gen_clf = new GenericLogRegClassifierApproach()
  .setLabelColumn("category")
  .setInputCols("feature_vector")
  .setOutputCol("prediction")
  .setModelFile("gc_graph/log_reg_graph.pb")
  .setEpochsNumber(20)
  .setBatchSize(128)
  .setLearningRate(0.01)
  .setOutputLogsPath(log_folder)
  .setDropout(0.1)
  .setFixImbalance(true) // .setValidationSplit(0.1)

val clf_Pipeline = new Pipeline().setStages(Array(features_asm, gc_logreg_graph_builder, gen_clf))

GenericSVMClassifier

Creates a generic single-label classifier which uses pre-generated Tensorflow graphs. The model operates on FEATURE_VECTOR annotations which can be produced using FeatureAssembler. Requires the FeaturesAssembler to create the input.

Parameters:

featureScaling: Feature scaling method. Possible values are ‘zscore’, ‘minmax’ or empty (no scaling) (default:’’)

multiClass: Whether to return only the label with the highest confidence score or all labels (default: False)

inputCols: previous annotations columns, if renamed (default: [‘features’])

outputCol: output annotation column. can be left default. (default: class)

Input Annotator Types: FEATURE_VECTOR

Output Annotator Type: CATEGORY

Python API: GenericSVMClassifier Scala API: GenericSVMClassifier Notebook: GenericSVMClassifierNotebook
Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("word_embeddings")

sentence_embeddings = nlp.SentenceEmbeddings() \
    .setInputCols(["document", "word_embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

features_asm = medical.FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")

generic_classifier = medical.GenericSVMClassifierModel.pretrained("generic_svm_classifier_ade", "en", "clinical/models")\
    .setInputCols(["features"])\
    .setOutputCol("class")

clf_Pipeline = nlp.Pipeline(stages=[
    document_assembler,
    tokenizer,
    word_embeddings,
    sentence_embeddings,
    features_asm,
    generic_classifier])

data = spark.createDataFrame([["""None of the patients required treatment for the overdose."""], 
 ["""I feel a bit drowsy & have a little blurred vision after taking an insulin"""]]).toDF("text")

result = clf_Pipeline.fit(data).transform(df)  # sample df

+----------------------------------------------------------------------------------------------------+-------+
|                                                                                                text| result|
+----------------------------------------------------------------------------------------------------+-------+
|                       Multicentric canine lymphoma in a 12-year-old keeshond: chemotherapy options.|[False]|
|                             Pyomyositis is a rare disease, encountered mainly in tropical climates.|[False]|
| Both patients subsequently developed markedly elevated EBV-DNA titers in association with monocl...|[False]|
|Bortezomib-induced paralytic ileus is a potential gastrointestinal side effect of this first-in-c...|[False]|
|However, given the clinically significant result to the interaction between tolazoline and cimeti...| [True]|
|                                              How much do novel antipsychotics benefit the patients?|[False]|
| We hypothesize that during interferon therapy, melanocytes may produce more melanin pigment in t...|[False]|
|They seemed to involve multiple aetiological factors, such as autoimmune thyroid disease, the tox...|[False]|
|               Two days after completing this regimen, the patient developed a rash with blistering.| [True]|
|A diagnosis of masked theophylline poisoning should be considered in similar situations involving...|[False]|
| The overall response rate of these 24 refractory lymphomas to gemcitabine-containing regimens wa...|[False]|
|Development of sarcoidosis during interferon alpha 2b and ribavirin combination therapy for chron...| [True]|
|A patient with coccidioidal meningitis was treated with intrathecally administered amphotericin B...|[False]|
|                                                Renal failure associated with the use of dextran-40.|[False]|
| However, with increased experience in applying BCG, the side effects now appear to be less promi...|[False]|
|                        Hepatotoxicity after high-dose methylprednisolone for demyelinating disease.| [True]|
| Histopathological findings included signs of orthokeratotic hyperkeratosis, moderate follicular ...| [True]|
| Acute spontaneous TLS is rare, and it has been described in leukemia and lymphoma and in some pa...|[False]|
|We present a fatal case of subacute methanol toxicity with associated diffuse brain involvement, ...| [True]|
| The reaction was thought to be triggered by the combination of radiation and epidermal growth fa...|[False]|
+----------------------------------------------------------------------------------------------------+-------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val wordEmbeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("document", "token"))
  .setOutputCol("word_embeddings")

val sentenceEmbeddings = new SentenceEmbeddings()
  .setInputCols(Array("document", "word_embeddings"))
  .setOutputCol("sentence_embeddings")
  .setPoolingStrategy("AVERAGE")

val featuresAssembler = new FeaturesAssembler()
  .setInputCols(Array("sentence_embeddings"))
  .setOutputCol("features")

val genericClassifier = PretrainedPipeline("generic_svm_classifier_ade", lang = "en", remoteLoc = "clinical/models")
  .setInputCols("features")
  .setOutputCol("class")

val pipeline = new Pipeline()
  .setStages(Array(
  documentAssembler,
  tokenizer,
  wordEmbeddings,
  sentenceEmbeddings,
  featuresAssembler,
  genericClassifier))

val data = Seq(
  ("""None of the patients required treatment for the overdose."""),
  ("""I feel a bit drowsy & have a little blurred vision after taking an insulin""")
)

val df = data.toDF("text")

val result = pipeline.fit(df).transform(df)


+----------------------------------------------------------------------------------------------------+-------+
|                                                                                                text| result|
+----------------------------------------------------------------------------------------------------+-------+
|                       Multicentric canine lymphoma in a 12-year-old keeshond: chemotherapy options.|[False]|
|                             Pyomyositis is a rare disease, encountered mainly in tropical climates.|[False]|
| Both patients subsequently developed markedly elevated EBV-DNA titers in association with monocl...|[False]|
|Bortezomib-induced paralytic ileus is a potential gastrointestinal side effect of this first-in-c...|[False]|
|However, given the clinically significant result to the interaction between tolazoline and cimeti...| [True]|
|                                              How much do novel antipsychotics benefit the patients?|[False]|
| We hypothesize that during interferon therapy, melanocytes may produce more melanin pigment in t...|[False]|
|They seemed to involve multiple aetiological factors, such as autoimmune thyroid disease, the tox...|[False]|
|               Two days after completing this regimen, the patient developed a rash with blistering.| [True]|
|A diagnosis of masked theophylline poisoning should be considered in similar situations involving...|[False]|
| The overall response rate of these 24 refractory lymphomas to gemcitabine-containing regimens wa...|[False]|
|Development of sarcoidosis during interferon alpha 2b and ribavirin combination therapy for chron...| [True]|
|A patient with coccidioidal meningitis was treated with intrathecally administered amphotericin B...|[False]|
|                                                Renal failure associated with the use of dextran-40.|[False]|
| However, with increased experience in applying BCG, the side effects now appear to be less promi...|[False]|
|                        Hepatotoxicity after high-dose methylprednisolone for demyelinating disease.| [True]|
| Histopathological findings included signs of orthokeratotic hyperkeratosis, moderate follicular ...| [True]|
| Acute spontaneous TLS is rare, and it has been described in leukemia and lymphoma and in some pa...|[False]|
|We present a fatal case of subacute methanol toxicity with associated diffuse brain involvement, ...| [True]|
| The reaction was thought to be triggered by the combination of radiation and epidermal growth fa...|[False]|
+----------------------------------------------------------------------------------------------------+-------+

GenericSVMClassifier is a derivative of GenericClassifier which implements SVM (Support Vector Machine) classification. The input to the model is FeatureVector and the output is category annotations with labels and corresponding confidence scores. The scores are standardized using the logistic function so that they vary between 0 and 1.

Parameters:

  • batchSize: (int) Batch size

  • dropout: (float) Dropout coefficient

  • epochsNumber: (int) Maximum number of epochs to train

  • featureScaling: (str) Feature scaling method. Possible values are ‘zscore’, ‘minmax’ or empty (no scaling)

  • fixImbalance: (boolean) Fix the imbalance in the training set by replicating examples of under represented categories

  • labelColumn: (str) Column with label per each document

  • learningRate: (float) Learning Rate

  • modelFile: (str) Location of file of the model used for classification

  • multiClass: (boolean) If multiClass is set, the model will return all the labels with corresponding scores. By default, multiClass is false.

  • outputLogsPath: (str) Folder path to save training logs. If no path is specified, the logs won’t be stored in disk. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3).

  • validationSplit: (float) The proportion of training dataset to be used as validation set.The model will be validated against this dataset on each Epoch and will not be used for training. The value should be between 0.0 and 1.0.

Input Annotator Types: FEATURE_VECTOR

Output Annotator Type: CATEGORY

Python API: GenericSVMClassifier Scala API: GenericSVMClassifier Notebook: GenericSVMClassifierNotebook
Show Example

from jojnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_healthcare_100d","en","clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("word_embeddings")

sentence_embeddings = nlp.SentenceEmbeddings() \
    .setInputCols(["document", "word_embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

embeddings_pipeline = nlp.Pipeline(stages = [
    document_assembler,
    tokenizer,
    word_embeddings,
    sentence_embeddings,
])

trainingData_with_embeddings = embeddings_pipeline.fit(trainingData).transform(trainingData)
trainingData_with_embeddings = trainingData_with_embeddings.select("text","category","sentence_embeddings")

graph_folder = "graph_folder"

gc_svm_graph_builder = medical.TFGraphBuilder()\
    .setModelName("svm_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("svm_graph.pb")

features_asm = medical.FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

gen_clf = medical.GenericSVMClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/svm_graph.pb")\
    .setEpochsNumber(2)\
    .setBatchSize(128)\
    .setLearningRate(0.015)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = nlp.Pipeline(stages=[
    features_asm,
    gc_svm_graph_builder,
    gen_clf])

model = clf_Pipeline.fit(trainingData_with_embeddings)
model.stages[-1].write().overwrite().save('/model_path/model_name')

#sample training data
    text	                                            category
0	Clioquinol intoxication occurring in the trea...	neg
1	"Retinoic acid syndrome" was prevented with s...	neg
2	BACKGROUND: External beam radiation therapy o...	neg
3	Although the enuresis ceased, she developed t...	neg
4	A 42-year-old woman had uneventful bilateral ...	neg

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")
  
val sentenceEmbeddings = BertSentenceEmbeddings
  .pretrained()
  .setInputCols(Array("document"))
  .setOutputCol("sentence_embedding")

val featuresAssembler = new FeaturesAssembler()
  .setInputCols(Array("sentence_embedding"))
  .setOutputCol("feature_vector")

val svmClassifier = new GenericSVMClassifierApproach()
  .setInputCols("feature_vector")
  .setOutputCol("prediction")
  .setLabelColumn("label")
  .setModelFile("src/test/resources/classification/svm_graph.pb") 
  .setEpochsNumber(10)
  .setBatchSize(1)
  .setMultiClass(false)
  .setlearningRate(0.01f)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceEmbeddings,
  featuresAssembler,
  svmClassifier,
))

val model = pipeline.fit(trainingData)

IOBTagger

Model

The IOBTagger chunk tag (Chunk based) outputs, namely NerConverter and ChunkMerger, serve the purpose of converting token tags into Named Entity Recognition (NER) tags (token-based). These tags help to identify and categorize specific entities within a given text, enabling valuable information and context to be extracted from tokens. For example output columns as inputs from NerConverter and Tokenizer can be used to merge.

Input Annotator Types: TOKEN, CHUNK

Output Annotator Type: NAMED_ENTITY

Python API: IOBTagger Scala API: IOBTagger Notebook: IOBTaggerNotebook
Show Example

from johnsnowlabs import nlp, medical 
# Pipeline stages are defined where NER is done. NER is converted to chunks.

docAssembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embs")

nerModel = medical.NerModel.pretrained("ner_posology_greedy", "en", "clinical/models")\
  .setInputCols(["sentence", "token", "embs"])\
  .setOutputCol("ner")

nerConverter = nlp.NerConverter()\
  .setInputCols(["sentence", "token", "ner"])\
  .setOutputCol("ner_chunk")

# Define the IOB tagger, which needs tokens and chunks as input. Show results.
iobTagger = medical.IOBTagger()\
  .setInputCols(["token", "ner_chunk"])\
  .setOutputCol("ner_label")

pipeline = nlp.Pipeline(stages=[docAssembler,
                            sentenceDetector,
                            tokenizer,
                            embeddings,
                            nerModel,
                            nerConverter,
                            iobTagger])

text = "The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO."
df = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(df).transform(df)

# chunk level result
result.selectExpr("explode(ner_chunk) as a") \
  .selectExpr("a.begin",
              "a.end",
              "a.result as ner_chunk",
              "a.metadata.entity as ner_label").show(50, False)

+-----+---+-------------------------------------------+---------+
|begin|end|ner_chunk                                  |ner_label|
+-----+---+-------------------------------------------+---------+
|27   |50 |1 capsule of Advil 10 mg                   |DRUG     |
|52   |61 |for 5 days                                 |DURATION |
|67   |109|magnesium hydroxide 100mg/1ml suspension PO|DRUG     |
+-----+---+-------------------------------------------+---------+

# token level result
result.selectExpr("explode(ner_label) as a") \
  .selectExpr("a.begin",
              "a.end",
              "a.metadata.word as word",
              "a.result as chunk").show(50, False)

+-----+---+----------+----------+
|begin|end|word      |chunk     |
+-----+---+----------+----------+
|0    |2  |The       |0         |
|4    |10 |patient   |0         |
|12   |14 |was       |0         |
|16   |25 |prescribed|0         |
|27   |27 |1         |B-DRUG    |
|29   |35 |capsule   |I-DRUG    |
|37   |38 |of        |I-DRUG    |
|40   |44 |Advil     |I-DRUG    |
|46   |47 |10        |I-DRUG    |
|49   |50 |mg        |I-DRUG    |
|52   |54 |for       |B-DURATION|
|56   |56 |5         |I-DURATION|
|58   |61 |days      |I-DURATION|
|63   |65 |and       |0         |
|67   |75 |magnesium |B-DRUG    |
|77   |85 |hydroxide |I-DRUG    |
|87   |95 |100mg/1ml |I-DRUG    |
|97   |106|suspension|I-DRUG    |
|108  |109|PO        |I-DRUG    |
|110  |110|.         |0         |
+-----+---+----------+----------+
from johnsnowlabs import nlp, finance
# Pipeline stages are defined where NER is done. NER is converted to chunks.

docAssembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embs")

nerModel = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
  .setInputCols(["sentence", "token", "embs"])\
  .setOutputCol("ner")

nerConverter = nlp.NerConverter()\
  .setInputCols(["sentence", "token", "ner"])\
  .setOutputCol("ner_chunk")

# Define the IOB tagger, which needs tokens and chunks as input. Show results.
iobTagger = finance.IOBTagger()\
  .setInputCols(["token", "ner_chunk"])\
  .setOutputCol("ner_label")

pipeline = nlp.Pipeline(stages=[docAssembler,
                            sentenceDetector,
                            tokenizer,
                            embeddings,
                            nerModel,
                            nerConverter,
                            iobTagger])

text = """In 2020, we acquired certain assets of Spell Security Private Limited (also known as "Spell Security"). More specifically, their Compliance product - Policy Compliance (PC)")."""
df = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(df).transform(df)  

# chunk level result
result.selectExpr("explode(ner_chunk) as a") \
  .selectExpr("a.begin",
              "a.end",
              "a.result as ner_chunk",
              "a.metadata.entity as ner_label").show(50, False)

+-----+---+------------------------------+---------+
|begin|end|ner_chunk                     |ner_label|
+-----+---+------------------------------+---------+
|39   |68 |Spell Security Private Limited|ORG      |
|86   |99 |Spell Security                |ALIAS    |
|129  |138|Compliance                    |PRODUCT  |
|150  |166|Policy Compliance             |PRODUCT  |
|169  |170|PC                            |ALIAS    |
+-----+---+------------------------------+---------+

# token level result
result.selectExpr("explode(ner_label) as a") \
  .selectExpr("a.begin",
              "a.end",
              "a.metadata.word as word",
              "a.result as chunk").show(50, False)

+-----+---+------------+---------+
|begin|end|word        |chunk    |
+-----+---+------------+---------+
|0    |1  |In          |0        |
|3    |6  |2020        |0        |
|7    |7  |,           |0        |
|9    |10 |we          |0        |
|12   |19 |acquired    |0        |
|21   |27 |certain     |0        |
|29   |34 |assets      |0        |
|36   |37 |of          |0        |
|39   |43 |Spell       |B-ORG    |
|45   |52 |Security    |I-ORG    |
|54   |60 |Private     |I-ORG    |
|62   |68 |Limited     |I-ORG    |
|70   |70 |(           |0        |
|71   |74 |also        |0        |
|76   |80 |known       |0        |
|82   |83 |as          |0        |
|85   |85 |"           |0        |
|86   |90 |Spell       |B-ALIAS  |
|92   |99 |Security    |I-ALIAS  |
|100  |102|").         |0        |
|104  |107|More        |0        |
|109  |120|specifically|0        |
|121  |121|,           |0        |
|123  |127|their       |0        |
|129  |138|Compliance  |B-PRODUCT|
|140  |146|product     |0        |
|148  |148|-           |0        |
|150  |155|Policy      |B-PRODUCT|
|157  |166|Compliance  |I-PRODUCT|
|168  |168|(           |0        |
|169  |170|PC          |B-ALIAS  |
|171  |174|)").        |0        |
+-----+---+------------+---------+
from johnsnowlabs import nlp, legal 
# Pipeline stages are defined where NER is done. NER is converted to chunks.

docAssembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embs")

ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models")\
  .setInputCols(["sentence", "token", "embs"])\
  .setOutputCol("ner")

nerConverter = nlp.NerConverter()\
  .setInputCols(["sentence", "token", "ner"])\
  .setOutputCol("ner_chunk")

# Define the IOB tagger, which needs tokens and chunks as input. Show results.
iobTagger = legal.IOBTagger()\
  .setInputCols(["token", "ner_chunk"])\
  .setOutputCol("ner_label")

pipeline = nlp.Pipeline(stages=[docAssembler, 
                            sentenceDetector, 
                            tokenizer, 
                            embeddings, 
                            ner_model, 
                            nerConverter, 
                            iobTagger])

text = """This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties").
"""

df = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(df).transform(df) 

# chunk level result
result.selectExpr("explode(ner_chunk) as a") \
  .selectExpr("a.begin",
              "a.end",
              "a.result as ner_chunk",
              "a.metadata.entity as ner_label").show(50, False)

+-----+---+-----------------------------------+---------+
|begin|end|ner_chunk                          |ner_label|
+-----+---+-----------------------------------+---------+
|141  |165|Armstrong Flooring, Inc.,          |ORG      |
|192  |197|Seller                             |ALIAS    |
|205  |221|AFI Licensing LLC                  |ORG      |
|263  |271|Licensing                          |ALIAS    |
|292  |297|Seller                             |ALIAS    |
|301  |307|Arizona                            |ALIAS    |
|315  |330|AHF Holding, Inc                   |ORG      |
|399  |403|Buyer                              |ALIAS    |
|411  |445|Armstrong Hardwood Flooring Company|ORG      |
|478  |484|Company                            |ALIAS    |
|505  |509|Buyer                              |ALIAS    |
|516  |529|Buyer Entities                     |ALIAS    |
|542  |548|Arizona                            |ALIAS    |
|574  |587|Buyer Entities                     |ALIAS    |
|611  |615|Party                              |ALIAS    |
|641  |647|Parties                            |ALIAS    |
+-----+---+-----------------------------------+---------+

# token level result
result.selectExpr("explode(ner_label) as a") \
  .selectExpr("a.begin",
              "a.end",
              "a.metadata.word as word",
              "a.result as chunk").show(50, False)

+-----+---+------------+-------+
|begin|end|word        |chunk  |
+-----+---+------------+-------+
|0    |3  |This        |0      |
|5    |16 |INTELLECTUAL|0      |
|18   |25 |PROPERTY    |0      |
|27   |35 |AGREEMENT   |0      |
|37   |37 |(           |0      |
|38   |41 |this        |0      |
|43   |43 |"           |0      |
|44   |52 |Agreement   |0      |
|53   |55 |"),         |0      |
|57   |61 |dated       |0      |
|63   |64 |as          |0      |
|66   |67 |of          |0      |
|69   |76 |December    |0      |
|78   |79 |31          |0      |
|80   |80 |,           |0      |
|82   |85 |2018        |0      |
|87   |87 |(           |0      |
|88   |90 |the         |0      |
|92   |92 |"           |0      |
|93   |101|Effective   |0      |
|103  |106|Date        |0      |
|107  |108|")          |0      |
|110  |111|is          |0      |
|113  |119|entered     |0      |
|121  |124|into        |0      |
|126  |127|by          |0      |
|129  |131|and         |0      |
|133  |139|between     |0      |
|141  |149|Armstrong   |B-ORG  |
|151  |158|Flooring    |I-ORG  |
|159  |159|,           |I-ORG  |
|161  |163|Inc         |I-ORG  |
|164  |165|.,          |I-ORG  |
|167  |167|a           |0      |
|169  |176|Delaware    |0      |
|178  |188|corporation |0      |
|190  |191|("          |0      |
|192  |197|Seller      |B-ALIAS|
|198  |199|")          |0      |
|201  |203|and         |0      |
|205  |207|AFI         |B-ORG  |
|209  |217|Licensing   |I-ORG  |
|219  |221|LLC         |I-ORG  |
|222  |222|,           |0      |
|224  |224|a           |0      |
|226  |233|Delaware    |0      |
|235  |241|limited     |0      |
|243  |251|liability   |0      |
|253  |259|company     |0      |
|261  |262|("          |0      |
+-----+---+------------+-------+
only showing top 50 rows

import spark.implicits._

// Pipeline stages are defined where NER is done. NER is converted to chunks. 
val docAssembler = new DocumentAssembler()
  .setInputCol("text") 
  .setOutputCol("document") 

val sentenceDetector = new SentenceDetector()
  .setInputCols("document") 
  .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
  .setInputCols("sentence") 
  .setOutputCol("token") 

val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
  .setInputCols(Array("sentence","token")) 
  .setOutputCol("embs") 

val nerModel = MedicalNerModel.pretrained("ner_posology_greedy","en","clinical/models")
  .setInputCols(Array("sentence","token","embs")) 
  .setOutputCol("ner") 

val nerConverter = new NerConverter()
  .setInputCols(Array("sentence","token","ner")) 
  .setOutputCol("ner_chunk") 
  
// Define the IOB tagger,which needs tokens and chunks as input. Show results. 
val iobTagger = new IOBTagger()
  .setInputCols(Array("token","ner_chunk")) 
  .setOutputCol("ner_label") 

val pipeline = new Pipeline().setStages(Array(
                                              docAssembler,
                                              sentenceDetector,
                                              tokenizer,
                                              embeddings, 
                                              nerModel,
                                              nerConverter,
                                              iobTagger)) 

val text = "The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO." 
val df = Seq(text) .toDF("text") 
val result = pipeline.fit(df) .transform(df) 

// chunk level result
+-----+---+-------------------------------------------+---------+
|begin|end|ner_chunk                                  |ner_label|
+-----+---+-------------------------------------------+---------+
|27   |50 |1 capsule of Advil 10 mg                   |DRUG     |
|52   |61 |for 5 days                                 |DURATION |
|67   |109|magnesium hydroxide 100mg/1ml suspension PO|DRUG     |
+-----+---+-------------------------------------------+---------+

// token level result
+-----+---+----------+----------+
|begin|end|word      |chunk     |
+-----+---+----------+----------+
|0    |2  |The       |0         |
|4    |10 |patient   |0         |
|12   |14 |was       |0         |
|16   |25 |prescribed|0         |
|27   |27 |1         |B-DRUG    |
|29   |35 |capsule   |I-DRUG    |
|37   |38 |of        |I-DRUG    |
|40   |44 |Advil     |I-DRUG    |
|46   |47 |10        |I-DRUG    |
|49   |50 |mg        |I-DRUG    |
|52   |54 |for       |B-DURATION|
|56   |56 |5         |I-DURATION|
|58   |61 |days      |I-DURATION|
|63   |65 |and       |0         |
|67   |75 |magnesium |B-DRUG    |
|77   |85 |hydroxide |I-DRUG    |
|87   |95 |100mg/1ml |I-DRUG    |
|97   |106|suspension|I-DRUG    |
|108  |109|PO        |I-DRUG    |
|110  |110|.         |0         |
+-----+---+----------+----------+
import spark.implicits._

// Pipeline stages are defined where NER is done. NER is converted to chunks. 
val docAssembler = new DocumentAssembler()
  .setInputCol("text") 
  .setOutputCol("document") 

val sentenceDetector = new SentenceDetector()
  .setInputCols("document") 
  .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token") 

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
  .setInputCols(Array("sentence","token")) 
  .setOutputCol("embs") 

val nerModel = FinanceNerModel.pretrained("finner_orgs_prods_alias","en","finance/models")
  .setInputCols(Array("sentence","token","embs")) 
  .setOutputCol("ner") 

val nerConverter = new NerConverter()
  .setInputCols(Array("sentence","token","ner"))
  .setOutputCol("ner_chunk") 

// Define the IOB tagger,which needs tokens and chunks as input. Show results. 
val iobTagger = new IOBTagger()
  .setInputCols(Array("token","ner_chunk")) 
  .setOutputCol("ner_label") 

val pipeline = new Pipeline().setStages(Array(
                                              docAssembler,
                                              sentenceDetector,
                                              tokenizer,
                                              embeddings, 
                                              nerModel, 
                                              nerConverter,
                                              iobTagger)) 

val text = """In 2020, we acquired certain assets of Spell Security Private Limited (also known as "Spell Security") . More specifically,their Compliance product - Policy Compliance (PC)).""" 
val df = Seq(text) .toDF("text") 
val result = pipeline.fit(df) .transform(df) 

// chunk level result
+-----+---+------------------------------+---------+
|begin|end|ner_chunk                     |ner_label|
+-----+---+------------------------------+---------+
|39   |68 |Spell Security Private Limited|ORG      |
|86   |99 |Spell Security                |ALIAS    |
|129  |138|Compliance                    |PRODUCT  |
|150  |166|Policy Compliance             |PRODUCT  |
|169  |170|PC                            |ALIAS    |
+-----+---+------------------------------+---------+

// token level result
+-----+---+------------+---------+
|begin|end|word        |chunk    |
+-----+---+------------+---------+
|0    |1  |In          |0        |
|3    |6  |2020        |0        |
|7    |7  |,           |0        |
|9    |10 |we          |0        |
|12   |19 |acquired    |0        |
|21   |27 |certain     |0        |
|29   |34 |assets      |0        |
|36   |37 |of          |0        |
|39   |43 |Spell       |B-ORG    |
|45   |52 |Security    |I-ORG    |
|54   |60 |Private     |I-ORG    |
|62   |68 |Limited     |I-ORG    |
|70   |70 |(           |0        |
|71   |74 |also        |0        |
|76   |80 |known       |0        |
|82   |83 |as          |0        |
|85   |85 |"           |0        |
|86   |90 |Spell       |B-ALIAS  |
|92   |99 |Security    |I-ALIAS  |
|100  |102|").         |0        |
|104  |107|More        |0        |
|109  |120|specifically|0        |
|121  |121|,           |0        |
|123  |127|their       |0        |
|129  |138|Compliance  |B-PRODUCT|
|140  |146|product     |0        |
|148  |148|-           |0        |
|150  |155|Policy      |B-PRODUCT|
|157  |166|Compliance  |I-PRODUCT|
|168  |168|(           |0        |
|169  |170|PC          |B-ALIAS  |
|171  |174|)").        |0        |
+-----+---+------------+---------+
import spark.implicits._

// Pipeline stages are defined where NER is done. NER is converted to chunks. 
val docAssembler = new DocumentAssembler()
  .setInputCol("text") 
  .setOutputCol("document") 

val sentenceDetector = new SentenceDetector()
  .setInputCols("document") 
  .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
  .setInputCols("sentence") 
  .setOutputCol("token") 

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
  .setInputCols(Array("sentence","token")) 
  .setOutputCol("embs") 

val ner_model = LegalNerModel.pretrained("legner_orgs_prods_alias","en","legal/models")
  .setInputCols(Array("sentence","token","embs")) 
  .setOutputCol("ner") 

val nerConverter = new NerConverter()
  .setInputCols(Array("sentence","token","ner")) 
  .setOutputCol("ner_chunk") 
  
// Define the IOB tagger,which needs tokens and chunks as input. Show results. 
val iobTagger = new IOBTagger()
  .setInputCols(Array("token","ner_chunk")) 
  .setOutputCol("ner_label") 

val pipeline = new Pipeline().setStages(Array(
                                              docAssembler,
                                              sentenceDetector,
                                              tokenizer,
                                              embeddings, 
                                              ner_model,
                                              nerConverter,
                                              iobTagger)) 

val text = """This
 INTELLECTUAL PROPERTY AGREEMENT (this "Agreement") ,dated as of December 31,2018 (the "Effective Date") is entered into by and between Armstrong Flooring,Inc.,a Delaware corporation ("Seller") and AFI Licensing LLC,a Delaware limited liability company ("Licensing" and together with Seller,"Arizona") and AHF Holding,Inc. (formerly known as Tarzan HoldCo,Inc.) ,a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company,a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand,a "Party" and collectively,the "Parties") .""" 
val df = Seq(text) .toDF("text") 
val result = pipeline.fit(df) .transform(df)

// chunk level result
+-----+---+-----------------------------------+---------+
|begin|end|ner_chunk                          |ner_label|
+-----+---+-----------------------------------+---------+
|141  |165|Armstrong Flooring, Inc.,          |ORG      |
|192  |197|Seller                             |ALIAS    |
|205  |221|AFI Licensing LLC                  |ORG      |
|263  |271|Licensing                          |ALIAS    |
|292  |297|Seller                             |ALIAS    |
|301  |307|Arizona                            |ALIAS    |
|315  |330|AHF Holding, Inc                   |ORG      |
|399  |403|Buyer                              |ALIAS    |
|411  |445|Armstrong Hardwood Flooring Company|ORG      |
|478  |484|Company                            |ALIAS    |
|505  |509|Buyer                              |ALIAS    |
|516  |529|Buyer Entities                     |ALIAS    |
|542  |548|Arizona                            |ALIAS    |
|574  |587|Buyer Entities                     |ALIAS    |
|611  |615|Party                              |ALIAS    |
|641  |647|Parties                            |ALIAS    |
+-----+---+-----------------------------------+---------+

// token level result
+-----+---+------------+-------+
|begin|end|word        |chunk  |
+-----+---+------------+-------+
|0    |3  |This        |0      |
|5    |16 |INTELLECTUAL|0      |
|18   |25 |PROPERTY    |0      |
|27   |35 |AGREEMENT   |0      |
|37   |37 |(           |0      |
|38   |41 |this        |0      |
|43   |43 |"           |0      |
|44   |52 |Agreement   |0      |
|53   |55 |"),         |0      |
|57   |61 |dated       |0      |
|63   |64 |as          |0      |
|66   |67 |of          |0      |
|69   |76 |December    |0      |
|78   |79 |31          |0      |
|80   |80 |,           |0      |
|82   |85 |2018        |0      |
|87   |87 |(           |0      |
|88   |90 |the         |0      |
|92   |92 |"           |0      |
|93   |101|Effective   |0      |
|103  |106|Date        |0      |
|107  |108|")          |0      |
|110  |111|is          |0      |
|113  |119|entered     |0      |
|121  |124|into        |0      |
|126  |127|by          |0      |
|129  |131|and         |0      |
|133  |139|between     |0      |
|141  |149|Armstrong   |B-ORG  |
|151  |158|Flooring    |I-ORG  |
|159  |159|,           |I-ORG  |
|161  |163|Inc         |I-ORG  |
|164  |165|.,          |I-ORG  |
|167  |167|a           |0      |
|169  |176|Delaware    |0      |
|178  |188|corporation |0      |
|190  |191|("          |0      |
|192  |197|Seller      |B-ALIAS|
|198  |199|")          |0      |
|201  |203|and         |0      |
|205  |207|AFI         |B-ORG  |
|209  |217|Licensing   |I-ORG  |
|219  |221|LLC         |I-ORG  |
|222  |222|,           |0      |
|224  |224|a           |0      |
|226  |233|Delaware    |0      |
|235  |241|limited     |0      |
|243  |251|liability   |0      |
|253  |259|company     |0      |
|261  |262|("          |0      |
+-----+---+------------+-------+
only showing top 50 rows

InternalDocumentSplitter

Model

InternalDocumentSplitter splits large documents into small documents. InternalDocumentSplitter has setSplitMode method to decide how to split documents.

If splitMode is recursive, It takes the separators in order and splits subtexts if they are over the chunk length, considering optional overlap of the chunks.

Additionally, you can set

  • custom patterns with setSplitPatterns
  • whether patterns should be interpreted as regex with setPatternsAreRegex
  • whether to keep the separators with setKeepSeparators
  • whether to trim whitespaces with setTrimWhitespace
  • whether to explode the splits to individual rows with setExplodeSplits

Parametres:

  • chunkSize: Size of each chunk of text. This param is applicable only for “recursive” splitMode.
  • chunkOverlap: Length of the overlap between text chunks, by default 0. This param is applicable only for recursive splitMode.
  • splitPatterns: Patterns to split the document. patternsAreRegex. Whether to interpret the split patterns as regular expressions, by default True.
  • keepSeparators: Whether to keep the separators in the final result , by default True. This param is applicable only for “recursive” splitMode.
  • explodeSplits: Whether to explode split chunks to separate rows , by default False.
  • trimWhitespace: Whether to trim whitespaces of extracted chunks , by default True.
  • splitMode: The split mode to determine how text should be segmented. Default: ‘regex’. It should be one of the following values:
    • “char”: Split text based on individual characters.
    • “token”: Split text based on tokens. You should supply tokens from inputCols.
    • “sentence”: Split text based on sentences. You should supply sentences from inputCols.
    • “recursive”: Split text recursively using a specific algorithm.
    • “regex”: Split text based on a regular expression pattern.
  • sentenceAwareness: Whether to split the document by sentence awareness if possible.
    • If true, it can stop the split process before maxLength.
    • If true, you should supply sentences from inputCols. Default: False.
    • This param is not applicable only for regex and recursive splitMode.
  • maxLength: The maximum length allowed for spitting. The mode in which the maximum length is specified:
    • “char”: Maximum length is measured in characters. Default: 512
    • “token”: Maximum length is measured in tokens. Default: 128
    • “sentence”: Maximum length is measured in sentences. Default: 8
  • customBoundsStrategy: The custom bounds strategy for text splitting using regular expressions. This param is applicable only for regex splitMode.
  • caseSensitive: Whether to use case sensitive when matching regex, by default False. This param is applicable only for regex splitMode.
  • metaDataFields: Metadata fields to add specified data in columns to the metadata of the split documents. You should set column names to read columns.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

Notebook: InternalDocumentSplitterNotebook
Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

document_splitter = medical.InternalDocumentSplitter()\
    .setInputCols("document")\
    .setOutputCol("splits")\
    .setSplitMode("recursive")\
    .setChunkSize(100)\
    .setChunkOverlap(3)\
    .setExplodeSplits(True)\
    .setPatternsAreRegex(False)\
    .setSplitPatterns(["\n\n", "\n", " "])\
    .setKeepSeparators(False)\
    .setTrimWhitespace(True)

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    document_splitter
])

df = spark.createDataFrame([[(
    "The patient is a 28-year-old, who is status post gastric bypass surgery"
    " nearly one year ago. \nHe has lost about 200 pounds and was otherwise doing well"
    " until yesterday evening around 7:00-8:00 when he developed nausea and right upper quadrant pain,"
    " which apparently wrapped around toward his right side and back. He feels like he was on it"
    " but has not done so. He has overall malaise and a low-grade temperature of 100.3."
    " \n\nHe denies any prior similar or lesser symptoms. His last normal bowel movement was yesterday."
    " He denies any outright chills or blood per rectum."
)]]).toDF("text")


pipeline_df = pipeline.fit(df).transform(df).select("splits").show(truncate=False)

## Result

+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
|splits                                                                                                                                                         |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 92, The patient is a 28-year-old, who is status post gastric bypass surgery nearly one year ago., {sentence -> 0, document -> 0}, []}]          |
|[{document, 94, 192, He has lost about 200 pounds and was otherwise doing well until yesterday evening around 7:00-8:00, {sentence -> 0, document -> 1}, []}]  |
|[{document, 193, 291, when he developed nausea and right upper quadrant pain, which apparently wrapped around toward his, {sentence -> 0, document -> 2}, []}] |
|[{document, 288, 387, his right side and back. He feels like he was on it but has not done so. He has overall malaise and, {sentence -> 0, document -> 3}, []}]|
|[{document, 384, 421, and a low-grade temperature of 100.3., {sentence -> 0, document -> 4}, []}]                                                              |
|[{document, 424, 520, He denies any prior similar or lesser symptoms. His last normal bowel movement was yesterday. He, {sentence -> 0, document -> 5}, []}]   |
|[{document, 518, 568, He denies any outright chills or blood per rectum., {sentence -> 0, document -> 6}, []}]                                                 |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val document_splitter = new InternalDocumentSplitter()
    .setInputCols("document")
    .setOutputCol("splits")
    .setSplitMode("recursive")
    .setChunkSize(100)
    .setChunkOverlap(3)
    .setExplodeSplits(true)
    .setPatternsAreRegex(false)
    .setSplitPatterns(Array("\n\n", "\n", " "))
    .setKeepSeparators(false)
    .setTrimWhitespace(true)

val pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    document_splitter ))


val test_data = Seq( "The patient is a 28-year-old, who is status post gastric bypass surgery"
    " nearly one year ago. \nHe has lost about 200 pounds and was otherwise doing well"
    " until yesterday evening around 7:00-8:00 when he developed nausea and right upper quadrant pain,"
    " which apparently wrapped around toward his right side and back. He feels like he was on it"
    " but has not done so. He has overall malaise and a low-grade temperature of 100.3."
    " \n\nHe denies any prior similar or lesser symptoms. His last normal bowel movement was yesterday."
    " He denies any outright chills or blood per rectum.").toDF("text")

val res = mapperPipeline.fit(test_data).transform(test_data)

// Show results

+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
|splits                                                                                                                                                         |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 92, The patient is a 28-year-old, who is status post gastric bypass surgery nearly one year ago., {sentence -> 0, document -> 0}, []}]          |
|[{document, 94, 192, He has lost about 200 pounds and was otherwise doing well until yesterday evening around 7:00-8:00, {sentence -> 0, document -> 1}, []}]  |
|[{document, 193, 291, when he developed nausea and right upper quadrant pain, which apparently wrapped around toward his, {sentence -> 0, document -> 2}, []}] |
|[{document, 288, 387, his right side and back. He feels like he was on it but has not done so. He has overall malaise and, {sentence -> 0, document -> 3}, []}]|
|[{document, 384, 421, and a low-grade temperature of 100.3., {sentence -> 0, document -> 4}, []}]                                                              |
|[{document, 424, 520, He denies any prior similar or lesser symptoms. His last normal bowel movement was yesterday. He, {sentence -> 0, document -> 5}, []}]   |
|[{document, 518, 568, He denies any outright chills or blood per rectum., {sentence -> 0, document -> 6}, []}]                                                 |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

NameChunkObfuscator

NameChunkObfuscator annotator allows to transform a dataset with an Input Annotation of type CHUNK, into its obfuscated version of by obfuscating the given CHUNKS. This module can replace name entities with consistent fakers, remain others same.

Obfuscation, refers to the process of de-identifying or removing sensitive patient information from clinical notes or other healthcare documents. The purpose of PHI obfuscation is to protect patient privacy and comply with regulations such as the Health Insurance Portability and Accountability Act (HIPAA).

It is important to note that the obfuscation should be done carefully to ensure that the de-identified data cannot be re-identified. Organizations must follow best practices and adhere to applicable regulations to protect patient privacy and maintain data security.

Parameters:

  • seed: The seed to select the names on obfuscation. With the seed, you can reply an execution several times with the same output..

  • obfuscateRefSource: Sets mode for select obfuscate source [‘both’, ’faker’, ‘file’] Default: ‘both’.

  • language: The language used to select some faker names. The values are the following: ‘en’(english),’de’(german), ‘es’(Spanish), ‘fr’(french) or ‘ro’(romanian) Default:’en’.

  • sameLength: The sameLength used to select the same length names as original ones during obfuscation. Example: ‘John’ –> ‘Mike’. Default: true.

  • nameEntities: The nameEntities used to select entities during obfuscation. The supported name entities are NAME, PATIENT, and DOCTOR. Default: ‘NAME’

  • genderAwareness: Whether to use gender-aware names or not during obfuscation. This param effects only names. Default: False

Input Annotator Types: CHUNK

Output Annotator Type: CHUNK

Python API: NameChunkObfuscator Scala API: NameChunkObfuscator Notebook: NameChunkObfuscatorNotebook
Show Example

from johnsnowlabs import medical, nlp

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

nameChunkObfuscator = medical.NameChunkObfuscator()\
  .setInputCols("ner_chunk")\
  .setOutputCol("replacement")\
  .setObfuscateRefSource("faker")\
  .setNameEntities(["DOCTOR", "PATIENT"])\
  .setGenderAwareness(True)

replacer_name = medical.Replacer()\
  .setInputCols("replacement","sentence")\
  .setOutputCol("obfuscated_sentence_name")\
  .setUseReplacement(True)

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      nameChunkObfuscator,
      replacer_name])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

## sample data
text ='''
Record date : 2093-01-13 , David Hale , M.D . , Patient name : Michael  , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555. Analyzed by Dr. Jennifer  .
'''

result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

result.select(F.explode(F.arrays_zip(result.sentence.result,
                                     result.obfuscated_sentence_name.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), 
              F.expr("cols['1']").alias("obfuscated_sentence_name"))


| sentence                                          | obfuscated_sentence_name                                  |
| ------------------------------------------------- | --------------------------------------------------------- |
| Record date : 2093-01-13 , David Hale , M.D .     | Record date : 2093-01-13 , Richardson , M.D .             |
| , Patient name : Michael , MR # 7194334 Date ...	| , Patient name : Thaxter , MR # 7194334 Date ...          |
| PCP : Oliveira , 25 years-old , Record date : ... | PCP : Adelaida , 25 years-old , Record date : ...         |
| Cocke County Baptist Hospital , 0295 Keats Str... | Cocke County Baptist Hospital , 0295 Keats Str...         |
| Analyzed by Dr. Jennifer .                        | Analyzed by Dr. Morganne .                                |

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

val nameChunkObfuscator = new NameChunkObfuscator()
    .setInputCols("ner_chunk")
    .setOutputCol("replacement")
    .setObfuscateRefSource("faker")
    .setNameEntities(Array("DOCTOR", "PATIENT"))
    .setGenderAwareness(true)

val replacer_name = new Replacer()
    .setInputCols(Array("replacement","sentence"))
    .setOutputCol("obfuscated_sentence_name")
    .setUseReplacement(true)

val nlpPipeline = new Pipeline().setStages(Array(
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      nameChunkObfuscator,
      replacer_name))

val data = Seq("Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .").toDF("text")

val result = nlpPipeline.fit(data).transfrom(data)


| sentence                                          | obfuscated_sentence_name                                  |
| ------------------------------------------------- | --------------------------------------------------------- |
| Record date : 2093-01-13 , David Hale , M.D .     | Record date : 2093-01-13 , Richardson , M.D .             |
| , Patient name : Michael , MR # 7194334 Date ...	| , Patient name : Thaxter , MR # 7194334 Date ...          |
| PCP : Oliveira , 25 years-old , Record date : ... | PCP : Adelaida , 25 years-old , Record date : ...         |
| Cocke County Baptist Hospital , 0295 Keats Str... | Cocke County Baptist Hospital , 0295 Keats Str...         |
| Analyzed by Dr. Jennifer .                        | Analyzed by Dr. Morganne .                                |

NameChunkObfuscator annotator that can be used in deidentification tasks for replacing doctor and patient names with fake names using a reference document.

Input Annotator Types: CHUNK

Output Annotator Type: CHUNK

Python API: NameChunkObfuscatorApproach Scala API: NameChunkObfuscatorApproach Notebook: NameChunkObfuscatorApproachNotebook
Show Example

from johnsnowlabs import medical, nlp

names = """Mitchell-NAME
Clifford-NAME
Jeremiah-NAME
Lawrence-NAME
Brittany-NAME
Patricia-NAME
Jennifer-NAME
Jackson-NAME
Leonard-NAME
Randall-NAME
Camacho-NAME
Ferrell-NAME
Mueller-NAME
Bowman-NAME
Hansen-NAME
"""

with open('names_test2.txt', 'w') as file:
    file.write(names)

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

nameChunkObfuscator = medical.NameChunkObfuscatorApproach()\
  .setInputCols("ner_chunk")\
  .setOutputCol("replacement")\
  .setObfuscateRefFile("names_test2.txt")\
  .setObfuscateRefSource("file")\
  .setRefFileFormat("csv")\
  .setRefSep("-")

replacer_name = medical.Replacer()\
  .setInputCols("replacement","sentence")\
  .setOutputCol("obfuscated_sentence_name")\
  .setUseReplacement(True)

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      nameChunkObfuscator,
      replacer_name])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

## Results
text ='''
M.D . , Patient name : Michael  , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555. Analyzed by Dr. Jennifer  .
'''

result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

result.select(F.explode(F.arrays_zip(result.sentence.result, 
                                     result.obfuscated_sentence_name.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("obfuscated_sentence_name"))

| sentence                                          | obfuscated_sentence_name                          | 
| ------------------------------------------------- | ------------------------------------------------- |
| M.D .                                             | M.D .                                             |
| , Patient name : Michael , MR # 7194334 Date ...  | , Patient name : Ferrell , MR # 7194334 Date ...  |
| PCP : Oliveira , 25 years-old , Record date : ...	| PCP : Clifford , 25 years-old , Record date : ... |
| Cocke County Baptist Hospital , 0295 Keats Str... | Cocke County Baptist Hospital , 0295 Keats Str... |
| Analyzed by Dr. Jennifer .                        | Analyzed by Dr. Jennifer .                        |

val names = """Mitchell-NAME
Clifford-NAME
Jeremiah-NAME
Lawrence-NAME
Brittany-NAME
Patricia-NAME
Jennifer-NAME
Jackson-NAME
Leonard-NAME
Randall-NAME
Camacho-NAME
Ferrell-NAME
Mueller-NAME
Bowman-NAME
Hansen-NAME
"""
/*
with open("names_test2.txt", 'w') as file:
    file.write(names)
*/

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

val nameChunkObfuscator = new NameChunkObfuscatorApproach()
    .setInputCols("ner_chunk")
    .setOutputCol("replacement")
    .setObfuscateRefFile("names_test2.txt")\
    .setObfuscateRefSource("file")
    .setRefFileFormat("csv")
    .setRefSep("-")

val replacer_name = new Replacer()
    .setInputCols(Array("replacement","sentence"))
    .setOutputCol("obfuscated_sentence_name")
    .setUseReplacement(true)

val nlpPipeline = new Pipeline().setStages((
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      nameChunkObfuscator,
      replacer_nam))

val data = Seq("M.D . , Patient name : Michael  , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555. Analyzed by Dr. Jennifer  .").toDF("text")

val res = nlpPipeline.fit(data).transform(data)

| sentence                                          | obfuscated_sentence_name                          | 
| ------------------------------------------------- | ------------------------------------------------- |
| M.D .                                             | M.D .                                             |
| , Patient name : Michael , MR # 7194334 Date ...  | , Patient name : Ferrell , MR # 7194334 Date ...  |
| PCP : Oliveira , 25 years-old , Record date : ...	| PCP : Clifford , 25 years-old , Record date : ... |
| Cocke County Baptist Hospital , 0295 Keats Str... | Cocke County Baptist Hospital , 0295 Keats Str... |
| Analyzed by Dr. Jennifer .                        | Analyzed by Dr. Jennifer .                        |

NerChunker

Model

Extracts phrases that fits into a known pattern using the NER tags. Useful for entity groups with neighboring tokens when there is no pretrained NER model to address certain issues. A Regex needs to be provided to extract the tokens between entities.

Parameter:

  • setRegexParsers: Array of grammar based chunk parsers.

Input Annotator Types: DOCUMENT, NAMED_ENTITY

Output Annotator Type: CHUNK

Python API: NerChunker Scala API: NerChunker Notebook: NerChunkerNotebook
Show Example

from johnsnowlabs import nlp, medical
# Defining pipeline stages for NER

documentAssembler= nlp.DocumentAssembler() \
  .setInputCol("text") \
  .setOutputCol("document")

sentenceDetector= nlp.SentenceDetector() \
  .setInputCols(["document"]) \
  .setOutputCol("sentence") \
  .setUseAbbreviations(False)

tokenizer= nlp.Tokenizer() \
  .setInputCols(["sentence"]) \
  .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
  .setInputCols(["sentence","token"]) \
  .setOutputCol("embeddings") \
  .setCaseSensitive(False)

ner = medical.NerModel.pretrained("ner_radiology", "en", "clinical/models") \
  .setInputCols(["sentence","token","embeddings"]) \
  .setOutputCol("ner") \
  .setIncludeConfidence(True)

# Define the NerChunker to combine to chunks
chunker = medical.NerChunker() \
  .setInputCols(["sentence","ner"]) \
  .setOutputCol("ner_chunk") \
  .setRegexParsers(["<ImagingFindings>.*<BodyPart>"])

pipeline= nlp.Pipeline(stages=[
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  ner,
  chunker
])

data= spark.createDataFrame([["She has cystic cyst on her kidney."]]).toDF("text")
result = pipeline.fit(data).transform(data)

# Show results:
result.selectExpr("explode(arrays_zip(ner.metadata , ner.result))")\
      .selectExpr("col['0'].word as word" , "col['1'] as ner").show(truncate=False)
+------+-----------------+
|word  |ner              |
+------+-----------------+
|She   |O                |
|has   |O                |
|cystic|B-ImagingFindings|
|cyst  |I-ImagingFindings|
|on    |O                |
|her   |O                |
|kidney|B-BodyPart       |
|.     |O                |
+------+-----------------+

result.select("ner_chunk.result").show(truncate=False)
+---------------------------+
|result                     |
+---------------------------+
|[cystic cyst on her kidney]|
+---------------------------+
from johnsnowlabs import nlp, finance 
# Defining pipeline stages for NER

documentAssembler= nlp.DocumentAssembler() \
  .setInputCol("text") \
  .setOutputCol("document")

sentenceDetector= nlp.SentenceDetector() \
  .setInputCols(["document"]) \
  .setOutputCol("sentence") 

tokenizer= nlp.Tokenizer() \
  .setInputCols(["sentence"]) \
  .setOutputCol("token")\
  .setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '"', "'", '%', '&'])

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \
  .setInputCols("sentence", "token") \
  .setOutputCol("embeddings")\
  .setMaxSentenceLength(512)\
  .setCaseSensitive(True)

ner_model = finance.NerModel.pretrained("finner_responsibility_reports_md", "en", "finance/models")\
  .setInputCols(["sentence", "token", "embeddings"])\
  .setOutputCol("ner")

# Define the NerChunker to combine to chunks
chunker = finance.NerChunker() \
  .setInputCols(["sentence","ner"]) \
  .setOutputCol("ner_chunk") \
  .setRegexParsers(["<ENVIRONMENTAL_KPI>.*<AMOUNT>"])

pipeline= nlp.Pipeline(stages=[
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  ner_model,
  chunker
])

data= spark.createDataFrame([["""The company has reduced its direct GHG emissions from 12,135 million tonnes of CO2e in 2017 to 4 million tonnes of CO2e in 2021. The indirect GHG emissions (scope 2) are mainly from imported energy, including electricity, heat, steam, and cooling, and the company has reduced its scope 2 emissions from 3 million tonnes of CO2e in 2017-2018 to 4 million tonnes of CO2e in 2020-2021. The scope 3 emissions are mainly from the use of sold products, and the emissions have increased from 377 million tonnes of CO2e in 2017 to 408 million tonnes of CO2e in 2021."""]]).toDF("text")

result = pipeline.fit(data).transform(data)

# Show results:
result.selectExpr("explode(arrays_zip(ner.metadata , ner.result))")\
      .selectExpr("col['0'].word as word" , "col['1'] as ner").show(truncate=False)

+---------+--------------------+
|word     |ner                 |
+---------+--------------------+
|The      |O                   |
|company  |O                   |
|has      |O                   |
|reduced  |O                   |
|its      |O                   |
|direct   |B-ENVIRONMENTAL_KPI |
|GHG      |I-ENVIRONMENTAL_KPI |
|emissions|I-ENVIRONMENTAL_KPI |
|from     |O                   |
|12,135   |B-AMOUNT            |
|million  |I-AMOUNT            |
|tonnes   |B-ENVIRONMENTAL_UNIT|
|of       |I-ENVIRONMENTAL_UNIT|
|CO2e     |I-ENVIRONMENTAL_UNIT|
|in       |O                   |
|2017     |B-DATE_PERIOD       |
|to       |O                   |
|4        |B-AMOUNT            |
|million  |I-AMOUNT            |
|tonnes   |B-ENVIRONMENTAL_UNIT|
+---------+--------------------+

result.select("ner_chunk.result").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[direct GHG emissions from 12,135 million tonnes of CO2e in 2017 to 4 million, indirect GHG emissions (scope 2) are mainly from imported energy, including electricity, heat, steam, and cooling, and the company has reduced its scope 2 emissions from 3 million tonnes of CO2e in 2017-2018 to 4 million, scope 3 emissions are mainly from the use of sold products, and the emissions have increased from 377 million tonnes of CO2e in 2017 to 408 million]|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
from johnsnowlabs import nlp, legal 
# Defining pipeline stages for NER

documentAssembler= nlp.DocumentAssembler() \
  .setInputCol("text") \
  .setOutputCol("document")

sentenceDetector= nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer= nlp.Tokenizer() \
  .setInputCols(["sentence"]) \
  .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
  .setInputCols(["sentence", "token"]) \
  .setOutputCol("embeddings")

ner_model = legal.NerModel.pretrained("legner_org_per_role_date", "en", "legal/models")\
  .setInputCols(["sentence", "token", "embeddings"])\
  .setOutputCol("ner")

# Define the NerChunker to combine to chunks
chunker = legal.NerChunker() \
  .setInputCols(["sentence","ner"]) \
  .setOutputCol("ner_chunk") \
  .setRegexParsers(["<PERSON>.*<ROLE>"])

pipeline= nlp.Pipeline(stages=[
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  ner_model,
  chunker
])

data= spark.createDataFrame([["""Jeffrey Preston Bezos is an American entrepreneur, founder and CEO of Amazon"""]]).toDF("text")

result = pipeline.fit(data).transform(data)

# Show results:
result.selectExpr("explode(arrays_zip(ner.metadata , ner.result))")\
      .selectExpr("col['0'].word as word" , "col['1'] as ner").show(truncate=False)

+------------+--------+
|word        |ner     |
+------------+--------+
|Jeffrey     |B-PERSON|
|Preston     |I-PERSON|
|Bezos       |I-PERSON|
|is          |O       |
|an          |O       |
|American    |O       |
|entrepreneur|O       |
|,           |O       |
|founder     |B-ROLE  |
|and         |O       |
|CEO         |B-ROLE  |
|of          |O       |
|Amazon      |B-ORG   |
+------------+--------+

result.select("ner_chunk.result").show(truncate=False)

+--------------------------------------------------------------------+
|result                                                              |
+--------------------------------------------------------------------+
|[Jeffrey Preston Bezos is an American entrepreneur, founder and CEO]|
+--------------------------------------------------------------------+

import spark.implicits._

// Defining pipeline stages for NER
val data= Seq("She has cystic cyst on her kidney.").toDF("text")

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")
  .setUseAbbreviations(false)

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence","token"))
  .setOutputCol("embeddings")
  .setCaseSensitive(False)

val ner = MedicalNerModel.pretrained("ner_radiology", "en", "clinical/models")
  .setInputCols(Array("sentence","token","embeddings"))
  .setOutputCol("ner")
  .setIncludeConfidence(True)

// Define the NerChunker to combine to chunks
val chunker = new NerChunker()
  .setInputCols(Array("sentence","ner"))
  .setOutputCol("ner_chunk")
  .setRegexParsers(Array("<ImagingFindings>.<BodyPart>"))

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  ner,
  chunker
))

val data = Seq(
  """She has cystic cyst on her kidney."""
).toDF("text")
val result = pipeline.fit(data).transform(data)

// Show results:

+------+-----------------+
|word  |ner              |
+------+-----------------+
|She   |O                |
|has   |O                |
|cystic|B-ImagingFindings|
|cyst  |I-ImagingFindings|
|on    |O                |
|her   |O                |
|kidney|B-BodyPart       |
|.     |O                |
+------+-----------------+

+---------------------------+
|result                     |
+---------------------------+
|[cystic cyst on her kidney]|
+---------------------------+
import spark.implicits._

// Defining pipeline stages for NER
val documentAssembler= new DocumentAssembler() 
  .setInputCol("text") 
  .setOutputCol("document")

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer= new Tokenizer() 
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") 
  .setInputCols(Array("sentence", "token")) 
  .setOutputCol("embeddings")

val ner_model = FinanceNerModel.pretrained("finner_responsibility_reports_md", "en", "finance/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")

// Define the NerChunker to combine to chunks
val chunker = new NerChunker() 
  .setInputCols(Array("sentence","ner")) 
  .setOutputCol("ner_chunk") 
  .setRegexParsers(Array("<ENVIRONMENTAL_KPI>.*<AMOUNT>"))

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  ner,
  chunker
))

val data = Seq(
  """The company has reduced its direct GHG emissions from 12,135 million tonnes of CO2e in 2017 to 4 million tonnes of CO2e in 2021. The indirect GHG emissions (scope 2) are mainly from imported energy, including electricity, heat, steam, and cooling, and the company has reduced its scope 2 emissions from 3 million tonnes of CO2e in 2017-2018 to 4 million tonnes of CO2e in 2020-2021. The scope 3 emissions are mainly from the use of sold products, and the emissions have increased from 377 million tonnes of CO2e in 2017 to 408 million tonnes of CO2e in 2021."""
).toDF("text")
val result = pipeline.fit(data).transform(data)

// Show results:

+---------+--------------------+
|word     |ner                 |
+---------+--------------------+
|The      |O                   |
|company  |O                   |
|has      |O                   |
|reduced  |O                   |
|its      |O                   |
|direct   |B-ENVIRONMENTAL_KPI |
|GHG      |I-ENVIRONMENTAL_KPI |
|emissions|I-ENVIRONMENTAL_KPI |
|from     |O                   |
|12,135   |B-AMOUNT            |
|million  |I-AMOUNT            |
|tonnes   |B-ENVIRONMENTAL_UNIT|
|of       |I-ENVIRONMENTAL_UNIT|
|CO2e     |I-ENVIRONMENTAL_UNIT|
|in       |O                   |
|2017     |B-DATE_PERIOD       |
|to       |O                   |
|4        |B-AMOUNT            |
|million  |I-AMOUNT            |
|tonnes   |B-ENVIRONMENTAL_UNIT|
+---------+--------------------+


+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[direct GHG emissions from 12,135 million tonnes of CO2e in 2017 to 4 million, indirect GHG emissions (scope 2) are mainly from imported energy, including electricity, heat, steam, and cooling, and the company has reduced its scope 2 emissions from 3 million tonnes of CO2e in 2017-2018 to 4 million, scope 3 emissions are mainly from the use of sold products, and the emissions have increased from 377 million tonnes of CO2e in 2017 to 408 million]|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
// Defining pipeline stages for NER

val documentAssembler= new DocumentAssembler() 
  .setInputCol("text") 
  .setOutputCol("document")

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer= new Tokenizer() 
  .setInputCols("sentence") 
  .setOutputCol("token")

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") 
  .setInputCols(Array("sentence", "token")) 
  .setOutputCol("embeddings")

val ner_model = LegalNerModel.pretrained("legner_org_per_role_date", "en", "legal/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")

// Define the NerChunker to combine to chunks
val chunker = new NerChunker() 
  .setInputCols(Array("sentence","ner")) 
  .setOutputCol("ner_chunk") 
  .setRegexParsers(Array("<PERSON>.*<ROLE>"))

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  ner,
  chunker
))

val data = Seq(
  """Jeffrey Preston Bezos is an American entrepreneur, founder and CEO of Amazon"""
).toDF("text")
val result = pipeline.fit(data).transform(data)

// Show results:

+------------+--------+
|word        |ner     |
+------------+--------+
|Jeffrey     |B-PERSON|
|Preston     |I-PERSON|
|Bezos       |I-PERSON|
|is          |O       |
|an          |O       |
|American    |O       |
|entrepreneur|O       |
|,           |O       |
|founder     |B-ROLE  |
|and         |O       |
|CEO         |B-ROLE  |
|of          |O       |
|Amazon      |B-ORG   |
+------------+--------+


+--------------------------------------------------------------------+
|result                                                              |
+--------------------------------------------------------------------+
|[Jeffrey Preston Bezos is an American entrepreneur, founder and CEO]|
+--------------------------------------------------------------------+

NerConverterInternal

Model

Converts a IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their label. Chunks with no associated entity (tagged “O”) are filtered out.

Parametres;

  • setThreshold: Confidence threshold.

  • setWhiteList: If defined, list of entities to process.

  • setBlackList: If defined, list of entities to ignore.

  • setReplaceLabels: If defined, contains a dictionary for entity replacement.

  • setPreservePosition: Whether to preserve the original position of the tokens in the original document or use the modified tokens.

  • setReplaceDictResource: If defined, path to the file containing a dictionary for entity replacement.

  • setIgnoreStopWords: If defined, list of stop words to ignore.

  • setGreedyMode: (Boolean) Whether to ignore B tags for contiguous tokens of same entity same .

This licensed annotator adds extra functionality to the open-source version by adding the following parameters: blackList, greedyMode, threshold, and ignoreStopWords that are not available in the NerConverter annotator.

See also Inside–outside–beginning (tagging) for more information.

Input Annotator Types: DOCUMENT, TOKEN, NAMED_ENTITY

Output Annotator Type: CHUNK

Python API: NerConverterInternal Scala API: NerConverterInternal Notebook: Notebook
Show Example

from johnsnowlabs import * 
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
embeddings  = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# NER model
nerModel = medical.NerModel.pretrained("ner_posology", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

# NER Converter
nerConverter = medical.NerConverterInternal() \
   .setInputCols(["sentence", "token", "ner"]) \
   .setOutputCol("ner_chunk")

nlpPipeline = nlp.Pipeline(stages = [document_assembler,
                              sentence_detector,
                              tokenizer,
                              embeddings,
                              nerModel,
                              nerConverter
                              ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

sample_text = """The patient was prescribed 1 capsule of Advil for 5 days.
He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, metformin 1000 mg two times a day.
"""

data = spark.createDataFrame([[sample_text]]).toDF("text")

result = model.transform(data)
result.select('text', 'ner.result', 'ner_chunk.result').show(truncate = 50)

+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|                                              text|                                            result|                                            result|
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|The patient was prescribed 1 capsule of Advil f...|[O, O, O, O, B-DOSAGE, B-FORM, O, B-DRUG, B-DUR...|[1, capsule, Advil, for 5 days, 40 units, insul...|
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
from johnsnowlabs import * 
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")\
    .setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '"', "'", '%', '&'])

embeddings  = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

# NER model
nerModel = finance.NerModel.pretrained("finner_responsibility_reports_md", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

# NER Converter
nerConverter = finance.NerConverterInternal() \
   .setInputCols(["sentence", "token", "ner"]) \
   .setOutputCol("ner_chunk")

nlpPipeline = nlp.Pipeline(stages = [document_assembler,
                              sentence_detector,
                              tokenizer,
                              embeddings,
                              nerModel,
                              nerConverter
                              ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

sample_text = """The company has reduced its direct GHG emissions from 12,135 million tonnes of CO2e in 2017 to 4 million tonnes of CO2e in 2021. The indirect GHG emissions (scope 2) are mainly from imported energy, including electricity, heat, steam, and cooling, and the company has reduced its scope 2 emissions from 3 million tonnes of CO2e in 2017-2018 to 4 million tonnes of CO2e in 2020-2021. The scope 3 emissions are mainly from the use of sold products, and the emissions have increased from 377 million tonnes of CO2e in 2017 to 408 million tonnes of CO2e in 2021.
"""

data = spark.createDataFrame([[sample_text]]).toDF("text")

result = model.transform(data)
result.select('text', 'ner.result', 'ner_chunk.result').show(truncate = 50)

+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|                                              text|                                            result|                                            result|
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|The company has reduced its direct GHG emission...|[O, O, O, O, O, B-ENVIRONMENTAL_KPI, I-ENVIRONM...|[direct GHG emissions, 12,135 million, tonnes o...|
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
from johnsnowlabs import * 
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

embeddings  = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

# NER model
nerModel = legal.NerModel.pretrained("legner_org_per_role_date", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

# NER Converter
nerConverter = legal.NerConverterInternal() \
   .setInputCols(["sentence", "token", "ner"]) \
   .setOutputCol("ner_chunk")

nlpPipeline = nlp.Pipeline(stages = [document_assembler,
                              sentence_detector,
                              tokenizer,
                              embeddings,
                              nerModel,
                              nerConverter
                              ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

sample_text = """Jeffrey Preston Bezos is an American entrepreneur, founder and CEO of Amazon
"""

data = spark.createDataFrame([[sample_text]]).toDF("text")

result = model.transform(data)
result.select('text', 'ner.result', 'ner_chunk.result').show(truncate = 50)

+--------------------------------------------------+--------------------------------------------------+---------------------------------------------+
|                                              text|                                            result|                                       result|
+--------------------------------------------------+--------------------------------------------------+---------------------------------------------+
|Jeffrey Preston Bezos is an American entreprene...|[B-PERSON, I-PERSON, I-PERSON, O, O, O, O, O, B...|[Jeffrey Preston Bezos, founder, CEO, Amazon]|
+--------------------------------------------------+--------------------------------------------------+---------------------------------------------+

import spark.implicits._

// Annotator that transforms a text column from dataframe into an Annotation ready for NLP 
val document_assembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document")

val sentence_detector = new SentenceDetector()
 .setInputCols(Array("document")) 
 .setOutputCol("sentence") 
 
// Tokenizer splits words in a relevant format for NLP 
val tokenizer = new Tokenizer()
 .setInputCols(Array("sentence")) 
 .setOutputCol("token") 
 
// Clinical word embeddings trained on PubMED dataset 
val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
 .setInputCols(Array("sentence","token")) 
 .setOutputCol("embeddings") 
 
// NER model 
val nerModel = MedicalNerModel.pretrained("ner_posology","en","clinical/models")
 .setInputCols(Array("sentence","token","embeddings")) 
 .setOutputCol("ner") 
 
// NER Converter 
val nerConverter = new NerConverterInternal()
 .setInputCols(Array("sentence","token","ner")) 
 .setOutputCol("ner_chunk") 

val nlpPipeline = new Pipeline().setStages(Array(
    document_assembler,
    sentence_detector, 
    tokenizer, 
    embeddings, 
    nerModel, 
    nerConverter )) 

val empty_data = Seq("") .toDF("text") 
val model = nlpPipeline.fit(empty_data) 

val sample_text = "The patient was prescribed 1 capsule of Advil for 5 days.He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night,12 units of insulin lispro with meals,metformin 1000 mg two times a day." 

val data = Seq(sample_text) .toDF("text") 
val result = model.transform(data) result.select("text","ner.result","ner_chunk.result") 

+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|                                              text|                                            result|                                            result|
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|The patient was prescribed 1 capsule of Advil f...|[O, O, O, O, B-DOSAGE, B-FORM, O, B-DRUG, B-DUR...|[1, capsule, Advil, for 5 days, 40 units, insul...|
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
import spark.implicits._

// Annotator that transforms a text column from dataframe into an Annotation ready for NLP 
val document_assembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val sentence_detector = new SentenceDetector()
 .setInputCols(Array("document")) 
 .setOutputCol("sentence") 

// Tokenizer splits words in a relevant format for NLP 
val tokenizer = new Tokenizer()
 .setInputCols(Array("sentence")) 
 .setOutputCol("token") 
 .setContextChars(Array(".",",",";",":","!","?","*","-","(",") ",""",""","%","&")) 

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
 .setInputCols(Array("sentence","token")) 
 .setOutputCol("embeddings") 

// NER model 
val nerModel = FinanceNerModel.pretrained("finner_responsibility_reports_md","en","finance/models")
 .setInputCols(Array("sentence","token","embeddings")) 
 .setOutputCol("ner") 
 
// NER Converter 
val nerConverter = new NerConverterInternal()
 .setInputCols(Array("sentence","token","ner")) 
 .setOutputCol("ner_chunk") 

val nlpPipeline = new Pipeline().setStages(Array(
    document_assembler,
    sentence_detector,
    tokenizer, 
    embeddings,
    nerModel, 
    nerConverter )) 

val empty_data = Seq("") .toDF("text") 
val model = nlpPipeline.fit(empty_data) 

val sample_text = "The company has reduced its direct GHG emissions from 12,135 million tonnes of CO2e in 2017 to 4 million tonnes of CO2e in 2021. The indirect GHG emissions (scope 2) are mainly from imported energy,including electricity,heat,steam,and cooling,and the company has reduced its scope 2 emissions from 3 million tonnes of CO2e in 2017-2018 to 4 million tonnes of CO2e in 2020-2021. The scope 3 emissions are mainly from the use of sold products,and the emissions have increased from 377 million tonnes of CO2e in 2017 to 408 million tonnes of CO2e in 2021." 
val data = Seq(sample_text) .toDF("text") 

val result = model.transform(data) result.select("text","ner.result","ner_chunk.result") 

+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|                                              text|                                            result|                                            result|
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|The company has reduced its direct GHG emission...|[O, O, O, O, O, B-ENVIRONMENTAL_KPI, I-ENVIRONM...|[direct GHG emissions, 12,135 million, tonnes o...|
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
import spark.implicits._

// Annotator that transforms a text column from dataframe into an Annotation ready for NLP 
val document_assembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val sentence_detector = new SentenceDetector()
 .setInputCols(Array("document")) 
 .setOutputCol("sentence") 

// Tokenizer splits words in a relevant format for NLP 
val tokenizer = new Tokenizer()
 .setInputCols(Array("sentence")) 
 .setOutputCol("token") 

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
 .setInputCols(Array("sentence","token")) 
 .setOutputCol("embeddings") 
 
// NER model 
val nerModel = LegalNerModel.pretrained("legner_org_per_role_date","en","legal/models")
 .setInputCols(Array("sentence","token","embeddings")) 
 .setOutputCol("ner") 
 
// NER Converter 
val nerConverter = new NerConverterInternal()
 .setInputCols(Array("sentence","token","ner")) 
 .setOutputCol("ner_chunk") 

val nlpPipeline = new Pipeline().setStages(Array(
    document_assembler, 
    sentence_detector, 
    tokenizer, 
    embeddings, 
    nerModel, 
    nerConverter )) 

val empty_data = Seq("") .toDF("text") 
val model = nlpPipeline.fit(empty_data) 

val sample_text = "Jeffrey Preston Bezos is an American entrepreneur,founder and CEO of Amazon" 

val data = Seq(sample_text) .toDF("text") 
val result = model.transform(data) result.select("text","ner.result","ner_chunk.result") 

+--------------------------------------------------+--------------------------------------------------+---------------------------------------------+
|                                              text|                                            result|                                       result|
+--------------------------------------------------+--------------------------------------------------+---------------------------------------------+
|Jeffrey Preston Bezos is an American entreprene...|[B-PERSON, I-PERSON, I-PERSON, O, O, O, O, O, B...|[Jeffrey Preston Bezos, founder, CEO, Amazon]|
+--------------------------------------------------+--------------------------------------------------+---------------------------------------------+

NerDisambiguator

Links words of interest, such as names of persons, locations and companies, from an input text document to a corresponding unique entity in a target Knowledge Base (KB). Words of interest are called Named Entities (NEs), mentions, or surface forms. Instantiated / pretrained model of the NerDisambiguator. Links words of interest, such as names of persons, locations and companies, from an input text document to a corresponding unique entity in a target Knowledge Base (KB). Words of interest are called Named Entities (NEs), mentions, or surface forms.

Parameters:

  • embeddingTypeParam: (String) ‘bow’ for word embeddings or ‘sentence’ for sentences.

  • numFirstChars: (Int) number of characters to be considered for initial prefix search in the knowledge base.

  • tokenSearch: (BooleanParam) mechanism of search - by token or by - chunk in knowledge base (token is recommended ==> Default value: True).

  • narrowWithApproximateMatching: (BooleanParam) narrow down the prefix search results with Levenshtein distance based matching (True is recommended).

  • levenshteinDistanceThresholdParam: (Float) value of the Levenshtein distance threshold to narrow results from prefix search (default value: 0.1).

  • nearMatchingGapParam: (Int) allows to define a limit on the string length (by trimming the candidate chunks) during Levenshtein distance-based narrowing, {len(candidate) - len(entity chunk) > nearMatchingGap} (default value: 4).

  • predictionsLimit: (BooleanParam) allows to limit the number of predictions N for top N predictions.

  • s3KnowledgeBaseName: (String) the name of the Knowledge Base name in S3.

Input Annotator Types: CHUNK, SENTENCE_EMBEDDINGS

Output Annotator Type: DISAMBIGUATION

Python API: NerDisambiguatorModel Scala API: NerDisambiguatorModel Notebook: NerDisambiguatorModelNotebook
Show Example

from johnsnowlabs import nlp,  medical

documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

sentence_embeddings = nlp.SentenceEmbeddings() \
    .setInputCols(["sentence","embeddings"]) \
    .setOutputCol("sentence_embeddings")

ner_model = medical.NerModel.pretrained() \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk") \
    .setWhiteList(["PER"])

disambiguator = medical.NerDisambiguator() \
    .setS3KnowledgeBaseName("i-per") \
    .setInputCols(["ner_chunk", "sentence_embeddings"]) \
    .setOutputCol("disambiguation") \
    .setTokenSearch(False)

pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    sentence_embeddings,
    ner_model,
    ner_converter,
    disambiguator])

text = """The show also had a contestant named Donald Trump who later defeated Christina Aguilera ..."""

df = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(df).transform(df)


# Result
result.selectExpr("explode(disambiguation)") \
    .selectExpr("col.metadata.chunk as chunk", "col.result as result").show(5, truncate=False)

+------------------+------------------------------------------------------------------------------------------------------------------------+
|chunk             |result                                                                                                                  |
+------------------+------------------------------------------------------------------------------------------------------------------------+
|Donald Trump      |http://en.wikipedia.org/?curid=55907961, http://en.wikipedia.org/?curid=31698421, http://en.wikipedia.org/?curid=4848272|
|Christina Aguilera|http://en.wikipedia.org/?curid=6636454, http://en.wikipedia.org/?curid=144171                                           |
+------------------+------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text") 
  .setOutputCol("document") 

val sentenceDetector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence") 
  .setOutputCol("token") 

val word_embeddings = WordEmbeddingsModel.pretrained()
  .setInputCols(Array("sentence","token")) 
  .setOutputCol("embeddings") 

val sentence_embeddings = new SentenceEmbeddings()
  .setInputCols(Array("sentence","embeddings")) 
  .setOutputCol("sentence_embeddings") 

val ner_model = MedicalNerModel.pretrained()
  .setInputCols(Array("sentence","token","embeddings")) 
  .setOutputCol("ner") 

val ner_converter = new NerConverter()
  .setInputCols(Array("sentence","token","ner")) 
  .setOutputCol("ner_chunk") 
  .setWhiteList(Array("PER")) 

val disambiguator = new NerDisambiguator()
  .setS3KnowledgeBaseName("i-per") 
  .setInputCols(Array("ner_chunk","sentence_embeddings")) 
  .setOutputCol("disambiguation") 
  .setTokenSearch(false)

val pipeline = new Pipeline().setStages(Array( 
                                              documentAssembler, 
                                              sentenceDetector, 
                                              tokenizer, 
                                              word_embeddings, 
                                              sentence_embeddings, 
                                              ner_model, 
                                              ner_converter, 
                                              disambiguator))
 
val text = "The show also had a contestant named Donald Trump who later defeated Christina Aguilera ..." 

val df = Seq(text) .toDF("text") 
val result = pipeline.fit(df) .transform(df) 

// Result 

+------------------+------------------------------------------------------------------------------------------------------------------------+
|chunk             |result                                                                                                                  |
+------------------+------------------------------------------------------------------------------------------------------------------------+
|Donald Trump      |http://en.wikipedia.org/?curid=55907961, http://en.wikipedia.org/?curid=31698421, http://en.wikipedia.org/?curid=4848272|
|Christina Aguilera|http://en.wikipedia.org/?curid=6636454, http://en.wikipedia.org/?curid=144171                                           |
+------------------+------------------------------------------------------------------------------------------------------------------------+

Links words of interest, such as names of persons, locations and companies, from an input text document to a corresponding unique entity in a target Knowledge Base (KB). Words of interest are called Named Entities (NEs), mentions, or surface forms. The model needs extracted CHUNKS and SENTENCE_EMBEDDINGS type input from e.g. SentenceEmbeddings and NerConverter.

Input Annotator Types: CHUNK, SENTENCE_EMBEDDINGS

Output Annotator Type: DISAMBIGUATION

Python API: NerDisambiguator Scala API: NerDisambiguator

NerModel

NerModel is the Named Entity Recognition (NER) annotator that allows to train generic NER model based on Neural Networks. The architecture of the neural network is a Char CNNs - BiLSTM - CRF that achieves state-of-the-art in most datasets.

Note that some pre-trained models require specific types of embeddings, depending on which they were trained.

Parameters:

  • setBatchSize: (int) number of samples used in one iteration of training (Default: 32).

  • setIncludeConfidence: (Boolean) whether to include confidence scores in annotation metadata (Default: False).

  • setConfigProtoBytes: (int) ConfigProto from tensorflow, serialized into byte array.

  • setIncludeAllConfidenceScores: (Boolean) whether to include confidence scores for all tags rather than just for the predicted one.

  • setMinProbability (Float) define the minimum probability value.

For available pretrained models please see the Models Hub. Additionally, pretrained pipelines are available for this module, see the Pipelines. For extended examples of usage, see the Spark NLP Workshop

Input Annotator Types: DOCUMENT, TOKEN, WORD_EMBEDDINGS

Output Annotator Type: NAMED_ENTITY

Python API: MedicalNerModel Scala API: MedicalNerModel Notebook: MedicalNerModelNotebook
Show Example

from johnsnowlabs import nlp, medical

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

jsl_ner = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("jsl_ner")

jsl_ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "jsl_ner"]) \
    .setOutputCol("ner_chunk")

jsl_ner_pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    jsl_ner,
    jsl_ner_converter])

text = '''
A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, and associated with an acute hepatitis, presented with a one-week history of polyuria, poor appetite, and vomiting.
She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation.
Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness, guarding, or rigidity. Pertinent laboratory findings on admission were: serum glucose 111 mg/dl,  creatinine 0.4 mg/dL, triglycerides 508 mg/dL, total cholesterol 122 mg/dL, and venous pH 7.27.
'''
data = spark.createDataFrame([[text]]).toDF("text")

result = jsl_ner_pipeline.fit(data).transform(data)

result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(100, truncate=False)

+-----------------------------+----------------------------+
|chunk                        |ner_label                   |
+-----------------------------+----------------------------+
|28-year-old                  |Age                         |
|female                       |Gender                      |
|gestational diabetes mellitus|Diabetes                    |
|eight years prior            |RelativeDate                |
|type two diabetes mellitus   |Diabetes                    |
|T2DM                         |Diabetes                    |
|HTG-induced pancreatitis     |Disease_Syndrome_Disorder   |
|three years prior            |RelativeDate                |
|acute                        |Modifier                    |
|hepatitis                    |Disease_Syndrome_Disorder   |
|one-week                     |Duration                    |
|polyuria                     |Symptom                     |
|poor appetite                |Symptom                     |
|vomiting                     |Symptom                     |
|She                          |Gender                      |
|metformin                    |Drug_Ingredient             |
|glipizide                    |Drug_Ingredient             |
|dapagliflozin                |Drug_Ingredient             |
|T2DM                         |Diabetes                    |
|atorvastatin                 |Drug_Ingredient             |
|gemfibrozil                  |Drug_Ingredient             |
|HTG                          |Hyperlipidemia              |
|She                          |Gender                      |
|dapagliflozin                |Drug_Ingredient             |
|for six months               |Duration                    |
|dry oral mucosa              |Symptom                     |
|her                          |Gender                      |
|abdominal                    |External_body_part_or_region|
|tenderness                   |Symptom                     |
|guarding                     |Symptom                     |
|rigidity                     |Symptom                     |
|admission                    |Admission_Discharge         |
|serum glucose                |Test                        |
|111 mg/dl                    |Test_Result                 |
|creatinine                   |Test                        |
|0.4 mg/dL                    |Test_Result                 |
|triglycerides                |Triglycerides               |
|508 mg/dL                    |Test_Result                 |
|total cholesterol 122 mg/dL  |Total_Cholesterol           |
|venous pH                    |Test                        |
|7.27                         |Test_Result                 |
+-----------------------------+----------------------------+
from johnsnowlabs import nlp, finance

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
    
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_legal_bert_base_uncased","en")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_sec_conll", "en", "finance/models") \
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = finance.NerConverterInternal()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter])

text = '''December 2007 SUBORDINATED LOAN AGREEMENT. THIS LOAN AGREEMENT is made on 7th December, 2007 BETWEEN: (1) SILICIUM DE PROVENCE S.A.S., a private company with limited liability, incorporated under the laws of France, whose registered office is situated at Usine de Saint Auban, France, represented by Mr.Frank Wouters, hereinafter referred to as the "Borrower", and ( 2 ) EVERGREEN SOLAR INC., a company incorporated in Delaware, U.S.A., with registered number 2426798, whose registered office is situated at Bartlett Street, Marlboro, Massachusetts, U.S.A. represented by Richard Chleboski, hereinafter referred to as "Lender" '''

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(100, truncate=False)
+--------------------------+---------+
|chunk                     |ner_label|
+--------------------------+---------+
|SILICIUM DE PROVENCE S.A.S|ORG      |
|France                    |LOC      |
|Usine de Saint Auban      |LOC      |
|France                    |LOC      |
|Mr.Frank Wouters          |PER      |
|Borrower                  |PER      |
|EVERGREEN SOLAR INC       |ORG      |
|Delaware                  |LOC      |
|U.S.A                     |LOC      |
|Bartlett Street           |LOC      |
|Marlboro                  |LOC      |
|Massachusetts             |LOC      |
|U.S.A                     |LOC      |
|Richard Chleboski         |PER      |
|Lender                    |PER      |
+--------------------------+---------+
from johnsnowlabs import nlp, legal

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
    .setInputCols("sentence", "token") \
    .setOutputCol("embeddings")\

ner_model = legal.NerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter])
  
text = """EXCLUSIVE DISTRIBUTOR AGREEMENT (" Agreement ") dated as April 15, 1994 by and between IMRS OPERATIONS INC., a Delaware corporation with its principal place of business at 777 Long Ridge Road, Stamford, Connecticut 06902, U.S.A. (hereinafter referred to as " Developer ") and Delteq Pte Ltd, a Singapore company (and a subsidiary of Wuthelam Industries (S) Pte LTD ) with its principal place of business at 215 Henderson Road , #101-03 Henderson Industrial Park , Singapore 0315 ( hereinafter referred to as " Distributor ")."""

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(100, truncate=False)

+-------------------------------+---------+
|chunk                          |ner_label|
+-------------------------------+---------+
|EXCLUSIVE DISTRIBUTOR AGREEMENT|DOC      |
|April 15, 1994                 |EFFDATE  |
|IMRS OPERATIONS INC            |PARTY    |
|Developer                      |ALIAS    |
|Delteq Pte Ltd                 |PARTY    |
|Distributor                    |ALIAS    |
+-------------------------------+---------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = SentenceDetector.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val wordEmbeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val jslNer = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("jsl_ner")

val jslNerConverter = new NerConverter()
  .setInputCols(Array("sentence", "token", "jsl_ner"))
  .setOutputCol("ner_chunk")

val jslNerPipeline = new Pipeline()
  .setStages(Array(documentAssembler, 
                   sentenceDetector, 
                   tokenizer, 
                   wordEmbeddings, 
                   jslNer, 
                   jslNerConverter))

val text = "A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, and associated with an acute hepatitis, presented with a one-week history of polyuria, poor appetite, and vomiting.
She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation.
Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness, guarding, or rigidity. Pertinent laboratory findings on admission were: serum glucose 111 mg/dl,  creatinine 0.4 mg/dL, triglycerides 508 mg/dL, total cholesterol 122 mg/dL, and venous pH 7.27."

val data = Seq(text).toDF("text")

val result = jslNerPipeline.fit(data).transform(data)

+-----------------------------+----------------------------+
|chunk                        |ner_label                   |
+-----------------------------+----------------------------+
|28-year-old                  |Age                         |
|female                       |Gender                      |
|gestational diabetes mellitus|Diabetes                    |
|eight years prior            |RelativeDate                |
|type two diabetes mellitus   |Diabetes                    |
|T2DM                         |Diabetes                    |
|HTG-induced pancreatitis     |Disease_Syndrome_Disorder   |
|three years prior            |RelativeDate                |
|acute                        |Modifier                    |
|hepatitis                    |Disease_Syndrome_Disorder   |
|one-week                     |Duration                    |
|polyuria                     |Symptom                     |
|poor appetite                |Symptom                     |
|vomiting                     |Symptom                     |
|She                          |Gender                      |
|metformin                    |Drug_Ingredient             |
|glipizide                    |Drug_Ingredient             |
|dapagliflozin                |Drug_Ingredient             |
|T2DM                         |Diabetes                    |
|atorvastatin                 |Drug_Ingredient             |
|gemfibrozil                  |Drug_Ingredient             |
|HTG                          |Hyperlipidemia              |
|She                          |Gender                      |
|dapagliflozin                |Drug_Ingredient             |
|for six months               |Duration                    |
|dry oral mucosa              |Symptom                     |
|her                          |Gender                      |
|abdominal                    |External_body_part_or_region|
|tenderness                   |Symptom                     |
|guarding                     |Symptom                     |
|rigidity                     |Symptom                     |
|admission                    |Admission_Discharge         |
|serum glucose                |Test                        |
|111 mg/dl                    |Test_Result                 |
|creatinine                   |Test                        |
|0.4 mg/dL                    |Test_Result                 |
|triglycerides                |Triglycerides               |
|508 mg/dL                    |Test_Result                 |
|total cholesterol 122 mg/dL  |Total_Cholesterol           |
|venous pH                    |Test                        |
|7.27                         |Test_Result                 |
+-----------------------------+----------------------------+
import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val embeddings = BertEmbeddings.pretrained("bert_embeddings_legal_bert_base_uncased", "en")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val nerModel = FinanceNerModel.pretrained("finner_sec_conll", "en", "finance/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")

val nerConverter = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter))

val text = '''December 2007 SUBORDINATED LOAN AGREEMENT. THIS LOAN AGREEMENT is made on 7th December, 2007 BETWEEN: (1) SILICIUM DE PROVENCE S.A.S., a private company with limited liability, incorporated under the laws of France, whose registered office is situated at Usine de Saint Auban, France, represented by Mr.Frank Wouters, hereinafter referred to as the "Borrower", and ( 2 ) EVERGREEN SOLAR INC., a company incorporated in Delaware, U.S.A., with registered number 2426798, whose registered office is situated at Bartlett Street, Marlboro, Massachusetts, U.S.A. represented by Richard Chleboski, hereinafter referred to as "Lender" '''

val data = Seq((text)).toDF("text")

val result = pipeline.fit(data).transform(data)

+--------------------------+---------+
|chunk                     |ner_label|
+--------------------------+---------+
|SILICIUM DE PROVENCE S.A.S|ORG      |
|France                    |LOC      |
|Usine de Saint Auban      |LOC      |
|France                    |LOC      |
|Mr.Frank Wouters          |PER      |
|Borrower                  |PER      |
|EVERGREEN SOLAR INC       |ORG      |
|Delaware                  |LOC      |
|U.S.A                     |LOC      |
|Bartlett Street           |LOC      |
|Marlboro                  |LOC      |
|Massachusetts             |LOC      |
|U.S.A                     |LOC      |
|Richard Chleboski         |PER      |
|Lender                    |PER      |
+--------------------------+---------+
import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val nerModel = LegalNerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")

val nerConverter = new NerConverter()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter))

val text = """EXCLUSIVE DISTRIBUTOR AGREEMENT ("Agreement") dated as April 15, 1994 by and between IMRS OPERATIONS INC., a Delaware corporation with its principal place of business at 777 Long Ridge Road, Stamford, Connecticut 06902, U.S.A. (hereinafter referred to as "Developer") and Delteq Pte Ltd, a Singapore company (and a subsidiary of Wuthelam Industries (S) Pte LTD) with its principal place of business at 215 Henderson Road, #101-03 Henderson Industrial Park, Singapore 0315 (hereinafter referred to as "Distributor")."""

val data = Seq(text).toDF("text")

val result = pipeline.fit(data).transform(data)

+-------------------------------+---------+
|chunk                          |ner_label|
+-------------------------------+---------+
|EXCLUSIVE DISTRIBUTOR AGREEMENT|DOC      |
|April 15, 1994                 |EFFDATE  |
|IMRS OPERATIONS INC            |PARTY    |
|Developer                      |ALIAS    |
|Delteq Pte Ltd                 |PARTY    |
|Distributor                    |ALIAS    |
+-------------------------------+---------+

This Named Entity recognition annotator allows to train generic NER model based on Neural Networks.

The architecture of the neural network is a Char CNNs - BiLSTM - CRF that achieves state-of-the-art in most datasets.

For instantiated/pretrained models, see NerDLModel.

The training data should be a labeled Spark Dataset, in the format of CoNLL 2003 IOB with Annotation type columns. The data should have columns of type DOCUMENT, TOKEN, WORD_EMBEDDINGS and an additional label column of annotator type NAMED_ENTITY. Excluding the label, this can be done with for example

For extended examples of usage, see the Spark NLP Workshop (sections starting with Training a Clinical NER)

Input Annotator Types: DOCUMENT, TOKEN, WORD_EMBEDDINGS

Output Annotator Type: NAMED_ENTITY

Python API: MedicalNerApproach Scala API: MedicalNerApproach Notebook: MedicalNerApproachNotebook
Show Example

from johnsnowlabs import nlp, medical 

# First extract the prerequisites for the NerDLApproach
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")

sentence = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")

clinical_embeddings = nlp.WordEmbeddingsModel.pretrained('embeddings_clinical', "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")

# Then the training can start
nerTagger = medical.NerApproach()\
.setInputCols(["sentence", "token", "embeddings"])\
.setLabelColumn("label")\
.setOutputCol("ner")\
.setMaxEpochs(2)\
.setBatchSize(64)\
.setRandomSeed(0)\
.setVerbose(1)\
.setValidationSplit(0.2)\
.setEvaluationLogExtended(True) \
.setEnableOutputLogs(True)\
.setIncludeConfidence(True)\
.setOutputLogsPath('ner_logs')\
.setGraphFolder('medical_ner_graphs')\
.setEnableMemoryOptimizer(True) #>> if you have a limited memory and a large conll file, you can set this True to train batch by batch

pipeline = nlp.Pipeline().setStages([
documentAssembler,
sentence,
tokenizer,
clinical_embeddings,
nerTagger
])

# We use the text and labels from the CoNLL dataset
conll = CoNLL()
trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")

pipelineModel = pipeline.fit(trainingData)
from johnsnowlabs import nlp, finance

# First extract the prerequisites for the NerDLApproach
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")

sentence = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")

clinical_embeddings = nlp.WordEmbeddingsModel.pretrained('embeddings_clinical', "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")

# Then the training can start
nerTagger = finance.NerApproach()\
.setInputCols(["sentence", "token", "embeddings"])\
.setLabelColumn("label")\
.setOutputCol("ner")\
.setMaxEpochs(2)\
.setBatchSize(64)\
.setRandomSeed(0)\
.setVerbose(1)\
.setValidationSplit(0.2)\
.setEvaluationLogExtended(True) \
.setEnableOutputLogs(True)\
.setIncludeConfidence(True)\
.setOutputLogsPath('ner_logs')\
.setGraphFolder('medical_ner_graphs')\
.setEnableMemoryOptimizer(True) #>> if you have a limited memory and a large conll file, you can set this True to train batch by batch

pipeline = nlp.Pipeline().setStages([
documentAssembler,
sentence,
tokenizer,
clinical_embeddings,
nerTagger
])
from johnsnowlabs import nlp, legal

# First extract the prerequisites for the NerDLApproach
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")

sentence = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")

clinical_embeddings = nlp.WordEmbeddingsModel.pretrained('embeddings_clinical', "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")

# Then the training can start
nerTagger = legal.NerApproach()\
.setInputCols(["sentence", "token", "embeddings"])\
.setLabelColumn("label")\
.setOutputCol("ner")\
.setMaxEpochs(2)\
.setBatchSize(64)\
.setRandomSeed(0)\
.setVerbose(1)\
.setValidationSplit(0.2)\
.setEvaluationLogExtended(True) \
.setEnableOutputLogs(True)\
.setIncludeConfidence(True)\
.setOutputLogsPath('ner_logs')\
.setGraphFolder('medical_ner_graphs')\
.setEnableMemoryOptimizer(True) #>> if you have a limited memory and a large conll file, you can set this True to train batch by batch

pipeline = nlp.Pipeline().setStages([
documentAssembler,
sentence,
tokenizer,
clinical_embeddings,
nerTagger
])

import spark.implicits._

// First extract the prerequisites for the NerDLApproach
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = WordEmbeddingsModel
  .pretrained('embeddings_clinical', "en", "clinical/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

// Then the training can start
val nerTagger =new MedicalNerApproach()
.setInputCols(Array("sentence", "token", "embeddings"))
.setLabelColumn("label")
.setOutputCol("ner")
.setMaxEpochs(5)
.setLr(0.003f)
.setBatchSize(8)
.setRandomSeed(0)
.setVerbose(1)
.setEvaluationLogExtended(false)
.setEnableOutputLogs(false)
.setIncludeConfidence(true)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence,
  tokenizer,
  embeddings,
  nerTagger
))

// We use the text and labels from the CoNLL dataset
val conll = CoNLL()
val trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")

val pipelineModel = pipeline.fit(trainingData)
import spark.implicits._

// First extract the prerequisites for the NerDLApproach
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = WordEmbeddingsModel
  .pretrained('embeddings_clinical', "en", "clinical/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

// Then the training can start
val nerTagger =new FinanceNerApproach()
.setInputCols(Array("sentence", "token", "embeddings"))
.setLabelColumn("label")
.setOutputCol("ner")
.setMaxEpochs(5)
.setLr(0.003f)
.setBatchSize(8)
.setRandomSeed(0)
.setVerbose(1)
.setEvaluationLogExtended(false)
.setEnableOutputLogs(false)
.setIncludeConfidence(true)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence,
  tokenizer,
  embeddings,
  nerTagger
))
import spark.implicits._

// First extract the prerequisites for the NerDLApproach
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = WordEmbeddingsModel
  .pretrained('embeddings_clinical', "en", "clinical/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

// Then the training can start
val nerTagger =new LegalNerApproach()
.setInputCols(Array("sentence", "token", "embeddings"))
.setLabelColumn("label")
.setOutputCol("ner")
.setMaxEpochs(5)
.setLr(0.003f)
.setBatchSize(8)
.setRandomSeed(0)
.setVerbose(1)
.setEvaluationLogExtended(false)
.setEnableOutputLogs(false)
.setIncludeConfidence(true)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence,
  tokenizer,
  embeddings,
  nerTagger
))

NerQuestionGenerator

Model

NerQuestionGenerator takes an NER chunk (obtained by, e.g., NerConverterInternal) and generates a questions based on two entity types, a pronoun and a strategy.

The question is generated in the form of [QUESTIONPRONOUN] [ENTITY1] [ENTITY2] [QUESTIONMARK]. The generated question can be used by QuestionAnswerer or ZeroShotNer annotators to answer the question or find NER entities.

Parametres:

  • questionPronoun: Pronoun to be used in the question. E.g., ‘When’, ‘Where’, ‘Why’, ‘How’, ‘Who’, ‘What’.

  • strategyType: Strategy for the proccess, either Paired (default) or Combined.

  • questionMark: Whether to add a question mark at the end of the question.

  • entities1: List with the entity types of entities that appear first in the question.

  • entities2: List with the entity types of entities that appear second in the question.

All the parameters can be set using the corresponding set method in camel case. For example, .setQuestionPronoun(True).

Input Annotator Types: CHUNK

Output Annotator Type: DOCUMENT

Python API: NerQuestionGenerator Scala API: NerQuestionGenerator Notebook: NerQuestionGeneratorNotebook
Show Example

from johnsnowlabs import nlp, medical
import json

entities = [
    {
    "label": "Person",
    "patterns": ["Jon", "John", "John's"]
    },
    {
    "label": "Organization",
    "patterns": ["St. Mary's Hospital", "St. Mary's"]
    },
    {
        "label": "Condition",
        "patterns": ["vital signs", "heartbeat", "oxygen saturation levels"]
    }
]

with open('./entities.json', 'w') as jsonfile:
    json.dump(entities, jsonfile)


document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

entity_ruler = nlp.EntityRulerApproach() \
    .setInputCols(["document"]) \
    .setOutputCol("entity") \
    .setPatternsResource("./entities.json")\
    .setCaseSensitive(False)

qagenerator = medical.NerQuestionGenerator()\
    .setInputCols(["entity"])\
    .setOutputCol("question")\
    .setQuestionPronoun("How is")\
    .setEntities1(["Person"])\
    .setEntities2(["Condition"])\
    .setStrategyType("Paired")\
    .setQuestionMark(True)

prep_pipeline = nlp.Pipeline(stages=[
    document_assembler,
    entity_ruler,
    qagenerator
])

example_text = """At St. Mary's Hospital, the healthcare team closely monitored John's vital signs with unwavering attention. They recorded his heartbeat and oxygen saturation levels, promptly addressing any deviations from normal. Their dedication and expertise at St. Mary's played a vital role in ensuring John's stability and fostering a swift recovery."""

df = spark.createDataFrame([[example_text]]).toDF("text")

result = prep_pipeline.fit(df).transform(df)

result.select("question").show(truncate=False)

## Result

+--------------------------------------------------------------------------------------------------------------------------------------------+
|question                                                                                                                                    |
+--------------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 62, 79, How is John's vital signs ?, {sentence -> 0}, []}, {document, 291, 134, How is John's heartbeat ?, {sentence -> 0}, []}]|
+--------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

/* entities.json file
entities = [
    {
    "label": "Person",
    "patterns": ["Jon", "John", "John's"]
    },
    {
    "label": "Organization",
    "patterns": ["St. Mary's Hospital", "St. Mary's"]
    },
    {
        "label": "Condition",
        "patterns": ["vital signs", "heartbeat", "oxygen saturation levels"]
    }
]
*/

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val entity_ruler = new EntityRulerApproach()
    .setInputCols("document")
    .setOutputCol("entity")
    .setPatternsResource("./entities.json")
    .setCaseSensitive(false)

val qagenerator = new NerQuestionGenerator()
    .setInputCols("entity")
    .setOutputCol("question")
    .setQuestionPronoun("How is")
    .setEntities1("Person")
    .setEntities2("Condition")
    .setStrategyType("Paired")
    .setQuestionMark(true)

val prep_pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    entity_ruler, 
    qagenerator )) 

val test_data = Seq("""At St. Mary's Hospital, the healthcare team closely monitored John's vital signs with unwavering attention. They recorded his heartbeat and oxygen saturation levels, promptly addressing any deviations from normal. Their dedication and expertise at St. Mary's played a vital role in ensuring John's stability and fostering a swift recovery.""").toDF("text")

val res = mapperPipeline.fit(test_data).transform(test_data)

// Show results

+--------------------------------------------------------------------------------------------------------------------------------------------+
|question                                                                                                                                    |
+--------------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 62, 79, How is John's vital signs ?, {sentence -> 0}, []}, {document, 291, 134, How is John's heartbeat ?, {sentence -> 0}, []}]|
+--------------------------------------------------------------------------------------------------------------------------------------------+

QuestionAnswering

Model

QuestionAnswering is a GPT-based model for answering questions given a context. Unlike span-based models, it generates the answers to the questions, rather than selecting phrases from the given context. The model is capable of answering various types of questions, including yes-no or full-text ones. Types of questions are supported: "short" (producing yes/no/maybe) answers and "long" (full answers).

Parameters:

  • questionType: Question type, e.g. “short” or “long”. The question types depend on the model.

  • maxNewTokens: Maximum number of of new tokens to generate, by default 30

  • maxContextLength: Maximum length of context text

  • configProtoBytes: ConfigProto from tensorflow, serialized into byte array.

  • doSample: Whether or not to use sampling; use greedy decoding otherwise, by default False

  • topK: The number of highest probability vocabulary tokens to consider, by default 1

  • noRepeatNgramSize: The number of tokens that can’t be repeated in the same order. Useful for preventing loops. The default is 0.

  • ignoreTokenIds: A list of token ids which are ignored in the decoder’s output, by default []

  • randomSeed: Set to positive integer to get reproducible results, by default None.

  • customPrompt: Custom prompt template. Available variables {QUESTION} and {CONTEXT}

Available models can be found at the Models Hub

For more extended examples on the document, pre-processing see the Spark NLP Workshop

Input Annotator Types: DOCUMENT, DOCUMENT

Output Annotator Type: CHUNK

Python API: MedicalQuestionAnswering Scala API: MedicalQuestionAnswering
Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.MultiDocumentAssembler()\
    .setInputCols("question", "context")\
    .setOutputCols("document_question", "document_context")

med_qa = medical.MedicalQuestionAnswering.pretrained("medical_qa_biogpt","en","clinical/models")\
    .setInputCols(["document_question", "document_context"])\
    .setOutputCol("answer")\
    .setMaxNewTokens(30)\
    .setTopK(1)\
    .setQuestionType("long") # "short"

pipeline = nlp.Pipeline(stages=[document_assembler, med_qa])

paper_abstract = [
    "In patients with Los Angeles (LA) grade C or D oesophagitis, a positive relationship has been established between the duration of intragastric acid suppression and healing.AIM: To determine whether there is an apparent optimal time of intragastric acid suppression for maximal healing of reflux oesophagitis. Post hoc analysis of data from a proof-of-concept, double-blind, randomized study of 134 adult patients treated with esomeprazole (10 or 40 mg od for 4 weeks) for LA grade C or D oesophagitis. A curve was fitted to pooled 24-h intragastric pH (day 5) and endoscopically assessed healing (4 weeks) data using piecewise quadratic logistic regression. Maximal reflux oesophagitis healing rates were achieved when intragastric pH>4 was achieved for approximately 50-70% (12-17 h) of the 24-h period. Acid suppression above this threshold did not yield further increases in healing rates."
]

question = ["Is there an optimal time of acid suppression for maximal healing?"]

data = spark.createDataFrame([ [paper_abstract[0],  question[0]] ]).toDF("context","question")

data.show(truncate = 60)

+------------------------------------------------------------+------------------------------------------------------------+
|                                                     context|                                                    question|
+------------------------------------------------------------+------------------------------------------------------------+
|In patients with Los Angeles (LA) grade C or D oesophagit...|Is there an optimal time of acid suppression for maximal ...|
+------------------------------------------------------------+------------------------------------------------------------+

result = pipeline.fit(data).transform(data)

result.selectExpr("document_question.result as Question", "answer.result as Long_Answer").show(truncate=False)

+-------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
|Question                                                           |Long_Answer                                                                                                                                          |
+-------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
|[Is there an optimal time of acid suppression for maximal healing?]|[in patients with reflux oesophagitis, maximal healing rates are obtained when intragastric pH is achieved for approximately 50 - 70 % ( 12 - 17 h )]|
+-------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
from johnsnowlabs import nlp, finance

document_assembler = nlp.MultiDocumentAssembler()\
    .setInputCols("question", "context")\
    .setOutputCols("document_question", "document_context")

fin_qa = finance.QuestionAnswering.pretrained("finqa_flant5_finetuned","en","finance/models")\
    .setInputCols(["document_question", "document_context"])\
    .setCustomPrompt("question: {QUESTION} context: {CONTEXT}")\
    .setMaxNewTokens(100)\
    .setOutputCol("answer")

pipeline = nlp.Pipeline(stages=[document_assembler, fin_qa])

context = """EXHIBIT 99.2 Page 1 of 3 DISTRIBUTOR AGREEMENT Agreement made this 19th day of March, 2020 Between: Co-Diagnostics, Inc. (herein referred to as "Principal") And PreCheck Health Services, Inc. (herein referred to as "Distributor"). In consideration of the mutual terms, conditions and covenants hereinafter set forth, Principal and Distributor acknowledge and agree to the following descriptions and conditions: DESCRIPTION OF PRINCIPAL The Principal is a company located in Utah, United States and is in the business of research and development of reagents. The Principal markets and sells it products globally through direct sales and distributors. DESCRIPTION OF DISTRIBUTOR The Distributor is a company operating or planning to operate in the United States of America, Latin America, Europe and Russia. The Distributor represents that the Distributor or a subsidiary of the Distributor is or will be fully licensed and registered in the Territory and will provide professional distribution services for the products of the Principal. CONDITIONS: 1. The Principal appoints the Distributor as a non-exclusive distributor, to sell Principal's qPCR infectious disease kits, Logix Smart COVID-19 PCR diagnostic test and Co-Dx Box™ instrument (the "Products"). The Products are described on Exhibit A to this Agreement. 2. The Principal grants Distributor non- exclusive rights to sell these products within the countries of Romania (the "Territory"), which may be amended by mutual written agreement."""

questions = ["""Which company is referred to as 'Principal' in the Distributor Agreement?""",
             """What is the date of the distributor agreement between Co-Diagnostics, Inc. and PreCheck Health Services, Inc.?""",
             """What is the Territory in which the Distributor has non-exclusive rights to sell Principal's products according to the Agreement?"""]

data = spark.createDataFrame(
    [
        [context[0],  questions[0]],
        [context[0],  questions[1]],
        [context[0],  questions[2]],
    ]
).toDF("context","question")

data.show(truncate = 80)

+------------------------------------------------------------------+--------------------------------------------------------------------------------+
|                                                          question|                                                                         context|
+------------------------------------------------------------------+--------------------------------------------------------------------------------+
|   What are the key components of the business strategy described?|Our business strategy has been to develop data processing and product technol...|
|What is the immediate strategy for scaling the IntentKey platform?|Our business strategy has been to develop data processing and product technol...|
|How does the company aim to provide differentiation in the market?|Our business strategy has been to develop data processing and product technol...|
+------------------------------------------------------------------+--------------------------------------------------------------------------------+

result = pipeline.fit(data).transform(data)

result.select('question', 'answer.result').show(truncate=False)

+------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|question                                                          |result                                                                                                                                                                                                                                                                                                            |
+------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|What are the key components of the business strategy described?   |[The key components of the business strategy described are proprietary demand (media spend) and supply side (media inventory) technologies, targeting technologies, on-page or in-app ad-unit technologies, proprietary data and data management technologies, and advertising fraud detection technologies. . . ]|
|What is the immediate strategy for scaling the IntentKey platform?|[The immediate strategy for scaling the IntentKey platform is to scale through the hiring of additional sales professionals, growing existing accounts and expanding the market size by concurrently selling the SaaS version of the IntentKey beginning in 2021. ]                                               |
|How does the company aim to provide differentiation in the market?|[The company aims to provide differentiation through the AI analytics and data products they own and protect through patents. ]                                                                                                                                                                                   |
+------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
from johnsnowlabs import nlp, legal

context = ["""EXHIBIT 99.2 Page 1 of 3 DISTRIBUTOR AGREEMENT Agreement made this 19th day of March, 2020 Between: Co-Diagnostics, Inc. (herein referred to as "Principal") And PreCheck Health Services, Inc. (herein referred to as "Distributor"). In consideration of the mutual terms, conditions and covenants hereinafter set forth, Principal and Distributor acknowledge and agree to the following descriptions and conditions: DESCRIPTION OF PRINCIPAL The Principal is a company located in Utah, United States and is in the business of research and development of reagents. The Principal markets and sells it products globally through direct sales and distributors. DESCRIPTION OF DISTRIBUTOR The Distributor is a company operating or planning to operate in the United States of America, Latin America, Europe and Russia. The Distributor represents that the Distributor or a subsidiary of the Distributor is or will be fully licensed and registered in the Territory and will provide professional distribution services for the products of the Principal. CONDITIONS: 1. The Principal appoints the Distributor as a non-exclusive distributor, to sell Principal's qPCR infectious disease kits, Logix Smart COVID-19 PCR diagnostic test and Co-Dx Box™ instrument (the "Products"). The Products are described on Exhibit A to this Agreement. 2. The Principal grants Distributor non- exclusive rights to sell these products within the countries of Romania (the "Territory"), which may be amended by mutual written agreement."""]

questions = ["""Which company is referred to as 'Principal' in the Distributor Agreement?""",
             """What is the date of the distributor agreement between Co-Diagnostics, Inc. and PreCheck Health Services, Inc.?""",
             """What is the Territory in which the Distributor has non-exclusive rights to sell Principal's products according to the Agreement?"""]

data = spark.createDataFrame(
    [
        [context[0],  questions[0]],
        [context[0],  questions[1]],
        [context[0],  questions[2]],
    ]
).toDF("context","question")

data.show(truncate = 80)

+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+
|                                                                         context|                                                                        question|
+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+
|EXHIBIT 99.2 Page 1 of 3 DISTRIBUTOR AGREEMENT Agreement made this 19th day o...|       Which company is referred to as 'Principal' in the Distributor Agreement?|
|EXHIBIT 99.2 Page 1 of 3 DISTRIBUTOR AGREEMENT Agreement made this 19th day o...|What is the date of the distributor agreement between Co-Diagnostics, Inc. an...|
|EXHIBIT 99.2 Page 1 of 3 DISTRIBUTOR AGREEMENT Agreement made this 19th day o...|What is the Territory in which the Distributor has non-exclusive rights to se...|
+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+

document_assembler = nlp.MultiDocumentAssembler()\
    .setInputCols("question", "context")\
    .setOutputCols("document_question", "document_context")

leg_qa = legal.QuestionAnswering.pretrained("legqa_flant5_finetuned","en","legal/models")\
    .setInputCols(["document_question", "document_context"])\
    .setCustomPrompt("question: {QUESTION} context: {CONTEXT}")\
    .setMaxNewTokens(40)\
    .setTopK(3)\
    .setOutputCol("answer")

pipeline = nlp.Pipeline(stages=[document_assembler, leg_qa])

result = pipeline.fit(data).transform(data)

result.selectExpr("document_question.result as Question", "answer.result as Answer").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|Question                                                                                                                          |Answer                                                                                                                                  |
+----------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|[Which company is referred to as 'Principal' in the Distributor Agreement?]                                                       |[Co-Diagnostics, Inc. is referred to as 'Principal' in the Distributor Agreement. ]                                                     |
|[What is the date of the distributor agreement between Co-Diagnostics, Inc. and PreCheck Health Services, Inc.?]                  |[The date of the distributor agreement between Co-Diagnostics, Inc. and PreCheck Health Services, Inc. is the 19th day of March, 2020. ]|
|[What is the Territory in which the Distributor has non-exclusive rights to sell Principal's products according to the Agreement?]|[The Territory in which the Distributor has non-exclusive rights to sell Principal's products according to the Agreement is Romania. ]  |
+----------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCols(Array("question", "context"))
  .setOutputCols(Array("document_question", "document_context"))

val medQA = MedicalQuestionAnswering.pretrained("medical_qa_biogpt", "en", "clinical/models")
  .setInputCols(Array("document_question", "document_context"))
  .setOutputCol("answer")
  .setMaxNewTokens(30)
  .setTopK(1)
  .setQuestionType("long") // "short"

val pipeline = new Pipeline().setStages(Array(
                                              documentAssembler, 
                                              medQA))

val paperAbstract = "In patients with Los Angeles (LA) grade C or D oesophagitis, a positive relationship has been established between the duration of intragastric acid suppression and healing.AIM: To determine whether there is an apparent optimal time of intragastric acid suppression for maximal healing of reflux oesophagitis. Post hoc analysis of data from a proof-of-concept, double-blind, randomized study of 134 adult patients treated with esomeprazole (10 or 40 mg od for 4 weeks) for LA grade C or D oesophagitis. A curve was fitted to pooled 24-h intragastric pH (day 5) and endoscopically assessed healing (4 weeks) data using piecewise quadratic logistic regression. Maximal reflux oesophagitis healing rates were achieved when intragastric pH>4 was achieved for approximately 50-70% (12-17 h) of the 24-h period. Acid suppression above this threshold did not yield further increases in healing rates."

val question = "Is there an optimal time of acid suppression for maximal healing?"

val data = Seq(paperAbstract, question).toDF("context", "question")

+------------------------------------------------------------+------------------------------------------------------------+
|                                                     context|                                                    question|
+------------------------------------------------------------+------------------------------------------------------------+
|In patients with Los Angeles (LA) grade C or D oesophagit...|Is there an optimal time of acid suppression for maximal ...|
+------------------------------------------------------------+------------------------------------------------------------+

val result = pipeline.fit(data).transform(data)


+-------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
|Question                                                           |Long_Answer                                                                                                                                          |
+-------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
|[Is there an optimal time of acid suppression for maximal healing?]|[in patients with reflux oesophagitis, maximal healing rates are obtained when intragastric pH is achieved for approximately 50 - 70 % ( 12 - 17 h )]|
+-------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCols(Array("question", "context"))
  .setOutputCols(Array("document_question", "document_context"))

val finQa = new FinanceQuestionAnswering()
  .pretrained("finqa_flant5_finetuned", "en", "finance/models")
  .setInputCols(Array("document_question", "document_context"))
  .setCustomPrompt("question: {QUESTION} context: {CONTEXT}")
  .setMaxNewTokens(100)
  .setOutputCol("answer")

val pipeline = new Pipeline().setStages(Array(documentAssembler, finQa))

val context = "EXHIBIT 99.2 Page 1 of 3 DISTRIBUTOR AGREEMENT Agreement made this 19th day of March, 2020 Between: Co-Diagnostics, Inc. (herein referred to as "Principal") And PreCheck Health Services, Inc. (herein referred to as "Distributor"). In consideration of the mutual terms, conditions and covenants hereinafter set forth, Principal and Distributor acknowledge and agree to the following descriptions and conditions: DESCRIPTION OF PRINCIPAL The Principal is a company located in Utah, United States and is in the business of research and development of reagents. The Principal markets and sells it products globally through direct sales and distributors. DESCRIPTION OF DISTRIBUTOR The Distributor is a company operating or planning to operate in the United States of America, Latin America, Europe and Russia. The Distributor represents that the Distributor or a subsidiary of the Distributor is or will be fully licensed and registered in the Territory and will provide professional distribution services for the products of the Principal. CONDITIONS: 1. The Principal appoints the Distributor as a non-exclusive distributor, to sell Principal's qPCR infectious disease kits, Logix Smart COVID-19 PCR diagnostic test and Co-Dx Box™ instrument (the "Products"). The Products are described on Exhibit A to this Agreement. 2. The Principal grants Distributor non- exclusive rights to sell these products within the countries of Romania (the "Territory"), which may be amended by mutual written agreement."

val questions = Seq(
  "Which company is referred to as 'Principal' in the Distributor Agreement?",
  "What is the date of the distributor agreement between Co-Diagnostics, Inc. and PreCheck Health Services, Inc.?",
  "What is the Territory in which the Distributor has non-exclusive rights to sell Principal's products according to the Agreement?"
)

val data = questions.map(q => (context, q)).toDF("context", "question")

+------------------------------------------------------------------+--------------------------------------------------------------------------------+
|                                                          question|                                                                         context|
+------------------------------------------------------------------+--------------------------------------------------------------------------------+
|   What are the key components of the business strategy described?|Our business strategy has been to develop data processing and product technol...|
|What is the immediate strategy for scaling the IntentKey platform?|Our business strategy has been to develop data processing and product technol...|
|How does the company aim to provide differentiation in the market?|Our business strategy has been to develop data processing and product technol...|
+------------------------------------------------------------------+--------------------------------------------------------------------------------+

val result = pipeline.fit(data).transform(data)

+------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|question                                                          |result                                                                                                                                                                                                                                                                                                            |
+------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|What are the key components of the business strategy described?   |[The key components of the business strategy described are proprietary demand (media spend) and supply side (media inventory) technologies, targeting technologies, on-page or in-app ad-unit technologies, proprietary data and data management technologies, and advertising fraud detection technologies. . . ]|
|What is the immediate strategy for scaling the IntentKey platform?|[The immediate strategy for scaling the IntentKey platform is to scale through the hiring of additional sales professionals, growing existing accounts and expanding the market size by concurrently selling the SaaS version of the IntentKey beginning in 2021. ]                                               |
|How does the company aim to provide differentiation in the market?|[The company aims to provide differentiation through the AI analytics and data products they own and protect through patents. ]                                                                                                                                                                                   |
+------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
val context = Seq("""EXHIBIT 99.2 Page 1 of 3 DISTRIBUTOR AGREEMENT Agreement made this 19th day of March, 2020 Between: Co-Diagnostics, Inc. (herein referred to as "Principal") And PreCheck Health Services, Inc. (herein referred to as "Distributor"). In consideration of the mutual terms, conditions and covenants hereinafter set forth, Principal and Distributor acknowledge and agree to the following descriptions and conditions: DESCRIPTION OF PRINCIPAL The Principal is a company located in Utah, United States and is in the business of research and development of reagents. The Principal markets and sells it products globally through direct sales and distributors. DESCRIPTION OF DISTRIBUTOR The Distributor is a company operating or planning to operate in the United States of America, Latin America, Europe and Russia. The Distributor represents that the Distributor or a subsidiary of the Distributor is or will be fully licensed and registered in the Territory and will provide professional distribution services for the products of the Principal. CONDITIONS: 1. The Principal appoints the Distributor as a non-exclusive distributor, to sell Principal's qPCR infectious disease kits, Logix Smart COVID-19 PCR diagnostic test and Co-Dx Box™ instrument (the "Products"). The Products are described on Exhibit A to this Agreement. 2. The Principal grants Distributor non- exclusive rights to sell these products within the countries of Romania (the "Territory"), which may be amended by mutual written agreement."""
)

val questions = Seq(
  """Which company is referred to as 'Principal' in the Distributor Agreement?""",
  """What is the date of the distributor agreement between Co-Diagnostics, Inc. and PreCheck Health Services, Inc.?""",
  """What is the Territory in which the Distributor has non-exclusive rights to sell Principal's products according to the Agreement?"""
)

// val data = context.flatMap(c => questions.map(q => (c, q))).toDF("context", "question")


+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+
|                                                                         context|                                                                        question|
+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+
|EXHIBIT 99.2 Page 1 of 3 DISTRIBUTOR AGREEMENT Agreement made this 19th day o...|       Which company is referred to as 'Principal' in the Distributor Agreement?|
|EXHIBIT 99.2 Page 1 of 3 DISTRIBUTOR AGREEMENT Agreement made this 19th day o...|What is the date of the distributor agreement between Co-Diagnostics, Inc. an...|
|EXHIBIT 99.2 Page 1 of 3 DISTRIBUTOR AGREEMENT Agreement made this 19th day o...|What is the Territory in which the Distributor has non-exclusive rights to se...|
+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCols(Array("question", "context"))
  .setOutputCols(Array("document_question", "document_context"))

val legQA = LegalQuestionAnswering.pretrained("legqa_flant5_finetuned", "en", "clinical/models")
  .setInputCols(Array("document_question", "document_context"))
  .setCustomPrompt("question: {QUESTION} context: {CONTEXT}")
  .setMaxNewTokens(40)
  .setTopK(3)
  .setOutputCol("answer")

val pipeline = new Pipeline().setStages(Array(documentAssembler, legQA))

val result = pipeline.fit(data).transform(data)

+----------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|Question                                                                                                                          |Answer                                                                                                                                  |
+----------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|[Which company is referred to as 'Principal' in the Distributor Agreement?]                                                       |[Co-Diagnostics, Inc. is referred to as 'Principal' in the Distributor Agreement. ]                                                     |
|[What is the date of the distributor agreement between Co-Diagnostics, Inc. and PreCheck Health Services, Inc.?]                  |[The date of the distributor agreement between Co-Diagnostics, Inc. and PreCheck Health Services, Inc. is the 19th day of March, 2020. ]|
|[What is the Territory in which the Distributor has non-exclusive rights to sell Principal's products according to the Agreement?]|[The Territory in which the Distributor has non-exclusive rights to sell Principal's products according to the Agreement is Romania. ]  |
+----------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+

RENerChunksFilter

Model

The RENerChunksFilter annotator filters desired relation pairs (defined by the parameter realtionPairs), and store those on the output column. Filtering the possible relations can be useful to perform additional analysis for a specific use case (e.g., checking adverse drug reactions and drug realations), which can be the input for further analysis using a pretrained RelationExtractionDLModel.

Parameters:

  • maxSyntacticDistance (Int): Maximum syntactic distance between a pair of named entities to consider them as a relation. Increasing this value will increase recall, but also increase the number of false positives.

  • relationPairs (List[Str]): List of dash-separated pairs of named entities. For example, [“Biomarker-RelativeDay”] will process all relations between entities of type “Biomarker” and “RelativeDay”.

  • relationPairsCaseSensitive (Boolean): Determines whether relation pairs are case sensitive.

For example, the ner_clinical NER model can identify PROBLEM, TEST, and TREATMENT entities. By using the RENerChunksFilter, one can filter only the relations between PROBLEM and TREATMENT entities only, removing any relation between the other entities, to further analyze the associations between clinical problems and treatments.

Input Annotator Types: CHUNK, DEPENDENCY

Output Annotator Type: CHUNK

Python API: RENerChunksFilter Scala API: RENerChunksFilter Notebook: RENerChunksFilter
Show Example

from johnsnowlabs import nlp, medical

documenter = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencer = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

words_embedder = nlp.WordEmbeddingsModel()\
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

pos_tagger = nlp.PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"])\
    .setOutputCol("pos_tags")

ner_tagger = medical.NerModel.pretrained("ner_ade_clinical", "en", "clinical/models")\
    .setInputCols("sentence", "token", "embeddings")\
    .setOutputCol("ner_tags")

ner_chunker = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner_tags"])\
    .setOutputCol("ner_chunks")

dependency_parser = nlp.DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos_tags", "token"])\
    .setOutputCol("dependencies")

ade_re_ner_chunk_filter = medical.RENerChunksFilter() \
    .setInputCols(["ner_chunks", "dependencies"])\
    .setOutputCol("re_ner_chunks")\
    .setMaxSyntacticDistance(10)\
    .setRelationPairs(["drug-ade, ade-drug"])

ade_re_model = medical.RelationExtractionDLModel()\
    .pretrained('redl_ade_biobert', 'en', "clinical/models") \
    .setInputCols(["re_ner_chunks", "sentences"]) \
    .setPredictionThreshold(0.5)\
    .setOutputCol("relations")

pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer,
    words_embedder,
    pos_tagger,
    ner_tagger,
    ner_chunker,
    dependency_parser,
    ade_re_ner_chunk_filter,
    ade_re_model
])

text = """A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands."""

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

from pyspark.sql import functions as F

results.select(
    F.explode(F.arrays_zip(results.relations.metadata, results.relations.result)).alias("cols")).select(
    F.expr("cols['0']['sentence']").alias("sentence"),
    F.expr("cols['0']['entity1_begin']").alias("entity1_begin"),
    F.expr("cols['0']['entity1_end']").alias("entity1_end"),
    F.expr("cols['0']['chunk1']").alias("chunk1"),
    F.expr("cols['0']['entity1']").alias("entity1"),
    F.expr("cols['0']['entity2_begin']").alias("entity2_begin"),
    F.expr("cols['0']['entity2_end']").alias("entity2_end"),
    F.expr("cols['0']['chunk2']").alias("chunk2"),
    F.expr("cols['0']['entity2']").alias("entity2"),
    F.expr("cols['1']").alias("relation"),
    F.expr("cols['0']['confidence']").alias("confidence"),
).show(truncate=70)

+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
|sentence|entity1_begin|entity1_end|   chunk1|entity1|entity2_begin|entity2_end|                                                   chunk2|entity2|relation|confidence|
+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
|       0|           25|         32| naproxen|   DRUG|          137|        148|                                             tense bullae|    ADE|       1| 0.9989047|
|       0|           25|         32| naproxen|   DRUG|          154|        210|cutaneous fragility on the face and the back of the hands|    ADE|       1| 0.9989704|
|       0|           87|         95|oxaprozin|   DRUG|          137|        148|                                             tense bullae|    ADE|       1|0.99895453|
|       0|           87|         95|oxaprozin|   DRUG|          154|        210|cutaneous fragility on the face and the back of the hands|    ADE|       1|0.99900633|
+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
from johnsnowlabs import nlp, finance

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

text_splitter = finance.TextSplitter()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model_date = finance.NerModel.pretrained("finner_sec_dates", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_dates")

ner_converter_date = finance.NerConverterInternal()\
    .setInputCols(["sentence","token","ner_dates"])\
    .setOutputCol("ner_chunk_date")

ner_model_org= finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_orgs")

ner_converter_org = finance.NerConverterInternal()\
    .setInputCols(["sentence","token","ner_orgs"])\
    .setOutputCol("ner_chunk_org")\

chunk_merger = finance.ChunkMergeApproach()\
    .setInputCols('ner_chunk_org', "ner_chunk_date")\
    .setOutputCol('ner_chunk')

pos = nlp.PerceptronModel.pretrained()\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("pos")

dependency_parser = nlp.DependencyParserModel().pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos", "token"])\
    .setOutputCol("dependencies")

re_filter = finance.RENerChunksFilter()\
    .setInputCols(["ner_chunk", "dependencies"])\
    .setOutputCol("re_ner_chunk")\
    .setRelationPairs(["ORG-ORG", "ORG-DATE"])\
    .setMaxSyntacticDistance(10)

reDL = finance.RelationExtractionDLModel().pretrained('finre_acquisitions_subsidiaries_md', 'en', 'finance/models')\
    .setInputCols(["re_ner_chunk", "sentence"])\
    .setOutputCol("relation")\
    .setPredictionThreshold(0.1)

pipeline = nlp.Pipeline(stages=[
        document_assembler,
        text_splitter,
        tokenizer,
        embeddings,
        ner_model_date,
        ner_converter_date,
        ner_model_org,
        ner_converter_org,
        chunk_merger,
        pos,
        dependency_parser,
        re_filter,
        reDL])

text = """In fiscal 2020, Cadence acquired all of the outstanding equity of AWR Corporation (“AWR”) and Integrand Software, Inc. (“Integrand”)."""

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

from pyspark.sql import functions as F

result.select(
    F.explode(F.arrays_zip(result.relation.metadata, result.relation.result)).alias("cols")).select(
    F.expr("cols['0']['sentence']").alias("sentence"),
    F.expr("cols['0']['entity1_begin']").alias("entity1_begin"),
    F.expr("cols['0']['entity1_end']").alias("entity1_end"),
    F.expr("cols['0']['chunk1']").alias("chunk1"),
    F.expr("cols['0']['entity1']").alias("entity1"),
    F.expr("cols['0']['entity2_begin']").alias("entity2_begin"),
    F.expr("cols['0']['entity2_end']").alias("entity2_end"),
    F.expr("cols['0']['chunk2']").alias("chunk2"),
    F.expr("cols['0']['entity2']").alias("entity2"),
    F.expr("cols['1']").alias("relation"),
    F.expr("cols['0']['confidence']").alias("confidence"),
).filter("relation != 'no_rel'").show(truncate=70)

+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
|sentence|entity1_begin|entity1_end|                 chunk1|entity1|entity2_begin|entity2_end|         chunk2|entity2|            relation|confidence|
+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
|       0|           16|         22|                Cadence|    ORG|            3|         13|    fiscal 2020|   DATE|has_acquisition_date|0.99687237|
|       0|           66|         80|        AWR Corporation|    ORG|            3|         13|    fiscal 2020|   DATE|has_acquisition_date|  0.993112|
|       0|           94|        116|Integrand Software, Inc|    ORG|            3|         13|    fiscal 2020|   DATE|has_acquisition_date| 0.9741451|
|       0|           66|         80|        AWR Corporation|    ORG|           16|         22|        Cadence|    ORG|     was_acquired_by|  0.997124|
|       0|           94|        116|Integrand Software, Inc|    ORG|           16|         22|        Cadence|    ORG|     was_acquired_by|0.99910504|
|       0|           94|        116|Integrand Software, Inc|    ORG|           66|         80|AWR Corporation|    ORG|     was_acquired_by|0.93245244|
+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
from johnsnowlabs import nlp, legal

document_assembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

text_splitter = legal.TextSplitter()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
    .setInputCols("sentence", "token") \
    .setOutputCol("embeddings")\
    .setMaxSentenceLength(512)

ner_model = legal.NerModel.pretrained(ner_model, "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

pos_tagger = nlp.PerceptronModel().pretrained() \
    .setInputCols(["sentence", "token"])\
    .setOutputCol("pos_tags")

dependency_parser = nlp.DependencyParserModel() \
    .pretrained("dependency_conllu", "en") \
    .setInputCols(["sentence", "pos_tags", "token"]) \
    .setOutputCol("dependencies")

re_filter = legal.RENerChunksFilter()\
    .setInputCols(["ner_chunk", "dependencies"])\
    .setOutputCol("re_ner_chunks")\
    .setMaxSyntacticDistance(10)\
    .setRelationPairs(['PARTY-ALIAS', 'DOC-PARTY', 'DOC-EFFDATE'])

re_model = legal.RelationExtractionDLModel.pretrained(re_model, "en", "legal/models")\
    .setPredictionThreshold(0.1)\
    .setInputCols(["re_ner_chunks", "sentence"])\
    .setOutputCol("relations")

pipeline = nlp.Pipeline(stages=[
        document_assembler,
        text_splitter,
        tokenizer,
        embeddings,
        ner_model,
        ner_converter,
        pos_tagger,
        dependency_parser,
        re_filter,
        re_model
        ])

text = """This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties")."""

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

from pyspark.sql import functions as F

result.select(
    F.explode(F.arrays_zip(result.relations.metadata, result.relations.result)).alias("cols")).select(
    F.expr("cols['0']['sentence']").alias("sentence"),
    F.expr("cols['0']['entity1_begin']").alias("entity1_begin"),
    F.expr("cols['0']['entity1_end']").alias("entity1_end"),
    F.expr("cols['0']['chunk1']").alias("chunk1"),
    F.expr("cols['0']['entity1']").alias("entity1"),
    F.expr("cols['0']['entity2_begin']").alias("entity2_begin"),
    F.expr("cols['0']['entity2_end']").alias("entity2_end"),
    F.expr("cols['0']['chunk2']").alias("chunk2"),
    F.expr("cols['0']['entity2']").alias("entity2"),
    F.expr("cols['1']").alias("relation"),
    F.expr("cols['0']['confidence']").alias("confidence"),
).filter("relation != 'no_rel'").show(truncate=70)

+--------+-------------+-----------+-------------------------------+-------+-------------+-----------+-----------------+-------+---------+----------+
|sentence|entity1_begin|entity1_end|                         chunk1|entity1|entity2_begin|entity2_end|           chunk2|entity2| relation|confidence|
+--------+-------------+-----------+-------------------------------+-------+-------------+-----------+-----------------+-------+---------+----------+
|       0|            5|         35|INTELLECTUAL PROPERTY AGREEMENT|    DOC|           69|         85|December 31, 2018|EFFDATE| dated_as| 0.9856822|
|       0|          141|        163|        Armstrong Flooring, Inc|  PARTY|          192|        197|           Seller|  ALIAS|has_alias|0.89620054|
+--------+-------------+-----------+-------------------------------+-------+-------------+-----------+-----------------+-------+---------+----------+

import spark.implicits._

val documenter = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentencer = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val wordsEmbedder = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val posTagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("pos_tags")

val nerTagger = MedicalNerModel.pretrained("ner_ade_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner_tags")

val nerChunker = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "ner_tags"))
  .setOutputCol("ner_chunks")

val dependencyParser = DependencyParserModel.pretrained("dependency_conllu", "en")
  .setInputCols(Array("sentence", "pos_tags", "token"))
  .setOutputCol("dependencies")

val adeReNerChunkFilter = new RENerChunksFilter()
  .setInputCols(Array("ner_chunks", "dependencies"))
  .setOutputCol("re_ner_chunks")
  .setMaxSyntacticDistance(10)
  .setRelationPairs(Array("drug-ade", "ade-drug"))

val adeReModel = RelationExtractionDLModel.pretrained("redl_ade_biobert", "en", "clinical/models")
  .setInputCols(Array("re_ner_chunks", "sentences"))
  .setPredictionThreshold(0.5)
  .setOutputCol("relations")

val pipeline = new Pipeline()
  .setStages(Array(
    documenter,
    sentencer,
    tokenizer,
    wordsEmbedder,
    posTagger,
    nerTagger,
    nerChunker,
    dependencyParser,
    adeReNerChunkFilter,
    adeReModel
  ))

val text = """A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands."""

val data = Seq(text).toDF("text")

val result = pipeline.fit(data).transform(data)

+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
|sentence|entity1_begin|entity1_end|   chunk1|entity1|entity2_begin|entity2_end|                                                   chunk2|entity2|relation|confidence|
+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
|       0|           25|         32| naproxen|   DRUG|          137|        148|                                             tense bullae|    ADE|       1| 0.9989047|
|       0|           25|         32| naproxen|   DRUG|          154|        210|cutaneous fragility on the face and the back of the hands|    ADE|       1| 0.9989704|
|       0|           87|         95|oxaprozin|   DRUG|          137|        148|                                             tense bullae|    ADE|       1|0.99895453|
|       0|           87|         95|oxaprozin|   DRUG|          154|        210|cutaneous fragility on the face and the back of the hands|    ADE|       1|0.99900633|
+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
import spark.implicits._

val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val text_splitter = new TextSplitter() 
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en", "finance/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val ner_model_date = FinanceNerModel.pretrained("finner_sec_dates", "en", "finance/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner_dates")

val ner_converter_date = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "ner_dates"))
  .setOutputCol("ner_chunk_date")

val ner_model_org = FinanceNerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner_orgs")

val ner_converter_org = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "ner_orgs"))
  .setOutputCol("ner_chunk_org")

val chunk_merger = new ChunkMergeApproach()
  .setInputCols(Array("ner_chunk_org", "ner_chunk_date"))
  .setOutputCol("ner_chunk")

val pos = PerceptronModel.pretrained()
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("pos")

val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en")
  .setInputCols(Array("sentence", "pos", "token"))
  .setOutputCol("dependencies")

val re_filter = new RENerChunksFilter()
  .setInputCols(Array("ner_chunk", "dependencies"))
  .setOutputCol("re_ner_chunk")
  .setRelationPairs(Array("ORG-ORG", "ORG-DATE"))
  .setMaxSyntacticDistance(10)

val reDL = RelationExtractionDLModel.pretrained("finre_acquisitions_subsidiaries_md", "en", "finance/models")
  .setInputCols(Array("re_ner_chunk", "sentence"))
  .setOutputCol("relation")
  .setPredictionThreshold(0.1)

val pipeline = new Pipeline().setStages(Array(
    document_assembler,
    text_splitter,
    tokenizer,
    embeddings,
    ner_model_date,
    ner_converter_date,
    ner_model_org,
    ner_converter_org,
    chunk_merger,
    pos,
    dependency_parser,
    re_filter,
    reDL
  ))

text = """In fiscal 2020, Cadence acquired all of the outstanding equity of AWR Corporation (“AWR”) and Integrand Software, Inc. (“Integrand”)."""

val data = Seq(text).toDS.toDF("text")

val result = pipeline.fit(data).transform(data)

+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
|sentence|entity1_begin|entity1_end|                 chunk1|entity1|entity2_begin|entity2_end|         chunk2|entity2|            relation|confidence|
+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
|       0|           16|         22|                Cadence|    ORG|            3|         13|    fiscal 2020|   DATE|has_acquisition_date|0.99687237|
|       0|           66|         80|        AWR Corporation|    ORG|            3|         13|    fiscal 2020|   DATE|has_acquisition_date|  0.993112|
|       0|           94|        116|Integrand Software, Inc|    ORG|            3|         13|    fiscal 2020|   DATE|has_acquisition_date| 0.9741451|
|       0|           66|         80|        AWR Corporation|    ORG|           16|         22|        Cadence|    ORG|     was_acquired_by|  0.997124|
|       0|           94|        116|Integrand Software, Inc|    ORG|           16|         22|        Cadence|    ORG|     was_acquired_by|0.99910504|
|       0|           94|        116|Integrand Software, Inc|    ORG|           66|         80|AWR Corporation|    ORG|     was_acquired_by|0.93245244|
+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
import spark.implicits._

val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val text_splitter = new TextSplitter()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en", "clinical/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")
  .setMaxSentenceLength(512)

val ner_model = LegalNerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")

val ner_converter = new NerConverter()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")

val pos_tagger = PerceptronModel().pretrained() \
  .setInputCols(["sentence", "token"])\
  .setOutputCol("pos_tags")

val dependency_parser = DependencyParserModel() \
  .pretrained("dependency_conllu", "en") \
  .setInputCols(["sentence", "pos_tags", "token"]) \
  .setOutputCol("dependencies")

val re_filter = new RENerChunksFilter()\
  .setInputCols(["ner_chunk", "dependencies"])\
  .setOutputCol("re_ner_chunks")\
  .setMaxSyntacticDistance(10)\
  .setRelationPairs(['PARTY-ALIAS', 'DOC-PARTY', 'DOC-EFFDATE'])

val re_model = RelationExtractionDLModel.pretrained("legre_contract_doc_parties", "en", "legal/models")
  .setPredictionThreshold(0.1)
  .setInputCols(Array("re_ner_chunks", "sentence"))
  .setOutputCol("relations")

val pipeline = new Pipeline()
  .setStages(Array(
    document_assembler,
    text_splitter,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    pos_tagger,
    dependency_parser,
    re_filter,
    re_model
  ))

text = """This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties")."""

val data = Seq(text).toDF("text")

val result = pipeline.fit(data).transform(data)

+--------+-------------+-----------+-------------------------------+-------+-------------+-----------+-----------------+-------+---------+----------+
|sentence|entity1_begin|entity1_end|                         chunk1|entity1|entity2_begin|entity2_end|           chunk2|entity2| relation|confidence|
+--------+-------------+-----------+-------------------------------+-------+-------------+-----------+-----------------+-------+---------+----------+
|       0|            5|         35|INTELLECTUAL PROPERTY AGREEMENT|    DOC|           69|         85|December 31, 2018|EFFDATE| dated_as| 0.9856822|
|       0|          141|        163|        Armstrong Flooring, Inc|  PARTY|          192|        197|           Seller|  ALIAS|has_alias|0.89620054|
+--------+-------------+-----------+-------------------------------+-------+-------------+-----------+-----------------+-------+---------+----------+

ReIdentification

Model

This annotator can reidentifies obfuscated entities by DeIdentification. It requires the outputs from the deidentification as input. Input columns need to be the deidentified document and the deidentification mappings set with DeIdentification.setMappingsColumn.

Input Annotator Types: DOCUMENT,CHUNK

Output Annotator Type: DOCUMENT

Python API: ReIdentification Scala API: ReIdentification Notebook: ReIdentificationNotebook
Show Example

from johnsnowlabs import nlp, medical

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

deidentification = medical.DeIdentification() \
    .setInputCols(["sentence", "token", "ner_chunk"]) \
    .setOutputCol("deidentified") \
    .setMode("mask")\
    .setReturnEntityMappings(True) #  return a new column to save the mappings between the mask/obfuscated entities and original entities.
    #.setMappingsColumn("MappingCol") # change the name of the column, 'aux' is default

pipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      deidentification])

text = """
Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora ,
MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 .
Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .
"""
data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

result.select(F.explode(F.arrays_zip(result.sentence.result, 
                                     result.deidentified.result)).alias("cols"))\
      .select(F.expr("cols['0']").alias("sentence"), 
              F.expr("cols['1']").alias("deidentified")).show(truncate = False)

+-----------------------------------------------------------------------+-------------------------------------------------------+
|sentence                                                               |deidentified                                           |
+-----------------------------------------------------------------------+-------------------------------------------------------+
|Record date : 2093-01-13 , David Hale , M.D .                          |Record date : <DATE> , <NAME> , M.D .                  |
|, Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 .              |, Name : <NAME> , MR # <ID> Date : <DATE> .            |
|PCP : Oliveira , 25 years-old , Record date : 2079-11-09 .             |PCP : <NAME> , <AGE> years-old , Record date : <DATE> .|
|Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .|<LOCATION> , <LOCATION> , Phone <CONTACT> .            |
+-----------------------------------------------------------------------+-------------------------------------------------------+

reIdentification = medical.ReIdentification()\
    .setInputCols(["aux","deidentified"])\
    .setOutputCol("original")

reid_result = reIdentification.transform(result)

reid_result.select('original.result').show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                         |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[Record date : 2093-01-13 , David Hale , M.D ., , Name : Hendrickson Ora ,MR # 7194334 Date : 01/13/93 ., PCP : Oliveira , 25 years-old , Record date : 2079-11-09 ., Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .] |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
from johnsnowlabs import nlp, finance

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained('finner_sec_10k_summary', 'en', 'finance/models')\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = finance.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

deidentification = finance.DeIdentification() \
    .setInputCols(["sentence", "token", "ner_chunk"]) \
    .setOutputCol("deidentified") \
    .setMode("mask")\
    .setReturnEntityMappings(True) #  return a new column to save the mappings between the mask/obfuscated entities and original entities. REquired for "ReIdentification"
    #.setMappingsColumn("MappingCol") # change the name of the column, 'aux' is default

pipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      ner_model,
      ner_converter,
      deidentification])

text= """
Commission file number 000-15867 
_____________________________________
 
CADENCE DESIGN SYSTEMS, INC. 
(Exact name of registrant as specified in its charter)
____________________________________ 
Delaware
 
00-0000000
(State or Other Jurisdiction ofIncorporation or Organization)
 
(I.R.S. EmployerIdentification No.)
2655 Seely Avenue, Building 5,
San Jose,
California
 
95134
(Address of Principal Executive Offices)
 
(Zip Code)
(408)
-943-1234 
(Registrant’s Telephone Number, including Area Code) 
Securities registered pursuant to Section 12(b) of the Act:
Title of Each Class
Trading Symbol(s)
Names of Each Exchange on which Registered
Common Stock, $0.01 par value per share
CDNS
Nasdaq Global Select Market
Securities registered pursuant to Section 12(g) of the Act:"""

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

result.select("deidentified.result").show(truncate = False)

+-------------------------------------------------------------------------------------------------------------+
|result                                                                                                       |
+-------------------------------------------------------------------------------------------------------------+
|[Commission file number <CFN> 
_____________________________________
 
<ORG>., (Exact name of registrant as specified in its charter)
____________________________________ 
<STATE>
 
<IRS>
(State or Other Jurisdiction ofIncorporation or Organization)
 
(I.R.S., EmployerIdentification No., )
<ADDRESS>
 
95134
(Address of Principal Executive Offices)
 
(Zip Code)
<PHONE> 
(Registrants Telephone Number, including Area Code) 
Securities registered pursuant to Section 12, (b) of the Act:
Title of Each Class
Trading Symbol, (s)
Names of Each Exchange on which Registered
<TITLE_CLASS>, <TITLE_CLASS_VALUE> par value per share
<TICKER>
<STOCK_EXCHANGE>
Securities registered pursuant to Section 12, (g) of the Act:]|
+-------------------------------------------------------------------------------------------------------------+

reIdentification = finance.ReIdentification()\
    .setInputCols(["aux","deidentified"])\
    .setOutputCol("original")

reid_result = reIdentification.transform(result)

reid_result.select('original.result').show(truncate=False)

+---------------------------------------------------------------------------------------------------+
|result                                                                                             |
+---------------------------------------------------------------------------------------------------+
|[Commission file number 000-15867 
_____________________________________
 
CADENCE DESIGN SYSTEMS, INC., (Exact name of registrant as specified in its charter)
____________________________________ 
Delaware
 
00-0000000
(State or Other Jurisdiction ofIncorporation or Organization)
 
(I.R.S., EmployerIdentification No., )
2655 Seely Avenue, Building 5,
San Jose,
California
 
95134
(Address of Principal Executive Offices)
 
(Zip Code)<(408)
-943-1234
(Registrants Telephone Number, including Area Code) 
Securities registered pursuant to Section 12, (b) of the Act:
Title of Each Class
Trading Symbol, (s)
Names of Each Exchange on which Registered
Common Stock, $0.01 par value per share
CDNS
Nasdaq Global Select Market
Securities registered pursuant to Section 12, (g) of the Act:]|
+---------------------------------------------------------------------------------------------------+
from johnsnowlabs import nlp, legal

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

legal_ner = legal.NerModel.pretrained("legner_contract_doc_parties_lg", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner") 

ner_converter = legal.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setReplaceLabels({"ALIAS": "PARTY"}) # "ALIAS" are secondary names of companies, so let's extract them also as PARTY

deidentification = legal.DeIdentification() \
    .setInputCols(["sentence", "token", "ner_chunk"]) \
    .setOutputCol("deidentified") \
    .setMode("mask")\
    .setReturnEntityMappings(True) #  return a new column to save the mappings between the mask/obfuscated entities and original entities. REquired for "ReIdentification"
    #.setMappingsColumn("MappingCol") # change the name of the column, 'aux' is default

pipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      legal_ner,
      ner_converter,
      deidentification])

text = """THIS STRATEGIC ALLIANCE AGREEMENT ("Agreement") is made and entered into as of December 14, 2016 , by and between Hyatt Franchising Latin America, L.L.C. a limited liability company organized and existing under the laws of the State of Delaware"""

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

result.select("deidentified.result").show(truncate = False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[THIS <DOC> ("Agreement") is made and entered into as of <EFFDATE> , by and between <PARTY>. a limited liability company organized and existing under the laws of the State of Delaware]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

reIdentification = legal.ReIdentification()\
    .setInputCols(["aux","deidentified"])\
    .setOutputCol("original")

reid_result = reIdentification.transform(result)

reid_result.select('original.result').show(truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[THIS STRATEGIC ALLIANCE AGREEMENT ("Agreement") is made and entered into as of December 14, 2016 , by and between Hyatt Franchising Latin America, L.L.C. a limited liability company organized and existing under the laws of the State of Delaware]|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val wordEmbeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val clinicalNer = MedicalNerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")

val nerConverter = new NerConverter()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")

val deidentification = new DeIdentification()
  .setInputCols(Array("sentence", "token", "ner_chunk"))
  .setOutputCol("deidentified")
  .setMode("mask")
  .setReturnEntityMappings(true)

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    wordEmbeddings,
    clinicalNer,
    nerConverter,
    deidentification
  ))

val text = """
Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora ,
MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 .
Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .
"""

val data = Seq((text)).toDF("text")

val result = pipeline.fit(data).transform(data)

+-----------------------------------------------------------------------+-------------------------------------------------------+
|sentence                                                               |deidentified                                           |
+-----------------------------------------------------------------------+-------------------------------------------------------+
|Record date : 2093-01-13 , David Hale , M.D .                          |Record date : <DATE> , <NAME> , M.D .                  |
|, Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 .              |, Name : <NAME> , MR # <ID> Date : <DATE> .            |
|PCP : Oliveira , 25 years-old , Record date : 2079-11-09 .             |PCP : <NAME> , <AGE> years-old , Record date : <DATE> .|
|Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .|<LOCATION> , <LOCATION> , Phone <CONTACT> .            |
+-----------------------------------------------------------------------+-------------------------------------------------------+

val reIdentification = new ReIdentification()
  .setInputCols(Array("aux", "deidentified"))
  .setOutputCol("original")

val reidResult = reIdentification.transform(result)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                         |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[Record date : 2093-01-13 , David Hale , M.D ., , Name : Hendrickson Ora ,MR # 7194334 Date : 01/13/93 ., PCP : Oliveira , 25 years-old , Record date : 2079-11-09 ., Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .] |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val nerModel = FinanceNerModel.pretrained("finner_sec_10k_summary", "en", "finance/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")

val nerConverter = new NerConverter()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")

val deidentification = new DeIdentification()
  .setInputCols(Array("sentence", "token", "ner_chunk"))
  .setOutputCol("deidentified")
  .setMode("mask")
  .setReturnEntityMappings(true)

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    nerModel,
    nerConverter,
    deidentification
  ))

val text = "Commission file number 000-15867 
_____________________________________
 
CADENCE DESIGN SYSTEMS, INC. 
(Exact name of registrant as specified in its charter)
____________________________________ 
Delaware
 
00-0000000
(State or Other Jurisdiction ofIncorporation or Organization)
 
(I.R.S. EmployerIdentification No.)
2655 Seely Avenue, Building 5,
San Jose,
California
 
95134
(Address of Principal Executive Offices)
 
(Zip Code)
(408)
-943-1234 
(Registrant’s Telephone Number, including Area Code) 
Securities registered pursuant to Section 12(b) of the Act:
Title of Each Class
Trading Symbol(s)
Names of Each Exchange on which Registered
Common Stock, $0.01 par value per share
CDNS
Nasdaq Global Select Market
Securities registered pursuant to Section 12(g) of the Act:"

val data = Seq(text).toDF("text")

val result = pipeline.fit(data).transform(data)

+-------------------------------------------------------------------------------------------------------------+
|result                                                                                                       |
+-------------------------------------------------------------------------------------------------------------+
|[Commission file number <CFN> 
_____________________________________
 
<ORG>., (Exact name of registrant as specified in its charter)
____________________________________ 
<STATE>
 
<IRS>
(State or Other Jurisdiction ofIncorporation or Organization)
 
(I.R.S., EmployerIdentification No., )
<ADDRESS>
 
95134
(Address of Principal Executive Offices)
 
(Zip Code)
<PHONE> 
(Registrants Telephone Number, including Area Code) 
Securities registered pursuant to Section 12, (b) of the Act:
Title of Each Class
Trading Symbol, (s)
Names of Each Exchange on which Registered
<TITLE_CLASS>, <TITLE_CLASS_VALUE> par value per share
<TICKER>
<STOCK_EXCHANGE>
Securities registered pursuant to Section 12, (g) of the Act:]|
+-------------------------------------------------------------------------------------------------------------+

val reIdentification = new ReIdentification()
  .setInputCols(Array("aux", "deidentified"))
  .setOutputCol("original")

val reidResult = reIdentification.transform(result)

+---------------------------------------------------------------------------------------------------+
|result                                                                                             |
+---------------------------------------------------------------------------------------------------+
|[Commission file number 000-15867 
_____________________________________
 
CADENCE DESIGN SYSTEMS, INC., (Exact name of registrant as specified in its charter)
____________________________________ 
Delaware
 
00-0000000
(State or Other Jurisdiction ofIncorporation or Organization)
 
(I.R.S., EmployerIdentification No., )
2655 Seely Avenue, Building 5,
San Jose,
California
 
95134
(Address of Principal Executive Offices)
 
(Zip Code)<(408)
-943-1234
(Registrants Telephone Number, including Area Code) 
Securities registered pursuant to Section 12, (b) of the Act:
Title of Each Class
Trading Symbol, (s)
Names of Each Exchange on which Registered
Common Stock, $0.01 par value per share
CDNS
Nasdaq Global Select Market
Securities registered pursuant to Section 12, (g) of the Act:]|
+---------------------------------------------------------------------------------------------------+
import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val legalNer = LegalNerModel.pretrained("legner_contract_doc_parties_lg", "en", "legal/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")

val nerConverter = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")
  .setReplaceLabels(Map("ALIAS" -> "PARTY"))

val deidentification = new DeIdentification()
  .setInputCols(Array("sentence", "token", "ner_chunk"))
  .setOutputCol("deidentified")
  .setMode("mask")
  .setReturnEntityMappings(true)

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    legalNer,
    nerConverter,
    deidentification
  ))

val text = "THIS STRATEGIC ALLIANCE AGREEMENT (\"Agreement\") is made and entered into as of December 14, 2016, by and between Hyatt Franchising Latin America, L.L.C. a limited liability company organized and existing under the laws of the State of Delaware"

val data = Seq(text).toDF("text")

val result = pipeline.fit(data).transform(data)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[THIS <DOC> ("Agreement") is made and entered into as of <EFFDATE> , by and between <PARTY>. a limited liability company organized and existing under the laws of the State of Delaware]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

val reIdentification = new ReIdentification()
  .setInputCols(Array("aux", "deidentified"))
  .setOutputCol("original")



+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[THIS STRATEGIC ALLIANCE AGREEMENT ("Agreement") is made and entered into as of December 14, 2016 , by and between Hyatt Franchising Latin America, L.L.C. a limited liability company organized and existing under the laws of the State of Delaware]|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

RegexMatcherInternal

Model

The RegexMatcherInternal class implements an internal annotator approach to match a set of regular expressions with a provided entity. This approach is utilized for associating specific patterns within text data with predetermined entities, such as dates, mentioned within the text.

The class allows users to define rules using regular expressions paired with entities, offering flexibility in customization. These rules can either be directly set using the setRules method, with a specified delimiter, or loaded from an external file using the setExternalRules method.

Additionally, users can specify parameters such as the matching strategy (MATCH_FIRST, MATCH_ALL, or MATCH_COMPLETE) to control how matches are handled. The output annotation type is CHUNK, with input annotation types supporting DOCUMENT. This class provides a versatile tool for implementing entity recognition based on user-defined patterns within text data.

A rule consists of a regex pattern and an identifier, delimited by a character of choice. An example could be "\\d{4}\\/\\d\\d\\/\\d\\d,date" which will match strings like "1970/01/01" to the identifier "date".

Parametres:

  • strategy: Can be either MATCH_FIRST, MATCH_ALL, MATCH_COMPLETE, by default MATCH_ALL.
  • rules: Regex rules to match the identifier with.
  • delimiter: Delimiter for rules provided with setRules.
  • externalRules: external resource to rules, needs delimiter in options.

See Spark NLP Workshop for more examples of usage.

Input Annotator Types: DOCUMENT

Output Annotator Type: CHUNK

Show Example

from johnsnowlabs import nlp, medical

text = """Name : Hendrickson, Ora, Record date: 2093-01-13, MR #719435.
Dr. John Green, ID: 1231511863, IP 203.120.223.13
He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93
Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no: A334455B.
Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com."""

data = spark.createDataFrame([[text]]).toDF("text")

rules = '''
(\d{1,3}\.){3}\d{1,3}~IPADDR
\d{4}-\d{2}-\d{2}|\d{2}/\d{2}/\d{2}|\d{2}/\d{2}/\d{2}~DATE
'''

with open('./rules/regex_rules.txt', 'w') as f:
    f.write(rules)

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

regex_matcher_internal = medical.RegexMatcherInternal()\
    .setInputCols('document')\
    .setStrategy("MATCH_ALL")\
    .setOutputCol("regex_matches")\
    .setExternalRules(path='./rules/regex_rules.txt', delimiter='~')

nlpPipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        regex_matcher_internal
])

result = nlpPipeline.fit(data).transform(data)

# result
+--------------+-----+---+---------+
|  regex_result|begin|end|ner_label|
+--------------+-----+---+---------+
|    2093-01-13|   38| 47|     DATE|
|203.120.223.13|   97|110|   IPADDR|
|      01/13/93|  188|195|     DATE|
+--------------+-----+---+---------+

//rules = '''
//(\d{1,3}\.){3}\d{1,3}~IPADDR
//\d{4}-\d{2}-\d{2}|\d{2}/\d{2}/\d{2}|\d{2}/\d{2}/\d{2}~DATE
//'''
//
//with open('./rules/regex_rules.txt', 'w') as f:
//    f.write(rules)

val text = """Name : Hendrickson, Ora, Record date: 2093-01-13, MR #719435.
  |Dr. John Green, ID: 1231511863, IP 203.120.223.13
  |He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93
  |Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no: A334455B.
  |Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""".stripMargin

import spark.implicits._

val data = Seq(text).toDF("text")

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val regexMatcher = new RegexMatcher()
  .setInputCols("document")
  .setStrategy("MATCH_ALL")
  .setOutputCol("regex_matches")
  .setExternalRulesPath("./rules/regex_rules.txt")
  .setDelimiter("~")

val nlpPipeline = new Pipeline()
  .setStages(Array(documentAssembler, regexMatcher))

val result = nlpPipeline.fit(data).transform(data)

# result
+--------------+-----+---+---------+
|  regex_result|begin|end|ner_label|
+--------------+-----+---+---------+
|    2093-01-13|   38| 47|     DATE|
|203.120.223.13|   97|110|   IPADDR|
|      01/13/93|  188|195|     DATE|
+--------------+-----+---+---------+

RelationExtraction

Extracts and classifies instances of relations between named entities.

Parameters:

  • predictionThreshold (Float): Sets minimal activation of the target unit to encode a new relation instance.

  • relationPairs (List[Str]): List of dash-separated pairs of named entities. For example, [“Biomarker-RelativeDay”] will process all relations between entities of type “Biomarker” and “RelativeDay”.

  • relationPairsCaseSensitive (Bool): Determines whether relation pairs are case sensitive.

  • relationTypePerPair dict[str, list[str]]: List of entity pairs per relations which limit the entities can form a relation. For example, {“CAUSE”: [“PROBLEM”, “SYMPTOM”]} which only let a “CAUSE” relation to hold between a problem (“PROBLEM) and a symptom (“SYMTOM”).

  • maxSyntacticDistance (Int): Maximal syntactic distance, as threshold (Default: 0). Determine how far the “from entity” can be from the “to entity” in the text. Increasing this value will increase recall, but also increase the number of false positives.

  • customLabels (dict[str, str]): Custom relation labels.

  • multiClass (Bool): If multiClass is set, the model will return all the labels with corresponding scores (Default: False)

  • DoExceptionHandling: If it is set as True, the annotator tries to process as usual and ff exception-causing data (e.g. corrupted record/ document) is passed to the annotator, an exception warning is emitted which has the exception message.

For pretrained models please see the Models Hub for available models.

Input Annotator Types: WORD_EMBEDDINGS, POS, CHUNK, DEPENDENCY

Output Annotator Type: CATEGORY

Python API: RelationExtractionModel Scala API: RelationExtractionModel Notebook: RelationExtractionModelNotebook
Show Example

from johnsnowlabs import nlp, medical

documenter = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencer = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentences"])\
    .setOutputCol("tokens")

words_embedder = nlp.WordEmbeddingsModel()\
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("embeddings")

pos_tagger = nlp.PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("pos_tags")

ner_tagger = medical.NerModel()\
    .pretrained("ner_posology", "en", "clinical/models")\
    .setInputCols("sentences", "tokens", "embeddings")\
    .setOutputCol("ner_tags")

ner_chunker = medical.NerConverterInternal()\
    .setInputCols(["sentences", "tokens", "ner_tags"])\
    .setOutputCol("ner_chunks")

dependency_parser = nlp.DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentences", "pos_tags", "tokens"])\
    .setOutputCol("dependencies")

reModel = medical.RelationExtractionModel()\
    .pretrained("posology_re")\
    .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
    .setOutputCol("relations")\
    .setMaxSyntacticDistance(4)

pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer,
    words_embedder,
    pos_tagger,
    ner_tagger,
    ner_chunker,
    dependency_parser,
    reModel
])

text = """
The patient was prescribed 1 unit of Advil for 5 days after meals. The patient was also
given 1 unit of Metformin daily.
He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night ,
12 units of insulin lispro with meals , and metformin 1000 mg two times a day.
"""
df = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(df).transform(df)

# Show results
result.select(F.explode(F.arrays_zip(
                              result.relations.result,
                              result.relations.metadata)).alias("cols"))\
.select(
    F.expr("cols['1']['chunk1']").alias("chunk1"),
    F.expr("cols['1']['chunk2']").alias("chunk2"),
    F.expr("cols['1']['entity1']").alias("entity1"),
    F.expr("cols['1']['entity2']").alias("entity2"),
    F.expr("cols['0']").alias("relations"),
    F.expr("cols['1']['confidence']").alias("confidence")).show(5, truncate=False)

+---------+----------------+-------+---------+--------------+----------+
|chunk1   |chunk2          |entity1|entity2  |relations     |confidence|
+---------+----------------+-------+---------+--------------+----------+
|1 unit   |Advil           |DOSAGE |DRUG     |DOSAGE-DRUG   |1.0       |
|Advil    |for 5 days      |DRUG   |DURATION |DRUG-DURATION |1.0       |
|1 unit   |Metformin       |DOSAGE |DRUG     |DOSAGE-DRUG   |1.0       |
|Metformin|daily           |DRUG   |FREQUENCY|DRUG-FREQUENCY|1.0       |
|40 units |insulin glargine|DOSAGE |DRUG     |DOSAGE-DRUG   |1.0       |
+---------+----------------+-------+---------+--------------+----------+

import spark.implicits._

val documenter = new DocumentAssembler()
    .setInputCol("text") 
    .setOutputCol("document") 

val sentencer = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentences") 

val tokenizer = new Tokenizer()
    .setInputCols("sentences") 
    .setOutputCol("tokens") 

val words_embedder = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models") 
    .setInputCols(Array("sentences","tokens")) 
    .setOutputCol("embeddings") 

val pos_tagger = PerceptronModel.pretrained("pos_clinical","en","clinical/models") 
    .setInputCols(Array("sentences","tokens")) 
    .setOutputCol("pos_tags") 

val ner_tagger = MedicalNerModel.pretrained("ner_posology","en","clinical/models") 
    .setInputCols("sentences","tokens","embeddings") 
    .setOutputCol("ner_tags") 

val ner_chunker = new NerConverterInternal()
    .setInputCols(Array("sentences","tokens","ner_tags")) 
    .setOutputCol("ner_chunks") 

val dependency_parser = DependencyParserModel.pretrained("dependency_conllu","en") 
    .setInputCols(Array("sentences","pos_tags","tokens")) 
    .setOutputCol("dependencies") 

val reModel = RelationExtractionModel.pretrained("posology_re") 
    .setInputCols(Array("embeddings","pos_tags","ner_chunks","dependencies")) 
    .setOutputCol("relations") 
    .setMaxSyntacticDistance(4) 

val pipeline = new Pipeline().setStages(Array(
                                             documenter, 
                                             sentencer, 
                                             tokenizer,
                                             words_embedder, 
                                             pos_tagger, 
                                             ner_tagger, 
                                             ner_chunker, 
                                             dependency_parser, 
                                             reModel )) 

val text = " The patient was prescribed 1 unit of Advil for 5 days after meals. The patient was also given 1 unit of Metformin daily. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals ,and metformin 1000 mg two times a day. " 

val df = Seq(text) .toDF("text") 
val result = pipeline.fit(df) .transform(df) 

// Show results

+---------+----------------+-------+---------+--------------+----------+
|chunk1   |chunk2          |entity1|entity2  |relations     |confidence|
+---------+----------------+-------+---------+--------------+----------+
|1 unit   |Advil           |DOSAGE |DRUG     |DOSAGE-DRUG   |1.0       |
|Advil    |for 5 days      |DRUG   |DURATION |DRUG-DURATION |1.0       |
|1 unit   |Metformin       |DOSAGE |DRUG     |DOSAGE-DRUG   |1.0       |
|Metformin|daily           |DRUG   |FREQUENCY|DRUG-FREQUENCY|1.0       |
|40 units |insulin glargine|DOSAGE |DRUG     |DOSAGE-DRUG   |1.0       |
+---------+----------------+-------+---------+--------------+----------+

Trains a TensorFlow model for relation extraction.

To train a custom relation extraction model, you need to first creat a Tensorflow graph using either the TfGraphBuilder annotator or the tf_graph module. Then, set the path to the Tensorflow graph using the method .setModelFile("path/to/tensorflow_graph.pb").

If the parameter relationDirectionCol is set, the model will be trained using the direction information (see the parameter decription for details). Otherwise, the model won’t have direction between the relation of the entities.

After training a model (using the .fit() method), the resulting object is of class RelationExtractionModel.

Parameters:

  • FromEntity: (begin_col: str, end_col: str, label_col: str) Sets from entity

  • begin_col Column that has a reference of where the chunk begins

  • end_col: Column that has a reference of where the chunk ends

  • label_col: Column that has a reference what are the type of chunk

  • ToEntity: (begin_col: str, end_col: str, label_col: str) Sets to entity

  • begin_col Column that has a reference of where the chunk begins

  • end_col: Column that has a reference of where the chunk ends

  • label_col: Column that has a reference what are the type of chunk

  • CustomLabels: (labels: dict[str, str]) Sets custom relation labels

  • labels: Dictionary which maps old to new labels

  • RelationDirectionCol: (col: str) Relation direction column (possible values are: “none”, “left” or “right”). If this parameter is not set, the model will not have direction between the relation of the entities

  • col Column contains the relation direction values

  • PretrainedModelPath (value: str) Path to an already trained model saved to disk, which is used as a starting point for training the new model

  • ОverrideExistingLabels (bool) Whether to override already learned labels when using a pretrained model to initialize the new model. Default is ‘true’

  • batchSize: (Int) Size for each batch in the optimization process

  • EpochsNumber (Int) Maximum number of epochs to train

  • Dropout: (Float) Dropout at the output of each layer

  • LearningRate: (Float) Learning rate for the optimization process

  • OutputLogsPath: (Str) Folder path to save training logs. If no path is specified, the logs won’t be stored in disk. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3).

  • ModelFile: (Str) The path to the Tensorflow graph

  • FixImbalance (Float) Fix the imbalance in the training set by replicating examples of under represented categories

  • ValidationSplit (Float) The proportion of training dataset to be used as validation set

  • OverrideExistingLabels (Boolean) Controls whether to override already learned lebels when using a pretrained model to initialize the new model. A value of true will override existing labels

  • MultiClass (Boolean) If multiClass is set, the model will return all the labels with corresponding scores. By default, multiClass is false.

  • ModelFile (Str) Location of file of the model used for classification

  • MaxSyntacticDistance (Int) Maximal syntactic distance, as threshold (Default: 0)

Input Annotator Types: WORD_EMBEDDINGS, POS, CHUNK, DEPENDENCY

Output Annotator Type: NONE

Python API: RelationExtractionApproach Scala API: RelationExtractionApproach Notebook: RelationExtractionApproachNotebook
Show Example

from johnsnowlabs import nlp, medical

# Defining pipeline stages to extract entities first
documentAssembler = nlp.DocumentAssembler() \
  .setInputCol("text") \
  .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
  .setInputCols(["document"]) \
  .setOutputCol("tokens")

embedder = nlp.WordEmbeddingsModel \
  .pretrained("embeddings_clinical", "en", "clinical/models") \
  .setInputCols(["document", "tokens"]) \
  .setOutputCol("embeddings")

posTagger = nlp.PerceptronModel \
  .pretrained("pos_clinical", "en", "clinical/models") \
  .setInputCols(["document", "tokens"]) \
  .setOutputCol("posTags")

nerTagger = nlp.MedicalNerModel \
  .pretrained("ner_events_clinical", "en", "clinical/models") \
  .setInputCols(["document", "tokens", "embeddings"]) \
  .setOutputCol("ner_tags")

nerConverter = nlp.NerConverter() \
  .setInputCols(["document", "tokens", "ner_tags"]) \
  .setOutputCol("nerChunks")

depencyParser = nlp.DependencyParserModel \
  .pretrained("dependency_conllu", "en") \
  .setInputCols(["document", "posTags", "tokens"]) \
  .setOutputCol("dependencies")

# Then define `RelationExtractionApproach` and training parameters
re = medical.RelationExtractionApproach() \
  .setInputCols(["embeddings", "posTags", "train_ner_chunks", "dependencies"]) \
  .setOutputCol("relations_t") \
  .setLabelColumn("target_rel") \
  .setEpochsNumber(300) \
  .setBatchSize(200) \
  .setLearningRate(0.001) \
  .setModelFile("path/to/graph_file.pb") \
  .setFixImbalance(True) \
  .setValidationSplit(0.05) \
  .setFromEntity("from_begin", "from_end", "from_label") \
  .setToEntity("to_begin", "to_end", "to_label")

finisher = nlp.Finisher() \
  .setInputCols(["relations_t"]) \
  .setOutputCols(["relations"]) \
  .setCleanAnnotations(False) \
  .setValueSplitSymbol(",") \
  .setAnnotationSplitSymbol(",") \
  .setOutputAsArray(False)

# Define complete pipeline and start training
pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    tokenizer,
    embedder,
    posTagger,
    nerTagger,
    nerConverter,
    depencyParser,
    re,
    finisher])

model = pipeline.fit(trainData)

import spark.implicits._

// Defining pipeline stages to extract entities first
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("tokens")

val embedder = WordEmbeddingsModel
  .pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("document", "tokens"))
  .setOutputCol("embeddings")

val posTagger = PerceptronModel
  .pretrained("pos_clinical", "en", "clinical/models")
  .setInputCols(Array("document", "tokens"))
  .setOutputCol("posTags")

val nerTagger = MedicalNerModel
  .pretrained("ner_events_clinical", "en", "clinical/models")
  .setInputCols(Array("document", "tokens", "embeddings"))
  .setOutputCol("ner_tags")

val nerConverter = new NerConverter()
  .setInputCols(Array("document", "tokens", "ner_tags"))
  .setOutputCol("nerChunks")

val depencyParser = DependencyParserModel
  .pretrained("dependency_conllu", "en")
  .setInputCols(Array("document", "posTags", "tokens"))
  .setOutputCol("dependencies")

// Then define `RelationExtractionApproach` and training parameters
val re = new RelationExtractionApproach()
  .setInputCols(Array("embeddings", "posTags", "train_ner_chunks", "dependencies"))
  .setOutputCol("relations_t")
  .setLabelColumn("target_rel")
  .setEpochsNumber(300)
  .setBatchSize(200)
  .setlearningRate(0.001f)
  .setModelFile("path/to/graph_file.pb")
  .setFixImbalance(true)
  .setValidationSplit(0.05f)
  .setFromEntity("from_begin", "from_end", "from_label")
  .setToEntity("to_begin", "to_end", "to_label")

val finisher = new Finisher()
  .setInputCols(Array("relations_t"))
  .setOutputCols(Array("relations"))
  .setCleanAnnotations(false)
  .setValueSplitSymbol(",")
  .setAnnotationSplitSymbol(",")
  .setOutputAsArray(false)

// Define complete pipeline and start training
val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    tokenizer,
    embedder,
    posTagger,
    nerTagger,
    nerConverter,
    depencyParser,
    re,
    finisher))

val model = pipeline.fit(trainData)

RelationExtractionDL

Model

This Relation Extraction annotator extracts and classifies instances of relations between named entities. In contrast with RelationExtractionModel, RelationExtractionDLModel is based on BERT.

Parametres:

  • predictionThreshold (Float): Sets minimal activation of the target unit to encode a new relation instance.

  • customLabels (dict[str, str]): Custom relation labels.

  • DoExceptionHandling: If it is set as True, the annotator tries to process as usual and ff exception-causing data (e.g. corrupted record/ document) is passed to the annotator, an exception warning is emitted which has the exception message.

Available models can be found at the Models Hub.

For more extended examples on document pre-processing see the Spark NLP Workshop

Input Annotator Types: CHUNK, DOCUMENT

Output Annotator Type: CATEGORY

Python API: RelationExtractionDLModel Scala API: RelationExtractionDLModel Notebook: RelationExtractionDLModelNotebook
Show Example

from johnsnowlabs import nlp, medical

documenter = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencer = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

words_embedder = nlp.WordEmbeddingsModel()\
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

pos_tagger = nlp.PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"])\
    .setOutputCol("pos_tags")

ner_tagger = medical.NerModel.pretrained("ner_ade_clinical", "en", "clinical/models")\
    .setInputCols("sentence", "token", "embeddings")\
    .setOutputCol("ner_tags")

ner_chunker = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner_tags"])\
    .setOutputCol("ner_chunks")

dependency_parser = nlp.DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos_tags", "token"])\
    .setOutputCol("dependencies")

ade_re_ner_chunk_filter = medical.RENerChunksFilter() \
    .setInputCols(["ner_chunks", "dependencies"])\
    .setOutputCol("re_ner_chunks")\
    .setMaxSyntacticDistance(10)\
    .setRelationPairs(["drug-ade, ade-drug"])

ade_re_model = medical.RelationExtractionDLModel()\
    .pretrained('redl_ade_biobert', 'en', "clinical/models") \
    .setInputCols(["re_ner_chunks", "sentences"]) \
    .setPredictionThreshold(0.5)\
    .setOutputCol("relations")

pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer,
    words_embedder,
    pos_tagger,
    ner_tagger,
    ner_chunker,
    dependency_parser,
    ade_re_ner_chunk_filter,
    ade_re_model
])

text = """A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands."""

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

from pyspark.sql import functions as F

results.select(
    F.explode(F.arrays_zip(results.relations.metadata, results.relations.result)).alias("cols")).select(
    F.expr("cols['0']['sentence']").alias("sentence"),
    F.expr("cols['0']['entity1_begin']").alias("entity1_begin"),
    F.expr("cols['0']['entity1_end']").alias("entity1_end"),
    F.expr("cols['0']['chunk1']").alias("chunk1"),
    F.expr("cols['0']['entity1']").alias("entity1"),
    F.expr("cols['0']['entity2_begin']").alias("entity2_begin"),
    F.expr("cols['0']['entity2_end']").alias("entity2_end"),
    F.expr("cols['0']['chunk2']").alias("chunk2"),
    F.expr("cols['0']['entity2']").alias("entity2"),
    F.expr("cols['1']").alias("relation"),
    F.expr("cols['0']['confidence']").alias("confidence"),
).show(truncate=70)

+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
|sentence|entity1_begin|entity1_end|   chunk1|entity1|entity2_begin|entity2_end|                                                   chunk2|entity2|relation|confidence|
+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
|       0|           25|         32| naproxen|   DRUG|          137|        148|                                             tense bullae|    ADE|       1| 0.9989047|
|       0|           25|         32| naproxen|   DRUG|          154|        210|cutaneous fragility on the face and the back of the hands|    ADE|       1| 0.9989704|
|       0|           87|         95|oxaprozin|   DRUG|          137|        148|                                             tense bullae|    ADE|       1|0.99895453|
|       0|           87|         95|oxaprozin|   DRUG|          154|        210|cutaneous fragility on the face and the back of the hands|    ADE|       1|0.99900633|
+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
from johnsnowlabs import nlp, finance
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

text_splitter = finance.TextSplitter()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model_date = finance.NerModel.pretrained("finner_sec_dates", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_dates")

ner_converter_date = finance.NerConverterInternal()\
    .setInputCols(["sentence","token","ner_dates"])\
    .setOutputCol("ner_chunk_date")

ner_model_org= finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_orgs")

ner_converter_org = finance.NerConverterInternal()\
    .setInputCols(["sentence","token","ner_orgs"])\
    .setOutputCol("ner_chunk_org")\

chunk_merger = finance.ChunkMergeApproach()\
    .setInputCols('ner_chunk_org', "ner_chunk_date")\
    .setOutputCol('ner_chunk')

pos = nlp.PerceptronModel.pretrained()\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("pos")

dependency_parser = nlp.DependencyParserModel().pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos", "token"])\
    .setOutputCol("dependencies")

re_filter = finance.RENerChunksFilter()\
    .setInputCols(["ner_chunk", "dependencies"])\
    .setOutputCol("re_ner_chunk")\
    .setRelationPairs(["ORG-ORG", "ORG-DATE"])\
    .setMaxSyntacticDistance(10)

reDL = finance.RelationExtractionDLModel().pretrained('finre_acquisitions_subsidiaries_md', 'en', 'finance/models')\
    .setInputCols(["re_ner_chunk", "sentence"])\
    .setOutputCol("relation")\
    .setPredictionThreshold(0.1)

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    text_splitter,
    tokenizer,
    embeddings,
    ner_model_date,
    ner_converter_date,
    ner_model_org,
    ner_converter_org,
    chunk_merger,
    pos,
    dependency_parser,
    re_filter,
    reDL])

text = """In fiscal 2020, Cadence acquired all of the outstanding equity of AWR Corporation (“AWR”) and Integrand Software, Inc. (“Integrand”)."""

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

from pyspark.sql import functions as F

result.select(
    F.explode(F.arrays_zip(result.relation.metadata, result.relation.result)).alias("cols")).select(
    F.expr("cols['0']['sentence']").alias("sentence"),
    F.expr("cols['0']['entity1_begin']").alias("entity1_begin"),
    F.expr("cols['0']['entity1_end']").alias("entity1_end"),
    F.expr("cols['0']['chunk1']").alias("chunk1"),
    F.expr("cols['0']['entity1']").alias("entity1"),
    F.expr("cols['0']['entity2_begin']").alias("entity2_begin"),
    F.expr("cols['0']['entity2_end']").alias("entity2_end"),
    F.expr("cols['0']['chunk2']").alias("chunk2"),
    F.expr("cols['0']['entity2']").alias("entity2"),
    F.expr("cols['1']").alias("relation"),
    F.expr("cols['0']['confidence']").alias("confidence"),
).filter("relation != 'no_rel'").show(truncate=70)

+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
|sentence|entity1_begin|entity1_end|                 chunk1|entity1|entity2_begin|entity2_end|         chunk2|entity2|            relation|confidence|
+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
|       0|           16|         22|                Cadence|    ORG|            3|         13|    fiscal 2020|   DATE|has_acquisition_date|0.99687237|
|       0|           66|         80|        AWR Corporation|    ORG|            3|         13|    fiscal 2020|   DATE|has_acquisition_date|  0.993112|
|       0|           94|        116|Integrand Software, Inc|    ORG|            3|         13|    fiscal 2020|   DATE|has_acquisition_date| 0.9741451|
|       0|           66|         80|        AWR Corporation|    ORG|           16|         22|        Cadence|    ORG|     was_acquired_by|  0.997124|
|       0|           94|        116|Integrand Software, Inc|    ORG|           16|         22|        Cadence|    ORG|     was_acquired_by|0.99910504|
|       0|           94|        116|Integrand Software, Inc|    ORG|           66|         80|AWR Corporation|    ORG|     was_acquired_by|0.93245244|
+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
from johnsnowlabs import nlp, legal

document_assembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

text_splitter = legal.TextSplitter()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
    .setInputCols("sentence", "token") \
    .setOutputCol("embeddings")\
    .setMaxSentenceLength(512)

ner_model = legal.NerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = legal.NerConverterInternal()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

re_model = legal.RelationExtractionDLModel.pretrained("legre_contract_doc_parties", "en", "legal/models")\
    .setPredictionThreshold(0.1)\
    .setInputCols(["ner_chunk", "sentence"])\
    .setOutputCol("relation")

pipeline = nlp.Pipeline(stages=[
        document_assembler,
        text_splitter,
        tokenizer,
        embeddings,
        ner_model,
        ner_converter,
        re_model
        ])

text = """This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties")."""

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

from pyspark.sql import functions as F

result.select(
    F.explode(F.arrays_zip(result.relation.metadata, result.relation.result)).alias("cols")).select(
    F.expr("cols['0']['sentence']").alias("sentence"),
    F.expr("cols['0']['entity1_begin']").alias("entity1_begin"),
    F.expr("cols['0']['entity1_end']").alias("entity1_end"),
    F.expr("cols['0']['chunk1']").alias("chunk1"),
    F.expr("cols['0']['entity1']").alias("entity1"),
    F.expr("cols['0']['entity2_begin']").alias("entity2_begin"),
    F.expr("cols['0']['entity2_end']").alias("entity2_end"),
    F.expr("cols['0']['chunk2']").alias("chunk2"),
    F.expr("cols['0']['entity2']").alias("entity2"),
    F.expr("cols['1']").alias("relation"),
    F.expr("cols['0']['confidence']").alias("confidence"),
).filter("relation != 'no_rel'").show(truncate=70)

+--------+-------------+-----------+-----------------------------------+-------+-------------+-----------+-----------------------+-------+--------------------+----------+
|sentence|entity1_begin|entity1_end|                             chunk1|entity1|entity2_begin|entity2_end|                 chunk2|entity2|            relation|confidence|
+--------+-------------+-----------+-----------------------------------+-------+-------------+-----------+-----------------------+-------+--------------------+----------+
|       0|            5|         35|    INTELLECTUAL PROPERTY AGREEMENT|    DOC|           69|         85|      December 31, 2018|EFFDATE|            dated_as| 0.9856822|
|       0|            5|         35|    INTELLECTUAL PROPERTY AGREEMENT|    DOC|          141|        163|Armstrong Flooring, Inc|  PARTY|           signed_by| 0.7816506|
|       0|            5|         35|    INTELLECTUAL PROPERTY AGREEMENT|    DOC|          205|        221|      AFI Licensing LLC|  PARTY|           signed_by|0.53521496|
|       0|          141|        163|            Armstrong Flooring, Inc|  PARTY|          192|        197|                 Seller|  ALIAS|           has_alias| 0.8962001|
|       0|          205|        221|                  AFI Licensing LLC|  PARTY|          263|        271|              Licensing|  ALIAS|           has_alias|0.95189077|
|       0|          292|        297|                             Seller|  ALIAS|          301|        307|                Arizona|  ALIAS|has_collective_alias| 0.8934925|
|       1|          411|        445|Armstrong Hardwood Flooring Company|  PARTY|          478|        484|                Company|  ALIAS|           has_alias|0.98353034|
|       1|          505|        509|                              Buyer|  ALIAS|          516|        529|         Buyer Entities|  ALIAS|has_collective_alias| 0.7217146|
|       1|          611|        615|                              Party|  ALIAS|          641|        647|                Parties|  ALIAS|has_collective_alias| 0.5040909|
+--------+-------------+-----------+-----------------------------------+-------+-------------+-----------+-----------------------+-------+--------------------+----------+

import spark.implicits._

val documenter = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentencer = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val wordsEmbedder = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val posTagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("pos_tags")

val nerTagger = MedicalNerModel.pretrained("ner_ade_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner_tags")

val nerChunker = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "ner_tags"))
  .setOutputCol("ner_chunks")

val dependencyParser = DependencyParserModel.pretrained("dependency_conllu", "en")
  .setInputCols(Array("sentence", "pos_tags", "token"))
  .setOutputCol("dependencies")

val adeReNerChunkFilter = new RENerChunksFilter()
  .setInputCols(Array("ner_chunks", "dependencies"))
  .setOutputCol("re_ner_chunks")
  .setMaxSyntacticDistance(10)
  .setRelationPairs(Array("drug-ade", "ade-drug"))

val adeReModel = RelationExtractionDLModel.pretrained("redl_ade_biobert", "en", "clinical/models")
  .setInputCols(Array("re_ner_chunks", "sentences"))
  .setPredictionThreshold(0.5)
  .setOutputCol("relations")

val pipeline = new Pipeline()
  .setStages(Array(
    documenter,
    sentencer,
    tokenizer,
    wordsEmbedder,
    posTagger,
    nerTagger,
    nerChunker,
    dependencyParser,
    adeReNerChunkFilter,
    adeReModel
  ))

val text = """A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands."""

val data = Seq(text).toDF("text")

val result = pipeline.fit(data).transform(data)

+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
|sentence|entity1_begin|entity1_end|   chunk1|entity1|entity2_begin|entity2_end|                                                   chunk2|entity2|relation|confidence|
+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
|       0|           25|         32| naproxen|   DRUG|          137|        148|                                             tense bullae|    ADE|       1| 0.9989047|
|       0|           25|         32| naproxen|   DRUG|          154|        210|cutaneous fragility on the face and the back of the hands|    ADE|       1| 0.9989704|
|       0|           87|         95|oxaprozin|   DRUG|          137|        148|                                             tense bullae|    ADE|       1|0.99895453|
|       0|           87|         95|oxaprozin|   DRUG|          154|        210|cutaneous fragility on the face and the back of the hands|    ADE|       1|0.99900633|
+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
import spark.implicits._

val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val text_splitter = new TextSplitter() 
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en", "finance/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val ner_model_date = NerModel.pretrained("finner_sec_dates", "en", "finance/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner_dates")

val ner_converter_date = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "ner_dates"))
  .setOutputCol("ner_chunk_date")

val ner_model_org = FinanceNerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner_orgs")

val ner_converter_org = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "ner_orgs"))
  .setOutputCol("ner_chunk_org")

val chunk_merger = new ChunkMergeApproach()
  .setInputCols(Array("ner_chunk_org", "ner_chunk_date"))
  .setOutputCol("ner_chunk")

val pos = PerceptronModel.pretrained()
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("pos")

val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en")
  .setInputCols(Array("sentence", "pos", "token"))
  .setOutputCol("dependencies")

val re_filter = new RENerChunksFilter()
  .setInputCols(Array("ner_chunk", "dependencies"))
  .setOutputCol("re_ner_chunk")
  .setRelationPairs(Array("ORG-ORG", "ORG-DATE"))
  .setMaxSyntacticDistance(10)

val reDL = RelationExtractionDLModel.pretrained("finre_acquisitions_subsidiaries_md", "en", "finance/models")
  .setInputCols(Array("re_ner_chunk", "sentence"))
  .setOutputCol("relation")
  .setPredictionThreshold(0.1)

val pipeline = new Pipeline().setStages(Array(
    document_assembler,
    text_splitter,
    tokenizer,
    embeddings,
    ner_model_date,
    ner_converter_date,
    ner_model_org,
    ner_converter_org,
    chunk_merger,
    pos,
    dependency_parser,
    re_filter,
    reDL
  ))

text = """In fiscal 2020, Cadence acquired all of the outstanding equity of AWR Corporation (“AWR”) and Integrand Software, Inc. (“Integrand”)."""

val data = Seq(text).toDF("text")

val result = pipeline.fit(data).transform(data)

+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
|sentence|entity1_begin|entity1_end|                 chunk1|entity1|entity2_begin|entity2_end|         chunk2|entity2|            relation|confidence|
+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
|       0|           16|         22|                Cadence|    ORG|            3|         13|    fiscal 2020|   DATE|has_acquisition_date|0.99687237|
|       0|           66|         80|        AWR Corporation|    ORG|            3|         13|    fiscal 2020|   DATE|has_acquisition_date|  0.993112|
|       0|           94|        116|Integrand Software, Inc|    ORG|            3|         13|    fiscal 2020|   DATE|has_acquisition_date| 0.9741451|
|       0|           66|         80|        AWR Corporation|    ORG|           16|         22|        Cadence|    ORG|     was_acquired_by|  0.997124|
|       0|           94|        116|Integrand Software, Inc|    ORG|           16|         22|        Cadence|    ORG|     was_acquired_by|0.99910504|
|       0|           94|        116|Integrand Software, Inc|    ORG|           66|         80|AWR Corporation|    ORG|     was_acquired_by|0.93245244|
+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
import spark.implicits._

val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val text_splitter = new TextSplitter()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en", "clinical/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")
  .setMaxSentenceLength(512)

val ner_model = LegalNerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")

val ner_converter = new NerConverter()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")

val re_model = RelationExtractionDLModel.pretrained("legre_contract_doc_parties", "en", "legal/models")
  .setPredictionThreshold(0.1)
  .setInputCols(Array("ner_chunk", "sentence"))
  .setOutputCol("relation")

val pipeline = new Pipeline()
  .setStages(Array(
    document_assembler,
    text_splitter,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    re_model
  ))

text = """This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties")."""

val data = Seq(text).toDS.toDF("text")

val result = pipeline.fit(data).transform(data)

+--------+-------------+-----------+-----------------------------------+-------+-------------+-----------+-----------------------+-------+--------------------+----------+
|sentence|entity1_begin|entity1_end|                             chunk1|entity1|entity2_begin|entity2_end|                 chunk2|entity2|            relation|confidence|
+--------+-------------+-----------+-----------------------------------+-------+-------------+-----------+-----------------------+-------+--------------------+----------+
|       0|            5|         35|    INTELLECTUAL PROPERTY AGREEMENT|    DOC|           69|         85|      December 31, 2018|EFFDATE|            dated_as| 0.9856822|
|       0|            5|         35|    INTELLECTUAL PROPERTY AGREEMENT|    DOC|          141|        163|Armstrong Flooring, Inc|  PARTY|           signed_by| 0.7816506|
|       0|            5|         35|    INTELLECTUAL PROPERTY AGREEMENT|    DOC|          205|        221|      AFI Licensing LLC|  PARTY|           signed_by|0.53521496|
|       0|          141|        163|            Armstrong Flooring, Inc|  PARTY|          192|        197|                 Seller|  ALIAS|           has_alias| 0.8962001|
|       0|          205|        221|                  AFI Licensing LLC|  PARTY|          263|        271|              Licensing|  ALIAS|           has_alias|0.95189077|
|       0|          292|        297|                             Seller|  ALIAS|          301|        307|                Arizona|  ALIAS|has_collective_alias| 0.8934925|
|       1|          411|        445|Armstrong Hardwood Flooring Company|  PARTY|          478|        484|                Company|  ALIAS|           has_alias|0.98353034|
|       1|          505|        509|                              Buyer|  ALIAS|          516|        529|         Buyer Entities|  ALIAS|has_collective_alias| 0.7217146|
|       1|          611|        615|                              Party|  ALIAS|          641|        647|                Parties|  ALIAS|has_collective_alias| 0.5040909|
+--------+-------------+-----------+-----------------------------------+-------+-------------+-----------+-----------------------+-------+--------------------+----------+

Replacer

Model

Replacer allows to replace entities in the original text with the ones extracted by the annotators NameChunkObfuscatorApproach or DateNormalizer.

Replacer is most often used in conjunction with the DateNormalizer annotator or in deidentification pipelines.

With the dates, the Replacer annotator is used to replace specific tokens in a text with another token or string. The DateNormalizer annotator, on the other hand, is used to normalize dates and times to a standardized format.

Obfuscation in healthcare is the act of making healthcare data difficult to understand or use without authorization. This can be done by replacing or removing identifying information, such as names, dates of birth, and Social Security numbers. Obfuscation can also be used to hide the contents of healthcare records, such as diagnoses, medications, and treatment plans.

In the deidentification process, the Replacer annotator is used to replace certain tokens or patterns in the text with specified values. For example, it can be used to replace all instances of a person’s name with a placeholder like “PERSON”.

The NameChunkObfuscatorApproach annotator is used to identify and obfuscate sensitive named entities in the text, such as people’s names, addresses, dates of birth, SSNs etc.

Parameter:

  • setUseReplacement: (Boolean) Select what output format should be used. By default it will use the current day.

Input Annotator Types: DOCUMENT, CHUNK

Output Annotator Type: DOCUMENT

Python API: Replacer Scala API: Replacer Notebook: ReplacerNotebook
Show Example

from johnsnowlabs import nlp, medical

names = """Mitchell#NAME
Clifford#NAME
Jeremiah#NAME
Lawrence#NAME
Brittany#NAME
Patricia#NAME
Samantha#NAME
Jennifer#NAME
Jackson#NAME
Leonard#NAME
Randall#NAME
Camacho#NAME
Ferrell#NAME
Mueller#NAME
Bowman#NAME
Hansen#NAME
Acosta#NAME
Gillespie#NAME
Zimmerman#NAME
Gillespie#NAME
Chandler#NAME
Bradshaw#NAME
Ferguson#NAME
Jacobson#NAME
Figueroa#NAME
Chandler#NAME
Schaefer#NAME
Matthews#NAME
Ferguson#NAME
Bradshaw#NAME
Figueroa#NAME
Delacruz#NAME
Gallegos#NAME
Villarreal#NAME
Williamson#NAME
Montgomery#NAME
Mclaughlin#NAME
Blankenship#NAME
Fitzpatrick#NAME
"""

with open('names_test.txt', 'w') as file:
    file.write(names)


# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("sentence")\

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
  .setInputCols("sentence")\
  .setOutputCol("token")\

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# NER model trained on n2c2 (de-identification and Heart Disease Risk Factors Challenge) datasets)
clinical_ner = medical.NerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter_name = medical.NerConverterInternal()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

nameChunkObfuscator = medical.NameChunkObfuscatorApproach()\
  .setInputCols("ner_chunk")\
  .setOutputCol("replacement")\
  .setRefFileFormat("csv")\
  .setObfuscateRefFile("names_test.txt")\
  .setRefSep("#")\

replacer_name = medical.Replacer()\
  .setInputCols("replacement","sentence")\
  .setOutputCol("obfuscated_document_name")\
  .setUseReplacement(True)

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler, 
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter_name,
    nameChunkObfuscator,
    replacer_name
    ])

sample_text = "John Davies is a 62 y.o. patient admitted. Mr. Davies was seen by attending physician Dr. Lorand and was scheduled for emergency assessment."

data = spark.createDataFrame([[sample_text]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)

## Result

Original text.  :  John Davies is a 62 y.o. patient admitted. Mr. Davies was seen by attending physician Dr. Lorand and was scheduled for emergency assessment.

Obfuscated text :  Joseeduardo is a 62 y.o. patient admitted. Mr. Teigan was seen by attending physician Dr. Mayson and was scheduled for emergency assessment.

import spark.implicits._

/* names.txt file

names = """Mitchell#NAME
Clifford#NAME
Jeremiah#NAME
Lawrence#NAME
Brittany#NAME
Patricia#NAME
Samantha#NAME
Jennifer#NAME
Jackson#NAME
Leonard#NAME
Randall#NAME
Camacho#NAME
Ferrell#NAME
Mueller#NAME
Bowman#NAME
Hansen#NAME
Acosta#NAME
Gillespie#NAME
Zimmerman#NAME
Gillespie#NAME
Chandler#NAME
Bradshaw#NAME
Ferguson#NAME
Jacobson#NAME
Figueroa#NAME
Chandler#NAME
Schaefer#NAME
Matthews#NAME
Ferguson#NAME
Bradshaw#NAME
Figueroa#NAME
Delacruz#NAME
Gallegos#NAME
Villarreal#NAME
Williamson#NAME
Montgomery#NAME
Mclaughlin#NAME
Blankenship#NAME
Fitzpatrick#NAME
"""
*/

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
    .setInputCols(Array("sentence","token"))
    .setOutputCol("embeddings")

val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented","en","clinical/models")
    .setInputCols(Array("sentence","token","embeddings"))
    .setOutputCol("ner")

val ner_converter_name = new NerConverterInternal()
    .setInputCols(Array("sentence","token","ner"))
    .setOutputCol("ner_chunk")

val nameChunkObfuscator = new NameChunkObfuscatorApproach()
    .setInputCols("ner_chunk")
    .setOutputCol("replacement")
    .setRefFileFormat("csv")
    .setObfuscateRefFile("names_test.txt")
    .setRefSep("//")

val replacer_name = new Replacer()
    .setInputCols("replacement","sentence")
    .setOutputCol("obfuscated_document_name")
    .setUseReplacement(true)

val nlpPipeline = new Pipeline().setStages(Array(
    documentAssembler, 
    tokenizer, 
    word_embeddings, 
    clinical_ner, 
    ner_converter_name, 
    nameChunkObfuscator, 
    replacer_name))


val test_data = Seq("""John Davies is a 62 y.o. patient admitted. Mr. Davies was seen by attending physician Dr. Lorand and was scheduled for emergency assessment.""").toDF("text")

val res = mapperPipeline.fit(test_data).transform(test_data)

// Show results

Original text.  :  John Davies is a 62 y.o. patient admitted. Mr. Davies was seen by attending physician Dr. Lorand and was scheduled for emergency assessment.

Obfuscated text :  Joseeduardo is a 62 y.o. patient admitted. Mr. Teigan was seen by attending physician Dr. Mayson and was scheduled for emergency assessment.

Resolution2Chunk

Model

This annotator is responsible for converting the annotations generated by entity resolver models (typically labeled as ENTITY) into a format compatible with subsequent stages of the pipeline, such as the ChunkMapperModel. It transforms these annotations into CHUNK annotations, allowing for seamless integration and processing of clinical terminologies and entities in the pipeline.

Input Annotator Types: Resolution

Output Annotator Type: CHUNK

Python API: Resolution2Chunk Scala API: Resolution2Chunk Notebook: Resolution2ChunkNotebook
Show Example

from johnsnowlabs import medical, nlp

document_assembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("ner_chunk")

sbert_embedder = nlp.BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en","clinical/models")\
      .setInputCols(["ner_chunk"])\
      .setOutputCol("sentence_embeddings")\
      .setCaseSensitive(False)

rxnorm_resolver = medical.SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented","en", "clinical/models") \
      .setInputCols(["sentence_embeddings"]) \
      .setOutputCol("rxnorm_code")\
      .setDistanceFunction("EUCLIDEAN")

resolver2chunk = medical.Resolution2Chunk()\
      .setInputCols(["rxnorm_code"]) \
      .setOutputCol("resolver2chunk")

chunkerMapper_action = medical.ChunkMapperModel.pretrained("rxnorm_action_treatment_mapper", "en", "clinical/models")\
      .setInputCols(["resolver2chunk"])\
      .setOutputCol("action_mapping")\
      .setRels(["action"]) #for treatment

pipeline = nlp.Pipeline().setStages([document_assembler,
                                 sbert_embedder,
                                 rxnorm_resolver,
                                 resolver2chunk,
                                 chunkerMapper_action
                                 ])

data= spark.createDataFrame([['Zonalon 50 mg']]).toDF('text')

res= pipeline.fit(data).transform(data)

# Example results

res.select(F.explode(F.arrays_zip(res.ner_chunk.result,
                                  res.rxnorm_code.result,
                                  res.action_mapping.result)).alias("col"))\
    .select(F.expr("col['0']").alias("document"),
            F.expr("col['1']").alias("rxnorm_code"),
            F.expr("col['2']").alias("Action Mapping")).show(truncate=False)

+-------------+-----------+--------------+
|document     |rxnorm_code|Action Mapping|
+-------------+-----------+--------------+
|Zonalon 50 mg|103971     |Analgesic     |
+-------------+-----------+--------------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("ner_chunk")

val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
  .setInputCols("ner_chunk")
  .setOutputCol("sentence_embeddings")
  .setCaseSensitive(False)

val rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented","en","clinical/models")
  .setInputCols("sentence_embeddings")
  .setOutputCol("rxnorm_code")
  .setDistanceFunction("EUCLIDEAN")

val resolver2chunk = new Resolution2Chunk()
  .setInputCols("rxnorm_code")
  .setOutputCol("resolver2chunk")

val chunkerMapper_action = ChunkMapperModel.pretrained("rxnorm_action_treatment_mapper","en","clinical/models")
  .setInputCols("resolver2chunk")
  .setOutputCol("action_mapping")
  .setRels("action")

val pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    sbert_embedder, 
    rxnorm_resolver, 
    resolver2chunk, 
    chunkerMapper_action )) 

val data = Seq("Zonalon 50 mg").toDF("text") 

val res = pipeline.fit(data).transform(data)

// Example results

+-------------+-----------+--------------+
|document     |rxnorm_code|Action Mapping|
+-------------+-----------+--------------+
|Zonalon 50 mg|103971     |Analgesic     |
+-------------+-----------+--------------+

ResolverMerger

Model

ResolverMerger provides the ability to merge sentence enitity resolver and chunk mapper model output columns.

To convert a sentence or document into a vector for tasks like semantic search or recommendation systems, a common approach is to utilize transformer models like BERT. These models provide embeddings for each token in the text. One option is to extract the embedding vector of the CLS token, which represents the overall meaning of the text. Another option is to average the embeddings of all tokens.

Alternatively, we can use fine-tuned Siamese network variants like SBERT, which are specifically designed to generate embeddings that bring similar sentences or documents closer together in the embedding space while separating dissimilar ones. These embeddings can be applied in “Sentence Entity Resolver Models” to perform entity mapping.

However, for a more straightforward approach, we can use a chunk mapper method to extract entities from the text. In addition, by combining resolver models and mapper models using the ResolverMerger annotator, we can further enhance the performance and accuracy of the resolver system.

Parameters:

  • inputCols: The name of the columns containing the input annotations. It can read an Array of strings.

  • outputCol: The name of the column in Document type that is generated. We can specify only one column here.

All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols().

Input Annotator Types: ENTITY, LABEL_DEPENDENCY

Output Annotator Type: ENTITY

Python API: ResolverMerger Scala API: ResolverMerger Notebook: ResolverMergerNotebook
Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
    .setInputCol('text')\
    .setOutputCol('document')

sentence_detector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols("sentence")\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model = medical.NerModel.pretrained("ner_posology_greedy", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols("sentence", "token", "ner")\
    .setOutputCol("chunk")

chunkerMapper = medical.ChunkMapperModel.pretrained("rxnorm_mapper", "en", "clinical/models")\
    .setInputCols(["chunk"])\
    .setOutputCol("RxNorm_Mapper")\
    .setRel("rxnorm_code")

cfModel = medical.ChunkMapperFilterer() \
    .setInputCols(["chunk", "RxNorm_Mapper"]) \
    .setOutputCol("chunks_fail") \
    .setReturnCriteria("fail")

chunk2doc = nlp.Chunk2Doc() \
    .setInputCols("chunks_fail") \
    .setOutputCol("doc_chunk")

sbert_embedder = nlp.BertSentenceEmbeddings.pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\
    .setInputCols(["doc_chunk"])\
    .setOutputCol("sentence_embeddings")\
    .setCaseSensitive(False)

resolver = medical.SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented", "en", "clinical/models") \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("resolver_code") \
    .setDistanceFunction("EUCLIDEAN")

resolverMerger = medical.ResolverMerger()\
    .setInputCols(["resolver_code","RxNorm_Mapper"])\
    .setOutputCol("RxNorm")

mapper_pipeline = nlp.Pipeline(
    stages = [
        document_assembler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        ner_model,
        ner_converter,
        chunkerMapper,
        chunkerMapper,
        cfModel,
        chunk2doc,
        sbert_embedder,
        resolver,
        resolverMerger
    ])

sample_text = [
    ["The patient was given Adapin 10 MG, coumadn 5 mg"],
    ["The patient was given Avandia 4 mg, Tegretol, zitiga"],
]

data = spark.createDataFrame(sample_text).toDF("text")

result = mapper_pipeline.fit(data).transform(data)

result.selectExpr(
    "chunk.result as chunk",
    "RxNorm_Mapper.result as RxNorm_Mapper",
    "chunks_fail.result as chunks_fail",
    "resolver_code.result as resolver_code",
    "RxNorm.result as RxNorm",
).show(truncate=False)


## Result

+--------------------------------+----------------------+--------------+-------------+------------------------+
|chunk                           |RxNorm_Mapper         |chunks_fail   |resolver_code|RxNorm                  |
+--------------------------------+----------------------+--------------+-------------+------------------------+
|[Adapin 10 MG, coumadn 5 mg]    |[1000049, NONE]       |[coumadn 5 mg]|[200883]     |[1000049, 200883]       |
|[Avandia 4 mg, Tegretol, zitiga]|[261242, 203029, NONE]|[zitiga]      |[220989]     |[261242, 203029, 220989]|
+--------------------------------+----------------------+--------------+-------------+------------------------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentence_detector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
    .setInputCols(Array("sentence","token"))
    .setOutputCol("embeddings")

val ner_model = MedicalNerModel.pretrained("ner_posology_greedy","en","clinical/models")
    .setInputCols(Array("sentence","token","embeddings"))
    .setOutputCol("ner")

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence","token","ner"))
    .setOutputCol("chunk")

val chunkerMapper = ChunkMapperModel.pretrained("rxnorm_mapper","en","clinical/models")
    .setInputCols("chunk")
    .setOutputCol("RxNorm_Mapper")
    .setRel("rxnorm_code")

val cfModel = new ChunkMapperFilterer()
    .setInputCols(Array("chunk","RxNorm_Mapper"))
    .setOutputCol("chunks_fail")
    .setReturnCriteria("fail")

val chunk2doc = new Chunk2Doc()
    .setInputCols("chunks_fail")
    .setOutputCol("doc_chunk")

val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
    .setInputCols("doc_chunk")
    .setOutputCol("sentence_embeddings")
    .setCaseSensitive(false)

val resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented","en","clinical/models")
    .setInputCols("sentence_embeddings")
    .setOutputCol("resolver_code")
    .setDistanceFunction("EUCLIDEAN")

val resolverMerger = new ResolverMerger()
    .setInputCols(Array("resolver_code","RxNorm_Mapper"))
    .setOutputCol("RxNorm")

val mapper_pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    sentence_detector, 
    tokenizer, 
    word_embeddings, 
    ner_model, 
    ner_converter, 
    chunkerMapper, 
    chunkerMapper, 
    cfModel, 
    chunk2doc, 
    sbert_embedder, 
    resolver, 
    resolverMerger))


val data = Seq(("""The patient was given Adapin 10 MG, coumadn 5 mg"""),("""The patient was given Avandia 4 mg, Tegretol, zitiga""")).toDF("text")

val res = mapperPipeline.fit(data).transform(data)

// Show results

+--------------------------------+----------------------+--------------+-------------+------------------------+
|chunk                           |RxNorm_Mapper         |chunks_fail   |resolver_code|RxNorm                  |
+--------------------------------+----------------------+--------------+-------------+------------------------+
|[Adapin 10 MG, coumadn 5 mg]    |[1000049, NONE]       |[coumadn 5 mg]|[200883]     |[1000049, 200883]       |
|[Avandia 4 mg, Tegretol, zitiga]|[261242, 203029, NONE]|[zitiga]      |[220989]     |[261242, 203029, 220989]|
+--------------------------------+----------------------+--------------+-------------+------------------------+

Router

Model

Router provides the ability to split an output of an annotator for a selected metadata field and the value for that field.

When we need to use multiple sentence entity resolver models in the same pipeline, we typically had to run the BertSentenceEmbeddings annotator multiple times based on the number of resolver models. This meant that the heavy process of generating sentence embeddings using BERT was repeated multiple times.

To address this issue, Spark NLP Healthcare Library has introduced a solution using the Router annotator. With this new approach, we can provide all the named entity recognition (NER) chunks to the BertSentenceEmbeddings annotator at once. The annotator generates the sentence embeddings for all the chunks together. Then, the output of the sentence embeddings is routed to the specific resolver models that are required for further processing.

This solution eliminates the need to run BertSentenceEmbeddings multiple times, reducing the computational overhead and improving the efficiency of the pipeline.

Parameters:

  • inputCols: The name of the columns containing the input annotations. It can read an Array of strings.

  • outputCol: The name of the column in the Document type that is generated. We can specify only one column here.

  • inputType: The type of entity that you want to filter (by default sentence_embeddings). Possible values; document|token|wordpiece|word_embeddings|sentence_embeddings|category|date|sentiment|pos|chunk|named_entity|regex|dependency|labeled_dependency|language|keyword

  • metadataField: The key in the metadata dictionary that you want to filter (by default entity)

  • filterFieldsElements: The filterfieldsElements are the allowed values for the metadata field that is being used.

All the parameters can be set using the corresponding set method in the camel case. For example, .setInputcols().

Input Annotator Types: ENTITY, LABEL_DEPENDENCY

Output Annotator Type: ENTITY

Python API: Router Scala API: Router Notebook: RouterNotebook
Show Example

from johnsnowlabs import nlp, medical

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols("document")\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols("sentence")\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols("sentence", "token")\
    .setOutputCol("word_embeddings")

# to get PROBLEM entitis
clinical_ner = medical.NerModel().pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "word_embeddings"]) \
    .setOutputCol("clinical_ner")

clinical_ner_chunk = medical.NerConverterInternal()\
    .setInputCols("sentence","token","clinical_ner")\
    .setOutputCol("clinical_ner_chunk")\
    .setWhiteList(["PROBLEM"])

# to get DRUG entities
posology_ner = medical.NerModel().pretrained("ner_posology", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "word_embeddings"]) \
    .setOutputCol("posology_ner")

posology_ner_chunk = medical.NerConverterInternal()\
    .setInputCols("sentence","token","posology_ner")\
    .setOutputCol("posology_ner_chunk")\
    .setWhiteList(["DRUG"])

# merge the chunks into a single ner_chunk
chunk_merger = medical.ChunkMergeApproach()\
    .setInputCols("clinical_ner_chunk","posology_ner_chunk")\
    .setOutputCol("final_ner_chunk")\
    .setMergeOverlapping(False)

# convert chunks to doc to get sentence embeddings of them
chunk2doc = nlp.Chunk2Doc().setInputCols("final_ner_chunk").setOutputCol("doc_final_chunk")

sbiobert_embeddings = nlp.BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
    .setInputCols(["doc_final_chunk"])\
    .setOutputCol("sbert_embeddings")\
    .setCaseSensitive(False)

# filter PROBLEM entity embeddings
router_sentence_icd10 = medical.Router() \
    .setInputCols("sbert_embeddings") \
    .setFilterFieldsElements(["PROBLEM"]) \
    .setOutputCol("problem_embeddings")

# filter DRUG entity embeddings
router_sentence_rxnorm = medical.Router() \
    .setInputCols("sbert_embeddings") \
    .setFilterFieldsElements(["DRUG"]) \
    .setOutputCol("drug_embeddings")

# use problem_embeddings only
icd_resolver = medical.SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_slim_billable_hcc","en", "clinical/models") \
    .setInputCols(["problem_embeddings"]) \
    .setOutputCol("icd10cm_code")\
    .setDistanceFunction("EUCLIDEAN")

# use drug_embeddings only
rxnorm_resolver = medical.SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented","en", "clinical/models") \
    .setInputCols(["drug_embeddings"]) \
    .setOutputCol("rxnorm_code")\
    .setDistanceFunction("EUCLIDEAN")


pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    clinical_ner_chunk,
    posology_ner,
    posology_ner_chunk,
    chunk_merger,
    chunk2doc,
    sbiobert_embeddings,
    router_sentence_icd10,
    router_sentence_rxnorm,
    icd_resolver,
    rxnorm_resolver
])

clinical_note = """The patient is a 41-year-old Vietnamese female with a cough that started last week.
She has had right-sided chest pain radiating to her back with fever starting yesterday.
She has a history of pericarditis in May 2006 and developed cough with right-sided chest pain.
MEDICATIONS
1. Coumadin 1 mg daily. Last INR was on Tuesday, August 14, 2007, and her INR was 2.3.
2. Amiodarone 100 mg p.o. daily.
"""

data = spark.createDataFrame([[clinical_note]]).toDF("text")

result = pipeline.fit(data).transform(data)

## Result

result.selectExpr(
    "final_ner_chunk.result as chunk",
    "posology_ner_chunk.result as posology_chunk",
    "rxnorm_code.result as rxnorm_code",
    "clinical_ner_chunk.result as clinical_chunk",
    "icd10cm_code.result as icd10cm_code",
).show(truncate=False)

+-----------------------------------------------------------------------------------------------------------+----------------------+-------------+-------------------------------------------------------------------------------------+--------------------------------------+
|chunk                                                                                                      |posology_chunk        |rxnorm_code  |clinical_chunk                                                                       |icd10cm_code                          |
+-----------------------------------------------------------------------------------------------------------+----------------------+-------------+-------------------------------------------------------------------------------------+--------------------------------------+
|[a cough, right-sided chest pain, fever, pericarditis, cough, right-sided chest pain, Coumadin, Amiodarone]|[Coumadin, Amiodarone]|[202421, 703]|[a cough, right-sided chest pain, fever, pericarditis, cough, right-sided chest pain]|[R05, R10.11, A68, I30.1, R05, R10.11]|
+-----------------------------------------------------------------------------------------------------------+----------------------+-------------+-------------------------------------------------------------------------------------+--------------------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence","token"))
    .setOutputCol("word_embeddings")
    
// to get PROBLEM entitis 
val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence","token","word_embeddings"))
    .setOutputCol("clinical_ner")

val clinical_ner_chunk = new NerConverterInternal()
    .setInputCols("sentence","token","clinical_ner")
    .setOutputCol("clinical_ner_chunk")
    .setWhiteList("PROBLEM")

// to get DRUG entities 
val posology_ner = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models")
    .setInputCols(Array("sentence","token","word_embeddings"))
    .setOutputCol("posology_ner")

val posology_ner_chunk = new NerConverterInternal()
    .setInputCols("sentence","token","posology_ner")
    .setOutputCol("posology_ner_chunk")
    .setWhiteList("DRUG")

// merge the chunks into a single ner_chunk 
val chunk_merger = new ChunkMergeApproach()
    .setInputCols(Array("clinical_ner_chunk","posology_ner_chunk"))
    .setOutputCol("final_ner_chunk")
    .setMergeOverlapping(false)

// convert chunks to doc to get sentence embeddings of them 
val chunk2doc = new Chunk2Doc()
    .setInputCols("final_ner_chunk")
    .setOutputCol("doc_final_chunk")

val sbiobert_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
    .setInputCols("doc_final_chunk")
    .setOutputCol("sbert_embeddings")
    .setCaseSensitive(false)

// filter PROBLEM entity embeddings 
val router_sentence_icd10 = new Router()
    .setInputCols("sbert_embeddings")
    .setFilterFieldsElements("PROBLEM")
    .setOutputCol("problem_embeddings")
    
// filter DRUG entity embeddings 
val router_sentence_rxnorm = new Router()
    .setInputCols("sbert_embeddings")
    .setFilterFieldsElements("DRUG")
    .setOutputCol("drug_embeddings")
    
// use problem_embeddings only 
val icd_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_slim_billable_hcc", "en", "clinical/models")
    .setInputCols("problem_embeddings")
    .setOutputCol("icd10cm_code")
    .setDistanceFunction("EUCLIDEAN")
    
// use drug_embeddings only 
val rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented", "en", "clinical/models")
    .setInputCols("drug_embeddings")
    .setOutputCol("rxnorm_code")
    .setDistanceFunction("EUCLIDEAN")

val pipeline = new Pipeline().setStages(Array(
    documentAssembler, 
    sentenceDetector, 
    tokenizer, 
    word_embeddings, 
    clinical_ner, 
    clinical_ner_chunk, 
    posology_ner, 
    posology_ner_chunk, 
    chunk_merger, 
    chunk2doc, 
    sbiobert_embeddings, 
    router_sentence_icd10, 
    router_sentence_rxnorm, 
    icd_resolver, 
    rxnorm_resolver))


val data = Seq("""The patient is a 41-year-old Vietnamese female with a cough that started last week.
She has had right-sided chest pain radiating to her back with fever starting yesterday.
She has a history of pericarditis in May 2006 and developed cough with right-sided chest pain.
MEDICATIONS
1. Coumadin 1 mg daily. Last INR was on Tuesday, August 14, 2007, and her INR was 2.3.
2. Amiodarone 100 mg p.o. daily.""").toDF("text")

val res = mapperPipeline.fit(data).transform(data)

// Show results

+-----------------------------------------------------------------------------------------------------------+----------------------+-------------+-------------------------------------------------------------------------------------+--------------------------------------+
|chunk                                                                                                      |posology_chunk        |rxnorm_code  |clinical_chunk                                                                       |icd10cm_code                          |
+-----------------------------------------------------------------------------------------------------------+----------------------+-------------+-------------------------------------------------------------------------------------+--------------------------------------+
|[a cough, right-sided chest pain, fever, pericarditis, cough, right-sided chest pain, Coumadin, Amiodarone]|[Coumadin, Amiodarone]|[202421, 703]|[a cough, right-sided chest pain, fever, pericarditis, cough, right-sided chest pain]|[R05, R10.11, A68, I30.1, R05, R10.11]|
+-----------------------------------------------------------------------------------------------------------+----------------------+-------------+-------------------------------------------------------------------------------------+--------------------------------------+

SentenceEntityResolver

The model transforms a dataset with Input Annotation type SENTENCE_EMBEDDINGS, coming from e.g. BertSentenceEmbeddings and returns the normalized entity for a particular trained ontology / curated dataset (e.g. ICD-10, RxNorm, SNOMED etc.).

Parameters:

  • distanceFunction: Determines how the distance between different entities will be calculated. Either COSINE or EUCLIDEAN.

  • neighbours: The number of neighbours to consider when computing the distances.

  • caseSensitive: WWhether to consider text casing or not.

  • threshold: Threshold of the distance between nodes to consider.

  • DoExceptionHandling: If it is set as True, the annotator tries to process as usual and ff exception-causing data (e.g. corrupted record/ document) is passed to the annotator, an exception warning is emitted which has the exception message.

All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols().

For a list of pretrained models, please see the Models Hub.

Input Annotator Types: SENTENCE_EMBEDDINGS

Output Annotator Type: ENTITY

Python API: SentenceEntityResolverModel Scala API: SentenceEntityResolverModel Notebook: SentenceEntityResolverModelNotebook
Show Example

from johnsnowlabs import nlp, medical 

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("jsl_ner_wip_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setWhiteList(["Test","Procedure"])

c2doc = nlp.Chunk2Doc()\
    .setInputCols(["ner_chunk"])\
    .setOutputCol("ner_chunk_doc")

sbert_embedder = nlp.BertSentenceEmbeddings\
    .pretrained("sbiobert_base_cased_mli","en","clinical/models")\
    .setInputCols(["ner_chunk_doc"])\
    .setOutputCol("sbert_embeddings")

# Then the resolver is defined on the extracted entities and sentence embeddings
cpt_resolver = medical.SentenceEntityResolverModel.pretrained("sbiobertresolve_cpt_procedures_augmented","en", "clinical/models")\
    .setInputCols(["sbert_embeddings"])\
    .setOutputCol("cpt_code")\
    .setDistanceFunction("EUCLIDEAN")

pipeline = nlp.Pipeline().setStages([
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    c2doc,
    sbert_embedder,
    cpt_resolver])

text = """She was admitted to the hospital with chest pain and found to have bilateral pleural effusion, the right greater than the left. CT scan of the chest also revealed a large mediastinal lymph node.
We reviewed the pathology obtained from the pericardectomy in March 2006, which was diagnostic of mesothelioma.
At this time, chest tube placement for drainage of the fluid occurred and thoracoscopy, which were performed, which revealed epithelioid malignant mesothelioma."""

df = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(df).transform(df)

# Show Results
+--------------------+---------+-----+----------+--------------------+--------------------+
|               chunk|   entity| code|confidence|       all_k_results|   all_k_resolutions|
+--------------------+---------+-----+----------+--------------------+--------------------+
|CT scan of the chest|     Test|62284|    0.2028|62284:::76497:::7...|Computed tomograp...|
|      pericardectomy|Procedure|33031|    0.3329|33031:::33025:::3...|Pericardectomy [P...|
|chest tube placement|Procedure|39503|    0.9343|39503:::32036:::3...|Insertion of ches...|
|drainage of the f...|Procedure|49405|    0.2476|49405:::49407:::4...|Drainage procedur...|
|        thoracoscopy|Procedure|32660|    0.1422|32660:::32667:::1...|Thoracoscopy [Tho...|
+--------------------+---------+-----+----------+--------------------+--------------------+
from johnsnowlabs import nlp, finance 

documentAssembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("ner_chunk")

embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \
      .setInputCols("ner_chunk") \
      .setOutputCol("sentence_embeddings")

resolver = finance.SentenceEntityResolverModel.pretrained("finel_edgar_company_name", "en", "finance/models")\
      .setInputCols(["ner_chunk", "sentence_embeddings"]) \
      .setOutputCol("normalized")\
      .setDistanceFunction("EUCLIDEAN")

pipeline = nlp.Pipeline(
      stages = [
          documentAssembler,
          embeddings,
          resolver])

text = """CONTACT GOLD"""

df = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(df).transform(df)

# Show Results

|chunk       |result            |all_k_results                                                                                                                                                                                                                                  |all_k_resolutions                                                                                                                                                                                                                              |

|CONTACT GOLD|Contact Gold Corp.|Contact Gold Corp.:::ISHARES GOLD TRUST:::Minatura Gold:::Mexus Gold US:::BESRA GOLD INC.:::ALAMOS GOLD INC:::JOSHUA GOLD RESOURCES INC:::MIDEX GOLD CORP.:::Gold Mark Stephen:::Guskin Gold Corp.:::CMX GOLD & SILVER CORP.:::Permal Gold Ltd.|Contact Gold Corp.:::ISHARES GOLD TRUST:::Minatura Gold:::Mexus Gold US:::BESRA GOLD INC.:::ALAMOS GOLD INC:::JOSHUA GOLD RESOURCES INC:::MIDEX GOLD CORP.:::Gold Mark Stephen:::Guskin Gold Corp.:::CMX GOLD & SILVER CORP.:::Permal Gold Ltd.|

from johnsnowlabs import nlp, legal 

documentAssembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("ner_chunk")

embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en")\
      .setInputCols("ner_chunk")\
      .setOutputCol("sentence_embeddings")
    
resolver = legal.SentenceEntityResolverModel.pretrained("legel_edgar_company_name", "en", "legal/models")\
      .setInputCols(["ner_chunk", "sentence_embeddings"])\
      .setOutputCol("irs_code")\
      .setDistanceFunction("EUCLIDEAN")

pipeline = nlp.Pipeline(
      stages = [
          documentAssembler,
          embeddings,
          resolver])

text = """CONTACT GOLD"""

df = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(df).transform(df)

# Show Results
+------------+------------------+---------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|chunk       |result            |code                                                                                         |all_k_results                                                                                                                                                                                                                                  |all_k_resolutions                                                                                                                                                                                                                              |

|CONTACT GOLD|Contact Gold Corp.|981369960:::0:::208273426:::204092640:::0:::0:::270531073:::261918920:::0:::271989147:::0:::0|Contact Gold Corp.:::ISHARES GOLD TRUST:::Minatura Gold:::Mexus Gold US:::BESRA GOLD INC.:::ALAMOS GOLD INC:::JOSHUA GOLD RESOURCES INC:::MIDEX GOLD CORP.:::Gold Mark Stephen:::Guskin Gold Corp.:::CMX GOLD & SILVER CORP.:::Permal Gold Ltd.|Contact Gold Corp.:::ISHARES GOLD TRUST:::Minatura Gold:::Mexus Gold US:::BESRA GOLD INC.:::ALAMOS GOLD INC:::JOSHUA GOLD RESOURCES INC:::MIDEX GOLD CORP.:::Gold Mark Stephen:::Guskin Gold Corp.:::CMX GOLD & SILVER CORP.:::Permal Gold Ltd.|


import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text") 
  .setOutputCol("document") 

val sentenceDetector = SentenceDetectorDLModel.pretrained()
  .setInputCols(Array("document")) 
  .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence")) 
  .setOutputCol("token") 

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
  .setInputCols(Array("sentence","token")) 
  .setOutputCol("embeddings") 

val clinical_ner = MedicalNerModel.pretrained("jsl_ner_wip_clinical","en","clinical/models")
  .setInputCols(Array("sentence","token","embeddings")) 
  .setOutputCol("ner") 

val ner_converter = new NerConverter()
  .setInputCols(Array("sentence","token","ner")) 
  .setOutputCol("ner_chunk") 
  .setWhiteList(Array("Test","Procedure")) 

val c2doc = new Chunk2Doc()
  .setInputCols(Array("ner_chunk")) 
  .setOutputCol("ner_chunk_doc") 

val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models") 
  .setInputCols(Array("ner_chunk_doc"))
  .setOutputCol("sbert_embeddings") 

// Then the resolver is defined on the extracted entities and sentence embeddings 

val cpt_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_cpt_procedures_augmented","en","clinical/models")
  .setInputCols(Array("sbert_embeddings")) 
  .setOutputCol("cpt_code") 
  .setDistanceFunction("EUCLIDEAN") 

val pipeline = new Pipeline().setStages(Array( 
                                              documentAssembler, 
                                              sentenceDetector, 
                                              tokenizer,
                                              word_embeddings, 
                                              clinical_ner, 
                                              ner_converter, 
                                              c2doc, 
                                              sbert_embedder, 
                                              cpt_resolver)) 


val text = "She was admitted to the hospital with chest pain and found to have bilateral pleural effusion,the right greater than the left. CT scan of the chest also revealed a large mediastinal lymph node. We reviewed the pathology obtained from the pericardectomy in March 2006,which was diagnostic of mesothelioma. At this time,chest tube placement for drainage of the fluid occurred and thoracoscopy,which were performed,which revealed epithelioid malignant mesothelioma." 

val df = Seq(text) .toDF("text") 
val result = pipeline.fit(df).transform(df) 

// Show Results
+--------------------+---------+-----+----------+--------------------+--------------------+
|               chunk|   entity| code|confidence|       all_k_results|   all_k_resolutions|
+--------------------+---------+-----+----------+--------------------+--------------------+
|CT scan of the chest|     Test|62284|    0.2028|62284:::76497:::7...|Computed tomograp...|
|      pericardectomy|Procedure|33031|    0.3329|33031:::33025:::3...|Pericardectomy [P...|
|chest tube placement|Procedure|39503|    0.9343|39503:::32036:::3...|Insertion of ches...|
|drainage of the f...|Procedure|49405|    0.2476|49405:::49407:::4...|Drainage procedur...|
|        thoracoscopy|Procedure|32660|    0.1422|32660:::32667:::1...|Thoracoscopy [Tho...|
+--------------------+---------+-----+----------+--------------------+--------------------+
import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text") 
  .setOutputCol("ner_chunk") 

val embeddings = UniversalSentenceEncoder.pretrained("tfhub_use","en")
  .setInputCols("ner_chunk") 
  .setOutputCol("sentence_embeddings") 

val resolver = SentenceEntityResolverModel.pretrained("finel_edgar_company_name","en","finance/models")
  .setInputCols(Array("ner_chunk","sentence_embeddings")) 
  .setOutputCol("normalized") 
  .setDistanceFunction("EUCLIDEAN") 

val pipeline = new Pipeline().setStages(Array(
                                            documentAssembler, 
                                            embeddings, 
                                            resolver)) 

val text = "CONTACT GOLD" 
val df = Seq(text) .toDF("text") 
val result = pipeline.fit(df).transform(df) 

// Show Results

|chunk       |result            |all_k_results                                                                                                                                                                                                                                  |all_k_resolutions                                                                                                                                                                                                                              |

|CONTACT GOLD|Contact Gold Corp.|Contact Gold Corp.:::ISHARES GOLD TRUST:::Minatura Gold:::Mexus Gold US:::BESRA GOLD INC.:::ALAMOS GOLD INC:::JOSHUA GOLD RESOURCES INC:::MIDEX GOLD CORP.:::Gold Mark Stephen:::Guskin Gold Corp.:::CMX GOLD & SILVER CORP.:::Permal Gold Ltd.|Contact Gold Corp.:::ISHARES GOLD TRUST:::Minatura Gold:::Mexus Gold US:::BESRA GOLD INC.:::ALAMOS GOLD INC:::JOSHUA GOLD RESOURCES INC:::MIDEX GOLD CORP.:::Gold Mark Stephen:::Guskin Gold Corp.:::CMX GOLD & SILVER CORP.:::Permal Gold Ltd.|

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text") 
  .setOutputCol("ner_chunk") 

val embeddings = UniversalSentenceEncoder.pretrained("tfhub_use","en")
  .setInputCols("ner_chunk") 
  .setOutputCol("sentence_embeddings") 

val resolver = SentenceEntityResolverModel.pretrained("legel_edgar_company_name","en","legal/models")
  .setInputCols(Array("ner_chunk","sentence_embeddings")) 
  .setOutputCol("irs_code") .setDistanceFunction("EUCLIDEAN") 

val pipeline = new Pipeline().setStages(Array(
                                            documentAssembler, 
                                            embeddings, 
                                            resolver)) 

val text = "CONTACT GOLD" 

val df = Seq(text) .toDF("text") 
val result = pipeline.fit(df).transform(df)

// Show Results

|chunk       |result            |code                                                                                         |all_k_results                                                                                                                                                                                                                                  |all_k_resolutions                                                                                                                                                                                                                              |

|CONTACT GOLD|Contact Gold Corp.|981369960:::0:::208273426:::204092640:::0:::0:::270531073:::261918920:::0:::271989147:::0:::0|Contact Gold Corp.:::ISHARES GOLD TRUST:::Minatura Gold:::Mexus Gold US:::BESRA GOLD INC.:::ALAMOS GOLD INC:::JOSHUA GOLD RESOURCES INC:::MIDEX GOLD CORP.:::Gold Mark Stephen:::Guskin Gold Corp.:::CMX GOLD & SILVER CORP.:::Permal Gold Ltd.|Contact Gold Corp.:::ISHARES GOLD TRUST:::Minatura Gold:::Mexus Gold US:::BESRA GOLD INC.:::ALAMOS GOLD INC:::JOSHUA GOLD RESOURCES INC:::MIDEX GOLD CORP.:::Gold Mark Stephen:::Guskin Gold Corp.:::CMX GOLD & SILVER CORP.:::Permal Gold Ltd.|


Trains a SentenceEntityResolverModel that maps sentence embeddings to entities in a knowledge base.

Parameters:

  • labelCol : Column name for the value we are trying to resolve. Usually this contains the entity ID in the knowledge base (e.g., the ICD-10 code).

  • normalizedCol: Column name for the original, normalized description

  • aux_label_col: Auxiliary label which maps resolved entities to additional labels

  • useAuxLabel: Whether to use the auxiliary column or not. Default value is False.

  • distanceFunction: Determines how the distance between different entities will be calculated.

  • confidenceFunction: What function to use to calculate confidence: Either ` INVERSE or SOFTMAX.

  • caseSensitive: whether to ignore case in tokens for embeddings matching (Default: False)

  • threshold: Threshold value for the last distance calculated (default: 5.0)

  • missAsEmpty: whether or not to return an empty annotation on unmatched chunks (default: True)

When finetuning an existing model, there are additional parameters:

  • pretrainedModelPath: Path to an already trained SentenceEntityResolverModel.This pretrained model will be used as a starting point for training the new one. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3).

  • overrideExistingCodes: Whether to override the existing codes with new data while continue the training from a pretrained model. Default value is False (keep all the codes).

  • dropCodesList: A list of codes in a pretrained model that will be omitted when the training process begins with a pretrained model.

You can find pretrained Sentence Embeddings (using BERT or other architecgture) in the NLP Models Hub <https://nlp.johnsnowlabs.com/models?task=Embeddings>_.

Input Annotator Types: SENTENCE_EMBEDDINGS

Output Annotator Type: ENTITY

Python API: SentenceEntityResolverApproach Scala API: SentenceEntityResolverApproach Notebook: SentenceEntityResolverApproachNotebook
Show Example

from johnsnowlabs import nlp, medical 

# Training a SNOMED resolution model using BERT sentence embeddings
# Define pre-processing pipeline for training data. It needs consists of columns for the normalized training data and their labels.
documentAssembler = nlp.DocumentAssembler() \
  .setInputCol("normalized_text") \
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

bertEmbeddings = nlp.BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_base_cased") \
  .setInputCols(["sentence"]) \
  .setOutputCol("bert_embeddings")

snomedTrainingPipeline = nlp.Pipeline(stages=[
  documentAssembler,
  sentenceDetector,
  bertEmbeddings
])
snomedTrainingModel = snomedTrainingPipeline.fit(data)
snomedData = snomedTrainingModel.transform(data).cache()

# Then the Resolver can be trained with
bertExtractor = medical.SentenceEntityResolverApproach() \
  .setNeighbours(25) \
  .setThreshold(1000) \
  .setInputCols(["bert_embeddings"]) \
  .setNormalizedCol("normalized_text") \
  .setLabelCol("label") \
  .setOutputCol("snomed_code") \
  .setDistanceFunction("EUCLIDIAN") \
  .setCaseSensitive(False)

snomedModel = bertExtractor.fit(snomedData)
from johnsnowlabs import nlp, finance

# Define pre-processing pipeline for training data. It needs consists of columns for the normalized training data and their labels.
documentAssembler = nlp.DocumentAssembler() \
  .setInputCol("normalized_text") \
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

bertEmbeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_large_cased") \
  .setInputCols(["sentence"]) \
  .setOutputCol("bert_embeddings")

preprocessing_pipeline = nlp.Pipeline(stages=[
  documentAssembler,
  sentenceDetector,
  bertEmbeddings
])
preprocessing_model = preprocessing_pipeline.fit(data)
processed_data = preprocessing_model.transform(data).cache()

# Then the Resolver can be trained with
bertExtractor = finance.SentenceEntityResolverApproach() \
  .setNeighbours(25) \
  .setThreshold(1000) \
  .setInputCols(["bert_embeddings"]) \
  .setNormalizedCol("normalized_text") \
  .setLabelCol("label") \
  .setOutputCol("snomed_code") \
  .setDistanceFunction("EUCLIDIAN") \
  .setCaseSensitive(False)

model = bertExtractor.fit(processed_data)
from johnsnowlabs import nlp, legal

# Define pre-processing pipeline for training data. It needs consists of columns for the normalized training data and their labels.
documentAssembler = nlp.DocumentAssembler() \
  .setInputCol("normalized_text") \
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

bertEmbeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_uncased_legal") \
  .setInputCols(["sentence"]) \
  .setOutputCol("bert_embeddings")

preprocessing_pipeline = nlp.Pipeline(stages=[
  documentAssembler,
  sentenceDetector,
  bertEmbeddings
])
data_preprocessing_model = preprocessing_pipeline.fit(data)
processed_data = data_preprocessing_model.transform(data).cache()

# Then the Resolver can be trained with
bertExtractor = legal.SentenceEntityResolverApproach() \
  .setNeighbours(25) \
  .setThreshold(1000) \
  .setInputCols(["bert_embeddings"]) \
  .setNormalizedCol("normalized_text") \
  .setLabelCol("label") \
  .setOutputCol("snomed_code") \
  .setDistanceFunction("EUCLIDIAN") \
  .setCaseSensitive(False)

model = bertExtractor.fit(processed_data)

import spark.implicits._

// Training a SNOMED resolution model using BERT sentence embeddings
// Define pre-processing pipeline for training data. It needs consists of columns for the normalized training data and their labels.
val documentAssembler = new DocumentAssembler()
   .setInputCol("normalized_text")
   .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

 val bertEmbeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_base_cased")
   .setInputCols("sentence")
   .setOutputCol("bert_embeddings")

 val snomedTrainingPipeline = new Pipeline().setStages(Array(
   documentAssembler,
   sentenceDetector,
   bertEmbeddings
 ))

 val snomedTrainingModel = snomedTrainingPipeline.fit(data)
 val snomedData = snomedTrainingModel.transform(data).cache()

// Then the Resolver can be trained with
val bertExtractor = new SentenceEntityResolverApproach()
  .setNeighbours(25)
  .setThreshold(1000)
  .setInputCols("bert_embeddings")
  .setNormalizedCol("normalized_text")
  .setLabelCol("label")
  .setOutputCol("snomed_code")
  .setDistanceFunction("EUCLIDIAN")
  .setCaseSensitive(false)

val snomedModel = bertExtractor.fit(snomedData)
import spark.implicits._

// Training a SNOMED resolution model using BERT sentence embeddings
// Define pre-processing pipeline for training data. It needs consists of columns for the normalized training data and their labels.
val documentAssembler = new DocumentAssembler()
   .setInputCol("normalized_text")
   .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

 val bertEmbeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_base_cased")
   .setInputCols("sentence")
   .setOutputCol("bert_embeddings")
   
 val snomedTrainingPipeline = new Pipeline().setStages(Array(
   documentAssembler,
   sentenceDetector,
   bertEmbeddings
 ))
 val snomedTrainingModel = snomedTrainingPipeline.fit(data)
 val snomedData = snomedTrainingModel.transform(data).cache()

// Then the Resolver can be trained with
val bertExtractor = new SentenceEntityResolverApproach()
  .setNeighbours(25)
  .setThreshold(1000)
  .setInputCols("bert_embeddings")
  .setNormalizedCol("normalized_text")
  .setLabelCol("label")
  .setOutputCol("snomed_code")
  .setDistanceFunction("EUCLIDIAN")
  .setCaseSensitive(false)

val snomedModel = bertExtractor.fit(snomedData)
import spark.implicits._

// Training a SNOMED resolution model using BERT sentence embeddings
// Define pre-processing pipeline for training data. It needs consists of columns for the normalized training data and their labels.
val documentAssembler = new DocumentAssembler()
   .setInputCol("normalized_text")
   .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

 val bertEmbeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_base_cased")
   .setInputCols("sentence")
   .setOutputCol("bert_embeddings")

 val snomedTrainingPipeline = new Pipeline().setStages(Array(
   documentAssembler,
   sentenceDetector,
   bertEmbeddings
 ))
 val snomedTrainingModel = snomedTrainingPipeline.fit(data)
 val snomedData = snomedTrainingModel.transform(data).cache()

// Then the Resolver can be trained with
val bertExtractor = new SentenceEntityResolverApproach()
  .setNeighbours(25)
  .setThreshold(1000)
  .setInputCols("bert_embeddings")
  .setNormalizedCol("normalized_text")
  .setLabelCol("label")
  .setOutputCol("snomed_code")
  .setDistanceFunction("EUCLIDIAN")
  .setCaseSensitive(false)

val snomedModel = bertExtractor.fit(snomedData)

Summarizer

Model

Summarizer annotator that uses a generative deep learning model to create summaries of medical, finance, and legal texts. This annotator helps to quickly summarize complex medical, finance, and legal information from related documents.

Parameters:

  • doSample: Whether or not to use sampling, use greedy decoding otherwise (Default: false)

  • ignoreTokenIds: A list of token ids which are ignored in the decoder’s output (Default: Array())

  • maxNewTokens: Maximum number of new tokens to be generated (Default: 30)

  • maxTextLength: Maximum length of context text.

  • noRepeatNgramSize: If set to int > 0, all ngrams of that size can only occur once (Default: 0)

  • randomSeed: Optional Random seed for the model.

  • refineChunkSize: How large should refined chunks Be.

  • refineMaxAttempts: How many times should chunks be re-summarized while they are above SummaryTargetLength before stopping.

  • refineSummary: Set true to perform refined summarization at increased computation cost.

  • refineSummaryTargetLength: Target length for refined summary.

  • topK: The number of highest probability vocabulary tokens to keep for top-k-filtering (Default: 50)

  • useCache: Cache internal state of the model to improve performance

Available models can be found at the Models Hub.

For more extended examples on document pre-processing see the Spark NLP Workshop

Input Annotator Types: DOCUMENT

Output Annotator Type: CHUNK

Python API: MedicalSummarizer Scala API: MedicalSummarizer
Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
    .setInputCol('text')\
    .setOutputCol('document')

summarizer = medical.Summarizer.pretrained("summarizer_clinical_jsl", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("summary")\
    .setMaxTextLength(512)\
    .setMaxNewTokens(512)

pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        summarizer
])

text = """The patient is a pleasant 17-year-old gentleman who was playing basketball today in gym. Two hours prior to presentation, he started to fall and someone stepped on his ankle and kind of twisted his right ankle and he cannot bear weight on it now. It hurts to move or bear weight. No other injuries noted. He does not think he has had injuries to his ankle in the past.
SOCIAL HISTORY: He does not drink or smoke.
MEDICAL DECISION MAKING:
He had an x-ray of his ankle that showed a small ossicle versus avulsion fracture of the talonavicular joint on the lateral view. He has had no pain over the metatarsals themselves. This may be a fracture based upon his exam. He does want to have me to put him in a splint. He was given Motrin here. He will be discharged home to follow up with Dr. X from Orthopedics.
DISPOSITION: Crutches and splint were administered here. I gave him a prescription for Motrin and some Darvocet if he needs to length his sleep and if he has continued pain to follow up with Dr. X. Return if any worsening problems."""

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

result.select("summary.result").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[A 17-year-old man fell and twisted his right ankle, causing pain to move or bear weight. An x-ray showed a small ossicle or avulsion fracture of the talonavicular joint on the lateral view, which may be a fracture based upon his exam. He was given Motrin and discharged home with crutches and a prescription for Motrin and Darvocet. He was advised to follow up with his doctor if pain worsens and return if any worsening problems worsen.]|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
from johnsnowlabs import nlp, finance

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

summarizer = finance.Summarizer().pretrained('finsum_flant5_base','en','finance/models')\
    .setInputCols(["document"])\
    .setOutputCol("summary")\
    .setMaxNewTokens(1000)

pipeline = nlp.Pipeline(stages=[document_assembler, summarizer])

data = spark.createDataFrame([["""Lost Time Incident Rate: 
The lost time incident rate per 200,000 hours worked in 2021 was 0.14, which decreased by 17.6% compared to 2020 (0.17) and decreased by 70.8% compared to 2019 (0.48). The decrease in the lost time incident rate can be attributed to the company's efforts to improve workplace safety and implement effective risk management strategies. 
The total Scope 2 GHG emissions in 2021 were 688,228 tonnes, which remained relatively stable compared to 2020. The company's efforts to transition to renewable energy sources have helped to minimize Scope 2 GHG emissions."""]]).toDF('text')

result = pipeline.fit(data).transform(data)

result.select("summary.result").show(truncate=False)
from johnsnowlabs import nlp, legal

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

summarizer = legal.Summarizer().pretrained('legsum_flant5_legal_augmented','en','legal/models')\
    .setInputCols(["document"])\
    .setOutputCol("summary")\
    .setMaxNewTokens(1000)

pipeline = nlp.Pipeline(stages=[document_assembler, summarizer])

data = spark.createDataFrame([
  ["""NOW, THEREFORE, in consideration of the Company’s disclosure of information to the Recipient
and the promises set forth below, the parties agree as follows:

     1. Confidential Information. “Confidential Information” as used in this
Agreement means all information relating to the Company disclosed to the Recipient by the Company,
including without limitation any business, technical, marketing, financial or other information,
whether in written, electronic or oral form. Any and all reproductions, copies, notes, summaries,
reports, analyses or other material derived by the Recipient or its Representatives (as defined
below) in whole or in part from the Confidential Information in whatever form maintained shall be
considered part of the Confidential Information itself and shall be treated as such. Confidential
Information does not include information that (a) is or becomes part of the public domain other
than as a result of disclosure by the Recipient or its Representatives; (b) becomes available to
the Recipient on a nonconfidential basis from a source other than the Company, provided that source
is not bound with respect to that information by a confidentiality agreement with the Company or is
otherwise prohibited from transmitting that information by a contractual, legal or other
obligation; (c) can be proven by the Recipient to have been in the Recipient’s possession prior to
disclosure of the same by the Company; or (d) is independently developed by the Recipient without
reference to or reliance on any of the Company’s Confidential Information."""]
]).toDF('text')

result = pipeline.fit(data).transform(data)

result.select("summary.result").show(truncate=False)


|result|

|[This legal agreement states that the company has disclosed all information relating to the company to the recipient, including any business, technical, marketing, financial or other information. It also states that any reproductions, copies, notes, summaries, reports, analyses or other material derived from the confidential information must be treated as part of the confidential information. The confidential information does not include information that is or becomes part of the public domain other than as a result of disclosure by the recipient or its representatives, becomes available to the recipient on a nonconfidential basis from a source other than the company, can be proven by the recipient to have been in the recipients possession prior to disclosure, or is independently developed by the recipient without reference to or reliance on any of the companys confidential information.]|


import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val summarizer = Summarizer.pretrained("summarizer_clinical_jsl", "en", "clinical/models")
  .setInputCols(Array("document"))
  .setOutputCol("summary")
  .setMaxTextLength(512)
  .setMaxNewTokens(512)


val pipeline = new Pipeline().setStages(Array(documentAssembler, summarizer))

val text = """The patient is a pleasant 17-year-old gentleman who was playing basketball today in gym. Two hours prior to presentation, he started to fall and someone stepped on his ankle and kind of twisted his right ankle and he cannot bear weight on it now. It hurts to move or bear weight. No other injuries noted. He does not think he has had injuries to his ankle in the past.
SOCIAL HISTORY: He does not drink or smoke.
MEDICAL DECISION MAKING:
He had an x-ray of his ankle that showed a small ossicle versus avulsion fracture of the talonavicular joint on the lateral view. He has had no pain over the metatarsals themselves. This may be a fracture based upon his exam. He does want to have me to put him in a splint. He was given Motrin here. He will be discharged home to follow up with Dr. X from Orthopedics.
DISPOSITION: Crutches and splint were administered here. I gave him a prescription for Motrin and some Darvocet if he needs to length his sleep and if he has continued pain to follow up with Dr. X. Return if any worsening problems."""

val data = Seq(text).toDS.toDF("text")

val result = pipeline.fit(data).transform(data)
import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val summarizer = Summarizer.pretrained("finsum_flant5_base", "en", "finance/models")
  .setInputCols(Array("document"))
  .setOutputCol("summary")
  .setMaxNewTokens(1000)

val pipeline = new Pipeline().setStages(Array(documentAssembler, summarizer))

val text = """Lost Time Incident Rate: 
The lost time incident rate per 200,000 hours worked in 2021 was 0.14, which decreased by 17.6% compared to 2020 (0.17) and decreased by 70.8% compared to 2019 (0.48). The decrease in the lost time incident rate can be attributed to the company's efforts to improve workplace safety and implement effective risk management strategies. 
The total Scope 2 GHG emissions in 2021 were 688,228 tonnes, which remained relatively stable compared to 2020. The company's efforts to transition to renewable energy sources have helped to minimize Scope 2 GHG emissions."""

val data = Seq(text).toDS.toDF("text")

val result = pipeline.fit(data).transform(data)
import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val summarizer = Summarizer.pretrained("legsum_flant5_legal_augmented", "en", "legal/models")
  .setInputCols(Array("document"))
  .setOutputCol("summary")
  .setMaxNewTokens(1000)

val pipeline = new Pipeline()
  .setStages(Array(documentAssembler, summarizer))

val text = """NOW, THEREFORE, in consideration of the Company’s disclosure of information to the Recipient
and the promises set forth below, the parties agree as follows:

     1. Confidential Information. “Confidential Information” as used in this
Agreement means all information relating to the Company disclosed to the Recipient by the Company,
including without limitation any business, technical, marketing, financial or other information,
whether in written, electronic or oral form. Any and all reproductions, copies, notes, summaries,
reports, analyses or other material derived by the Recipient or its Representatives (as defined
below) in whole or in part from the Confidential Information in whatever form maintained shall be
considered part of the Confidential Information itself and shall be treated as such. Confidential
Information does not include information that (a) is or becomes part of the public domain other
than as a result of disclosure by the Recipient or its Representatives; (b) becomes available to
the Recipient on a nonconfidential basis from a source other than the Company, provided that source
is not bound with respect to that information by a confidentiality agreement with the Company or is
otherwise prohibited from transmitting that information by a contractual, legal or other
obligation; (c) can be proven by the Recipient to have been in the Recipient’s possession prior to
disclosure of the same by the Company; or (d) is independently developed by the Recipient without
reference to or reliance on any of the Company’s Confidential Information."""

val data = Seq(text).toDS.toDF("text")

val result = pipeline.fit(data).transform(data)


|result|

|[This legal agreement states that the company has disclosed all information relating to the company to the recipient, including any business, technical, marketing, financial or other information. It also states that any reproductions, copies, notes, summaries, reports, analyses or other material derived from the confidential information must be treated as part of the confidential information. The confidential information does not include information that is or becomes part of the public domain other than as a result of disclosure by the recipient or its representatives, becomes available to the recipient on a nonconfidential basis from a source other than the company, can be proven by the recipient to have been in the recipients possession prior to disclosure, or is independently developed by the recipient without reference to or reliance on any of the companys confidential information.]|


TFGraphBuilder

Model

TFGraphBuilder annotator can be used to create graphs in the model training pipeline. TFGraphBuilder inspects the data and creates the proper graph if a suitable version of TensorFlow (>= 2.7 ) is available. The graph is stored in the defined folder and loaded by the approach.

You can use this builder with MedicalNerApproach, FinanceNerApproach, LegalNerApproach, RelationExtractionApproach, AssertionDLApproach, and GenericClassifierApproach.

ATTENTION: Playing with the parameters of TFGraphBuilder may affect the model performance that you want to train.

Input Annotator Types: The setInputCols parameter is changing based on the setModelName parameter.

Output Annotator Type: There is no output file. The setGraphFile function creates a file with a .pb extension and saves it there.

Python API: TFGraphBuilder
Show Example

graph_folder = "./medical_graphs"
ner_graph_builder = medical.TFGraphBuilder()\
    .setModelName("ner_dl")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("label")\
    .setGraphFile("auto")\
    .setHiddenUnitsNumber(20)\
    .setGraphFolder(graph_folder)\
    .setIsLicensed(True)  # False -> for NerDLApproach
graph_folder = "./finance_graphs"
ner_graph_builder = finance.TFGraphBuilder()\
    .setModelName("ner_dl")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("label")\
    .setGraphFile("auto")\
    .setHiddenUnitsNumber(20)\
    .setGraphFolder(graph_folder)\
    .setIsLicensed(True)  # False -> for NerDLApproach
graph_folder = "./legal_graphs"
ner_graph_builder = legal.TFGraphBuilder()\
    .setModelName("ner_dl")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("label")\
    .setGraphFile("auto")\
    .setHiddenUnitsNumber(20)\
    .setGraphFolder(graph_folder)\
    .setIsLicensed(True)  # False -> for NerDLApproach

TextGenerator

Model

The Medical, Financial, and Legal Text Generators are specialized tools designed for text abstraction in their respective fields. The MedicalTextGenerator, based on the BioGPT model, excels in medical text abstraction, allowing users to provide prompts and contexts for tasks like disease explanation, paraphrasing medical context, or creating clinical notes for cancer patients. This model is adept at extracting relevant information due to its training on extensive medical data.

Similarly, the Financial and Legal Text Generators utilize the Flan-T5 model, an advanced version of the T5 model, for tasks in financial and legal text abstraction. Users can input prompts and contexts to receive high-quality summaries, document abstractions, and other text-based outputs. The Flan-T5 model’s training on a diverse range of texts ensures the generation of coherent and accurate content in these domains.

Parameters:

  • maxNewTokens: Maximum number of of new tokens to generate, by default 30

  • maxContextLength: Maximum length of context text

  • configProtoBytes: ConfigProto from tensorflow, serialized into byte array.

  • doSample: Whether or not to use sampling; use greedy decoding otherwise, by default False

  • topK: The number of highest probability vocabulary tokens to consider, by default 1

  • noRepeatNgramSize: The number of tokens that can’t be repeated in the same order. Useful for preventing loops. The default is 0.

  • ignoreTokenIds: A list of token ids which are ignored in the decoder’s output, by default []

  • randomSeed: Set to positive integer to get reproducible results, by default None.

  • customPrompt: The only available variable is {DOCUMENT} and it is populated with the contents of the input document

Available models can be found at the Models Hub.

For more extended examples on document pre-processing see the Spark NLP Workshop.

Input Annotator Types: DOCUMENT

Output Annotator Type: CHUNK

Python API: MedicalTextGenerator Scala API: MedicalTextGenerator
Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("prompt")\
    .setOutputCol("document_prompt")

med_text_generator  = medical.TextGenerator.pretrained("text_generator_biomedical_biogpt_base", "en", "clinical/models")\
    .setInputCols("document_prompt")\
    .setOutputCol("answer")\
    .setMaxNewTokens(256)\
    .setDoSample(True)\
    .setTopK(3)\
    .setRandomSeed(42)\
    .setStopAtEos(True)

pipeline = nlp.Pipeline(stages=[document_assembler, med_text_generator])

data = spark.createDataFrame([['Covid 19 is']]).toDF("prompt")

result = pipeline.fit(data).transform(data)

result.select("answer.result").show(truncate=False)

+--------------------------------------------------------------------------+
|result                                                                    |
+--------------------------------------------------------------------------+
|[Covid 19 is a pandemic that has affected the world's economy and health.]|
+--------------------------------------------------------------------------+
from johnsnowlabs import nlp, finance 

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("prompt")

flant5 = finance.TextGenerator.pretrained("fingen_flant5_base","en","finance/models")\
    .setInputCols(["prompt"])\
    .setOutputCol("answer")\
    .setMaxNewTokens(150)\
    .setStopAtEos(True)\
  
pipeline = nlp.Pipeline(stages=[document_assembler, flant5])

data = spark.createDataFrame([["Explain what is Sec 10-k filing"]]).toDF('text')

result = pipeline.fit(data).transform(data)

result.select("answer.result").show(truncate=False)

result = pipeline.fit(data).transform(data)

+--------------------------------------------------------------------------------------------------------------------+
|result                                                                                                              |
+--------------------------------------------------------------------------------------------------------------------+
|[Sec 10k filing is a form of tax filing that requires a party to file jointly or several entities for tax purposes.]|
+--------------------------------------------------------------------------------------------------------------------+
from johnsnowlabs import nlp, legal

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("prompt")

flant5 = legal.TextGenerator.pretrained("leggen_flant5_finetuned","en","legal/models")\
    .setInputCols(["prompt"])\
    .setOutputCol("answer")\
    .setMaxNewTokens(200)\
    .setTopK(3)\
    .setRandomSeed(42)\
    .setNoRepeatNgramSize(3)\
    .setStopAtEos(True)
 
pipeline = nlp.Pipeline(stages=[document_assembler, flant5])

data = spark.createDataFrame([["This exhibit has been redacted and is the subject of a confidential treatment request. Redacted material is marked with [* * *] and has been filed separately with the securities and exchange commission."]]).toDF("text")

pipeline.fit(data).transform(data)

result = pipeline.fit(data).transform(data)

result.select("answer.result").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| result                                                                                                                                                                                                                                           |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| This exhibit has been redacted and is the subject of a confidential treatment request. Redacted material is marked with [* * *] and has been filed separately with the securities and exchange commission. The redacted material is confidential |
| and will not be disclosed to any third party without the prior written consent of the parties.                                                                                                                                                   |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("prompt")
  .setOutputCol("document_prompt")

val medTextGenerator = TextGenerator.pretrained("text_generator_biomedical_biogpt_base", "en", "clinical/models")
  .setInputCols(Array("document_prompt"))
  .setOutputCol("answer")
  .setMaxNewTokens(256)
  .setDoSample(true)
  .setTopK(3)
  .setRandomSeed(42)
  .setStopAtEos(true)

val pipeline = new Pipeline().setStages(Array(documentAssembler, medTextGenerator))

val data = Seq("Covid 19 is").toDS.toDF("prompt")

val result = pipeline.fit(data).transform(data)

+--------------------------------------------------------------------------+
|result                                                                    |
+--------------------------------------------------------------------------+
|[Covid 19 is a pandemic that has affected the world's economy and health.]|
+--------------------------------------------------------------------------+
import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("prompt")

val flanT5 = TextGenerator.pretrained("fingen_flant5_base", "en", "finance/models")
  .setInputCols(Array("prompt"))
  .setOutputCol("answer")
  .setMaxNewTokens(150)
  .setStopAtEos(true)

val pipeline = new Pipeline().setStages(Array(documentAssembler, flanT5))

val data = Seq("Explain what is Sec 10-k filing").toDS.toDF("text")

val result = pipeline.fit(data).transform(data)

+--------------------------------------------------------------------------------------------------------------------+
|result                                                                                                              |
+--------------------------------------------------------------------------------------------------------------------+
|[Sec 10k filing is a form of tax filing that requires a party to file jointly or several entities for tax purposes.]|
+--------------------------------------------------------------------------------------------------------------------+
import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("prompt")

val flanT5 = TextGenerator.pretrained("leggen_flant5_finetuned", "en", "legal/models")
  .setInputCols(Array("prompt"))
  .setOutputCol("answer")
  .setMaxNewTokens(200)
  .setTopK(3)
  .setRandomSeed(42)
  .setNoRepeatNgramSize(3)
  .setStopAtEos(true)

val pipeline = new Pipeline().setStages(Array(documentAssembler, flanT5))

val data = Seq("This exhibit has been redacted and is the subject of a confidential treatment request. Redacted material is marked with [* * *] and has been filed separately with the securities and exchange commission.").toDS.toDF("text")

val result = pipeline.fit(data).transform(data)


+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| result                                                                                                                                                                                                                                           |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| This exhibit has been redacted and is the subject of a confidential treatment request. Redacted material is marked with [* * *] and has been filed separately with the securities and exchange commission. The redacted material is confidential |
| and will not be disclosed to any third party without the prior written consent of the parties.                                                                                                                                                   |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

TextMatcherInternal

Model

This annotator match exact phrases provided in a file against a Document.

Parametres:

  • setEntities (str): Sets the external resource for the entities. path : str Path to the external resource read_as : str, optional How to read the resource, by default ReadAs.TEXT options : dict, optional Options for reading the resource, by default {“format”: “text”}
  • setCaseSensitive (Boolean): Sets whether to match regardless of case. (Default: True)

  • setMergeOverlapping (Boolean): Sets whether to merge overlapping matched chunks. (Default: False)

  • setEntityValue (str): Sets the value for the entity metadata field. If any entity value isn’t set in the file, we need to set it for the entity value.

  • setBuildFromTokens (Boolean): Sets whether the TextMatcherInternal should take the CHUNK from TOKEN.

  • setDelimiter (str): Sets value for the delimiter between Phrase, Entity.

See Spark NLP Workshop for more examples of usage.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CHUNK

Show Example

from johnsnowlabs import nlp, medical

matcher_drug = """
Aspirin 100mg#Drug
aspirin#Drug
paracetamol#Drug
amoxicillin#Drug
ibuprofen#Drug
lansoprazole#Drug
"""

with open ('matcher_drug.csv', 'w') as f:
  f.write(matcher_drug)

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

entityExtractor = medical.TextMatcherInternal()\
    .setInputCols(["document", "token"])\
    .setEntities("matcher_drug.csv")\
    .setOutputCol("matched_text")\
    .setCaseSensitive(False)\
    .setDelimiter("#")\
    .setMergeOverlapping(False)

mathcer_pipeline = nlp.Pipeline().setStages([
                  documentAssembler,
                  tokenizer,
                  entityExtractor])

data = spark.createDataFrame([["John's doctor prescribed aspirin 100mg for his heart condition, along with paracetamol for his fever, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD."]]).toDF("text")

matcher_model = mathcer_pipeline.fit(data)
result = matcher_model.transform(data)

# result
+-------------+-----+---+-----+
|        chunk|begin|end|label|
+-------------+-----+---+-----+
|      aspirin|   25| 31| Drug|
|aspirin 100mg|   25| 37| Drug|
|  paracetamol|   75| 85| Drug|
|  amoxicillin|  102|112| Drug|
|    ibuprofen|  134|142| Drug|
| lansoprazole|  170|181| Drug|
+-------------+-----+---+-----+

import spark.implicits._

//matcher_drug = """
//Aspirin 100mg#Drug
//aspirin#Drug
//paracetamol#Drug
//amoxicillin#Drug
//ibuprofen#Drug
//lansoprazole#Drug
//"""
//
//with open ('matcher_drug.csv', 'w') as f:
//  f.write(matcher_drug)

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("token")

val entityExtractor = new EntityExtractor()
  .setInputCols(Array("document", "token"))
  .setOutputCol("matched_text")
  .setEntities("matcher_drug.csv")
  .setCaseSensitive(false)
  .setDelimiter("#")
  .setMergeOverlapping(false)

val matcherPipeline = new Pipeline()
  .setStages(Array(documentAssembler, 
                   tokenizer, 
                   entityExtractor))

val data = Seq("John's doctor prescribed aspirin 100mg for his heart condition, along with paracetamol for his fever, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD.")
  .toDF("text")

val matcherModel = matcherPipeline.fit(data)
val result = matcherModel.transform(data)


# result
+-------------+-----+---+-----+
|        chunk|begin|end|label|
+-------------+-----+---+-----+
|      aspirin|   25| 31| Drug|
|aspirin 100mg|   25| 37| Drug|
|  paracetamol|   75| 85| Drug|
|  amoxicillin|  102|112| Drug|
|    ibuprofen|  134|142| Drug|
| lansoprazole|  170|181| Drug|
+-------------+-----+---+-----+

WindowedSentenceModel

Model

This annotator that helps you to merge the previous and following sentences of a given piece of text, so that you add the context surrounding them. This is super useful for especially context-rich analyses that require a deeper understanding of the language being used.

Inferring the class from sentence X may be a much harder task sometime, due to the lack of context, than to infer the class of sentence X-1 + sentence X + sentence X+1. In this example, the window is 1, that’s why we augment sentence with 1 neighbour from behind and another from ahead. Window size can be configured so that each piece of text/sentence get a number of previous and posterior sentences as context, equal to the windows size.

Parameters:

  • setWindowSize: Sets size of the sliding window.

  • setGlueString: Sets string to use to join the neighboring elements together.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

Python API: WindowedSentenceModel Scala API: WindowedSentenceModel
Show Example

from johnsnowlabs import medical, nlp

documentAssembler =  nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector =  nlp.SentenceDetector()\
    .setInputCols("document")\
    .setOutputCol("sentence")

windowedSentence1 =  medical.WindowedSentenceModel()\
    .setWindowSize(1)\
    .setInputCols("sentence")\
    .setOutputCol("window_1")

windowedSentence2 =  medical.WindowedSentenceModel()\
    .setWindowSize(2)\
    .setInputCols("sentence")\
    .setOutputCol("window_2")

pipeline = nlp.Pipeline(stages=[
    documentAssembler, 
    sentenceDetector, 
    windowedSentence1, 
    windowedSentence2
    ])


sample_text = """The patient was admitted on Monday. 
She has a right-sided pleural effusion for thoracentesis. 
Her Coumadin was placed on hold.
A repeat echocardiogram was checked. 
She was started on prophylaxis for DVT. 
Her CT scan from March 2006 prior to her pericardectomy. 
It already shows bilateral plural effusions."""

data = spark.createDataFrame([[sample_text]]).toDF("text")

result = pipeline.fit(data).transform(data)

# Example results

result.select(F.explode('window_1')).select('col.result').show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------------------+
|The patient was admitted on Monday. She has a right-sided pleural effusion for thoracentesis.                                                |
|The patient was admitted on Monday. She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold.               |
|She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold. A repeat echocardiogram was checked.              |
|Her Coumadin was placed on hold. A repeat echocardiogram was checked. She was started on prophylaxis for DVT.                                |
|A repeat echocardiogram was checked. She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy.        |
|She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy. It already shows bilateral plural effusions.|
|Her CT scan from March 2006 prior to her pericardectomy. It already shows bilateral plural effusions.                                        |
+---------------------------------------------------------------------------------------------------------------------------------------------+

result.select(F.explode('window_2')).select('col.result').show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                          |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|The patient was admitted on Monday. She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold.                                                                                                  |
|The patient was admitted on Monday. She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold. A repeat echocardiogram was checked.                                                             |
|The patient was admitted on Monday. She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold. A repeat echocardiogram was checked. She was started on prophylaxis for DVT.                     |
|She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold. A repeat echocardiogram was checked. She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy.|
|Her Coumadin was placed on hold. A repeat echocardiogram was checked. She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy. It already shows bilateral plural effusions.             |
|A repeat echocardiogram was checked. She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy. It already shows bilateral plural effusions.                                              |
|She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy. It already shows bilateral plural effusions.                                                                                   |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
from johnsnowlabs import nlp, legal
from pyspark.sql import functions as F

doc_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("isolated_sentence")

context_window = legal.WindowedSentenceModel()\
    .setInputCols(["isolated_sentence"])\
    .setOutputCol("window")\
    .setWindowSize(1)

window_splitting_pipeline = nlp.Pipeline(stages=[doc_assembler, sentence_detector, context_window])

window_splitting_model = window_splitting_pipeline.fit(df)

window_splitting_lp = nlp.LightPipeline(window_splitting_model)

## Result

['1  \nMUTUAL NONDISCLOSURE AGREEMENT  \nThis Mutual Nondisclosure Agreement (the “Agreement”) is made on _________ (“Effective  \nDate”) by and between:  \n(1) John Snow Labs, a Delaware corporation, registered at 16192 Coastal Highway,  \nLewes, Delaware 19958 (“John Snow Labs”), and   \n(2) Achiles, S.L, a Spanish corporation, registered at Gran Via, 2º floor, Offices 9\nand 10.(“Company”),  \n(each a “party” and together the “parties”). Recitals:  \nJohn Snow Labs and Company intend to explore the possibility of a business relationship  \nbetween each other, whereby each party (“Discloser”) may disclose sensitive information to the  \nother party (“Recipient”).',
 '1  \nMUTUAL NONDISCLOSURE AGREEMENT  \nThis Mutual Nondisclosure Agreement (the “Agreement”) is made on _________ (“Effective  \nDate”) by and between:  \n(1) John Snow Labs, a Delaware corporation, registered at 16192 Coastal Highway,  \nLewes, Delaware 19958 (“John Snow Labs”), and   \n(2) Achiles, S.L, a Spanish corporation, registered at Gran Via, 2º floor, Offices 9\nand 10.(“Company”),  \n(each a “party” and together the “parties”). Recitals:  \nJohn Snow Labs and Company intend to explore the possibility of a business relationship  \nbetween each other, whereby each party (“Discloser”) may disclose sensitive information to the  \nother party (“Recipient”). The parties agree as follows:',
 'Recitals:  \nJohn Snow Labs and Company intend to explore the possibility of a business relationship  \nbetween each other, whereby each party (“Discloser”) may disclose sensitive information to the  \nother party (“Recipient”). The parties agree as follows: 1. Definition.',]

import spark.implicits._

val documentAssembler =  new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector =  new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val windowedSentence1 =  new WindowedSentenceModel()
    .setWindowSize(1)
    .setInputCols("sentence")
    .setOutputCol("window_1")

val windowedSentence2 =  new WindowedSentenceModel()
    .setWindowSize(2)
    .setInputCols("sentence")
    .setOutputCol("window_2")

val pipeline = new Pipeline().setStages(Array(
    documentAssembler, 
    sentenceDetector, 
    windowedSentence1, 
    windowedSentence2
))


val testDataset = Seq("The patient was admitted on Monday. 
She has a right-sided pleural effusion for thoracentesis. 
Her Coumadin was placed on hold.
A repeat echocardiogram was checked. 
She was started on prophylaxis for DVT. 
Her CT scan from March 2006 prior to her pericardectomy. 
It already shows bilateral plural effusions.").toDF("text")

val result = pipeline.fit(testDataset).transform(testDataset)

// Result

// window 1

+---------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------------------+
|The patient was admitted on Monday. She has a right-sided pleural effusion for thoracentesis.                                                |
|The patient was admitted on Monday. She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold.               |
|She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold. A repeat echocardiogram was checked.              |
|Her Coumadin was placed on hold. A repeat echocardiogram was checked. She was started on prophylaxis for DVT.                                |
|A repeat echocardiogram was checked. She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy.        |
|She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy. It already shows bilateral plural effusions.|
|Her CT scan from March 2006 prior to her pericardectomy. It already shows bilateral plural effusions.                                        |
+---------------------------------------------------------------------------------------------------------------------------------------------+

// window 2

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                          |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|The patient was admitted on Monday. She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold.                                                                                                  |
|The patient was admitted on Monday. She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold. A repeat echocardiogram was checked.                                                             |
|The patient was admitted on Monday. She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold. A repeat echocardiogram was checked. She was started on prophylaxis for DVT.                     |
|She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold. A repeat echocardiogram was checked. She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy.|
|Her Coumadin was placed on hold. A repeat echocardiogram was checked. She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy. It already shows bilateral plural effusions.             |
|A repeat echocardiogram was checked. She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy. It already shows bilateral plural effusions.                                              |
|She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy. It already shows bilateral plural effusions.                                                                                   |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._

val doc_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentence_detector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("isolated_sentence")

val context_window = new WindowedSentenceModel()
    .setInputCols("isolated_sentence")
    .setOutputCol("window")
    .setWindowSize(1)

val pipeline = new Pipeline().setStages(Array(
    doc_assembler, 
    sentence_detector, 
    context_window))

val window_splitting_model = window_splitting_pipeline.fit(df)

val window_splitting_lp = LightPipeline(window_splitting_model)

ZeroShotNerModel

Model

This is a zero shot named entity recognition based on RoBertaForQuestionAnswering. Zero shot models excel at generalization, meaning that the model can accurately predict entities in very different data sets without the need to fine tune the model or train from scratch for each different domain.

Even though a model trained to solve a specific problem can achieve better accuracy than a zero-shot model in this specific task, it probably won’t be be useful in a different task. That is where zero-shot models shows its usefulness by being able to achieve good results in many different scenarions.

Parameters:

  • entityDefinitions: A dictionary with definitions of the named entities. The keys of dictionary are the entity types and the values are lists of hypothesis templates.

  • predictionThreshold: Minimal confidence score to consider the entity(Default: 0.01)

  • ignoreEntitites: A list of entities to be discarted from the output..

All the parameters can be set using the corresponding set method in camel case. For example, .setMultiLabel().

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: NAMED_ENTITY

Python API: ZeroShotNerModel Scala API: ZeroShotNerModel Notebook: ZeroShotNerModelNotebook
Show Example

from johnsnowlabs import nlp, medical

documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

zero_shot_ner = medical.ZeroShotNerModel.pretrained("zero_shot_ner_roberta", "en", "clinical/models")\
    .setEntityDefinitions(
        {
            "PROBLEM": ["What is the disease?", "What is his symptom?", "What is her disease?", "What is his disease?",
                        "What is the problem?" ,"What does a patient suffer", 'What was the reason that the patient is admitted to the clinic?'],
            "DRUG": ["Which drug?", "Which is the drug?", "What is the drug?", "Which drug does he use?", "Which drug does she use?", "Which drug do I use?", "Which drug is prescribed for a symptom?"],
            "ADMISSION_DATE": ["When did patient admitted to a clinic?"],
            "PATIENT_AGE": ["How old is the patient?","What is the gae of the patient?"]
        })\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("zero_shot_ner")\
    .setPredictionThreshold(0.1) # default 0.01

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "zero_shot_ner"])\
    .setOutputCol("ner_chunk")\

pipeline = nlp.Pipeline(stages = [
    documentAssembler,
    sentenceDetector,
    tokenizer,
    zero_shot_ner,
    ner_converter])

text_list = ["The doctor pescribed Majezik for my severe headache.",
             "The patient was admitted to the hospital for his colon cancer.",
             "27 years old patient was admitted to clinic on Sep 1st by Dr. X for a right-sided pleural effusion for thoracentesis."
            ]

data = spark.createDataFrame(text_list, nlp.StringType()).toDF("text")

result = pipeline.fit(data).transform(data)

result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label"),
              F.expr("cols['1']['confidence']").alias("confidence")).show(50, truncate=100)

+------------------------------------------------+--------------+----------+
|                                           chunk|     ner_label|confidence|
+------------------------------------------------+--------------+----------+
|                                         Majezik|          DRUG|0.64671576|
|                                 severe headache|       PROBLEM| 0.5526346|
|                                    colon cancer|       PROBLEM| 0.8898498|
|                                    27 years old|   PATIENT_AGE| 0.6943085|
|                                         Sep 1st|ADMISSION_DATE|0.95646095|
|a right-sided pleural effusion for thoracentesis|       PROBLEM|0.50026613|
+------------------------------------------------+--------------+----------+
from johnsnowlabs import nlp, finance

documentAssembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

textsplitter = finance.TextSplitter()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols("sentence")\
  .setOutputCol("token")

zero_shot_ner = finance.ZeroShotNerModel.pretrained("finner_roberta_zeroshot", "en", "finance/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("zero_shot_ner")\
    .setEntityDefinitions(
        {
            "DATE": ['When was the company acquisition?', 'When was the company purchase agreement?'],
            "ORG": ["Which company was acquired?"],
            "PRODUCT": ["Which product?"],
            "PROFIT_INCREASE": ["How much has the gross profit increased?"],
            "REVENUES_DECLINED": ["How much has the revenues declined?"],
            "OPERATING_LOSS_2020": ["Which was the operating loss in 2020"],
            "OPERATING_LOSS_2019": ["Which was the operating loss in 2019"]
        })

ner_converter = finance.NerConverterInternal()\
  .setInputCols(["sentence", "token", "zero_shot_ner"])\
  .setOutputCol("ner_chunk")

pipeline =  nlp.Pipeline(stages=[
  documentAssembler,
  textsplitter,
  tokenizer,
  zero_shot_ner,
  ner_converter
  ]
)

from pyspark.sql.types import StringType
text_list = ["In March 2012, as part of a longer-term strategy, the Company acquired Vertro, Inc., which owned and operated the ALOT product portfolio.",
              "In February 2017, the Company entered into an asset purchase agreement with NetSeer, Inc.",
              "While our gross profit margin increased to 81.4% in 2020 from 63.1% in 2019, our revenues declined approximately 27% in 2020 as compared to 2019.",
              "We reported an operating loss of approximately $8,048,581 million in 2020 as compared to an operating loss of $7,738,193 in 2019."]

data = spark.createDataFrame(text_list, nlp.StringType()).toDF("text")

result = pipeline.fit(data).transform(data)

result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(50, truncate=100)


+------------------+-------------------+
|chunk             |ner_label          |
+------------------+-------------------+
|March 2012        |DATE               |
|Vertro            |ORG                |
|ALOT              |PRODUCT            |
|February 2017     |DATE               |
|NetSeer           |ORG                |
|81.4%             |PROFIT_INCREASE    |
|27%               |REVENUES_DECLINED  |
|$8,048,581 million|OPERATING_LOSS_2020|
|$7,738,193        |OPERATING_LOSS_2019|
|2019              |DATE               |
+------------------+-------------------+
from johnsnowlabs import nlp, legal

documentAssembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

textSplitter = legal.TextSplitter()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols("sentence")\
  .setOutputCol("token")

zero_shot_ner = legal.ZeroShotNerModel.pretrained("legner_roberta_zeroshot", "en", "legal/models")\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("zero_shot_ner")\
  .setEntityDefinitions(
        {
            "DATE": ['When was the company acquisition?', 'When was the company purchase agreement?', "When was the agreement?"],
            "ORG": ["Which company?"],
            "STATE": ["Which state?"],
            "AGREEMENT": ["What kind of agreement?"],
            "LICENSE": ["What kind of license?"],
            "LICENSE_RECIPIENT": ["To whom the license is granted?"]
        })
    
ner_converter = legal.NerConverterInternal()\
  .setInputCols(["sentence", "token", "zero_shot_ner"])\
  .setOutputCol("ner_chunk")

pipeline =  nlp.Pipeline(stages=[
  documentAssembler,
  textSplitter,
  tokenizer,
  zero_shot_ner,
  nerconverter
  ]
)

from pyspark.sql.types import StringType

text_list = [
    "In March 2012, as part of a longer-term strategy, the Company acquired Vertro, Inc., which owned and operated the ALOT product portfolio.",
    "In February 2017, the Company entered into an asset purchase agreement with NetSeer, Inc.",
    "This INTELLECTUAL PROPERTY AGREEMENT, dated as of December 31, 2018 (the 'Effective Date') is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ('Seller') and AFI Licensing LLC, a Delaware company (the 'Licensee')",
    "The Company hereby grants to Seller a perpetual, non- exclusive, royalty-free license",
]

data = spark.createDataFrame(text_list, nlp.StringType()).toDF("text")

result = pipeline.fit(data).transform(data)

result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(50, truncate=100)

+-------------------------------------+-----------------+
|chunk                                |ner_label        |
+-------------------------------------+-----------------+
|March 2012                           |DATE             |
|Vertro, Inc                          |ORG              |
|February 2017                        |DATE             |
|asset purchase agreement             |AGREEMENT        |
|NetSeer                              |ORG              |
|INTELLECTUAL PROPERTY                |AGREEMENT        |
|December 31, 2018                    |DATE             |
|Armstrong Flooring                   |LICENSE_RECIPIENT|
|Delaware                             |STATE            |
|AFI Licensing LLC, a Delaware company|LICENSE_RECIPIENT|
|Seller                               |LICENSE_RECIPIENT|
|perpetual                            |LICENSE          |
|non- exclusive                       |LICENSE          |
|royalty-free                         |LICENSE          |
+-------------------------------------+-----------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val zeroShotNer = ZeroShotNerModel.pretrained("zero_shot_ner_roberta", "en", "clinical/models")
  .setEntityDefinitions(Map(
    "PROBLEM" -> Seq("What is the disease?", "What is his symptom?", "What is her disease?", "What is his disease?",
                     "What is the problem?" ,"What does a patient suffer", "What was the reason that the patient is admitted to the clinic?"),
    "DRUG" -> Seq("Which drug?", "Which is the drug?", "What is the drug?", "Which drug does he use?", "Which drug does she use?", "Which drug do I use?", "Which drug is prescribed for a symptom?"),
    "ADMISSION_DATE" -> Seq("When did patient admitted to a clinic?"),
    "PATIENT_AGE" -> Seq("How old is the patient?", "What is the gae of the patient?")
  ))
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("zero_shot_ner")
  .setPredictionThreshold(0.1)

val nerConverter = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "zero_shot_ner"))
  .setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(
    documentAssembler, 
    sentenceDetector, 
    tokenizer, 
    zeroShotNer, 
    nerConverter))

val textList = Seq(
  "The doctor pescribed Majezik for my severe headache.",
  "The patient was admitted to the hospital for his colon cancer.",
  "27 years old patient was admitted to clinic on Sep 1st by Dr. X for a right-sided pleural effusion for thoracentesis."
).toDS.toDF("text")

val result = pipeline.fit(textList).transform(textList)

  
+------------------------------------------------+--------------+----------+
|                                           chunk|     ner_label|confidence|
+------------------------------------------------+--------------+----------+
|                                         Majezik|          DRUG|0.64671576|
|                                 severe headache|       PROBLEM| 0.5526346|
|                                    colon cancer|       PROBLEM| 0.8898498|
|                                    27 years old|   PATIENT_AGE| 0.6943085|
|                                         Sep 1st|ADMISSION_DATE|0.95646095|
|a right-sided pleural effusion for thoracentesis|       PROBLEM|0.50026613|
+------------------------------------------------+--------------+----------+
import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val textsplitter = new TextSplitter()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val zero_shot_ner = ZeroShotNerModel.pretrained("finner_roberta_zeroshot", "en", "finance/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("zero_shot_ner")
  .setEntityDefinitions(
    Map(
      "DATE" -> Seq('When was the company acquisition?', 'When was the company purchase agreement?'),
      "ORG" -> Seq("Which company was acquired?"),
      "PRODUCT" -> Seq("Which product?"),
      "PROFIT_INCREASE" -> Seq("How much has the gross profit increased?"),
      "REVENUES_DECLINED" -> Seq("How much has the revenues declined?"),
      "OPERATING_LOSS_2020" -> Seq("Which was the operating loss in 2020"),
      "OPERATING_LOSS_2019" -> Seq("Which was the operating loss in 2019")
    )
  )

val ner_converter = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "zero_shot_ner"))
  .setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  textsplitter,
  tokenizer,
  zero_shot_ner,
  ner_converter
))

val text_list = Seq(
  "In March 2012, as part of a longer-term strategy, the Company acquired Vertro, Inc., which owned and operated the ALOT product portfolio.",
  "In February 2017, the Company entered into an asset purchase agreement with NetSeer, Inc.",
  "While our gross profit margin increased to 81.4% in 2020 from 63.1% in 2019, our revenues declined approximately 27% in 2020 as compared to 2019.",
  "We reported an operating loss of approximately $8,048,581 million in 2020 as compared to an operating loss of $7,738,193 in 2019."
).toDS.toDF("text")

val result = pipeline.fit(text_list).transform(text_list)

+------------------+-------------------+
|chunk             |ner_label          |
+------------------+-------------------+
|March 2012        |DATE               |
|Vertro            |ORG                |
|ALOT              |PRODUCT            |
|February 2017     |DATE               |
|NetSeer           |ORG                |
|81.4%             |PROFIT_INCREASE    |
|27%               |REVENUES_DECLINED  |
|$8,048,581 million|OPERATING_LOSS_2020|
|$7,738,193        |OPERATING_LOSS_2019|
|2019              |DATE               |
+------------------+-------------------+
import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val textSplitter = new TextSplitter()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val zeroShotNer = ZeroShotNerModel.pretrained("legner_roberta_zeroshot", "en", "legal/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("zero_shot_ner")
  .setEntityDefinitions(Map(
    "DATE" -> Seq("When was the company acquisition?", "When was the company purchase agreement?", "When was the agreement?"),
    "ORG" -> Seq("Which company?"),
    "STATE" -> Seq("Which state?"),
    "AGREEMENT" -> Seq("What kind of agreement?"),
    "LICENSE" -> Seq("What kind of license?"),
    "LICENSE_RECIPIENT" -> Seq("To whom the license is granted?")
  ))

val nerConverter = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "zero_shot_ner"))
  .setOutputCol("ner_chunk")


val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    textSplitter,
    tokenizer,
    zeroShotNer,
    nerConverter
  ))

val textList = Seq(
  "In March 2012, as part of a longer-term strategy, the Company acquired Vertro, Inc., which owned and operated the ALOT product portfolio.",
  "In February 2017, the Company entered into an asset purchase agreement with NetSeer, Inc.",
  "This INTELLECTUAL PROPERTY AGREEMENT, dated as of December 31, 2018 (the 'Effective Date') is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ('Seller') and AFI Licensing LLC, a Delaware company (the 'Licensee')",
  "The Company hereby grants to Seller a perpetual, non-exclusive, royalty-free license"
).toDS.toDF("text")

val result = pipeline.fit(textList).transform(textList)

+-------------------------------------+-----------------+
|chunk                                |ner_label        |
+-------------------------------------+-----------------+
|March 2012                           |DATE             |
|Vertro, Inc                          |ORG              |
|February 2017                        |DATE             |
|asset purchase agreement             |AGREEMENT        |
|NetSeer                              |ORG              |
|INTELLECTUAL PROPERTY                |AGREEMENT        |
|December 31, 2018                    |DATE             |
|Armstrong Flooring                   |LICENSE_RECIPIENT|
|Delaware                             |STATE            |
|AFI Licensing LLC, a Delaware company|LICENSE_RECIPIENT|
|Seller                               |LICENSE_RECIPIENT|
|perpetual                            |LICENSE          |
|non-exclusive                        |LICENSE          |
|royalty-free                         |LICENSE          |
+-------------------------------------+-----------------+

ZeroShotRelationExtractionModel

Model

ZeroShotRelationExtractionModel implements zero-shot binary relations extraction by utilizing BERT transformer models trained on the NLI (Natural Language Inference) task.

The model inputs consists of documents/sentences and paired NER chunks, usually obtained by RENerChunksFilter. The definitions of relations which are extracted is given by a dictionary structures, specifying a set of statements regarding the relationship of named entities.

These statements are automatically appended to each document in the dataset and the NLI model is used to determine whether a particular relationship between entities.

Parameters:

  • relationalCategories: A dictionary with definitions of relational categories. The keys of dictionary are the relation labels and the values are lists of hypothesis templates.

  • predictionThreshold: Minimal confidence score to encode a relation (Default: 0.5)

  • multiLabel: Whether or not a pair of entities can be categorized by multiple relations (Default: False).

All the parameters can be set using the corresponding set method in camel case. For example, .setMultiLabel().

For available pretrained models please see the Models Hub.

Input Annotator Types: CHUNK, DOCUMENT

Output Annotator Type: CATEGORY

Python API: ZeroShotRelationExtractionModel Scala API: ZeroShotRelationExtractionModel Notebook: ZeroShotRelationExtractionModelNotebook
Show Example

from johnsnowlabs import nlp, medical

documenter = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencer = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentences"])\
    .setOutputCol("tokens")

words_embedder = nlp.WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("embeddings")

ner_clinical = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens", "embeddings"])\
    .setOutputCol("ner_clinical")

ner_clinical_converter = medical.NerConverterInternal()\
    .setInputCols(["sentences", "tokens", "ner_clinical"])\
    .setOutputCol("ner_clinical_chunks")\
    .setWhiteList(["PROBLEM", "TEST"])  # PROBLEM-TEST-TREATMENT

ner_posology = medical.NerModel.pretrained("ner_posology", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens", "embeddings"])\
    .setOutputCol("ner_posology")

ner_posology_converter = medical.NerConverterInternal()\
    .setInputCols(["sentences", "tokens", "ner_posology"])\
    .setOutputCol("ner_posology_chunks")\
    .setWhiteList(["DRUG"]) # DRUG-FREQUENCY-DOSAGE-DURATION-FORM-ROUTE-STRENGTH

chunk_merger = medical.ChunkMergeApproach()\
    .setInputCols("ner_clinical_chunks", "ner_posology_chunks")\
    .setOutputCol("merged_ner_chunks")

## ZERO-SHOT RE Starting...

pos_tagger = nlp.PerceptronModel().pretrained("pos_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("pos_tags")

dependency_parser = nlp.DependencyParserModel().pretrained("dependency_conllu", "en")\
    .setInputCols(["document", "pos_tags", "tokens"])\
    .setOutputCol("dependencies")

re_ner_chunk_filter = medical.RENerChunksFilter().setRelationPairs(["problem-test", "problem-drug"])\
    .setMaxSyntacticDistance(4)\
    .setDocLevelRelations(False)\
    .setInputCols(["merged_ner_chunks", "dependencies"])\
    .setOutputCol("re_ner_chunks")

re_model = medical.ZeroShotRelationExtractionModel.pretrained("re_zeroshot_biobert", "en", "clinical/models")\
    .setInputCols(["re_ner_chunks", "sentences"])\
    .setOutputCol("relations")\
    .setMultiLabel(True)\
    .setRelationalCategories(
        {
            "ADE": ["{DRUG} causes {PROBLEM}."],
            "IMPROVE": ["{DRUG} improves {PROBLEM}.", "{DRUG} cures {PROBLEM}."],
            "REVEAL": ["{TEST} reveals {PROBLEM}."],
        }
    )

pipeline = nlp.Pipeline(
    stages = [
        documenter,
        sentencer,
        tokenizer,
        words_embedder,
        ner_clinical,
        ner_clinical_converter,
        ner_posology,
        ner_posology_converter,
        chunk_merger,
        pos_tagger,
        dependency_parser,
        re_ner_chunk_filter,
        re_model
    ]
)

text = "Paracetamol can alleviate headache or sickness. An MRI test can be used to find cancer."

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

from pyspark.sql import functions as F

results.select(
    F.explode(F.arrays_zip(results.relations.metadata, results.relations.result)).alias("cols")).select(
    F.expr("cols['0']['sentence']").alias("sentence"),
    F.expr("cols['0']['entity1_begin']").alias("entity1_begin"),
    F.expr("cols['0']['entity1_end']").alias("entity1_end"),
    F.expr("cols['0']['chunk1']").alias("chunk1"),
    F.expr("cols['0']['entity1']").alias("entity1"),
    F.expr("cols['0']['entity2_begin']").alias("entity2_begin"),
    F.expr("cols['0']['entity2_end']").alias("entity2_end"),
    F.expr("cols['0']['chunk2']").alias("chunk2"),
    F.expr("cols['0']['entity2']").alias("entity2"),
    F.expr("cols['0']['hypothesis']").alias("hypothesis"),
    F.expr("cols['0']['nli_prediction']").alias("nli_prediction"),
    F.expr("cols['1']").alias("relation"),
    F.expr("cols['0']['confidence']").alias("confidence"),
).show(truncate=70)

+--------+-------------+-----------+-----------+-------+-------------+-----------+--------+-------+------------------------------+--------------+--------+----------+
sentence|entity1_begin|entity1_end|     chunk1|entity1|entity2_begin|entity2_end|  chunk2|entity2|                    hypothesis|nli_prediction|relation|confidence|
+--------+-------------+-----------+-----------+-------+-------------+-----------+--------+-------+------------------------------+--------------+--------+----------+
       0|            0|         10|Paracetamol|   DRUG|           38|         45|sickness|PROBLEM|Paracetamol improves sickness.|        entail| IMPROVE|0.98819494|
       0|            0|         10|Paracetamol|   DRUG|           26|         33|headache|PROBLEM|Paracetamol improves headache.|        entail| IMPROVE| 0.9929625|
       1|           48|         58|An MRI test|   TEST|           80|         85|  cancer|PROBLEM|   An MRI test reveals cancer.|        entail|  REVEAL| 0.9760039|
+--------+-------------+-----------+-----------+-------+-------------+-----------+--------+-------+------------------------------+--------------+--------+----------+
from johnsnowlabs import nlp, finance

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_financial_small", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = finance.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

re_model = finance.ZeroShotRelationExtractionModel.pretrained("finre_zero_shot", "en", "finance/models")\
    .setInputCols(["ner_chunk", "sentence"])\
    .setOutputCol("relations")\
    .setMultiLabel(False)\
    .setRelationalCategories(
        {
            "profit_decline_by": [
                "{PROFIT_DECLINE} decreased by {AMOUNT} from",
                "{PROFIT_DECLINE} decreased by {AMOUNT} to",
            ],
            "profit_decline_by_per": [
                "{PROFIT_DECLINE} decreased by a {PERCENTAGE} from",
                "{PROFIT_DECLINE} decreased by a {PERCENTAGE} to",
            ],
            "profit_decline_from": [
                "{PROFIT_DECLINE} decreased from {AMOUNT}",
                "{PROFIT_DECLINE} decreased from {AMOUNT} for the year",
            ],
            "profit_decline_from_per": [
                "{PROFIT_DECLINE} decreased from {PERCENTAGE} to",
                "{PROFIT_DECLINE} decreased from {PERCENTAGE} to a total of",
            ],
            "profit_decline_to": ["{PROFIT_DECLINE} to {AMOUNT}"],
            "profit_increase_from": ["{PROFIT_INCREASE} from {AMOUNT}"],
            "profit_increase_to": ["{PROFIT_INCREASE} to {AMOUNT}"],
            "expense_decrease_by": ["{EXPENSE_DECREASE} decreased by {AMOUNT}"],
            "expense_decrease_by_per": ["{EXPENSE_DECREASE} decreased by a {PERCENTAGE}"],
            "expense_decrease_from": ["{EXPENSE_DECREASE} decreased from {AMOUNT}"],
            "expense_decrease_to": [
                "{EXPENSE_DECREASE} for a total of {AMOUNT} for the fiscal year"
            ],
            "has_date": [
                "{AMOUNT} for the fiscal year ended {FISCAL_YEAR}",
                "{PERCENTAGE} for the fiscal year ended {FISCAL_YEAR}",
            ],
        }
    )

pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        embeddings,
        ner_model,
        ner_converter,
        re_model,
    ]
)

text = """License fees revenue decreased 40 %, or $ 0.5 million to $ 0.7 million for the year ended December 31, 2020 compared to $ 1.2 million for the year ended December 31, 2019. Services revenue increased 4 %, or $ 1.1 million, to $ 25.6 million for the year ended December 31, 2020 from $ 24.5 million for the year ended December 31, 2019. Costs of revenue, excluding depreciation and amortization increased by $ 0.1 million, or 2 %, to $ 8.8 million for the year ended December 31, 2020 from $ 8.7 million for the year ended December 31, 2019.  Also, a decrease in travel costs of $ 0.4 million due to travel restrictions caused by the global pandemic. As a percentage of revenue, cost of revenue, excluding depreciation and amortization was 34 % for each of the years ended December 31, 2020 and 2019. Sales and marketing expenses decreased 20 %, or $ 1.5 million, to $ 6.0 million for the year ended December 31, 2020 from $ 7.5 million for the year ended December 31, 2019."""

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

from pyspark.sql import functions as F

result.select(
    F.explode(F.arrays_zip(result.relations.metadata, result.relations.result)).alias("cols")).select(
    F.expr("cols['0']['sentence']").alias("sentence"),
    F.expr("cols['0']['entity1_begin']").alias("entity1_begin"),
    F.expr("cols['0']['entity1_end']").alias("entity1_end"),
    F.expr("cols['0']['chunk1']").alias("chunk1"),
    F.expr("cols['0']['entity1']").alias("entity1"),
    F.expr("cols['0']['entity2_begin']").alias("entity2_begin"),
    F.expr("cols['0']['entity2_end']").alias("entity2_end"),
    F.expr("cols['0']['chunk2']").alias("chunk2"),
    F.expr("cols['0']['entity2']").alias("entity2"),
    F.expr("cols['0']['hypothesis']").alias("hypothesis"),
    F.expr("cols['0']['nli_prediction']").alias("nli_prediction"),
    F.expr("cols['1']").alias("relation"),
    F.expr("cols['0']['confidence']").alias("confidence"),
).show(truncate=70)

+--------+-------------+-----------+----------------------------+----------------+-------------+-----------+-----------------+-----------+--------------------------------------------------------+--------------+---------------------+----------+
|sentence|entity1_begin|entity1_end|                      chunk1|         entity1|entity2_begin|entity2_end|           chunk2|    entity2|                                              hypothesis|nli_prediction|             relation|confidence|
+--------+-------------+-----------+----------------------------+----------------+-------------+-----------+-----------------+-----------+--------------------------------------------------------+--------------+---------------------+----------+
|       1|          227|        238|                25.6 million|          AMOUNT|          316|        332|December 31, 2019|FISCAL_YEAR|25.6 million for the fiscal year ended December 31, 2019|        entail|             has_date| 0.8744757|
|       0|           31|         32|                          40|      PERCENTAGE|          153|        169|December 31, 2019|FISCAL_YEAR|          40 for the fiscal year ended December 31, 2019|        entail|             has_date| 0.7889032|
|       5|          799|        826|Sales and marketing expenses|EXPENSE_DECREASE|          923|        933|      7.5 million|     AMOUNT| Sales and marketing expenses decreased from 7.5 million|        entail|expense_decrease_from| 0.9770538|
|       0|           59|         69|                 0.7 million|          AMOUNT|           90|        106|December 31, 2020|FISCAL_YEAR| 0.7 million for the fiscal year ended December 31, 2020|        entail|             has_date|0.67187774|
|       1|          172|        187|            Services revenue| PROFIT_INCREASE|          227|        238|     25.6 million|     AMOUNT|                        Services revenue to 25.6 million|        entail|   profit_increase_to| 0.9674029|
|       0|           31|         32|                          40|      PERCENTAGE|           90|        106|December 31, 2020|FISCAL_YEAR|          40 for the fiscal year ended December 31, 2020|        entail|             has_date|0.77800345|
|       5|          838|        839|                          20|      PERCENTAGE|          898|        914|December 31, 2020|FISCAL_YEAR|          20 for the fiscal year ended December 31, 2020|        entail|             has_date|0.85455483|
|       3|          561|        572|                travel costs|EXPENSE_DECREASE|          579|        589|      0.4 million|     AMOUNT|                   travel costs decreased by 0.4 million|        entail|  expense_decrease_by| 0.9946776|
|       0|           42|         52|                 0.5 million|          AMOUNT|          153|        169|December 31, 2019|FISCAL_YEAR| 0.5 million for the fiscal year ended December 31, 2019|        entail|             has_date| 0.7756689|
|       1|          172|        187|            Services revenue| PROFIT_INCREASE|          209|        219|      1.1 million|     AMOUNT|                       Services revenue from 1.1 million|        entail| profit_increase_from|0.96610945|
|       2|          408|        418|                 0.1 million|          AMOUNT|          521|        537|December 31, 2019|FISCAL_YEAR| 0.1 million for the fiscal year ended December 31, 2019|        entail|             has_date| 0.9083247|
|       5|          849|        859|                 1.5 million|          AMOUNT|          898|        914|December 31, 2020|FISCAL_YEAR| 1.5 million for the fiscal year ended December 31, 2020|        entail|             has_date| 0.7528142|
|       5|          849|        859|                 1.5 million|          AMOUNT|          954|        970|December 31, 2019|FISCAL_YEAR| 1.5 million for the fiscal year ended December 31, 2019|        entail|             has_date|0.80734617|
|       0|           42|         52|                 0.5 million|          AMOUNT|           90|        106|December 31, 2020|FISCAL_YEAR| 0.5 million for the fiscal year ended December 31, 2020|        entail|             has_date| 0.7157578|
|       1|          172|        187|            Services revenue| PROFIT_INCREASE|          284|        295|     24.5 million|     AMOUNT|                        Services revenue to 24.5 million|        entail|   profit_increase_to| 0.8597209|
|       0|           59|         69|                 0.7 million|          AMOUNT|          153|        169|December 31, 2019|FISCAL_YEAR| 0.7 million for the fiscal year ended December 31, 2019|        entail|             has_date|0.74845695|
|       1|          199|        199|                           4|      PERCENTAGE|          259|        275|December 31, 2020|FISCAL_YEAR|           4 for the fiscal year ended December 31, 2020|        entail|             has_date|0.84127575|
|       2|          424|        424|                           2|      PERCENTAGE|          465|        481|December 31, 2020|FISCAL_YEAR|           2 for the fiscal year ended December 31, 2020|        entail|             has_date| 0.8046481|
|       2|          424|        424|                           2|      PERCENTAGE|          521|        537|December 31, 2019|FISCAL_YEAR|           2 for the fiscal year ended December 31, 2019|        entail|             has_date| 0.8485104|
|       0|            0|         19|        License fees revenue|  PROFIT_DECLINE|           31|         32|               40| PERCENTAGE|               License fees revenue decreased by a 40 to|        entail|profit_decline_by_per| 0.9948003|
+--------+-------------+-----------+----------------------------+----------------+-------------+-----------+-----------------+-----------+--------------------------------------------------------+--------------+---------------------+----------+
only showing top 20 rows
from johnsnowlabs import nlp, legal

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

tokenClassifier = legal.BertForTokenClassification.pretrained('legner_obligations','en', 'legal/models')\
    .setInputCols("token", "document")\
    .setOutputCol("ner")\
    .setMaxSentenceLength(512)\
    .setCaseSensitive(True)

ner_converter = legal.NerConverterInternal()\
    .setInputCols(["document", "token", "ner"])\
    .setOutputCol("ner_chunk")

re_model = legal.ZeroShotRelationExtractionModel.pretrained("legre_zero_shot", "en", "legal/models")\
    .setInputCols(["ner_chunk", "document"]) \
    .setOutputCol("relations")

re_model.setRelationalCategories({
    "GRANTS_TO": ["{OBLIGATION_SUBJECT} grants {OBLIGATION_INDIRECT_OBJECT}"],
    "GRANTS": ["{OBLIGATION_SUBJECT} grants {OBLIGATION_ACTION}"]
})

pipeline = nlp.Pipeline(stages = [
                document_assembler,  
                tokenizer,
                tokenClassifier, 
                ner_converter,
                re_model
               ])

text = """Arizona Copyright Grant. Subject to the terms and conditions of this Agreement, Arizona hereby grants to the Company a perpetual, non-exclusive, royalty-free license in, to and under the Arizona Licensed Copyrights for use in the Company Field throughout the world."""

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

from pyspark.sql import functions as F

result.select(
    F.explode(F.arrays_zip(result.relations.metadata, result.relations.result)).alias("cols")).select(
    F.expr("cols['0']['sentence']").alias("sentence"),
    F.expr("cols['0']['entity1_begin']").alias("entity1_begin"),
    F.expr("cols['0']['entity1_end']").alias("entity1_end"),
    F.expr("cols['0']['chunk1']").alias("chunk1"),
    F.expr("cols['0']['entity1']").alias("entity1"),
    F.expr("cols['0']['entity2_begin']").alias("entity2_begin"),
    F.expr("cols['0']['entity2_end']").alias("entity2_end"),
    F.expr("cols['0']['chunk2']").alias("chunk2"),
    F.expr("cols['0']['entity2']").alias("entity2"),
    F.expr("cols['0']['hypothesis']").alias("hypothesis"),
    F.expr("cols['0']['nli_prediction']").alias("nli_prediction"),
    F.expr("cols['1']").alias("relation"),
    F.expr("cols['0']['confidence']").alias("confidence"),
).show(truncate=70)

+--------+-------------+-----------+-------+------------------+-------------+-----------+-------------+--------------------------+----------------------------+--------------+---------+----------+
|sentence|entity1_begin|entity1_end| chunk1|           entity1|entity2_begin|entity2_end|       chunk2|                   entity2|                  hypothesis|nli_prediction| relation|confidence|
+--------+-------------+-----------+-------+------------------+-------------+-----------+-------------+--------------------------+----------------------------+--------------+---------+----------+
|       0|           80|         86|Arizona|OBLIGATION_SUBJECT|          109|        115|      Company|OBLIGATION_INDIRECT_OBJECT|      Arizona grants Company|        entail|GRANTS_TO| 0.9535338|
|       0|           80|         86|Arizona|OBLIGATION_SUBJECT|           88|        100|hereby grants|         OBLIGATION_ACTION|Arizona grants hereby grants|        entail|   GRANTS| 0.9873099|
+--------+-------------+-----------+-------+------------------+-------------+-----------+-------------+--------------------------+----------------------------+--------------+---------+----------+

import spark.implicits._

val documenter = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentencer = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
    .setInputCols(Array("document"))
    .setOutputCol("sentences")

val tokenizer = new Tokenizer()
    .setInputCols(Array("sentences"))
    .setOutputCol("tokens")

val wordsEmbedder = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentences", "tokens"))
    .setOutputCol("embeddings")

val nerClinical = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
    .setInputCols(Array("sentences", "tokens", "embeddings"))
    .setOutputCol("ner_clinical")

val nerClinicalConverter = new NerConverterInternal()
    .setInputCols(Array("sentences", "tokens", "ner_clinical"))
    .setOutputCol("ner_clinical_chunks")
    .setWhiteList(Array("PROBLEM", "TEST"))

val nerPosology = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models")
    .setInputCols(Array("sentences", "tokens", "embeddings"))
    .setOutputCol("ner_posology")

val nerPosologyConverter = new NerConverterInternal()
    .setInputCols(Array("sentences", "tokens", "ner_posology"))
    .setOutputCol("ner_posology_chunks")
    .setWhiteList(Array("DRUG"))

val chunkMerger = new ChunkMergeApproach()
    .setInputCols(Array("ner_clinical_chunks", "ner_posology_chunks"))
    .setOutputCol("merged_ner_chunks")

val posTagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models")
    .setInputCols(Array("sentences", "tokens"))
    .setOutputCol("pos_tags")

val dependencyParser = DependencyParserModel.pretrained("dependency_conllu", "en")
    .setInputCols(Array("document", "pos_tags", "tokens"))
    .setOutputCol("dependencies")

val reNerChunkFilter = new RENerChunksFilter()
    .setRelationPairs(Array("problem-test", "problem-drug"))
    .setMaxSyntacticDistance(4)
    .setDocLevelRelations(false)
    .setInputCols(Array("merged_ner_chunks", "dependencies"))
    .setOutputCol("re_ner_chunks")

val reModel = ZeroShotRelationExtractionModel.pretrained("re_zeroshot_biobert", "en", "clinical/models")
    .setInputCols(Array("re_ner_chunks", "sentences"))
    .setOutputCol("relations")
    .setMultiLabel(true)
    .setRelationalCategories(Map(
        "ADE" -> Array("{DRUG} causes {PROBLEM}."),
        "IMPROVE" -> Array("{DRUG} improves {PROBLEM}.", "{DRUG} cures {PROBLEM}."),
        "REVEAL" -> Array("{TEST} reveals {PROBLEM}.")
    ))

val pipeline = new Pipeline().setStages(Array(
    documenter,
    sentencer,
    tokenizer,
    wordsEmbedder,
    nerClinical,
    nerClinicalConverter,
    nerPosology,
    nerPosologyConverter,
    chunkMerger,
    posTagger,
    dependencyParser,
    reNerChunkFilter,
    reModel
))

val text = "Paracetamol can alleviate headache or sickness. An MRI test can be used to find cancer."

val data = Seq(text).toDF("text")

val result = pipeline.fit(data).transform(data)
 
+--------+-------------+-----------+-----------+-------+-------------+-----------+--------+-------+--------------------+--------------+--------+----------+
|sentence|entity1_begin|entity1_end|     chunk1|entity1|entity2_begin|entity2_end|  chunk2|entity2|          hypothesis|nli_prediction|relation|confidence|
+--------+-------------+-----------+-----------+-------+-------------+-----------+--------+-------+--------------------+--------------+--------+----------+
|       0|            0|         10|Paracetamol|   DRUG|           38|         45|sickness|PROBLEM|Paracetamol impro...|        entail| IMPROVE|0.98819494|
|       0|            0|         10|Paracetamol|   DRUG|           26|         33|headache|PROBLEM|Paracetamol impro...|        entail| IMPROVE| 0.9929625|
|       1|           48|         58|An MRI test|   TEST|           80|         85|  cancer|PROBLEM|An MRI test revea...|        entail|  REVEAL| 0.9760039|
+--------+-------------+-----------+-----------+-------+-------------+-----------+--------+-------+--------------------+--------------+--------+----------+
import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val nerModel = FinanceNerModel.pretrained("finner_financial_small", "en", "finance/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")

val nerConverter = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")

val reModel = ZeroShotRelationExtractionModel.pretrained("finre_zero_shot", "en", "finance/models")
  .setInputCols(Array("ner_chunk", "sentence"))
  .setOutputCol("relations")
  .setMultiLabel(false)
  .setRelationalCategories(Map(
  "profit_decline_by" -> Array(
    "{PROFIT_DECLINE} decreased by {AMOUNT} from",
    "{PROFIT_DECLINE} decreased by {AMOUNT} to"
  ),
  "profit_decline_by_per" -> Array(
    "{PROFIT_DECLINE} decreased by a {PERCENTAGE} from",
    "{PROFIT_DECLINE} decreased by a {PERCENTAGE} to"
  ),
  "profit_decline_from" -> Array(
    "{PROFIT_DECLINE} decreased from {AMOUNT}",
    "{PROFIT_DECLINE} decreased from {AMOUNT} for the year"
  ),
  "profit_decline_from_per" -> Array(
    "{PROFIT_DECLINE} decreased from {PERCENTAGE} to",
    "{PROFIT_DECLINE} decreased from {PERCENTAGE} to a total of"
  ),
  "profit_decline_to" -> Array("{PROFIT_DECLINE} to {AMOUNT}"),
  "profit_increase_from" -> Array("{PROFIT_INCREASE} from {AMOUNT}"),
  "profit_increase_to" -> Array("{PROFIT_INCREASE} to {AMOUNT}"),
  "expense_decrease_by" -> Array("{EXPENSE_DECREASE} decreased by {AMOUNT}"),
  "expense_decrease_by_per" -> Array("{EXPENSE_DECREASE} decreased by a {PERCENTAGE}"),
  "expense_decrease_from" -> Array("{EXPENSE_DECREASE} decreased from {AMOUNT}"),
  "expense_decrease_to" -> Array("{EXPENSE_DECREASE} for a total of {AMOUNT} for the fiscal year"),
  "has_date" -> Array(
    "{AMOUNT} for the fiscal year ended {FISCAL_YEAR}",
    "{PERCENTAGE} for the fiscal year ended {FISCAL_YEAR}"
  )
))

val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    nerModel,
    nerConverter,
    reModel
  ))

val text = """License fees revenue decreased 40 %, or $ 0.5 million to $ 0.7 million for the year ended December 31, 2020 compared to $ 1.2 million for the year ended December 31, 2019. Services revenue increased 4 %, or $ 1.1 million, to $ 25.6 million for the year ended December 31, 2020 from $ 24.5 million for the year ended December 31, 2019. Costs of revenue, excluding depreciation and amortization increased by $ 0.1 million, or 2 %, to $ 8.8 million for the year ended December 31, 2020 from $ 8.7 million for the year ended December 31, 2019.  Also, a decrease in travel costs of $ 0.4 million due to travel restrictions caused by the global pandemic. As a percentage of revenue, cost of revenue, excluding depreciation and amortization was 34 % for each of the years ended December 31, 2020 and 2019. Sales and marketing expenses decreased 20 %, or $ 1.5 million, to $ 6.0 million for the year ended December 31, 2020 from $ 7.5 million for the year ended December 31, 2019."""

val data = Seq(text).toDF("text")

val result = pipeline.fit(data).transform(data)

+--------+-------------+-----------+----------------------------+----------------+-------------+-----------+-----------------+-----------+--------------------------------------------------------+--------------+---------------------+----------+
|sentence|entity1_begin|entity1_end|                      chunk1|         entity1|entity2_begin|entity2_end|           chunk2|    entity2|                                              hypothesis|nli_prediction|             relation|confidence|
+--------+-------------+-----------+----------------------------+----------------+-------------+-----------+-----------------+-----------+--------------------------------------------------------+--------------+---------------------+----------+
|       1|          227|        238|                25.6 million|          AMOUNT|          316|        332|December 31, 2019|FISCAL_YEAR|25.6 million for the fiscal year ended December 31, 2019|        entail|             has_date| 0.8744757|
|       0|           31|         32|                          40|      PERCENTAGE|          153|        169|December 31, 2019|FISCAL_YEAR|          40 for the fiscal year ended December 31, 2019|        entail|             has_date| 0.7889032|
|       5|          799|        826|Sales and marketing expenses|EXPENSE_DECREASE|          923|        933|      7.5 million|     AMOUNT| Sales and marketing expenses decreased from 7.5 million|        entail|expense_decrease_from| 0.9770538|
|       0|           59|         69|                 0.7 million|          AMOUNT|           90|        106|December 31, 2020|FISCAL_YEAR| 0.7 million for the fiscal year ended December 31, 2020|        entail|             has_date|0.67187774|
|       1|          172|        187|            Services revenue| PROFIT_INCREASE|          227|        238|     25.6 million|     AMOUNT|                        Services revenue to 25.6 million|        entail|   profit_increase_to| 0.9674029|
|       0|           31|         32|                          40|      PERCENTAGE|           90|        106|December 31, 2020|FISCAL_YEAR|          40 for the fiscal year ended December 31, 2020|        entail|             has_date|0.77800345|
|       5|          838|        839|                          20|      PERCENTAGE|          898|        914|December 31, 2020|FISCAL_YEAR|          20 for the fiscal year ended December 31, 2020|        entail|             has_date|0.85455483|
|       3|          561|        572|                travel costs|EXPENSE_DECREASE|          579|        589|      0.4 million|     AMOUNT|                   travel costs decreased by 0.4 million|        entail|  expense_decrease_by| 0.9946776|
|       0|           42|         52|                 0.5 million|          AMOUNT|          153|        169|December 31, 2019|FISCAL_YEAR| 0.5 million for the fiscal year ended December 31, 2019|        entail|             has_date| 0.7756689|
|       1|          172|        187|            Services revenue| PROFIT_INCREASE|          209|        219|      1.1 million|     AMOUNT|                       Services revenue from 1.1 million|        entail| profit_increase_from|0.96610945|
|       2|          408|        418|                 0.1 million|          AMOUNT|          521|        537|December 31, 2019|FISCAL_YEAR| 0.1 million for the fiscal year ended December 31, 2019|        entail|             has_date| 0.9083247|
|       5|          849|        859|                 1.5 million|          AMOUNT|          898|        914|December 31, 2020|FISCAL_YEAR| 1.5 million for the fiscal year ended December 31, 2020|        entail|             has_date| 0.7528142|
|       5|          849|        859|                 1.5 million|          AMOUNT|          954|        970|December 31, 2019|FISCAL_YEAR| 1.5 million for the fiscal year ended December 31, 2019|        entail|             has_date|0.80734617|
|       0|           42|         52|                 0.5 million|          AMOUNT|           90|        106|December 31, 2020|FISCAL_YEAR| 0.5 million for the fiscal year ended December 31, 2020|        entail|             has_date| 0.7157578|
|       1|          172|        187|            Services revenue| PROFIT_INCREASE|          284|        295|     24.5 million|     AMOUNT|                        Services revenue to 24.5 million|        entail|   profit_increase_to| 0.8597209|
|       0|           59|         69|                 0.7 million|          AMOUNT|          153|        169|December 31, 2019|FISCAL_YEAR| 0.7 million for the fiscal year ended December 31, 2019|        entail|             has_date|0.74845695|
|       1|          199|        199|                           4|      PERCENTAGE|          259|        275|December 31, 2020|FISCAL_YEAR|           4 for the fiscal year ended December 31, 2020|        entail|             has_date|0.84127575|
|       2|          424|        424|                           2|      PERCENTAGE|          465|        481|December 31, 2020|FISCAL_YEAR|           2 for the fiscal year ended December 31, 2020|        entail|             has_date| 0.8046481|
|       2|          424|        424|                           2|      PERCENTAGE|          521|        537|December 31, 2019|FISCAL_YEAR|           2 for the fiscal year ended December 31, 2019|        entail|             has_date| 0.8485104|
|       0|            0|         19|        License fees revenue|  PROFIT_DECLINE|           31|         32|               40| PERCENTAGE|               License fees revenue decreased by a 40 to|        entail|profit_decline_by_per| 0.9948003|
+--------+-------------+-----------+----------------------------+----------------+-------------+-----------+-----------------+-----------+--------------------------------------------------------+--------------+---------------------+----------+
only showing top 20 rows
import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new Tokenizer()
    .setInputCols(Array("document"))
    .setOutputCol("token")

val tokenClassifier = LegalBertForTokenClassification.pretrained("legner_obligations", "en", "legal/models")
    .setInputCols(Array("token", "document"))
    .setOutputCol("ner")
    .setMaxSentenceLength(512)
    .setCaseSensitive(true)

val nerConverter = new NerConverterInternal()
    .setInputCols(Array("document", "token", "ner"))
    .setOutputCol("ner_chunk")

val reModel = ZeroShotRelationExtractionModel.pretrained("legre_zero_shot", "en", "legal/models")
    .setInputCols(Array("ner_chunk", "document"))
    .setOutputCol("relations")

reModel.setRelationalCategories(Map(
    "GRANTS_TO" -> Array("{OBLIGATION_SUBJECT} grants {OBLIGATION_INDIRECT_OBJECT}"),
    "GRANTS" -> Array("{OBLIGATION_SUBJECT} grants {OBLIGATION_ACTION}")
))

val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    tokenizer,
    tokenClassifier,
    nerConverter,
    reModel
))

val text = """Arizona Copyright Grant. Subject to the terms and conditions of this Agreement, Arizona hereby grants to the Company a perpetual, non-exclusive, royalty-free license in, to and under the Arizona Licensed Copyrights for use in the Company Field throughout the world."""

val data = Seq(text).toDF("text")

val result = pipeline.fit(data).transform(data)


+--------+-------------+-----------+-------+------------------+-------------+-----------+-------------+--------------------------+----------------------------+--------------+---------+----------+
|sentence|entity1_begin|entity1_end| chunk1|           entity1|entity2_begin|entity2_end|       chunk2|                   entity2|                  hypothesis|nli_prediction| relation|confidence|
+--------+-------------+-----------+-------+------------------+-------------+-----------+-------------+--------------------------+----------------------------+--------------+---------+----------+
|       0|           80|         86|Arizona|OBLIGATION_SUBJECT|          109|        115|      Company|OBLIGATION_INDIRECT_OBJECT|      Arizona grants Company|        entail|GRANTS_TO| 0.9535338|
|       0|           80|         86|Arizona|OBLIGATION_SUBJECT|           88|        100|hereby grants|         OBLIGATION_ACTION|Arizona grants hereby grants|        entail|   GRANTS| 0.9873099|
+--------+-------------+-----------+-------+------------------+-------------+-----------+-------------+--------------------------+----------------------------+--------------+---------+----------+
Last updated