Enterprise NLP Annotators

A Spark NLP Enterprise license includes access to unique annotators. At the Spark NLP Workshop you can see different types of annotators in action.

By clicking on any annotator, you will see different sections:

The Approach, or class to train models.
The Model, to infer using pretrained models.

Also, for most of the annotators, you will find examples for the different enterprise libraries:

Healthcare NLP
Finance NLP
Legal NLP

Check out the Spark NLP Annotators page for more information on how to read this page.

Available Annotators

Annotators	Description
AnnotationConverter	A flexible converter for transforming annotations in a DataFrame using custom logic.
AnnotationMerger	Merge annotations from different pipeline steps that have the same annotation type into a unified annotation.
AssertionChunkConverter	AssertionChunkConverter annotator uses both begin and end indices of the tokens as input to add a more robust metadata to the chunk column in a way that improves the reliability of the indices and avoid loss of data.
AssertionDL	AssertionDL is a deep Learning based approach used to extract Assertion Status from extracted entities and text.
AssertionFilterer	Filters entities coming from ASSERTION type annotations and returns the CHUNKS.
AssertionLogReg	Logistic Regression is used to extract Assertion Status from extracted entities and text.
AssertionMerger	Merges variety assertion columns coming from Assertion annotators
AverageEmbeddings	Computes the mean of vector embeddings for two sentences of equal size, producing a unified representation
BertForAssertionClassification	BertForAssertionClassification extracts the assertion status from text by analyzing both the extracted entities and their surrounding context.
BertForSequenceClassification	Can load Bert Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.
BertForTokenClassifier	Can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) for Named-Entity-Recognition (NER) tasks.
BertSentenceChunkEmbeddings	This annotator combines sentence and NER chunk embeddings to enhance resolution codes, leveraging contextual information in the embeddings for more precise results. It takes sentence context and NER chunks as input and produces embeddings for each chunk, facilitating input for the resolution model.
Chunk2Token	A feature transformer that converts the input array of strings (annotatorType CHUNK) into an array of chunk-based tokens (annotatorType TOKEN).
ChunkConverter	This annotator merges NER-detected entities with RegexMatcher-based rules for unified processing in the pipeline.
ChunkEntityResolver	Returns a normalized entity for a particular trained ontology / curated dataset (e.g. clinical ICD-10, RxNorm, SNOMED; financial SEC’s EDGAR database, etc).
ChunkFilterer	Filters entities coming from CHUNK annotations.
ChunkKeyPhraseExtraction	Uses Bert Sentence Embeddings to determine the most relevant key phrases describing a text.
ChunkMapper	We can use ChunkMapper to map entities with their associated code/reference based on pre-defined dictionaries.
ChunkMapperFilterer	Annotator to be used after `ChunkMapper` that allows to filter chunks based on the results of the mapping, whether it was successful or failed.
ChunkMerge	Merges entities coming from different CHUNK annotations.
ChunkSentenceSplitter	Annotator can split the documents into chunks according to separators given as `CHUNK` columns. It is useful when you need to perform different models or analysis in different sections of your document
ContextualAssertion	This model identifies contextual cues within text data, such as negation, uncertainty etc.It annotates text chunks with assertions based on configurable rules, prefix and suffix patterns, and exception patterns.
ContextualEntityFilterer	ContextualEntityFilterer can filter chunks coming from CHUNK annotations based on entity(identifier,field) info in metadata and contextual cues.
ContextualEntityRuler	ContextualEntityRuler is an annotator that updates chunks based on contextual rules.
ContextualParser	Extracts entity from a document based on user defined rules.
ContextSplitAssembler	Converts and assembles `VECTOR_SIMILARITY_RANKINGS` type annotations into `DOCUMENT` type.
DateNormalizer	This annotator transforms date mentions to a common standard format: YYYY/MM/DD. It is useful when using data from different sources, some times from different countries that has different formats to represent dates.
DeIdentification	Deidentifies Input Annotations of types DOCUMENT, TOKEN and CHUNK, by either masking or obfuscating the given CHUNKS.
DistilBertForSequenceClassification	Can load DistilBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.
Doc2ChunkInternal	Converts `DOCUMENT`, `TOKEN` typed annotations into `CHUNK` type with the contents of a `chunkCol`.
DocMapper	Uses the text representation of document annotations to map clinical codes to other codes or relevant information.
DocumentFiltererByClassifier	This annotator sorts documents based on classifier results. It uses white and black lists, allowing or blocking specific outcomes. It can be case-sensitive or case-insensitive for broader matching. This tool efficiently organizes documents based on classifier outcomes.
DocumentFiltererByNER	This annotator returns sentences containing the entity chunks you have filtered, allowing you to see only the sentences with the entities you want.
DocumentHashCoder	This annotator swaps dates in a document column with hash codes from another column, creating a new column with shifted day information. The subsequent `DeIdentification` annotator anonymizes the document, incorporating the altered dates.
DocumentLogRegClassifier	Classifies documents with a Logarithmic Regression algorithm.
DocumentMLClassifier	classifies documents with a Logarithmic Regression algorithm.
DrugNormalizer	Annotator which normalizes raw text from documents, e.g. scraped web pages or xml documents.
EntityChunkEmbeddings	Entity Chunk Embeddings uses BERT Sentence embeddings to compute a weighted average vector represention of related entity chunks.
EntityRulerInternal	This annotator match exact strings or regex patterns provided in a file against a Document and assigns them an named entity.
FeaturesAssembler	Collects features from different columns.
FewShotClassifier	This Annotator specifically target few-shot classification tasks, which involve training a model to make accurate predictions with limited labeled data.
FewShotAssertionClassifierModel	assertion classification using large (LLMS based) few shot classifiers based on the SetFit approach.
FewShotAssertionSentenceClassifier	It is a util annotator that is used in some datasets to train a new FewShotAssertionClassifierModel.
FhirDeIdentification	A Spark Transformer for de-identifying FHIR resources according to configurable privacy rules.
Flattener	`Flattener` annotator in Spark NLP converts annotation results into a simplified DataFrame format for easier analysis and interpretation.
GenericClassifier	Creates a generic single-label classifier which uses pre-generated Tensorflow graphs.
GenericLogRegClassifier	Is a derivative of GenericClassifier which implements a multinomial logistic regression.
GenericSVMClassifier	Creates a generic single-label classifier which uses pre-generated Tensorflow graphs.
InternalDocumentSplitter	This annotator splits large documents into small documents.
IOBTagger	Merges token tags and NER labels from chunks in the specified format.
LargeFewShotClassifier	This annotator is designed to work effectively with minimal labeled data, offering flexibility and adaptability to new, unseen classes.
LightDeIdentification	Light version of DeIdentification.
LLMLoader	LLMLoader is designed to interact with a LLMs that are converted into gguf format. This module allows using John Snow Labs’ licensed LLMs at various sizes that are finetuned on medical context for certain tasks.
Mapper2Chunk	This annotator converts ‘LABELED_DEPENDENCY’ type annotations coming from ChunkMapper into ‘CHUNK’ type to create new chunk-type column
MedicalLLM	MedicalLLM was designed to load and run large language models (LLMs) in GGUF format with scalable performance.
MetadataAnnotationConverter	Converts metadata fields in annotations into actual begin, end, or result values.
MultiChunk2Doc	Merges a given chunks to create a document.
NameChunkObfuscator	This annotator allows to transform a dataset with an Input Annotation of type CHUNK, into its obfuscated version of by obfuscating the given CHUNKS.
NerChunker	Extracts phrases that fits into a known pattern using the NER tags.
NerConverterInternal	Converts a IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their label.
NerDisambiguator	Links words of interest, such as names of persons, locations and companies, from an input text document to a corresponding unique entity in a target Knowledge Base (KB).
NerModel	This Named Entity recognition annotator is a generic NER model based on Neural Networks.
NerQuestionGenerator	This annotator takes an NER chunk (obtained by, e.g., `NerConverterInternal`) and generates a questions based on two entity types, a pronoun and a strategy.
QuestionAnswering	GPT-based model for answering questions given a context.
PretrainedZeroShotNER	it makes easy to identify specific entities in text without needing pre-labeled datasets.
RegexMatcherInternal	`RegexMatcherInternal` matches predefined regex patterns with entities in text, allowing for flexible entity recognition based on user-defined rules.
ReIdentification	Reidentifies obfuscated entities by DeIdentification.
RelationExtraction	Extracts and classifies instances of relations between named entities.
RelationExtractionDL	Extracts and classifies instances of relations between named entities.
REChunkMerger	Merges relation chunks to create a new chunk.
RENerChunksFilter	Filters and outputs combinations of relations between extracted entities, for further processing.
Replacer	This annotator allows to replace entities in the original text with the ones extracted by the annotators `NameChunkObfuscatorApproach` or `DateNormalizer`.
Resolution2Chunk	This annotator is responsible for converting the annotations generated by entity resolver models (typically labeled as ENTITY) into a format compatible with subsequent stages of the pipeline, such as the ChunkMapperModel.
ResolverMerger	This annotator is provide the ability to merge sentence enitity resolver and chunk mapper model output columns.
Router	This annotator is provide the ability to split an output of an annotator for a selected metadata field and the value for that field.
SentenceEntityResolver	Returns the normalized entity for a particular trained ontology / curated dataset (e.g. clinical ICD-10, RxNorm, SNOMED; financial SEC’s EDGAR database, etc) based on sentence embeddings.
StructuredJsonConverter	StructuredJsonConverter is a transformer that converts the output of the pipeline into a structured JSON format.
Summarizer	Helps to quickly summarize complex medical information.
TextGenerator	Uses the basic BioGPT model to perform various tasks related to medical text abstraction.
TextMatcherInternal	This annotator match exact phrases provided in a file against a Document.
TFGraphBuilder	Creates Tensorflow graphs.
VectorDBPostProcessor	VectorDBPostProcessor is used to filter and sort the annotations from the :class:`sparknlp_jsl.annotator.resolution.VectorDBModel`.
WindowedSentenceModel	This annotator that helps you to merge the previous and following sentences of a given piece of text, so that you add the context surrounding them.
ZeroShotNerModel	This is a zero-shot named entity recognition using `RoBertaForQuestionAnswering`. It identifies entities across diverse data without domain-specific fine-tuning.
ZeroShotRelationExtractionModel	This annotator implements zero-shot binary relations extraction by utilizing `BERT` transformer models trained on the NLI (Natural Language Inference) task.

AnnotationConverter

Model

A flexible converter for transforming annotations in a DataFrame using custom logic.

This class allows users to define custom conversion functions (f) to modify annotations, enabling transformations like:

Assertion outputs → Chunk outputs
LLM outputs → Document outputs
rule-based outputs → Updated outputs

The converter integrates with PySpark NLP-style pipelines (e.g., DocumentAssembler, Tokenizer) but operates purely in Python (not Scala).

Parameters:

f: (FunctionParam) User-defined function to transform annotations.
inputCol: (Param[String]) Name of the input column containing annotations.
outputCol: (Param[String]) Name of the output column for converted annotations.
outputAnnotatorType: (Param[String]) Type of the output annotations (e.g., “token”).

Result

+--------------------+--------+-----+
|        result      |  begin | end |
+--------------------+--------+-----+
|                 I  |    0   |  0  |
|              like  |    2   |  5  |
|             Spark  |    7   | 11  |
|               NLP  |   12   | 14  |
|        annotators  |   16   | 25  |
|              such  |   27   | 30  |
|                as  |   32   | 33  |
|           Medical  |   35   | 41  |
|              Bert  |   42   | 45  |
|               For  |   46   | 48  |
|          Sequence  |   49   | 56  |
|    Classification  |   57   | 70  |
|               and  |   72   | 74  |
|              Bert  |   76   | 79  |
|               For  |   80   | 82  |
|         Assertion  |   83   | 91  |
|    Classification  |   92   |105  |
|                 .  |  106   |106  |
+--------------------+--------+-----+

Input Annotator Types: ANY

Output Annotator Type: ANY

Python API: AnnotationConverter

Scala API: AnnotationConverter

Show Example

from johnsnowlabs import nlp, medical
from sparknlp_jsl.annotator import AnnotationConverter

test_data = spark.createDataFrame([
    (1, """I like SparkNLP annotators such as MedicalBertForSequenceClassification and BertForAssertionClassification."""),
]).toDF("id", "text")
document_assembler = DocumentAssembler().setInputCol('text').setOutputCol('document')
tokenizer = Tokenizer().setInputCols('document').setOutputCol('token')

def myFunction(annotations): new_annotations = [] pattern = r”(?<=[a-z])(?=[A-Z])”

for annotation in annotations:
    text = annotation.result
    import re
    parts = re.split(pattern, text)
    begin = annotation.begin
    for part in parts:
        end = begin + len(part) - 1
        new_annotations.append(
            Annotation(
                annotatorType="token",
                begin=begin,
                end=end,
                result=part,
                metadata=annotation.metadata,
                embeddings=annotation.embeddings,
            )
        )
        begin = end + 1

return new_annotations ``` camel_case_tokenizer = AnnotationConverter(f=myFunction)\
.setInputCol("token")\
.setOutputCol("camel_case_token")\
.setOutputAnnotatorType("token")

pipeline = Pipeline(stages=[document_assembler, tokenizer, camel_case_tokenizer]) model = pipeline.fit(test_data) df = model.transform(test_data) df.selectExpr(“explode(camel_case_token) as tokens”).show(truncate=False)

</div>





</div>
<!--END Python-->
<!--Scala--> 
<div class="tabs-mfl-box" markdown="1">

<div class="tabs-mfl-head" markdown="1">

</div>





</div>
<!--END Scala--> 

</details>


</div>
<!--END Model-->




</div>

  

  
    
    
<div class="tabs-model-aproach" markdown="1">

## AnnotationMerger





<div class="annotator_type tac mont">Model</div>







<!--Model-->
<div class="h3-box tabs-python-scala-box" markdown="1">

Merge annotations from different pipeline steps that have the same annotation type into a unified annotation. Possible annotations that can be merged include:
- document (e.g., output of `DocumentAssembler` annotator)
- token (e.g., output of `Tokenizer` annotator)
- word_embeddings (e.g., output of `WordEmbeddingsModel` annotator)
- sentence_embeddings (e.g., output of `BertSentenceEmbeddings` annotator)
- category (e.g., output of `RelationExtractionModel` annotator)
- date (e.g., output of `DateMatcher` annotator)
- sentiment (e.g., output of `SentimentDLModel` annotator)
- pos (e.g., output of `PerceptronModel` annotator)
- chunk (e.g., output of `NerConverter` annotator)
- named_entity (e.g., output of `NerDLModel` annotator)
- regex (e.g., output of `RegexTokenizer` annotator)
- dependency (e.g., output of `DependencyParserModel` annotator)
- language (e.g., output of `LanguageDetectorDL` annotator)
- keyword (e.g., output of `YakeModel` annotator)

Parameters:

- `inputType`: The type of the annotations that you want to merge. Possible values.

**Input Annotator Types:** `ANY`

**Output Annotator Type:** `ANY`




| **Python API:** [AnnotationMerger](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/annotation_merger/index.html#sparknlp_jsl.annotator.annotation_merger.AnnotationMerger)| **Scala API:** [AnnotationMerger](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/annotator/AnnotationMerger.html)| **Notebook:** [AnnotationMerger](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Healthcare_NLP/AnnotationMerger.ipynb)|




<details>

<summary class="button"><b>Show Example</b></summary>

<div class="tabs-python-scala-head"><button class="tab-python-scala-li tabheader_active">Python</button><button class="tab-python-scala-li">Scala</button></div>

<!--Python-->
<div class="tabs-mfl-box" markdown="1">

<div class="tabs-mfl-head" markdown="1">
<button class="tab-mfl-li" markdown="1">Medical</button><button class="tab-mfl-li" markdown="1">Finance</button><button class="tab-mfl-li" markdown="1">Legal</button>
</div>



<div class="tab-mfl-content" markdown="1">

```python
from johnsnowlabs import nlp, medical

# Create the pipeline with two RE models
documenter = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencer = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentences"])\
    .setOutputCol("tokens")

words_embedder = nlp.WordEmbeddingsModel()\
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("embeddings")

pos_tagger = nlp.PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("pos_tags")

pos_ner_tagger = medical.NerModel()\
    .pretrained("ner_posology", "en", "clinical/models")\
    .setInputCols("sentences", "tokens", "embeddings")\
    .setOutputCol("ner_pos")

pos_ner_chunker = medical.NerConverterInternal()\
    .setInputCols(["sentences", "tokens", "ner_pos"])\
    .setOutputCol("pos_ner_chunks")

dependency_parser = nlp.DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentences", "pos_tags", "tokens"])\
    .setOutputCol("dependencies")

pos_reModel = medical.RelationExtractionModel()\
    .pretrained("posology_re")\
    .setInputCols(["embeddings", "pos_tags", "pos_ner_chunks", "dependencies"])\
    .setOutputCol("pos_relations")\
    .setMaxSyntacticDistance(4)

ade_ner_tagger = medical.NerModel.pretrained("ner_ade_clinical", "en", "clinical/models")\
    .setInputCols("sentences", "tokens", "embeddings")\
    .setOutputCol("ade_ner_tags")  

ade_ner_chunker = medical.NerConverterInternal()\
    .setInputCols(["sentences", "tokens", "ade_ner_tags"])\
    .setOutputCol("ade_ner_chunks")

ade_reModel = medical.RelationExtractionModel()\
    .pretrained("re_ade_clinical", "en", 'clinical/models')\
    .setInputCols(["embeddings", "pos_tags", "ade_ner_chunks", "dependencies"])\
    .setOutputCol("ade_relations")\
    .setMaxSyntacticDistance(10)\
    .setRelationPairs(["drug-ade, ade-drug"])

annotation_merger = medical.AnnotationMerger()\
    .setInputCols("ade_relations", "pos_relations")\
    .setInputType("category")\
    .setOutputCol("all_relations")

merger_pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer, 
    words_embedder, 
    pos_tagger, 
    pos_ner_tagger,
    pos_ner_chunker,
    dependency_parser,
    pos_reModel,
    ade_ner_tagger,
    ade_ner_chunker,
    ade_reModel,
    annotation_merger
])

# Show example result
text = """
The patient was prescribed 1 unit of naproxen for 5 days after meals for chronic low back pain. The patient was also given 1 unit of oxaprozin daily for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands.. 
"""
data = spark.createDataFrame([[text]]).toDF("text")

result = merger_pipeline.fit(data).transform(data)
result.selectExpr("pos_relations.result as PosologyRelation", 
                  "ade_relations.result as AdeRelation", 
                  "all_relations.result as MergedRelation").show(truncate=False)

+---------------------------------------------------------+-----------+---------------------------------------------------------------+
|PosologyRelation                                         |AdeRelation|MergedRelation                                                 |
+---------------------------------------------------------+-----------+---------------------------------------------------------------+
|[DOSAGE-DRUG, DRUG-DURATION, DOSAGE-DRUG, DRUG-FREQUENCY]|[1, 1]     |[1, 1, DOSAGE-DRUG, DRUG-DURATION, DOSAGE-DRUG, DRUG-FREQUENCY]|
+---------------------------------------------------------+-----------+---------------------------------------------------------------+

from johnsnowlabs import nlp, finance

# Create the pipeline with two RE models
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

text_splitter = finance.TextSplitter()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model_date = finance.NerModel.pretrained("finner_sec_dates", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_dates")

ner_converter_date = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_dates"])\
    .setOutputCol("ner_chunk_date")

ner_model_org= finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_orgs")

ner_converter_org = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_orgs"])\
    .setOutputCol("ner_chunk_org")\

chunk_merger = finance.ChunkMergeApproach()\
    .setInputCols('ner_chunk_org', "ner_chunk_date")\
    .setOutputCol('ner_chunk')

pos = nlp.PerceptronModel.pretrained()\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("pos")

dependency_parser = nlp.DependencyParserModel().pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos", "token"])\
    .setOutputCol("dependencies")

re_filter = finance.RENerChunksFilter()\
    .setInputCols(["ner_chunk", "dependencies"])\
    .setOutputCol("re_ner_chunk")\
    .setRelationPairs(["ORG-ORG", "ORG-DATE"])\
    .setMaxSyntacticDistance(10)

reDL = finance.RelationExtractionDLModel().pretrained('finre_acquisitions_subsidiaries_md', 'en', 'finance/models')\
    .setInputCols(["re_ner_chunk", "sentence"])\
    .setOutputCol("relations_acq")\
    .setPredictionThreshold(0.1)

reDL_alias = finance.RelationExtractionDLModel().pretrained('finre_org_prod_alias', 'en', 'finance/models')\
    .setInputCols(["re_ner_chunk", "sentence"])\
    .setOutputCol("relations_alias")\
    .setPredictionThreshold(0.1)

annotation_merger = finance.AnnotationMerger()\
    .setInputCols("relations_acq", "relations_alias")\
    .setOutputCol("relations")\
    .setInputType("category")

nlpPipeline = nlp.Pipeline(stages=[
        document_assembler,
        text_splitter,
        tokenizer,
        embeddings,
        ner_model_date,
        ner_converter_date,
        ner_model_org,
        ner_converter_org,
        chunk_merger,
        pos,
        dependency_parser,
        re_filter,
        reDL,
        reDL_alias,
        annotation_merger])


# Show example result
text ="""Definite-lived intangible assets acquired with Cadence’s fiscal 2021 acquisitions were as follows:
 
Acquisition Date Fair Value
Weighted Average Amortization Period
 
(In thousands)
 (in years)
Existing technology
$
59,100 
13.7 years
Agreements and relationships
28,900 
13.7 years
Tradenames, trademarks and patents
4,600 
14.3 years
Total acquired intangibles with definite lives
$
92,600 
13.7 years
2020 Acquisitions
In fiscal 2020, Cadence acquired all of the outstanding equity of AWR Corporation ("AWR") and Integrand Software, Inc. ("Integrand"). These acquisitions enhanced Cadence’s technology portfolio to address growing radio frequency design activity, driven by expanding use of 5G communications.
The aggregate cash consideration for these acquisitions was $195.6 million, after taking into account cash acquired of $1.5 million. The total purchase consideration was allocated to the assets acquired and liabilities assumed based on their respective estimated fair values on the acquisition dates. Cadence will also make payments to certain employees, subject to continued employment and other performance-based conditions, through the first quarter of fiscal 2023.
 With its acquisitions of AWR and Integrand, Cadence recorded $101.3 million of definite-lived intangible assets with a weighted average amortization period of approximately nine years. The definite-lived intangible assets related primarily to existing technology and customer agreements and relationships. Cadence also recorded $119.4 million of goodwill and $25.1 million of net liabilities, consisting primarily of deferred tax liabilities, assumed deferred revenue and trade accounts receivable. The recorded goodwill was primarily related to the acquired assembled workforce and expected synergies from combining operations of the acquired companies with Cadence. None of the goodwill related to the acquisitions of AWR and Integrand is deductible for tax purposes.
Cadence completed one additional acquisition during fiscal 2020 that was not material to the consolidated financial statements. 
Pro Forma Financial Information
Cadence has not presented pro forma financial information for any of the businesses it acquired during fiscal 2021 and fiscal 2020 because the results of operations for these businesses are not material to Cadence’s consolidated financial statements.
Acquisition-Related Transaction Costs
Transaction costs associated with acquisitions, which consist of professional fees and administrative costs, were not material during fiscal 2021, 2020 or 2019 and were expensed as incurred in Cadence’s consolidated income statements.
NOTE 7. GOODWILL AND ACQUIRED INTANGIBLES
Goodwill
The changes in the carrying amount of goodwill during fiscal 2021 and 2020 were as follows:
 
Gross CarryingAmount
 
(In thousands)
Balance as of December 28, 2019
$
661,856 
Goodwill resulting from acquisitions
120,564 
Effect of foreign currency translation
(333)
Balance as of January 2, 2021
782,087 
Goodwill resulting from acquisitions
154,362 
Effect of foreign currency translation
(8,091)
Balance as of January 1, 2022
$
928,358 
Cadence completed its annual goodwill impairment test during the third quarter of fiscal 2021 and determined that the fair value of Cadence’s single reporting unit exceeded the carrying amount of its net assets and that no impairment existed.
65"""

data = spark.createDataFrame([[text]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)


# Show the results 
result.selectExpr("relations_acq.result as AcqRelation", 
                  "relations_alias.result as AliasRelation", 
                  "relations.result as MergedRelation").show(truncate=False)

+------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|AcqRelation                                                                                     |AliasRelation                                                                           |MergedRelation                                                                                                                                                                          |
+------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[has_acquisition_date, was_acquired_by, other, other, other, has_acquisition_date, other, other]|[has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias]|[has_acquisition_date, was_acquired_by, other, other, other, has_acquisition_date, other, other, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias]|
+------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

from johnsnowlabs import nlp, legal

# Create the pipeline with two RE models
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

text_splitter = legal.TextSplitter()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings =nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model_date = legal.NerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_parties")

ner_converter_date = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_parties"])\
    .setOutputCol("ner_chunk_parties")

ner_model_org= legal.NerModel.pretrained("legner_whereas_md", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_whereas")

ner_converter_org = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner_whereas"])\
    .setOutputCol("ner_chunk_whereas")\

chunk_merger = legal.ChunkMergeApproach()\
    .setInputCols('ner_chunk_whereas', "ner_chunk_parties")\
    .setOutputCol('ner_chunk')

pos = nlp.PerceptronModel.pretrained()\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("pos")

dependency_parser = nlp.DependencyParserModel().pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos", "token"])\
    .setOutputCol("dependencies")

re_filter = legal.RENerChunksFilter()\
    .setInputCols(["ner_chunk", "dependencies"])\
    .setOutputCol("re_ner_chunk")\
    .setMaxSyntacticDistance(10)

reDL = legal.RelationExtractionDLModel().pretrained("legre_contract_doc_parties_md", "en", "legal/models")\
    .setInputCols(["re_ner_chunk", "sentence"])\
    .setOutputCol("relations_parties")\
    .setPredictionThreshold(0.1)

reDL_alias = legal.RelationExtractionDLModel().pretrained("legre_whereas", "en", "legal/models")\
    .setInputCols(["re_ner_chunk", "sentence"])\
    .setOutputCol("relations_whereas")\
    .setPredictionThreshold(0.1)

annotation_merger = legal.AnnotationMerger()\
    .setInputCols("relations_parties", "relations_whereas")\
    .setOutputCol("relations")\
    .setInputType("category")

nlpPipeline = nlp.Pipeline(stages=[
        document_assembler,
        text_splitter,
        tokenizer,
        embeddings,
        ner_model_date,
        ner_converter_date,
        ner_model_org,
        ner_converter_org,
        chunk_merger,
        pos,
        dependency_parser,
        re_filter,
        reDL,
        reDL_alias,
        annotation_merger])


# Show example result
text = """
WHEREAS, the Company Entities own certain Copyrights and Know-How which may be used in the Arizona Field, and in connection with the transactions contemplated by the Stock Purchase Agreement, Arizona desires to obtain a license from the Company Entities to use such Intellectual Property on the terms and subject to the conditions set forth herein.
"""
data = spark.createDataFrame([[text]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)

# Show the results 
result.selectExpr("relations_parties.result as PartiesRelation", 
                  "relations_whereas.result as WhereasRelation", 
                  "relations.result as MergedRelation").show(truncate=False)

+-----------------------------+--------------------------------------+-------------------------------------------------------------------+
|PartiesRelation              |WhereasRelation                       |MergedRelation                                                     |
+-----------------------------+--------------------------------------+-------------------------------------------------------------------+
|[signed_by, other, signed_by]|[has_subject, has_subject, has_object]|[signed_by, other, signed_by, has_subject, has_subject, has_object]|
+-----------------------------+--------------------------------------+-------------------------------------------------------------------+

import spark.implicits._

// Create the pipeline with two RE models
val documenter = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentencer = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentences")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentences"))
  .setOutputCol("tokens")

val words_embedder = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("sentences", "tokens"))
  .setOutputCol("embeddings")

val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models")
  .setInputCols(Array("sentences", "tokens"))
  .setOutputCol("pos_tags")

val pos_ner_tagger = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models")
  .setInputCols(Array("sentences", "tokens", "embeddings"))
  .setOutputCol("ner_pos")

val pos_ner_chunker = new NerConverterInternal()
  .setInputCols(Array("sentences", "tokens", "ner_pos"))
  .setOutputCol("pos_ner_chunks")

val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en")
  .setInputCols(Array("sentences", "pos_tags", "tokens"))
  .setOutputCol("dependencies")

val pos_reModel = RelationExtractionModel.pretrained("posology_re")
  .setInputCols(Array("embeddings", "pos_tags", "pos_ner_chunks", "dependencies"))
  .setOutputCol("pos_relations")
  .setMaxSyntacticDistance(4)

val ade_ner_tagger = MedicalNerModel.pretrained("ner_ade_clinical", "en", "clinical/models")
  .setInputCols(Array("sentences", "tokens", "embeddings"))
  .setOutputCol("ade_ner_tags")

val ade_ner_chunker = new NerConverterInternal()
  .setInputCols(Array("sentences", "tokens", "ade_ner_tags"))
  .setOutputCol("ade_ner_chunks")

val ade_reModel = RelationExtractionModel.pretrained("re_ade_clinical", "en", "clinical/models")
  .setInputCols(Array("embeddings", "pos_tags", "ade_ner_chunks", "dependencies"))
  .setOutputCol("ade_relations")
  .setMaxSyntacticDistance(10)
  .setRelationPairs(Array("drug-ade", "ade-drug"))

val annotation_merger = new AnnotationMerger()
  .setInputCols(Array("ade_relations", "pos_relations"))
  .setInputType("category")
  .setOutputCol("all_relations")

val merger_pipeline = new Pipeline().setStages(Array(
  documenter,
  sentencer,
  tokenizer,
  words_embedder,
  pos_tagger,
  pos_ner_tagger,
  pos_ner_chunker,
  dependency_parser,
  pos_reModel,
  ade_ner_tagger,
  ade_ner_chunker,
  ade_reModel,
  annotation_merger
))


// Show example result

val text =
  """
The patient was prescribed 1 unit of naproxen for 5 days after meals for chronic low back pain. The patient was also given 1 unit of oxaprozin daily for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands..
"""

val data = Seq(text).toDF("text")
val result = merger_pipeline.fit(data).transform(data)

+---------------------------------------------------------+-----------+---------------------------------------------------------------+
|PosologyRelation                                         |AdeRelation|MergedRelation                                                 |
+---------------------------------------------------------+-----------+---------------------------------------------------------------+
|[DOSAGE-DRUG, DRUG-DURATION, DOSAGE-DRUG, DRUG-FREQUENCY]|[1, 1]     |[1, 1, DOSAGE-DRUG, DRUG-DURATION, DOSAGE-DRUG, DRUG-FREQUENCY]|
+---------------------------------------------------------+-----------+---------------------------------------------------------------+

import spark.implicits._

// Create the pipeline with two RE models
val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val text_splitter = new TextSplitter()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val ner_model_date = FinanceNerModel.pretrained("finner_sec_dates", "en", "finance/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner_dates")

val ner_converter_date = new NerConverter()
  .setInputCols(Array("sentence", "token", "ner_dates"))
  .setOutputCol("ner_chunk_date")

val ner_model_org = FinanceNerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner_orgs")

val ner_converter_org = new NerConverter()
  .setInputCols(Array("sentence", "token", "ner_orgs"))
  .setOutputCol("ner_chunk_org")

val chunk_merger = new Chunker()
  .setInputCols(Array("ner_chunk_org", "ner_chunk_date"))
  .setOutputCol("ner_chunk")

val pos = new PerceptronModel()
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("pos")

val dependency_parser = new DependencyParserModel()
  .pretrained("dependency_conllu", "en")
  .setInputCols(Array("sentence", "pos", "token"))
  .setOutputCol("dependencies")

val re_filter = new RelationExtractionModel()
  .setInputCols(Array("ner_chunk", "dependencies"))
  .setOutputCol("re_ner_chunk")
  .setRelationPairs(Array("ORG-ORG", "ORG-DATE"))
  .setMaxSyntacticDistance(10)

val reDL = new RelationExtractionModel()
  .pretrained("finre_acquisitions_subsidiaries_md", "en", "finance/models")
  .setInputCols(Array("re_ner_chunk", "sentence"))
  .setOutputCol("relations_acq")
  .setPredictionThreshold(0.1)

val reDL_alias = new RelationExtractionModel()
  .pretrained("finre_org_prod_alias", "en", "finance/models")
  .setInputCols(Array("re_ner_chunk", "sentence"))
  .setOutputCol("relations_alias")
  .setPredictionThreshold(0.1)

val annotation_merger = new AnnotationMerger()
  .setInputCols("relations_acq", "relations_alias")
  .setOutputCol("relations")
  .setInputType("category")

val nlpPipeline = new Pipeline().setStages(Array(
  document_assembler,
  text_splitter,
  tokenizer,
  embeddings,
  ner_model_date,
  ner_converter_date,
  ner_model_org,
  ner_converter_org,
  chunk_merger,
  pos,
  dependency_parser,
  re_filter,
  reDL,
  reDL_alias,
  annotation_merger
))

// Show example result
val text = """
Definite-lived intangible assets acquired with Cadence’s fiscal 2021 acquisitions were as follows:
 
Acquisition Date Fair Value
Weighted Average Amortization Period
 
(In thousands)
 (in years)
Existing technology
$
59,100 
13.7 years
Agreements and relationships
28,900 
13.7 years
Tradenames, trademarks and patents
4,600 
14.3 years
Total acquired intangibles with definite lives
$
92,600 
13.7 years
2020 Acquisitions
In fiscal 2020, Cadence acquired all of the outstanding equity of AWR Corporation ("AWR") and Integrand Software, Inc. ("Integrand"). These acquisitions enhanced Cadence’s technology portfolio to address growing radio frequency design activity, driven by expanding use of 5G communications.
The aggregate cash consideration for these acquisitions was $195.6 million, after taking into account cash acquired of $1.5 million. The total purchase consideration was allocated to the assets acquired and liabilities assumed based on their respective estimated fair values on the acquisition dates. Cadence will also make payments to certain employees, subject to continued employment and other performance-based conditions, through the first quarter of fiscal 2023.
 With its acquisitions of AWR and Integrand, Cadence recorded $101.3 million of definite-lived intangible assets with a weighted average amortization period of approximately nine years. The definite-lived intangible assets related primarily to existing technology and customer agreements and relationships. Cadence also recorded $119.4 million of goodwill and $25.1 million of net liabilities, consisting primarily of deferred tax liabilities, assumed deferred revenue and trade accounts receivable. The recorded goodwill was primarily related to the acquired assembled workforce and expected synergies from combining operations of the acquired companies with Cadence. None of the goodwill related to the acquisitions of AWR and Integrand is deductible for tax purposes.
Cadence completed one additional acquisition during fiscal 2020 that was not material to the consolidated financial statements. 
Pro Forma Financial Information
Cadence has not presented pro forma financial information for any of the businesses it acquired during fiscal 2021 and fiscal 2020 because the results of operations for these businesses are not material to Cadence’s consolidated financial statements.
Acquisition-Related Transaction Costs
Transaction costs associated with acquisitions, which consist of professional fees and administrative costs, were not material during fiscal 2021, 2020 or 2019 and were expensed as incurred in Cadence’s consolidated income statements.
NOTE 7. GOODWILL AND ACQUIRED INTANGIBLES
Goodwill
The changes in the carrying amount of goodwill during fiscal 2021 and 2020 were as follows:
 
Gross CarryingAmount
 
(In thousands)
Balance as of December 28, 2019
$
661,856 
Goodwill resulting from acquisitions
120,564 
Effect of foreign currency translation
(333)
Balance as of January 2, 2021
782,087 
Goodwill resulting from acquisitions
154,362 
Effect of foreign currency translation
(8,091)
Balance as of January 1, 2022
$
928,358 
Cadence completed its annual goodwill impairment test during the third quarter of fiscal 2021 and determined that the fair value of Cadence’s single reporting unit exceeded the carrying amount of its net assets and that no impairment existed.
65
"""

val data = Seq(text).toDF("text")
val result = nlpPipeline.fit(data).transform(data)

+------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|AcqRelation                                                                                     |AliasRelation                                                                           |MergedRelation                                                                                                                                                                          |
+------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[has_acquisition_date, was_acquired_by, other, other, other, has_acquisition_date, other, other]|[has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias]|[has_acquisition_date, was_acquired_by, other, other, other, has_acquisition_date, other, other, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias]|
+------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

// Create the pipeline with two RE models
val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val text_splitter = new TextSplitter()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val ner_model_date = LegalNerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner_parties")

val ner_converter_date = new NerConverter()
  .setInputCols(Array("sentence", "token", "ner_parties"))
  .setOutputCol("ner_chunk_parties")

val ner_model_org = LegalNerModel.pretrained("legner_whereas_md", "en", "legal/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner_whereas")

val ner_converter_org = new NerConverter()
  .setInputCols(Array("sentence", "token", "ner_whereas"))
  .setOutputCol("ner_chunk_whereas")

val chunk_merger = new Chunker()
  .setInputCols(Array("ner_chunk_whereas", "ner_chunk_parties"))
  .setOutputCol("ner_chunk")

val pos = new PerceptronModel()
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("pos")

val dependency_parser = new DependencyParserModel()
  .pretrained("dependency_conllu", "en")
  .setInputCols(Array("sentence", "pos", "token"))
  .setOutputCol("dependencies")

val re_filter = new RelationExtractionModel()
  .setInputCols(Array("ner_chunk", "dependencies"))
  .setOutputCol("re_ner_chunk")
  .setMaxSyntacticDistance(10)

val reDL = new RelationExtractionModel()
  .pretrained("legre_contract_doc_parties_md", "en", "legal/models")
  .setInputCols(Array("re_ner_chunk", "sentence"))
  .setOutputCol("relations_parties")
  .setPredictionThreshold(0.1)

val reDL_alias = new RelationExtractionModel()
  .pretrained("legre_whereas", "en", "legal/models")
  .setInputCols(Array("re_ner_chunk", "sentence"))
  .setOutputCol("relations_whereas")
  .setPredictionThreshold(0.1)

val annotation_merger = new AnnotationMerger()
  .setInputCols("relations_parties", "relations_whereas")
  .setOutputCol("relations")
  .setInputType("category")

val nlpPipeline = new Pipeline().setStages(Array(
  document_assembler,
  text_splitter,
  tokenizer,
  embeddings,
  ner_model_date,
  ner_converter_date,
  ner_model_org,
  ner_converter_org,
  chunk_merger,
  pos,
  dependency_parser,
  re_filter,
  reDL,
  reDL_alias,
  annotation_merger
))

// Show example result

val text = """WHEREAS, the Company Entities own certain Copyrights and Know-How which may be used in the Arizona Field, and in connection with the transactions contemplated by the Stock Purchase Agreement, Arizona desires to obtain a license from the Company Entities to use such Intellectual Property on the terms and subject to the conditions set forth herein.
"""

val data = Seq(text).toDF("text")
val result = nlpPipeline.fit(data).transform(data)

+-----------------------------+--------------------------------------+-------------------------------------------------------------------+
|PartiesRelation              |WhereasRelation                       |MergedRelation                                                     |
+-----------------------------+--------------------------------------+-------------------------------------------------------------------+
|[signed_by, other, signed_by]|[has_subject, has_subject, has_object]|[signed_by, other, signed_by, has_subject, has_subject, has_object]|
+-----------------------------+--------------------------------------+-------------------------------------------------------------------+

AssertionChunkConverter

Model

This annotator creates a CHUNK column with metadata useful for training an Assertion Status Detection model (see AssertionDL).

In some cases, there may be issues while creating the chunk column when using token indices that can lead to loss of data to train assertion status models.

The AssertionChunkConverter annotator uses both the begin and end indices of the tokens as input to add more robust metadata to the chunk column in a way that improves the reliability of the indices and avoids loss of data.

Parameters:

chunkBeginCol: (Str) The column containing the start index of the chunk.
chunkEndCol: (Str) The column containing the end index of the chunk.
chunkTextCol: (Str) The column containing the text chunk.
outputTokenBeginCol: (Str) The column containing the selected token start.
outputTokenEndCol: (Str) The column containing the selected token end index.
metadataFields: (Dict) The dictionary of of metadata fields to be added to the chunk column

NOTE: Chunk begin and end indices in the assertion status model training dataframe can be populated using the new version of the ALAB module.

Input Annotator Types: TOKEN

Output Annotator Type: CHUNK

Python API: AssertionChunkConverter

Scala API: AssertionChunkConverter

Notebook: AssertionChunkConverterNootebook

Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("tokens")

converter = medical.AssertionChunkConverter()\
  .setInputCols("tokens")\
  .setChunkTextCol("target")\
  .setChunkBeginCol("char_begin")\
  .setChunkEndCol("char_end")\
  .setOutputTokenBeginCol("token_begin")\
  .setOutputTokenEndCol("token_end")\
  .setOutputCol("chunk")


pipeline = nlp.Pipeline().setStages(
    [document_assembler, 
     sentenceDetector, 
     tokenizer, 
     converter]
)

data = spark.createDataFrame([
    ["An angiography showed bleeding in two vessels off of the Minnie supplying the sigmoid that were succesfully embolized.","Minnie", 57, 64,],
    ["After discussing this with his PCP, Leon was clear that the patient had had recurrent DVTs and ultimately a PE and his PCP felt strongly that he required long-term anticoagulation","PCP",31,34,],
]).toDF("text", "target", "char_begin", "char_end")

results = pipeline.fit(data).transform(data)

results.selectExpr(
    "target",
    "char_begin",
    "char_end",
    "token_begin",
    "token_end",
    "tokens[token_begin].result",
    "tokens[token_end].result",
    "target",
    "chunk",
).show(truncate=False)

+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
|target|char_begin|char_end|token_begin|token_end|tokens[token_begin].result|tokens[token_end].result|target|chunk                                         |
+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
|Minnie|57        |64      |10         |10       |Minnie                    |Minnie                  |Minnie|[{chunk, 57, 62, Minnie, {sentence -> 0}, []}]|
|PCP   |31        |34      |5          |5        |PCP                       |PCP                     |PCP   |[{chunk, 31, 33, PCP, {sentence -> 0}, []}]   |
+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+

from johnsnowlabs import nlp, finance

document_assembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("tokens")

converter = finance.AssertionChunkConverter()\
  .setInputCols("tokens")\
  .setChunkTextCol("target")\
  .setChunkBeginCol("char_begin")\
  .setChunkEndCol("char_end")\
  .setOutputTokenBeginCol("token_begin")\
  .setOutputTokenEndCol("token_end")\
  .setOutputCol("chunk")


pipeline = nlp.Pipeline().setStages(
    [document_assembler, 
     sentenceDetector, 
     tokenizer, 
     converter]
)

data = spark.createDataFrame([
    ["Tom Martin worked as Cadence's CTO until 2010","Cadence's CTO",21,33],
    ["Mrs. Charles was before Managing Director at a big consultancy company","Managing Director",24,40],
]).toDF("text", "target", "char_begin", "char_end")

results = pipeline.fit(data).transform(data)

results.selectExpr(
    "target",
    "char_begin",
    "char_end",
    "token_begin",
    "token_end",
    "tokens[token_begin].result",
    "tokens[token_end].result",
    "target",
    "chunk",
).show(truncate=False)

+-----------------+----------+--------+-----------+---------+--------------------------+------------------------+-----------------+---------------------------------------------------------+
|target           |char_begin|char_end|token_begin|token_end|tokens[token_begin].result|tokens[token_end].result|target           |chunk                                                    |
+-----------------+----------+--------+-----------+---------+--------------------------+------------------------+-----------------+---------------------------------------------------------+
|Cadence's CTO    |21        |33      |4          |4        |Cadence's                 |Cadence's               |Cadence's CTO    |[{chunk, 21, 29, Cadence's CTO, {sentence -> 0}, []}]    |
|Managing Director|24        |40      |5          |5        |Managing                  |Managing                |Managing Director|[{chunk, 24, 31, Managing Director, {sentence -> 0}, []}]|
+-----------------+----------+--------+-----------+---------+--------------------------+------------------------+-----------------+---------------------------------------------------------+

from johnsnowlabs import nlp, legal

document_assembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("tokens")

converter = legal.AssertionChunkConverter()\
  .setInputCols("tokens")\
  .setChunkTextCol("target")\
  .setChunkBeginCol("char_begin")\
  .setChunkEndCol("char_end")\
  .setOutputTokenBeginCol("token_begin")\
  .setOutputTokenEndCol("token_end")\
  .setOutputCol("chunk")


pipeline = nlp.Pipeline().setStages(
    [document_assembler,
     sentenceDetector,
     tokenizer, 
     converter]
)

data = spark.createDataFrame([
    ["This Agreement may be executed by different parties hereto","parties",44,50,],
    ["The Administrative Agent will determine the Dollar Equivalent amount","Agent",19,23,],
]).toDF("text", "target", "char_begin", "char_end")

results = pipeline.fit(data).transform(data)

results.selectExpr(
    "target",
    "char_begin",
    "char_end",
    "token_begin",
    "token_end",
    "tokens[token_begin].result",
    "tokens[token_end].result",
    "target",
    "chunk",
).show(truncate=False)

+-------+----------+--------+-----------+---------+--------------------------+------------------------+-------+-----------------------------------------------+
|target |char_begin|char_end|token_begin|token_end|tokens[token_begin].result|tokens[token_end].result|target |chunk                                          |
+-------+----------+--------+-----------+---------+--------------------------+------------------------+-------+-----------------------------------------------+
|parties|44        |50      |7          |6        |parties                   |different               |parties|[{chunk, 44, 42, parties, {sentence -> 0}, []}]|
|Agent  |19        |23      |2          |1        |Agent                     |Administrative          |Agent  |[{chunk, 19, 17, Agent, {sentence -> 0}, []}]  |
+-------+----------+--------+-----------+---------+--------------------------+------------------------+-------+-----------------------------------------------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("tokens")

val converter = new AssertionChunkConverter()
  .setInputCols("tokens")
  .setOutputCol("chunk")
  .setChunkTextCol("target")
  .setChunkBeginCol("char_begin")
  .setChunkEndCol("char_end")
  .setOutputTokenBeginCol("token_begin")
  .setOutputTokenEndCol("token_end")

val pipeline = new Pipeline().setStages(Array(
                                              document_assembler, 
                                              sentenceDetector, 
                                              tokenizer, 
                                              converter))

val data = Seq(Array(
    ("An angiography showed bleeding in two vessels off of the Minnie supplying the sigmoid that were succesfully embolized.", "Minnie",57,64,),
    ("After discussing this with his PCP, Leon was clear that the patient had had recurrent DVTs and ultimately a PE and his PCP felt strongly that he required long-term anticoagulation", "PCP", 31, 34,)
)).toDF("text", "target", "char_begin", "char_end")

val results = pipeline.fit(data).transform(data)

+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
|target|char_begin|char_end|token_begin|token_end|tokens[token_begin].result|tokens[token_end].result|target|chunk                                         |
+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
|Minnie|57        |64      |10         |10       |Minnie                    |Minnie                  |Minnie|[{chunk, 57, 62, Minnie, {sentence -> 0}, []}]|
|PCP   |31        |34      |5          |5        |PCP                       |PCP                     |PCP   |[{chunk, 31, 33, PCP, {sentence -> 0}, []}]   |
+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("tokens")

val converter = new AssertionChunkConverter()
  .setInputCols("tokens")
  .setOutputCol("chunk")
  .setChunkTextCol("target")
  .setChunkBeginCol("char_begin")
  .setChunkEndCol("char_end")
  .setOutputTokenBeginCol("token_begin")
  .setOutputTokenEndCol("token_end")

val pipeline = new Pipeline().setStages(Array(
                                              document_assembler, 
                                              sentenceDetector, 
                                              tokenizer, 
                                              converter))

val data = Seq(Array(
    ("Tom Martin worked as Cadence's CTO until 2010","Cadence's CTO",21,33,),
    ("Mrs. Charles was before Managing Director at a big consultancy company", "Managing Director",24, 40,)
)).toDF("text", "target", "char_begin", "char_end")

val results = pipeline.fit(data).transform(data)

+-----------------+----------+--------+-----------+---------+-----+----+------------+----------+-------------------+
|target           |char_begin|char_end|token_begin|token_end|begin|end |begin_result|end_result|result             |
+-----------------+----------+--------+-----------+---------+-----+----+------------+----------+-------------------+
|Cadence's CTO    |21        |33      |4          |4        |[21] |[29]|Cadence's   |Cadence's |[Cadence's CTO]    |
|Managing Director|24        |40      |5          |5        |[24] |[31]|Managing    |Managing  |[Managing Director]|
+-----------------+----------+--------+-----------+---------+-----+----+------------+----------+-------------------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("tokens")

val converter = new AssertionChunkConverter()
  .setInputCols("tokens")
  .setOutputCol("chunk")
  .setChunkTextCol("target")
  .setChunkBeginCol("char_begin")
  .setChunkEndCol("char_end")
  .setOutputTokenBeginCol("token_begin")
  .setOutputTokenEndCol("token_end")

val pipeline = new Pipeline().setStages(Array(
                                              document_assembler, 
                                              sentenceDetector, 
                                              tokenizer, 
                                              converter))

val data = Seq(Array(
    ("Tom Martin worked as Cadence's CTO until 2010","Cadence's CTO", 21,33,),
    ("Mrs. Charles was before Managing Director at a big consultancy company","Managing Director",24,40,)
)).toDF("text", "target", "char_begin", "char_end")
 
val results = pipeline.fit(data).transform(data)

+-------+----------+--------+-----------+---------+--------------------------+------------------------+-------+-----------------------------------------------+
|target |char_begin|char_end|token_begin|token_end|tokens[token_begin].result|tokens[token_end].result|target |chunk                                          |
+-------+----------+--------+-----------+---------+--------------------------+------------------------+-------+-----------------------------------------------+
|parties|44        |50      |7          |6        |parties                   |different               |parties|[{chunk, 44, 42, parties, {sentence -> 0}, []}]|
|Agent  |19        |23      |2          |1        |Agent                     |Administrative          |Agent  |[{chunk, 19, 17, Agent, {sentence -> 0}, []}]  |
+-------+----------+--------+-----------+---------+--------------------------+------------------------+-------+-----------------------------------------------+

AssertionDL

AssertionDL is a deep Learning based approach used to extract Assertion Status from extracted entities and text. AssertionDLModel requires DOCUMENT, CHUNK and WORD_EMBEDDINGS type annotator inputs, which can be obtained by e.g a DocumentAssembler, NerConverter and WordEmbeddingsModel. The result is an assertion status annotation for each recognized entity. Possible values include “present”,“absent”,“hypothetical”,“conditional”,“associated_with_other_person” etc.

Parameters:

inputCols: Gets current column names of input annotations.
outputCol: Gets output column name of annotations.
ScopeWindow: Sets the scope of the window of the assertion expression.
EntityAssertionCaseSensitive: Sets the case sensitivity of entities and assertion labels.
DoExceptionHandling: If it is set as True, the annotator tries to process as usual and ff exception-causing data (e.g. corrupted record/ document) is passed to the annotator, an exception warning is emitted which has the exception message.
datasetInfo (Str): Descriptive information about the dataset being used.
blackList (list[str]): If defined, list of entities to ignore. The rest will be processed.
whiteList (list[str]): If defined, list of entities to process. The rest will be ignored. Do not include IOB prefix on labels.
caseSensitive (Bool): Determines whether the definitions of the white listed and black listed entities are case sensitive. Default: True.

For pretrained models please see the Models Hub for available models.

Input Annotator Types: DOCUMENT, CHUNK, WORD_EMBEDDINGS

Output Annotator Type: ASSERTION

Python API: AssertionDLModel

Scala API: AssertionDLModel

Notebook: AssertionDLModelNotebook

Show Example

from johnsnowlabs import nlp, medical
# Define pipeline stages to extract NER chunks first
documentAssembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embeddings")

nerModel = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"])\
  .setOutputCol("ner")

nerConverter = nlp.NerConverter()\
  .setInputCols(["sentence", "token", "ner"])\
  .setOutputCol("ner_chunk")

# Then a pretrained AssertionDLModel is used to extract the assertion status
clinicalAssertion = medical.AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models") \
  .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
  .setOutputCol("assertion")

assertionPipeline = nlp.Pipeline(stages=[
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter,
  clinicalAssertion
])

data = spark.createDataFrame([
  ["Patient with severe fever and sore throat"],
  ["Patient shows no stomach pain"],
  ["She was maintained on an epidural and PCA for pain control."]]).toDF("text")


# Show results
result = assertionPipeline.fit(data).transform(data)
result.selectExpr("ner_chunk.result as chunk_result", "assertion.result as assertion_result").show(3, truncate=False)

+--------------------------------+--------------------------------+
|chunk_result                    |assertion_result                |
+--------------------------------+--------------------------------+
|[severe fever, sore throat]     |[present, present]              |
|[stomach pain]                  |[absent]                        |
|[an epidural, PCA, pain control]|[present, present, hypothetical]|
+--------------------------------+--------------------------------+

from johnsnowlabs import nlp, finance

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector =  nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer =  nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings =  nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")\

ner_converter = finance.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\

assertion = finance.AssertionDLModel.pretrained("finassertion_competitors", "en", "finance/models")\
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")
    
pipeline = nlp.Pipeline(stages=[
    document_assembler, 
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    assertion
    ])

data = spark.createDataFrame([["Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc."]]).toDF("text")


# Show results
result = pipeline.fit(data).transform(data)
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata, result.assertion.result)).alias("cols"))\
      .select(F.expr("cols['1']['sentence']").alias("sent_id"),
              F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label"),
              F.expr("cols['2']").alias("assertion")).show(truncate=False)

+-------+------------+---------+----------+
|sent_id|chunk       |ner_label|assertion |
+-------+------------+---------+----------+
|0      |McAfee LLC  |ORG      |COMPETITOR|
|0      |Broadcom Inc|ORG      |COMPETITOR|
+-------+------------+---------+----------+

from johnsnowlabs import nlp, legal

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings_ner = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
    .setInputCols("sentence", "token") \
    .setOutputCol("embeddings_ner")\

ner_model = legal.NerModel.pretrained('legner_contract_doc_parties', 'en', 'legal/models')\
    .setInputCols(["sentence", "token", "embeddings_ner"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setWhiteList(["DOC", "EFFDATE", "PARTY"])

embeddings_ass = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings_ass")

assertion = legal.AssertionDLModel.pretrained("legassertion_time", "en", "legal/models")\
    .setInputCols(["sentence", "ner_chunk", "embeddings_ass"]) \
    .setOutputCol("assertion")

nlpPipeline = nlp.Pipeline(stages=[
            document_assembler, 
            sentence_detector,
            tokenizer,
            embeddings_ner,
            ner_model,
            ner_converter,
            embeddings_ass,
            assertion
            ])

data = spark.createDataFrame([["This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc."]]).toDF("text")


# Show results
result = nlpPipeline.fit(data).transform(data)
result.select(F.explode(F.arrays_zip(result.ner_chunk.result,  
                                     result.ner_chunk.begin, 
                                     result.ner_chunk.end, 
                                     result.ner_chunk.metadata, 
                                     result.assertion.result)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias("ner_label"),
              F.expr("cols['4']").alias("assertion")).show(truncate=False)

+-------------------------------+-----+---+---------+---------+
|chunk                          |begin|end|ner_label|assertion|
+-------------------------------+-----+---+---------+---------+
|Intellectual Property Agreement|11   |41 |DOC      |PRESENT  |
|Amazon Inc                     |51   |60 |PARTY    |PRESENT  |
|Atlantic Inc                   |67   |78 |PARTY    |PRESENT  |
+-------------------------------+-----+---+---------+---------+

import spark.implicits._
// Define pipeline stages to extract NER chunks first

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val nerModel = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val nerConverter = new NerConverter()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

// Then a pretrained AssertionDLModel is used to extract the assertion status
val clinicalAssertion = AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models")
    .setInputCols(Array("sentence", "ner_chunk", "embeddings"))
    .setOutputCol("assertion")

val assertionPipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter,
  clinicalAssertion
))

val data = Seq(
  "Patient with severe fever and sore throat",
  "Patient shows no stomach pain",
  "She was maintained on an epidural and PCA for pain control.").toDF("text")
  

// Show results
val result = assertionPipeline.fit(data).transform(data)

+--------------------------------+--------------------------------+
|chunk_result                    |assertion_result                |
+--------------------------------+--------------------------------+
|[severe fever, sore throat]     |[present, present]              |
|[stomach pain]                  |[absent]                        |
|[an epidural, PCA, pain control]|[present, present, hypothetical]|
+--------------------------------+--------------------------------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentence_detector =  new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer =  new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val embeddings =  BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val ner_model = FinanceNerModel.pretrained("finner_orgs_prods_alias","en","finance/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = new NerConverter()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

val assertion = AssertionDLModel.pretrained("finassertion_competitors", "en", "finance/models")
    .setInputCols(Array("sentence", "ner_chunk", "embeddings"))
    .setOutputCol("assertion")
    
val pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    assertion
    ))

val data = Seq("Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc.").toDF("text")


// Show results
val result = pipeline.fit(data).transform(data)

+-------+------------+---------+----------+
|sent_id|chunk       |ner_label|assertion |
+-------+------------+---------+----------+
|0      |McAfee LLC  |ORG      |COMPETITOR|
|0      |Broadcom Inc|ORG      |COMPETITOR|
+-------+------------+---------+----------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val embeddings_ner = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings_ner")

val ner_model = LegalNerModel.pretrained('legner_contract_doc_parties', 'en', 'legal/models')
    .setInputCols(Array("sentence", "token", "embeddings_ner"))
    .setOutputCol("ner")

val ner_converter = new NerConverter()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")
    .setWhiteList(Array("DOC", "EFFDATE", "PARTY"))

val embeddings_ass = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings_ass")

val assertion = AssertionDLModel.pretrained("legassertion_time", "en", "legal/models")
    .setInputCols(Array("sentence", "ner_chunk", "embeddings_ass"))
    .setOutputCol("assertion")
    
val pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    sentence_detector,
    tokenizer,
    embeddings_ner,
    ner_model,
    ner_converter,
    embeddings_ass,
    assertion
    ))

val data = Seq("This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc.").toDF("text")


// Show results
val result = pipeline.fit(data).transform(data)

+-------------------------------+-----+---+---------+---------+
|chunk                          |begin|end|ner_label|assertion|
+-------------------------------+-----+---+---------+---------+
|Intellectual Property Agreement|11   |41 |DOC      |PRESENT  |
|Amazon Inc                     |51   |60 |PARTY    |PRESENT  |
|Atlantic Inc                   |67   |78 |PARTY    |PRESENT  |
+-------------------------------+-----+---+---------+---------+

Trains AssertionDL, a deep Learning based approach used to extract Assertion Status from extracted entities and text. Contains all the methods for training an AssertionDLModel. For pretrained models please use AssertionDLModel and see the Models Hub for available models.

Parameters:

inputCols: Gets current column names of input annotations.
outputCol: Gets output column name of annotations.
ScopeWindow: Sets the scope of the window of the assertion expression.
StartCol: Set a column that contains the token number for the start of the target.
datasetInfo (Str): Descriptive information about the dataset being used.

Input Annotator Types: DOCUMENT, CHUNK, WORD_EMBEDDINGS

Output Annotator Type: ASSERTION

Python API: AssertionDLApproach

Scala API: AssertionDLApproach

Notebook: AssertionDLApproachNotebook

Show Example

from johnsnowlabs import nlp, medical

# First, pipeline stages for pre-processing the dataset (containing columns for text and label) are defined.
document = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

chunk = nlp.Doc2Chunk() \
    .setInputCols(["document"]) \
    .setOutputCol("chunk") \
    .setChunkCol("target")\
    .setStartCol("start")\
    .setStartColByTokenIndex(True)\
    .setFailOnMissing(False)\
    .setLowerCase(True)

token = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

# Define AssertionDLApproach with parameters and start training
assertionStatus = medical.AssertionDLApproach() \
    .setLabelCol("label") \
    .setInputCols(["document", "chunk", "embeddings"]) \
    .setOutputCol("assertion") \
    .setBatchSize(128) \
    .setDropout(0.012) \
    .setLearningRate(0.015) \
    .setEpochs(1) \
    .setStartCol("start") \
    .setEndCol("end") \
    .setMaxSentLen(250)

trainingPipeline = nlp.Pipeline().setStages([
    document,
    chunk,
    token,
    embeddings,
    assertionStatus
])

assertionResults = trainingPipeline.fit(data).transform(data).cache()

from johnsnowlabs import nlp, finance

# First, pipeline stages for pre-processing the dataset (containing columns for text and label) are defined.
document = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

chunk = nlp.Doc2Chunk() \
    .setInputCols(["document"]) \
    .setOutputCol("chunk")

token = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

# Define AssertionDLApproach with parameters and start training
assertionStatus = finance.AssertionDLApproach() \
    .setLabelCol("label") \
    .setInputCols(["document", "chunk", "embeddings"]) \
    .setOutputCol("assertion") \
    .setBatchSize(128) \
    .setDropout(0.012) \
    .setLearningRate(0.015) \
    .setEpochs(1) \
    .setStartCol("start") \
    .setEndCol("end") \
    .setMaxSentLen(250)

trainingPipeline = nlp.Pipeline().setStages([
    document,
    chunk,
    token,
    embeddings,
    assertionStatus
])

assertionResults = trainingPipeline.fit(data).transform(data).cache()

from johnsnowlabs import nlp, legal

# First, pipeline stages for pre-processing the dataset (containing columns for text and label) are defined.
document = nlp.DocumentAssembler()\
    .setInputCol("sentence")\
    .setOutputCol("document")

chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("doc_chunk")

token = nlp.Tokenizer()\
    .setInputCols(['document'])\
    .setOutputCol('token')

roberta_embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings") \
    .setMaxSentenceLength(512)

# Define AssertionDLApproach with parameters and start training
assertionStatus = legal.AssertionDLApproach()\
    .setLabelCol("assertion_label")\
    .setInputCols(["document", "doc_chunk", "embeddings"])\
    .setOutputCol("assertion")\
    .setBatchSize(128)\
    .setLearningRate(0.001)\
    .setEpochs(2)\
    .setStartCol("tkn_start")\
    .setEndCol("tkn_end")\
    .setMaxSentLen(1200)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath('training_logs/')\
    .setGraphFolder(graph_folder)\
    .setGraphFile(f"{graph_folder}/assertion_graph.pb")\
    .setTestDataset(path="test_data.parquet", read_as='SPARK', options={'format': 'parquet'})\
    .setScopeWindow(scope_window)
    #.setValidationSplit(0.2)\    
    #.setDropout(0.1)\    

trainingPipeline = nlp.Pipeline().setStages([
    document,
    chunk,
    token,
    roberta_embeddings,
    assertionStatus
])

assertionResults = trainingPipeline.fit(data).transform(data).cache()

import spark.implicits._

// First, pipeline stages for pre-processing the dataset (containing columns for text and label) are defined.
val document = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val chunk = new Doc2Chunk()
  .setInputCols(Array("document"))
  .setOutputCol("chunk")

val token = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("document", "token"))
  .setOutputCol("embeddings")

// Define AssertionDLApproach with parameters and start training
val assertionStatus = new AssertionDLApproach()
  .setLabelCol("label")
  .setInputCols(Array("document", "chunk", "embeddings"))
  .setOutputCol("assertion")
  .setBatchSize(128)
  .setDropout(0.012)
  .setLearningRate(0.015)
  .setEpochs(1)
  .setStartCol("start")
  .setEndCol("end")
  .setMaxSentLen(250)

val trainingPipeline = new Pipeline().setStages(Array(
  document,
  chunk,
  token,
  embeddings,
  assertionStatus
))

val assertionResults = trainingPipeline.fit(data).transform(data).cache()

import spark.implicits._
// First, pipeline stages for pre-processing the dataset (containing columns for text and label) are defined.

val document = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val chunk = new Doc2Chunk()
  .setInputCols(Array("document"))
  .setOutputCol("chunk")

val token = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("document", "token"))
  .setOutputCol("embeddings")

// Define AssertionDLApproach with parameters and start training
val assertionStatus = new AssertionDLApproach()
  .setLabelCol("label")
  .setInputCols(Array("document", "chunk", "embeddings"))
  .setOutputCol("assertion")
  .setBatchSize(128)
  .setDropout(0.012)
  .setLearningRate(0.015)
  .setEpochs(1)
  .setStartCol("start")
  .setEndCol("end")
  .setMaxSentLen(250)

val trainingPipeline = new Pipeline().setStages(Array(
  document,
  chunk,
  token,
  embeddings,
  assertionStatus
))

val assertionResults = trainingPipeline.fit(data).transform(data).cache()

import spark.implicits._

val document = new DocumentAssembler()
    .setInputCol("sentence")
    .setOutputCol("document")

val chunk = new Doc2Chunk()
    .setInputCols(Array("document"))
    .setOutputCol("doc_chunk")
    .setChunkCol("chunk")
    .setStartCol("tkn_start")
    .setStartColByTokenIndex(true)
    .setFailOnMissing(false)
    .setLowerCase(false)

val token = new Tokenizer()
    .setInputCols(Array('document'))
    .setOutputCol('token')

val roberta_embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") 
    .setInputCols(Array("document", "token")) 
    .setOutputCol("embeddings") 
    .setMaxSentenceLength(512)

# Define AssertionDLApproach with parameters and start training
val assertionStatus = new AssertionDLApproach()
    .setLabelCol("assertion_label")
    .setInputCols(Array("document", "doc_chunk", "embeddings"))
    .setOutputCol("assertion")
    .setBatchSize(128)
    .setLearningRate(0.001)
    .setEpochs(2)
    .setStartCol("tkn_start")
    .setEndCol("tkn_end")
    .setMaxSentLen(1200)
    .setEnableOutputLogs(true)
    .setOutputLogsPath('training_logs/')
    .setGraphFolder(graph_folder)
    .setGraphFile(f"{graph_folder}/assertion_graph.pb")
    .setTestDataset(path="test_data.parquet", read_as='SPARK', options={'format': 'parquet'})
    .setScopeWindow(scope_window)
    #.setValidationSplit(0.2) 
    #.setDropout(0.1) 

val trainingPipeline = new Pipeline().setStages(Array(
  document,
  chunk,
  token,
  roberta_embeddings,
  assertionStatus
))

val assertionResults = trainingPipeline.fit(data).transform(data).cache()

AssertionFilterer

Model

Filters entities coming from ASSERTION type annotations and returns the CHUNKS. Filters can be set via a white list on the extracted chunk, the assertion or a regular expression. White list for assertion is enabled by default. To use chunk white list, criteria has to be set to "isin". For regex, criteria has to be set to "regex".

Parameters:

whiteList: (list) If defined, list of entities to process. The rest will be ignored.
CaseSensitive: (bool) Determines whether the definitions of the white listed entities are case sensitive.
regex: (list) List of dash-separated pairs of named entities.
criteria: (list) Set tag representing what is the criteria to filter the chunks. possibles values (assertion,isIn,regex). assertion: Filter by the assertion, isIn : Filter by the chunk, regex : Filter using a regex.
entitiesConfidence: (Str) Entity pairs to remove based on the confidence level.

Input Annotator Types: DOCUMENT, CHUNK, ASSERTION

Output Annotator Type: CHUNK

Python API: AssertionFilterer

Scala API: AssertionFilterer

Notebook: AssertionFiltererNotebook

Show Example

from johnsnowlabs import nlp, medical 

# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")\
    #.setIncludeAllConfidenceScores(False)

ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\
    .setWhiteList(["PROBLEM", "TEST","TREATMENT"])

clinical_assertion = medical.AssertionDLModel.pretrained("assertion_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")

assertion_filterer = medical.AssertionFilterer()\
    .setInputCols("sentence","ner_chunk","assertion")\
    .setOutputCol("assertion_filtered")\
    .setCaseSensitive(False)\
    .setWhiteList(["Present"])
#or .setBlackList([["absent"]])

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      clinical_assertion,
      assertion_filterer
    ])

data = spark.createDataFrame([["Patient has a headache for the last 2 weeks, needs to get a head CT, and appears anxious when she walks fast. Alopecia noted. She denies pain."]]).toDF("text")

result = nlpPipeline.fit(data).transform(data)

# Show results:

result.selectExpr("ner_chunk.result as ner_chunk", "assertion.result as assertion").show(3, truncate=False)
+------------------------------------------------+--------------------------------------------------+
|ner_chunk                                       |assertion                                         |
+------------------------------------------------+--------------------------------------------------+
|[a headache, a head CT, anxious, Alopecia, pain]|[Present, Hypothetical, Possible, Present, Absent]|
+------------------------------------------------+--------------------------------------------------+

result.select("filtered.result").show(3, truncate=False)
+----------------------+
|result                |
+----------------------+
|[a headache, Alopecia]|
+----------------------+

from johnsnowlabs import nlp, finance 

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector =  nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer =  nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings =  nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")\

ner_converter = finance.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\

assertion = finance.AssertionDLModel.pretrained("finassertion_competitors", "en", "finance/models")\
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")

assertion_filterer = finance.AssertionFilterer()\
    .setInputCols("sentence","ner_chunk","assertion")\
    .setOutputCol("assertion_filtered")\
    .setCaseSensitive(False)\
    .setWhiteList(["Competitor"])

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    assertion,
    assertion_filterer
    ])

data = spark.createDataFrame([["Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc."]]).toDF("text")

# Show results

result = pipeline.fit(data).transform(data)
result.selectExpr("ner_chunk.result as ner_chunk", "assertion.result as assertion").show(3, truncate=False)
+--------------------------+------------------------+
|ner_chunk                 |assertion               |
+--------------------------+------------------------+
|[McAfee LLC, Broadcom Inc]|[COMPETITOR, COMPETITOR]|
+--------------------------+------------------------+

result.select("assertion_filtered.result").show(3, truncate=False)
+--------------------------+
|result                    |
+--------------------------+
|[McAfee LLC, Broadcom Inc]|
+--------------------------+

from johnsnowlabs import nlp, legal 

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings_ner = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
    .setInputCols("sentence", "token") \
    .setOutputCol("embeddings_ner")\

ner_model = legal.NerModel.pretrained('legner_contract_doc_parties', 'en', 'legal/models')\
    .setInputCols(["sentence", "token", "embeddings_ner"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setWhiteList(["DOC", "EFFDATE", "PARTY"])

embeddings_ass = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings_ass")

assertion = legal.AssertionDLModel.pretrained("legassertion_time", "en", "legal/models")\
    .setInputCols(["sentence", "ner_chunk", "embeddings_ass"]) \
    .setOutputCol("assertion")

assertion_filterer = legal.AssertionFilterer()\
    .setInputCols("sentence","ner_chunk","assertion")\
    .setOutputCol("assertion_filtered")\
    .setCaseSensitive(False)\
    .setWhiteList(["Present"])


nlpPipeline = nlp.Pipeline(stages=[
            document_assembler,
            sentence_detector,
            tokenizer,
            embeddings_ner,
            ner_model,
            ner_converter,
            embeddings_ass,
            assertion,
            assertion_filterer
            ])

data = spark.createDataFrame([["This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc."]]).toDF("text")

# Show results

result = nlpPipeline.fit(data).transform(data)
result.selectExpr("ner_chunk.result as ner_chunk", "assertion.result as assertion").show(3, truncate=False)
+-----------------------------------------------------------+---------------------------+
|ner_chunk                                                  |assertion                  |
+-----------------------------------------------------------+---------------------------+
|[Intellectual Property Agreement, Amazon Inc, Atlantic Inc]|[PRESENT, PRESENT, PRESENT]|
+-----------------------------------------------------------+---------------------------+

result.select("assertion_filtered.result").show(3, truncate=False)
+-----------------------------------------------------------+
|result                                                     |
+-----------------------------------------------------------+
|[Intellectual Property Agreement, Amazon Inc, Atlantic Inc]|
+-----------------------------------------------------------+

import spark.implicits._

// Annotator that transforms a text column from dataframe into an Annotation ready for NLP
val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

// Sentence Detector annotator, processes various sentences per line
val sentenceDetector = new SentenceDetector()
    .setInputCols(Array("document"))
    .setOutputCol("sentence")

// Tokenizer splits words in a relevant format for NLP
val tokenizer = new Tokenizer()
    .setInputCols(Array("sentence"))
    .setOutputCol("token")

// Clinical word embeddings trained on PubMED dataset
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") 
    .setInputCols(Array("sentence", "token", "embeddings")) 
    .setOutputCol("ner")
    //.setIncludeAllConfidenceScores(false)

val ner_converter = new NerConverterInternal() 
    .setInputCols(Array("sentence", "token", "ner")) 
    .setOutputCol("ner_chunk")
    .setWhiteList(Array("PROBLEM", "TEST","TREATMENT"))

val clinical_assertion = AssertionDLModel.pretrained("assertion_jsl", "en", "clinical/models") 
    .setInputCols(Array("sentence", "ner_chunk", "embeddings")) 
    .setOutputCol("assertion")

val assertion_filterer = new AssertionFilterer()
    .setInputCols("sentence","ner_chunk","assertion")
    .setOutputCol("assertion_filtered")
    .setCaseSensitive(false)
    .setWhiteList(Array("Present"))
//or .setBlackList(Array("absent"))

val nlpPipeline = new Pipeline().setStages(Array(
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      clinical_assertion,
      assertion_filterer
))


val text ="""Patient has a headache for the last 2 weeks, needs to get a head CT, and appears anxious when she walks fast. Alopecia noted. She denies pain."""

val data = Seq(text).toDF("text")
val result = nlpPipeline.fit(data).transform(data)

// Show results:

+------------------------------------------------+--------------------------------------------------+
|ner_chunk                                       |assertion                                         |
+------------------------------------------------+--------------------------------------------------+
|[a headache, a head CT, anxious, Alopecia, pain]|[Present, Hypothetical, Possible, Present, Absent]|
+------------------------------------------------+--------------------------------------------------+

+----------------------+
|result                |
+----------------------+
|[a headache, Alopecia]|
+----------------------+

import spark.implicits._

// Annotator that transforms a text column from dataframe into an Annotation ready for NLP
val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

// Sentence Detector annotator, processes various sentences per line
val sentence_detector = new SentenceDetector()
    .setInputCols(Array("document"))
    .setOutputCol("sentence")

// Tokenizer splits words in a relevant format for NLP
val tokenizer = new Tokenizer()
    .setInputCols(Array("sentence"))
    .setOutputCol("token")

// Clinical word embeddings trained on PubMED dataset
val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val ner_model = FinanceNerModel.pretrained("finner_orgs_prods_alias","en","finance/models") 
    .setInputCols(Array("sentence", "token", "embeddings")) 
    .setOutputCol("ner")

val ner_converter = new NerConverterInternal() 
    .setInputCols(Array("sentence", "token", "ner")) 
    .setOutputCol("ner_chunk")

val assertion = AssertionDLModel.pretrained("finassertion_competitors", "en", "finance/models") 
    .setInputCols(Array("sentence", "ner_chunk", "embeddings")) 
    .setOutputCol("assertion")

val assertion_filterer = new AssertionFilterer()
    .setInputCols("sentence","ner_chunk","assertion")
    .setOutputCol("assertion_filtered")
    .setCaseSensitive(false)
    .setWhiteList(Array("Competitor"))


val nlpPipeline = new Pipeline().setStages(Array(
        document_assembler,
        sentence_detector,
        tokenizer,
        embeddings,
        ner_model,
        ner_converter,
        assertion,
        assertion_filterer
))

val text ="""Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc."""

val data = Seq(text).toDF("text")
val result = nlpPipeline.fit(data).transform(data)

// Show results:
+--------------------------+------------------------+
|ner_chunk                 |assertion               |
+--------------------------+------------------------+
|[McAfee LLC, Broadcom Inc]|[COMPETITOR, COMPETITOR]|
+--------------------------+------------------------+

+--------------------------+
|result                    |
+--------------------------+
|[McAfee LLC, Broadcom Inc]|
+--------------------------+

import spark.implicits._

// Annotator that transforms a text column from dataframe into an Annotation ready for NLP
val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

// Sentence Detector annotator, processes various sentences per line
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
    .setInputCols(Array("document"))
    .setOutputCol("sentence")

// Tokenizer splits words in a relevant format for NLP
val tokenizer = new Tokenizer()
    .setInputCols(Array("sentence"))
    .setOutputCol("token")

// Clinical word embeddings trained on PubMED dataset
val embeddings_ner = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings_ner")

val ner_model = LegalNerModel.pretrained('legner_contract_doc_parties', 'en', 'legal/models') 
    .setInputCols(Array("sentence", "token", "embeddings_ner")) 
    .setOutputCol("ner")
    //.setIncludeAllConfidenceScores(false)

val ner_converter = new NerConverterInternal() 
    .setInputCols(Array("sentence", "token", "ner")) 
    .setOutputCol("ner_chunk")
    .setWhiteList(Array("DOC", "EFFDATE", "PARTY"))

val embeddings_ass = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") 
    .setInputCols(Array("sentence", "token")) 
    .setOutputCol("embeddings_ass")

val assertion = AssertionDLModel.pretrained("assertion_jsl", "en", "clinical/models") 
    .setInputCols(Array("sentence", "ner_chunk", "embeddings_ass")) 
    .setOutputCol("assertion")

val assertion_filterer = new AssertionFilterer()
    .setInputCols("sentence","ner_chunk","assertion")
    .setOutputCol("assertion_filtered")
    .setCaseSensitive(false)
    .setWhiteList(Array("Present"))


val nlpPipeline = new Pipeline().setStages(Array(
      document_assembler,
      sentence_detector,
      tokenizer,
      embeddings_ner,
      ner_model,
      ner_converter,
      embeddings_ass,
      assertion,
      assertion_filterer
))


val text ="""This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc."""

val data = Seq(text).toDF("text")
val result = nlpPipeline.fit(data).transform(data)

// Show results:
+-----------------------------------------------------------+---------------------------+
|ner_chunk                                                  |assertion                  |
+-----------------------------------------------------------+---------------------------+
|[Intellectual Property Agreement, Amazon Inc, Atlantic Inc]|[PRESENT, PRESENT, PRESENT]|
+-----------------------------------------------------------+---------------------------+

+-----------------------------------------------------------+
|result                                                     |
+-----------------------------------------------------------+
|[Intellectual Property Agreement, Amazon Inc, Atlantic Inc]|
+-----------------------------------------------------------+

AssertionLogReg

This is a main class in AssertionLogReg family. Logarithmic Regression is used to extract Assertion Status from extracted entities and text. AssertionLogRegModel requires DOCUMENT, CHUNK and WORD_EMBEDDINGS type annotator inputs, which can be obtained by e.g a DocumentAssembler, NerConverter and WordEmbeddingsModel. The result is an assertion status annotation for each recognized entity. Possible values are "Negated", "Affirmed" and "Historical".

Unlike the DL Model, this class does not extend AnnotatorModel. Instead it extends the RawAnnotator, that’s why the main point of interest is method transform().

At the moment there are no pretrained models available for this class. Please refer to AssertionLogRegApproach to train your own model.

Parametres:

setAfter(Int): Length of the context after the target (Default: 13)
setBefore(Int): Length of the context before the target (Default: 11)
setEndCol(String): Column that contains the token number for the end of the target
setStartCol(String): Column that contains the token number for the start of the target

Input Annotator Types: DOCUMENT, CHUNK, WORD_EMBEDDINGS

Output Annotator Type: ASSERTION

Python API: AssertionLogRegModel

Scala API: AssertionLogRegModel

Notebook: AssertionLogRegModelNotebook

Trains a classification method, which uses the Logarithmic Regression Algorithm. It is used to extract Assertion Status from extracted entities and text. Contains all the methods for training a AssertionLogRegModel, together with trainWithChunk, trainWithStartEnd.

Parameters:

label : Column with label per each token
maxIter: This specifies the maximum number of iterations to be performed in the model’s training, default: 26
regParam : This specifies the regularization parameter. Regularization helps to control the complexity of the model, aiding in preventing the issue of overfitting.
eNetParam : Elastic net parameter
beforeParam : Length of the context before the target
afterParam : Length of the context after the target
startCol : Column that contains the token number for the start of the target
endCol : Column that contains the token number for the end of the target

Input Annotator Types: DOCUMENT, CHUNK, WORD_EMBEDDINGS

Output Annotator Type: ASSERTION

Python API: AssertionLogRegApproach

Scala API: AssertionLogRegApproach

Notebook: AssertionLogRegApproachNotebook

Show Example

from johnsnowlabs import nlp, medical

# First define pipeline stages to extract embeddings and text chunks
documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

glove = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("word_embeddings") \
    .setCaseSensitive(False)

chunkAssembler = nlp.Doc2Chunk() \
    .setInputCols(["document"]) \
    .setChunkCol("target") \
    .setOutputCol("chunk")

# Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
assertion = medical.AssertionLogRegApproach() \
    .setLabelCol("label") \
    .setInputCols(["document", "chunk", "word_embeddings"]) \
    .setOutputCol("assertion") \
    .setReg(0.01) \
    .setBefore(11) \
    .setAfter(13) \
    .setStartCol("start") \
    .setEndCol("end")

assertionPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    nerModel,
    nerConverter,
    assertion
])

assertionModel = assertionPipeline.fit(dataset)

from johnsnowlabs import nlp, finance

# First define pipeline stages to extract embeddings and text chunks
documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

glove = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("word_embeddings") \
    .setCaseSensitive(False)

chunkAssembler = nlp.Doc2Chunk() \
    .setInputCols(["document"]) \
    .setChunkCol("target") \
    .setOutputCol("chunk")

# Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
assertion = finance.AssertionLogRegApproach() \
    .setLabelCol("label") \
    .setInputCols(["document", "chunk", "word_embeddings"]) \
    .setOutputCol("assertion") \
    .setReg(0.01) \
    .setBefore(11) \
    .setAfter(13) \
    .setStartCol("start") \
    .setEndCol("end")

assertionPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    nerModel,
    nerConverter,
    assertion
])

assertionModel = assertionPipeline.fit(dataset)

from johnsnowlabs import nlp, legal

# First define pipeline stages to extract embeddings and text chunks
documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

glove = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("word_embeddings") \
    .setCaseSensitive(False)

chunkAssembler = nlp.Doc2Chunk() \
    .setInputCols(["document"]) \
    .setChunkCol("target") \
    .setOutputCol("chunk")

# Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
assertion = legal.AssertionLogRegApproach() \
    .setLabelCol("label") \
    .setInputCols(["document", "chunk", "word_embeddings"]) \
    .setOutputCol("assertion") \
    .setReg(0.01) \
    .setBefore(11) \
    .setAfter(13) \
    .setStartCol("start") \
    .setEndCol("end")

assertionPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    nerModel,
    nerConverter,
    assertion
])

assertionModel = assertionPipeline.fit(dataset)

import spark.implicits._

// First define pipeline stages to extract embeddings and text chunks
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val glove = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("document", "token"))
  .setOutputCol("word_embeddings")
  .setCaseSensitive(false)

val chunkAssembler = new Doc2Chunk()
  .setInputCols("document")
  .setChunkCol("target")
  .setOutputCol("chunk")

// Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
val assertion = new AssertionLogRegApproach()
  .setLabelCol("label")
  .setInputCols(Array("document", "chunk", "word_embeddings"))
  .setOutputCol("assertion")
  .setReg(0.01)
  .setBefore(11)
  .setAfter(13)
  .setStartCol("start")
  .setEndCol("end")

val assertionPipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter,
  assertion
))

val assertionModel = assertionPipeline.fit(dataset)

import spark.implicits._

// First define pipeline stages to extract embeddings and text chunks
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val glove = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("document", "token"))
  .setOutputCol("word_embeddings")
  .setCaseSensitive(false)

val chunkAssembler = new Doc2Chunk()
  .setInputCols("document")
  .setChunkCol("target")
  .setOutputCol("chunk")

// Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
val assertion = new AssertionLogRegApproach()
  .setLabelCol("label")
  .setInputCols(Array("document", "chunk", "word_embeddings"))
  .setOutputCol("assertion")
  .setReg(0.01)
  .setBefore(11)
  .setAfter(13)
  .setStartCol("start")
  .setEndCol("end")

val assertionPipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter,
  assertion
))

val assertionModel = assertionPipeline.fit(dataset)

import spark.implicits._

// First define pipeline stages to extract embeddings and text chunks
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val glove = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("document", "token"))
  .setOutputCol("word_embeddings")
  .setCaseSensitive(false)

val chunkAssembler = new Doc2Chunk()
  .setInputCols("document")
  .setChunkCol("target")
  .setOutputCol("chunk")

// Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
val assertion = new AssertionLogRegApproach()
  .setLabelCol("label")
  .setInputCols(Array("document", "chunk", "word_embeddings"))
  .setOutputCol("assertion")
  .setReg(0.01)
  .setBefore(11)
  .setAfter(13)
  .setStartCol("start")
  .setEndCol("end")

val assertionPipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter,
  assertion
))

val assertionModel = assertionPipeline.fit(dataset)

AssertionMerger

Model

Merges variety assertion columns coming from Assertion annotators like sparknlp_jsl.annotator.assertion.AssertionDLModel. AssertionMerger can filter, prioritize and merge assertion annotations by using proper parameters. See Also: sparknlp_jsl.annotator.WhiteBlackListParams for filtering options.

Parameters:

mergeOverlapping (Bool): Whether to merge overlapping matched assertions.
applyFilterBeforeMerge (Bool): Whether to apply filtering before merging.
assertionsConfidence (dict[str, float]): Pairs (assertion,confidenceThreshold) to filter assertions which have confidence lower than the confidence threshold.
orderingFeatures (list[str]): Specifies the ordering features to use for overlapping entities. Possible values include: ‘begin’, ‘end’, ‘length’, ‘source’, ‘confidence’. Default: [‘begin’, ‘length’, ‘source’]
selectionStrategy (str): Determines the strategy for selecting annotations. Annotations can be selected either sequentially based on their order (Sequential) or using a more diverse strategy (DiverseLonger). Currently, only Sequential and DiverseLonger options are available. Default: Sequential.
defaultConfidence (float): When the confidence value is included in the orderingFeatures and a given annotation does not have any confidence, this parameter determines the value to be used. The default value is 0.
assertionSourcePrecedence (str): Specifies the assertion sources to use for prioritizing overlapping annotations when the ‘source’ ordering feature is utilized. This parameter contains a comma-separated list of assertion sources that drive the prioritization. Annotations will be prioritized based on the order of the given string.
sortByBegin (Bool): Whether to sort the annotations by begin at the end of the merge and filter process. Default: False.
blackList (list[str]): If defined, list of entities to ignore. The rest will be processed.
whiteList (list[str]): If defined, list of entities to process. The rest will be ignored. Do not include IOB prefix on labels.
caseSensitive (Bool): Determines whether the definitions of the white listed and black listed entities are case sensitive. Default: True.
majorityVoting (Bool): Whether to use majority voting to resolve conflicts. Default: False.

Input Annotator Types: ASSERTION

Output Annotator Type: ASSERTION

Python API: AssertionMerger

Scala API: AssertionMerger

Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_jsl = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_jsl")\
    #.setIncludeAllConfidenceScores(False)

ner_jsl_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner_jsl"]) \
    .setOutputCol("ner_jsl_chunk")\
    .setWhiteList(["SYMPTOM","VS_FINDING","DISEASE_SYNDROME_DISORDER","ADMISSION_DISCHARGE","PROCEDURE"])

assertion_jsl = medical.AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_jsl_chunk", "embeddings"]) \
    .setOutputCol("assertion_jsl")\
    .setEntityAssertionCaseSensitive(False)

ner_clinical = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_clinical")\
    #.setIncludeAllConfidenceScores(False)

ner_clinical_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner_clinical"]) \
    .setOutputCol("ner_clinical_chunk")\

assertion_dl = medical.AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_clinical_chunk", "embeddings"]) \
    .setOutputCol("assertion_dl")

assertion_merger = medical.AssertionMerger() \
    .setInputCols("assertion_jsl", "assertion_dl") \
    .setOutputCol("assertion_merger") \
    .setMergeOverlapping(True) \
    .setSelectionStrategy("sequential") \
    .setAssertionSourcePrecedence("assertion_dl, assertion_jsl") \
    .setCaseSensitive(False) \
    .setAssertionsConfidence({"past": 0.70}) \
    .setOrderingFeatures(["length", "source", "confidence"]) \
    .setDefaultConfidence(0.50)\
    #.setBlackList(["HYPothetical"])

pipeline = Pipeline( stages =[document_assembler,
                              sentence_detector,
                              tokenizer,
                              word_embeddings,
                              ner_jsl,
                              ner_jsl_converter,
                              assertion_jsl,
                              ner_clinical,
                              ner_clinical_converter,
                              assertion_dl,
                              assertion_merger])

data = spark.createDataFrame([
                        """Patient had a headache for the last 2 weeks, and appears anxious when she walks fast. No alopecia noted. She denies pain. Her father is paralyzed and it is a stressor for her. She got antidepressant. We prescribed sleeping pills for her current insomnia."""], StringType()).toDF("text")


data = data.coalesce(1).withColumn("idx", F.monotonically_increasing_id())
results = pipeline.fit(data).transform(data)

## Result

+---+--------------+-----+---+---------+---------+----------------+----------+
|idx|ner_chunk     |begin|end|ner_label|assertion|assertion_source|confidence|
+---+--------------+-----+---+---------+---------+----------------+----------+
|0  |headache      |14   |21 |Symptom  |Past     |assertion_jsl   |0.9999    |
|0  |anxious       |57   |63 |PROBLEM  |present  |assertion_dl    |0.9392    |
|0  |alopecia      |89   |96 |PROBLEM  |absent   |assertion_dl    |0.9992    |
|0  |pain          |116  |119|PROBLEM  |absent   |assertion_dl    |0.9884    |
|0  |paralyzed     |136  |144|Symptom  |Family   |assertion_jsl   |0.9995    |
|0  |stressor      |158  |165|Symptom  |Family   |assertion_jsl   |1.0       |
|0  |antidepressant|184  |197|TREATMENT|present  |assertion_dl    |0.9628    |
|0  |sleeping pills|214  |227|TREATMENT|present  |assertion_dl    |0.998     |
|0  |insomnia      |245  |252|Symptom  |Past     |assertion_jsl   |0.9862    |
+---+--------------+-----+---+---------+---------+----------------+----------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")

val sentence_detector = new SentenceDetector()
      .setInputCols(Array("document"))
      .setOutputCol("sentence")

val tokenizer = new Tokenizer()
      .setInputCols(Array("sentence"))
      .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
      .setInputCols(Array("sentence", "token"))
      .setOutputCol("embeddings")

val ner_jsl = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models")
      .setInputCols(Array("sentence", "token", "embeddings"))
      .setOutputCol("ner_jsl")
    //.setIncludeAllConfidenceScores(false)

val ner_jsl_converter = new NerConverterInternal()
      .setInputCols(Array("sentence", "token", "ner_jsl"))
      .setOutputCol("ner_jsl_chunk")
      .setWhiteList(Array("SYMPTOM", "VS_FINDING", "DISEASE_SYNDROME_DISORDER", "ADMISSION_DISCHARGE", "PROCEDURE"))

val assertion_jsl = AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models")
      .setInputCols(Array("sentence", "ner_jsl_chunk", "embeddings"))
      .setOutputCol("assertion_jsl")
      .setEntityAssertionCaseSensitive(false)

val ner_clinical = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
      .setInputCols(Array("sentence", "token", "embeddings"))
      .setOutputCol("ner_clinical")
    //.setIncludeAllConfidenceScores(false)

val ner_clinical_converter = new NerConverterInternal()
      .setInputCols(Array("sentence", "token", "ner_clinical"))
      .setOutputCol("ner_clinical_chunk")

val assertion_dl = AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models")
      .setInputCols(Array("sentence", "ner_clinical_chunk", "embeddings"))
      .setOutputCol("assertion_dl")

val assertion_merger = new AssertionMerger()
      .setInputCols("assertion_jsl", "assertion_dl")
      .setOutputCol("assertion_merger")
      .setMergeOverlapping(true)
      .setSelectionStrategy("sequential")
      .setAssertionSourcePrecedence("assertion_dl, assertion_jsl")
      .setCaseSensitive(false)
      .setAssertionsConfidence(Map("past"-> 0.70f))
      .setOrderingFeatures(Array("length", "source", "confidence"))
      .setDefaultConfidence(0.50f)
     // .setBlackList(("HYPothetical"))

val pipeline = new Pipeline().setStages(Array(document_assembler,
          sentence_detector,
          tokenizer,
          word_embeddings,
          ner_jsl,
          ner_jsl_converter,
          assertion_jsl,
          ner_clinical,
          ner_clinical_converter,
          assertion_dl,
          assertion_merger))

val text = "Patient had a headache for the last 2 weeks, and appears anxious when she walks fast. No alopecia noted. She denies pain. Her father is paralyzed and it is a stressor for her. She got antidepressant. We prescribed sleeping pills for her current insomnia."

val data = Seq(text).toDF("text")
//val data = data.coalesce(1).withColumn("idx", F.monotonically_increasing_id())

val results = pipeline.fit(data).transform(data)

// Result

+---+--------------+-----+---+---------+---------+----------------+----------+
|idx|ner_chunk     |begin|end|ner_label|assertion|assertion_source|confidence|
+---+--------------+-----+---+---------+---------+----------------+----------+
|0  |headache      |14   |21 |Symptom  |Past     |assertion_jsl   |0.9999    |
|0  |anxious       |57   |63 |PROBLEM  |present  |assertion_dl    |0.9392    |
|0  |alopecia      |89   |96 |PROBLEM  |absent   |assertion_dl    |0.9992    |
|0  |pain          |116  |119|PROBLEM  |absent   |assertion_dl    |0.9884    |
|0  |paralyzed     |136  |144|Symptom  |Family   |assertion_jsl   |0.9995    |
|0  |stressor      |158  |165|Symptom  |Family   |assertion_jsl   |1.0       |
|0  |antidepressant|184  |197|TREATMENT|present  |assertion_dl    |0.9628    |
|0  |sleeping pills|214  |227|TREATMENT|present  |assertion_dl    |0.998     |
|0  |insomnia      |245  |252|Symptom  |Past     |assertion_jsl   |0.9862    |
+---+--------------+-----+---+---------+---------+----------------+----------+

AverageEmbeddings

Model

AverageEmbeddings computes the mean of vector embeddings for two sentences of equal size, producing a unified representation.

Parameters:

inputCols: The name of the columns containing the input annotations. It can read either a String column or an Array.
outputCol: The name of the column in Document type that is generated. We can specify only one column here.

All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols().

Input Annotator Types: SENTENCE_EMBEDDINGS, SENTENCE_EMBEDDINGS, CHUNK

Output Annotator Type: EMBEDDINGS

Python API: AverageEmbeddings

Scala API: AverageEmbeddings

Notebook: AverageEmbeddingsNotebook

Show Example

from johnsnowlabs import nlp, medical

document_assembler =  nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")\

sentence_detector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

doc2Chunk = nlp.Doc2Chunk() \
    .setInputCols("sentence") \
    .setOutputCol("chunk") \
    .setIsArray(True)

sbiobert_base_cased_mli = nlp.BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
    .setInputCols("sentence")\
    .setOutputCol("sbiobert_base_cased_mli")

sent_biobert_clinical_base_cased = nlp.BertSentenceEmbeddings.pretrained("sent_biobert_clinical_base_cased", "en") \
    .setInputCols("sentence") \
    .setOutputCol("sent_biobert_clinical_base_cased")

avg_embeddings = medical.AverageEmbeddings()\
    .setInputCols(["sent_biobert_clinical_base_cased","sbiobert_base_cased_mli","chunk"])\
    .setOutputCol("embeddings")

pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        doc2Chunk,
        sbiobert_base_cased_mli,
        sent_biobert_clinical_base_cased,
        avg_embeddings
    ])

data = spark.createDataFrame([[" The patient was prescribed 1 capsule of Advil for 5 days "]]).toDF("text")

result = pipeline.fit(data).transform(data)

result_df = result.select(F.explode(F.arrays_zip(result.chunk.result,
                                                 result.chunk.metadata,
                                                 result.sentence.result,
                                                 result.embeddings.embeddings,
                                                 result.sent_biobert_clinical_base_cased.embeddings,
                                                 result.sbiobert_base_cased_mli.embeddings,)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("sentence"),
                          F.expr("cols['1']").alias("sentence_metadata"),
                          F.expr("cols['2']").alias("chunk"),
                          F.expr("cols['3']").alias("embeddings"),
                          F.expr("cols['4']").alias("sent_biobert_clinical_base_cased"),
                          F.expr("cols['5']").alias("sbiobert_base_cased_mli"))

result_df.show(50, truncate=1000)

## Result

+--------------------------------------------------+---------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|                                          sentence|          sentence_metadata|                                             chunk|                                        embeddings|                  sent_biobert_clinical_base_cased|                           sbiobert_base_cased_mli|
+--------------------------------------------------+---------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|The patient was prescribed 1 capsule of Advil f...|{sentence -> 0, chunk -> 0}|The patient was prescribed 1 capsule of Advil f...|[0.32466835, 0.12497781, -0.20237188, 0.3716198...|[-0.07857181, -0.061015874, -0.020198729, 0.177...|[0.7279085, 0.3109715, -0.38454503, 0.5657965, ...|
+--------------------------------------------------+---------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document") 

val sentence_detector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val doc2Chunk = new Doc2Chunk()
  .setInputCols("sentence")
  .setOutputCol("chunk")
  .setIsArray(true)

val sbiobert_base_cased_mli = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
  .setInputCols("sentence")
  .setOutputCol("sbiobert_base_cased_mli")

val sent_biobert_clinical_base_cased = BertSentenceEmbeddings.pretrained("sent_biobert_clinical_base_cased","en")
  .setInputCols("sentence")
  .setOutputCol("sent_biobert_clinical_base_cased")

val avg_embeddings = new AverageEmbeddings()
  .setInputCols(Array("sent_biobert_clinical_base_cased","sbiobert_base_cased_mli","chunk"))
  .setOutputCol("embeddings") 

val pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    sentence_detector, 
    doc2Chunk, 
    sbiobert_base_cased_mli, 
    sent_biobert_clinical_base_cased, 
    avg_embeddings)) 

val data = Seq(" The patient was prescribed 1 capsule of Advil for 5 days").toDF("text")

val result = pipeline.fit(data).transform(data)

// Show results
+--------------------------------------------------+---------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|                                          sentence|          sentence_metadata|                                             chunk|                                        embeddings|                  sent_biobert_clinical_base_cased|                           sbiobert_base_cased_mli|
+--------------------------------------------------+---------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|The patient was prescribed 1 capsule of Advil f...|{sentence -> 0, chunk -> 0}|The patient was prescribed 1 capsule of Advil f...|[0.32466835, 0.12497781, -0.20237188, 0.3716198...|[-0.07857181, -0.061015874, -0.020198729, 0.177...|[0.7279085, 0.3109715, -0.38454503, 0.5657965, ...|
+--------------------------------------------------+---------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+

BertForAssertionClassification

Model

BertForAssertionClassification extracts the assertion status from text by analyzing both the extracted entities and their surrounding context.

This classifier leverages pre-trained BERT models fine-tuned on biomedical text (e.g., BioBERT) and applies a sequence classification/regression head (a linear layer on the pooled output) to support multi-class document classification.

Key features:

Accepts DOCUMENT and CHUNK type inputs and produces ASSERTION type annotations.
Emphasizes entity context by marking target entities with special tokens (e.g., [entity]), allowing the model to better focus on them.
Utilizes a transformer-based architecture (BERT for Sequence Classification) to achieve accurate assertion status prediction.

Input Example:

This annotator preprocesses the input text to emphasize the target entities as follows: [CLS] Patient with [entity] severe fever [entity].

Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see Import Transformers into Spark NLP 🚀

Parameters:

configProtoBytes: ConfigProto from tensorflow, serialized into byte array.
classificationCaseSensitive: Whether to use case sensitive classification. Default is True.
setBatchSize: Sets the number of inputs processed together in a single batch during inference. A higher batch size can improve throughput and reduce overall inference time on supported hardware, but may increase memory usage. (Default: 8).

Input Annotator Types: DOCUMENT, CHUNK

Output Annotator Type: ASSERTION

Python API: BertForAssertionClassification

Scala API: BertForAssertionClassification

Notebook: BertForAssertionClassification

Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text") \
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector()\
    .setInputCols("document")\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")\
    .setCaseSensitive(False)

ner = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setWhiteList(["PROBLEM"])

assertion_classifier = medical.BertForAssertionClassification.pretrained("assertion_bert_classification_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "ner_chunk"])\
    .setOutputCol("assertion_class")

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    embeddings,
    ner,
    ner_converter,
    assertion_classifier
])

text = """
GENERAL: He is an elderly gentleman in no acute distress. He is sitting up in bed eating his breakfast. He is alert and oriented and answering questions appropriately.
HEENT: Sclerae showed mild arcus senilis in the right. Left was clear. Pupils are equally round and reactive to light. Extraocular movements are intact. Oropharynx is clear.
NECK: Supple. Trachea is midline. No jugular venous pressure distention is noted. No adenopathy in the cervical, supraclavicular, or axillary areas.
ABDOMEN: Soft and not tender. There may be some fullness in the left upper quadrant, although I do not appreciate a true spleen with inspiration.
EXTREMITIES: There is some edema, but no cyanosis and clubbing .
"""

data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)


# result

+--------------------------------------------------------------+-----+----+---------+----------------------+
|ner_chunk                                                     |begin|end |ner_label|assertion_class_result|
+--------------------------------------------------------------+-----+----+---------+----------------------+
|acute distress                                                |43   |56  |PROBLEM  |absent                |
|mild arcus senilis in the right                               |191  |221 |PROBLEM  |present               |
|jugular venous pressure distention                            |380  |413 |PROBLEM  |absent                |
|adenopathy in the cervical, supraclavicular, or axillary areas|428  |489 |PROBLEM  |absent                |
|tender                                                        |514  |519 |PROBLEM  |absent                |
|some fullness in the left upper quadrant                      |535  |574 |PROBLEM  |possible              |
|some edema                                                    |660  |669 |PROBLEM  |present               |
|cyanosis                                                      |679  |686 |PROBLEM  |absent                |
|clubbing                                                      |692  |699 |PROBLEM  |absent                |
+--------------------------------------------------------------+-----+----+---------+----------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new Tokenizer()
    .setInputCols("document")
    .setOutputCol("token")

val wordEmbeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols("document", "token")
    .setOutputCol("embeddings")

val jslNer = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models")
   .setInputCols("sentence", "token", "embeddings")
   .setOutputCol("jsl_ner")

val jslNerConverter = new NerConverterInternal()
    .setInputCols("sentence", "token", "jsl_ner")
    .setOutputCol("ner_chunks")

val clinicalAssertion = BertForAssertionClassification.pretrained("assertion_bert_classification_clinical", "en", "clinical/models")
    .setInputCols("sentence", "ner_chunk")
    .setOutputCol("assertion")
    .setCaseSensitive(false)

val pipeline = new Pipeline().setStages(
  Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    wordEmbeddings,
    jslNer,
    jslNerConverter,
    clinicalAssertion
  ))

val text = "GENERAL: He is an elderly gentleman in no acute distress. He is sitting up in bed eating his breakfast. He is alert and oriented and answering questions appropriately.
HEENT: Sclerae showed mild arcus senilis in the right. Left was clear. Pupils are equally round and reactive to light. Extraocular movements are intact. Oropharynx is clear.
NECK: Supple. Trachea is midline. No jugular venous pressure distention is noted. No adenopathy in the cervical, supraclavicular, or axillary areas.
ABDOMEN: Soft and not tender. There may be some fullness in the left upper quadrant, although I do not appreciate a true spleen with inspiration.
EXTREMITIES: There is some edema, but no cyanosis and clubbing ."

val df = Seq(text).toDF("text")
val result = pipeline.fit(df).transform(df)


# result
+--------------------------------------------------------------+-----+----+---------+----------------------+
|ner_chunk                                                     |begin|end |ner_label|assertion_class_result|
+--------------------------------------------------------------+-----+----+---------+----------------------+
|acute distress                                                |43   |56  |PROBLEM  |absent                |
|mild arcus senilis in the right                               |191  |221 |PROBLEM  |present               |
|jugular venous pressure distention                            |380  |413 |PROBLEM  |absent                |
|adenopathy in the cervical, supraclavicular, or axillary areas|428  |489 |PROBLEM  |absent                |
|tender                                                        |514  |519 |PROBLEM  |absent                |
|some fullness in the left upper quadrant                      |535  |574 |PROBLEM  |possible              |
|some edema                                                    |660  |669 |PROBLEM  |present               |
|cyanosis                                                      |679  |686 |PROBLEM  |absent                |
|clubbing                                                      |692  |699 |PROBLEM  |absent                |
+--------------------------------------------------------------+-----+----+---------+----------------------+

BertForSequenceClassification

Model

BertForSequenceClassification can load Bert Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.

Parameters:

batchSize’: Size of every batch (default: 8).
coalesceSentences’: Instead of 1 class per sentence (if inputCols is “sentence” output 1 class per document by averaging probabilities in all sentences (default: False).
maxSentenceLength’: Max sentence length to process (default: 128).
caseSensitive’: Whether to ignore case in tokens for embeddings matching (default: True)

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CATEGORY

Python API: BertForSequenceClassification

Scala API: BertForSequenceClassification

Show Example

from johnsnowlabs import nlp, medical
 
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

sequenceClassifier = medical.BertForSequenceClassification.pretrained("bert_sequence_classifier_ade", "en", "clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("classes")

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    tokenizer,
    sequenceClassifier
])


text =[["Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair 2137. Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. Name (NI) past or present smoking hx, no EtOH."],
     ["Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep Apnea. Denies tobacco and ETOH. Works as cafeteria worker."]]

data = spark.createDataFrame(text).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("text", "classes.result").show(2,truncate=100)

| text                                                                                           | result |
|------------------------------------------------------------------------------------------------|-------|
| Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair... | [False] |
| Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep... | [False] |

import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new Tokenizer()
    .setInputCols("document")
    .setOutputCol("token")

val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_ade", "en", "clinical/models")
    .setInputCols(Array("document","token"))
    .setOutputCol("classes")

val pipeline =  new Pipeline().setStages(Array(
    document_assembler, 
    tokenizer, 
    sequenceClassifier))

val text = List(
  List("Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair 2137. Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. Name (NI) past or present smoking hx, no EtOH."),
  List("Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep Apnea. Denies tobacco and ETOH. Works as cafeteria worker.")
)

val data = Seq(text).toDF("text")

val result = pipeline.fit(data).transform(data)

| text                                                                                           | result |
|------------------------------------------------------------------------------------------------|-------|
| Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair... | [False] |
| Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep... | [False] |

BertForTokenClassifier

Model

BertForTokenClassifier can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) for Named-Entity-Recognition (NER) tasks.\

Parameters:

CaseSensitive (Boolean): Whether to lowercase tokens or not (Default: False).

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: NAMED_ENTITY

Python API: BertForTokenClassifier

Scala API: BertForTokenClassifier

Show Example

from johnsnowlabs import nlp, medical
 
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols("sentence")\
    .setOutputCol("token")

tokenClassifier = medical.BertForTokenClassification.pretrained("bert_token_classifier_ner_clinical", "en", "clinical/models")\
    .setInputCols("token", "sentence")\
    .setOutputCol("ner")\
    .setCaseSensitive(True)

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

pipeline =  nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    tokenClassifier,
    ner_converter
    ])


text = """A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .
Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection .
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation .
Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity .
Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 .
Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia .
The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission .
However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L .
The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again .
The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours .
Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use .
The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day .
It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge ."""

data = spark.createDataFrame([[text]]).toDF("text")

res = pipeline.fit(data).transform(data)

res.select(F.explode(F.arrays_zip(res.ner_chunk.result, res.ner_chunk.begin, res.ner_chunk.end, res.ner_chunk.metadata)).alias("cols")) \
   .select(F.expr("cols['3']['sentence']").alias("sentence_id"),
           F.expr("cols['0']").alias("chunk"),
           F.expr("cols['2']").alias("end"),
           F.expr("cols['3']['entity']").alias("ner_label"))\
   .filter("ner_label!='O'")\
   .show(truncate=False)

## Result
+-----------+-----------------------------+---+---------+
|sentence_id|chunk                        |end|ner_label|
+-----------+-----------------------------+---+---------+
|0          |gestational diabetes mellitus|67 |PROBLEM  |
|0          |type two diabetes mellitus   |153|PROBLEM  |
|0          |T2DM                         |160|PROBLEM  |
|0          |HTG-induced pancreatitis     |209|PROBLEM  |
|0          |an acute hepatitis           |280|PROBLEM  |
|0          |obesity                      |294|PROBLEM  |
|0          |a body mass index            |317|TEST     |
|0          |BMI                          |323|TEST     |
|0          |polyuria                     |387|PROBLEM  |
|0          |polydipsia                   |400|PROBLEM  |
|0          |poor appetite                |416|PROBLEM  |
|0          |vomiting                     |431|PROBLEM  |
|1          |amoxicillin                  |521|TREATMENT|
|1          |a respiratory tract infection|555|PROBLEM  |
|2          |metformin                    |578|TREATMENT|
|2          |glipizide                    |590|TREATMENT|
|2          |dapagliflozin                |610|TREATMENT|
|2          |T2DM                         |619|PROBLEM  |
|2          |atorvastatin                 |636|TREATMENT|
|2          |gemfibrozil                  |652|TREATMENT|
+-----------+-----------------------------+---+---------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val tokenClassifier = MedicalBertForTokenClassification.pretrained("bert_token_classifier_ner_clinical", "en", "clinical/models")
    .setInputCols(Array("token", "sentence"))
    .setOutputCol("ner")
    .setCaseSensitive(True)

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence","token","ner"))
    .setOutputCol("ner_chunk")

val pipeline =  new Pipeline().setStages(Array(
    documentAssembler,
    sentenceDetector, 
    tokenizer, 
    tokenClassifier, 
    ner_converter))

val text = """A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .
Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection .
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation .
Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity .
Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 .
Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia .
The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission .
However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L .
The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again .
The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours .
Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use .
The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day .
It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge ."""

val data = Seq(text).toDF("text")

val result = pipeline.fit(data).transform(data)

// Result

+-----------+-----------------------------+---+---------+
|sentence_id|chunk                        |end|ner_label|
+-----------+-----------------------------+---+---------+
|0          |gestational diabetes mellitus|67 |PROBLEM  |
|0          |type two diabetes mellitus   |153|PROBLEM  |
|0          |T2DM                         |160|PROBLEM  |
|0          |HTG-induced pancreatitis     |209|PROBLEM  |
|0          |an acute hepatitis           |280|PROBLEM  |
|0          |obesity                      |294|PROBLEM  |
|0          |a body mass index            |317|TEST     |
|0          |BMI                          |323|TEST     |
|0          |polyuria                     |387|PROBLEM  |
|0          |polydipsia                   |400|PROBLEM  |
|0          |poor appetite                |416|PROBLEM  |
|0          |vomiting                     |431|PROBLEM  |
|1          |amoxicillin                  |521|TREATMENT|
|1          |a respiratory tract infection|555|PROBLEM  |
|2          |metformin                    |578|TREATMENT|
|2          |glipizide                    |590|TREATMENT|
|2          |dapagliflozin                |610|TREATMENT|
|2          |T2DM                         |619|PROBLEM  |
|2          |atorvastatin                 |636|TREATMENT|
|2          |gemfibrozil                  |652|TREATMENT|
+-----------+-----------------------------+---+---------+

BertSentenceChunkEmbeddings

Model

This annotator allows aggregating sentence embeddings with ner chunk embeddings to get specific and more accurate resolution codes. It works by averaging sentence and chunk embeddings add contextual information in the embedding value. Input to this annotator is the context (sentence) and ner chunks, while the output is embedding for each chunk that can be fed to the resolver model.

Parameters:

inputCols: The name of the columns containing the input annotations. It can read either a String column or an Array.
outputCol: The name of the column in Document type that is generated. We can specify only one column here.
chunkWeight: Relative weight of chunk embeddings in comparison to sentence embeddings. The value should between 0 and 1. The default is 0.5, which means the chunk and sentence embeddings are given equal weight.
setMaxSentenceLength: Sets max sentence length to process, by default 128.
caseSensitive: Determines whether the definitions of the white listed entities are case sensitive.
strategy: Strategy for computing embeddings. Supported strategies are: sentence_average, scope_average, chunk_only, scope_only. The default is sentence_average.
scopeWindow: cope window to calculate scope embeddings. The scope window is defined by two non-negative integers. The default is [0, 0], which means only the chunk embeddings are used. The first integer defines the number of tokens before the chunk and the second integer defines the number of tokens after the chunk.

All the parameters can be set using the corresponding set method in camel case. For example, .setInputCols().

For more information and examples of BertSentenceChunkEmbeddings annotator, you can check the Spark NLP Workshop, and in special, the notebook 24.1.Improved_Entity_Resolution_with_SentenceChunkEmbeddings.ipynb.

Input Annotator Types: DOCUMENT, CHUNK

Output Annotator Type: SENTENCE_EMBEDDINGS

Python API: BertSentenceChunkEmbeddings

Scala API: BertSentenceChunkEmbeddings

Notebook: BertSentenceChunkEmbeddingsNotebook

Show Example

from johnsnowlabs import nlp, medical
# Define the pipeline

document_assembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
      .setInputCols(["document"])\
      .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["document", "token"])\
      .setOutputCol("word_embeddings")

clinical_ner = medical.NerModel.pretrained("ner_abbreviation_clinical", "en", "clinical/models") \
      .setInputCols(["document", "token", "word_embeddings"]) \
      .setOutputCol("ner")

ner_converter = medical.NerConverterInternal() \
      .setInputCols(["document", "token", "ner"]) \
      .setOutputCol("ner_chunk")\
      .setWhiteList(['ABBR'])

sentence_chunk_embeddings = medical.BertSentenceChunkEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
      .setInputCols(["document", "ner_chunk"])\
      .setOutputCol("sentence_embeddings")\
      .setChunkWeight(0.5)\
      .setCaseSensitive(True)
    
resolver_pipeline = nlp.Pipeline(
    stages = [
      document_assembler,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      sentence_chunk_embeddings
])


sample_text = [
"""The patient admitted from the IR for aggressive irrigation of the Miami pouch. DISCHARGE DIAGNOSES: 1. A 58-year-old female with a history of stage 2 squamous cell carcinoma of the cervix status post total pelvic exenteration in 1991.""",
"""Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal. 
Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet."""]

from pyspark.sql.types import StringType, IntegerType

df = spark.createDataFrame(sample_text, StringType()).toDF('text')
result = resolver_pipeline.fit(df).transform(df)

result.selectExpr("explode(sentence_embeddings) AS s")\
      .selectExpr("s.result", "slice(s.embeddings, 1, 5) AS averageEmbedding")\
      .show(truncate=False)

+------+--------------------------------------------------------------+
|result|averageEmbedding                                              |
+------+--------------------------------------------------------------+
|IR    |[0.11792798, 0.36022937, -1.0620842, 0.87576616, 0.5389829]   |
|CBC   |[-0.07262431, -0.671684, 0.009878114, 0.76053196, 0.4687413]  |
|AB    |[-0.2781681, -0.43619046, -0.20924012, 0.84943366, 0.40831584]|
|VDRL  |[-0.07109344, -0.20644212, 0.0367461, 0.43459156, 0.3684616]  |
|HIV   |[-0.1740405, -0.4599509, -0.041505605, 0.61368394, 0.66777927]|
+------+--------------------------------------------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")

val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("tokens")

val wordEmbeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
      .setInputCols(Array("document", "tokens"))
      .setOutputCol("word_embeddings")

val nerModel = MedicalNerModel.pretrained("ner_abbreviation_clinical", "en", "clinical/models")
      .setInputCols(Array("document", "tokens", "word_embeddings"))
      .setOutputCol("ner")

val nerConverter = new NerConverterInternal()
      .setInputCols("document", "tokens", "ner")
      .setOutputCol("ner_chunk")
      .setWhiteList(Array('ABBR'))

val sentenceChunkEmbeddings = BertSentenceChunkEmbeddings.pretrained("sbluebert_base_uncased_mli", "en", "clinical/models")
      .setInputCols(Array("document", "ner_chunk"))
      .setOutputCol("sentence_embeddings")
      .setChunkWeight(0.5)
      .setCaseSensitive(True)

val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      sentenceDetector,
      tokenizer,
      wordEmbeddings,
      nerModel,
      nerConverter,
      sentenceChunkEmbeddings))

val sampleText = "The patient admitted from the IR for aggressive irrigation of the Miami pouch. DISCHARGE DIAGNOSES: 1. A 58-year-old female with a history of stage 2 squamous cell carcinoma of the cervix status post total pelvic exenteration in 1991." +
"Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal. 
Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet."

val data = Seq(sampleText).toDF("sampleText")
val result = pipeline.fit(data).transform(data)

+------+--------------------------------------------------------------+
|result|averageEmbedding                                              |
+------+--------------------------------------------------------------+
|IR    |[0.11792798, 0.36022937, -1.0620842, 0.87576616, 0.5389829]   |
|CBC   |[-0.07262431, -0.671684, 0.009878114, 0.76053196, 0.4687413]  |
|AB    |[-0.2781681, -0.43619046, -0.20924012, 0.84943366, 0.40831584]|
|VDRL  |[-0.07109344, -0.20644212, 0.0367461, 0.43459156, 0.3684616]  |
|HIV   |[-0.1740405, -0.4599509, -0.041505605, 0.61368394, 0.66777927]|
+------+--------------------------------------------------------------+

Chunk2Token

Model

A feature transformer that converts the input array of strings (annotatorType CHUNK) into an array of chunk-based tokens (annotatorType TOKEN).

When the input is empty, an empty array is returned.

This Annotator is specially convenient when using NGramGenerator annotations as inputs to WordEmbeddingsModels.

Parameters:

inputCols: The name of the columns containing the input annotations. It can read either a String column or an Array.
outputCol: The name of the column in Document type that is generated. We can specify only one column here.

All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols().

Input Annotator Types: CHUNK

Output Annotator Type: TOKEN

Python API: Chunk2Token

Scala API: Chunk2Token

Notebook: Chunk2TokenNotebook

Show Example

from johnsnowlabs import nlp, medical 
# Define a pipeline for generating n-grams
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

token = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

ngrammer = nlp.NGramGenerator() \
    .setN(2) \
    .setEnableCumulative(False) \
    .setInputCols(["token"]) \
    .setOutputCol("ngrams") \
    .setDelimiter("_")

# Stage to convert n-gram CHUNKS to TOKEN type
chunk2Token = medical.Chunk2Token()\
    .setInputCols(["ngrams"])\
    .setOutputCol("ngram_tokens")

trainingPipeline = nlp.Pipeline(stages=[
    document, 
    sentenceDetector, 
    token, 
    ngrammer, 
    chunk2Token])

data = spark.createDataFrame([["A 63-year-old man presents to the hospital ..."]]).toDF("text")
result = trainingPipeline.fit(data).transform(data).cache()
result.selectExpr("explode(ngram_tokens)").show(5, False)

+----------------------------------------------------------------+
|col                                                             |
+----------------------------------------------------------------+
|{token, 0, 12, A_63-year-old, {sentence -> 0, chunk -> 0}, []}  |
|{token, 2, 16, 63-year-old_man, {sentence -> 0, chunk -> 1}, []}|
|{token, 14, 25, man_presents, {sentence -> 0, chunk -> 2}, []}  |
|{token, 18, 28, presents_to, {sentence -> 0, chunk -> 3}, []}   |
|{token, 27, 32, to_the, {sentence -> 0, chunk -> 4}, []}        |
+----------------------------------------------------------------+

from johnsnowlabs import nlp, finance

# Define a pipeline for generating n-grams
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

token = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

ngrammer = nlp.NGramGenerator() \
    .setN(2) \
    .setEnableCumulative(False) \
    .setInputCols(["token"]) \
    .setOutputCol("ngrams") 

# Stage to convert n-gram CHUNKS to TOKEN type
chunk2Token = finance.Chunk2Token()\
    .setInputCols(["ngrams"])\
    .setOutputCol("ngram_tokens")

trainingPipeline = nlp.Pipeline(stages=[
    document, 
    sentenceDetector, 
    token, 
    ngrammer, 
    chunk2Token])

data = spark.createDataFrame([["Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc."]]).toDF("text")

result = trainingPipeline.fit(data).transform(data)
result.selectExpr("explode(ngram_tokens)").show(5, False)

+--------------------------------------------------------------------+
|col                                                                 |
+--------------------------------------------------------------------+
|{token, 0, 14, Our competitors, {sentence -> 0, chunk -> 0}, []}    |
|{token, 4, 22, competitors include, {sentence -> 0, chunk -> 1}, []}|
|{token, 16, 26, include the, {sentence -> 0, chunk -> 2}, []}       |
|{token, 24, 36, the following, {sentence -> 0, chunk -> 3}, []}     |
|{token, 28, 39, following by, {sentence -> 0, chunk -> 4}, []}      |
+--------------------------------------------------------------------+

from johnsnowlabs import nlp, legal 
# Define a pipeline for generating n-grams
document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

token = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

ngrammer = nlp.NGramGenerator() \
    .setN(2) \
    .setEnableCumulative(False) \
    .setInputCols(["token"]) \
    .setOutputCol("ngrams") 

# Stage to convert n-gram CHUNKS to TOKEN type
chunk2Token = legal.Chunk2Token()\
    .setInputCols(["ngrams"])\
    .setOutputCol("ngram_tokens")

trainingPipeline = nlp.Pipeline(stages=[
    document, 
    sentenceDetector, 
    token, 
    ngrammer, 
    chunk2Token])

data = spark.createDataFrame([["This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc."]]).toDF("text")
result = trainingPipeline.fit(data).transform(data).cache()
result.selectExpr("explode(ngram_tokens)").show(5, False)

+-----------------------------------------------------------------------+
|col                                                                    |
+-----------------------------------------------------------------------+
|{token, 0, 6, This is, {sentence -> 0, chunk -> 0}, []}                |
|{token, 5, 9, is an, {sentence -> 0, chunk -> 1}, []}                  |
|{token, 8, 22, an Intellectual, {sentence -> 0, chunk -> 2}, []}       |
|{token, 11, 31, Intellectual Property, {sentence -> 0, chunk -> 3}, []}|
|{token, 24, 41, Property Agreement, {sentence -> 0, chunk -> 4}, []}   |
+-----------------------------------------------------------------------+

import spark.implicits._

// Define a pipeline for generating n-grams
val document = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val token = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val ngrammer = new NGramGenerator()
    .setN(2)
    .setEnableCumulative(false)
    .setInputCols("token")
    .setOutputCol("ngrams")
    .setDelimiter("_")

// Stage to convert n-gram CHUNKS to TOKEN type
val chunk2Token = new Chunk2Token()
    .setInputCols("ngrams")
    .setOutputCol("ngram_tokens")

val trainingPipeline = new Pipeline().setStages(Array(
    document, 
    sentenceDetector, 
    token, 
    ngrammer, 
    chunk2Token))

val data = Seq(("A 63-year-old man presents to the hospital ...")).toDF("text")

val result = trainingPipeline.fit(data).transform(data)

+----------------------------------------------------------------+
|col                                                             |
+----------------------------------------------------------------+
|{token, 3, 15, A_63-year-old, {sentence -> 0, chunk -> 0}, []}  |
|{token, 5, 19, 63-year-old_man, {sentence -> 0, chunk -> 1}, []}|
|{token, 17, 28, man_presents, {sentence -> 0, chunk -> 2}, []}  |
|{token, 21, 31, presents_to, {sentence -> 0, chunk -> 3}, []}   |
|{token, 30, 35, to_the, {sentence -> 0, chunk -> 4}, []}        |
+----------------------------------------------------------------+

import spark.implicits._

// Define a pipeline for generating n-grams
val document = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val token = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val ngrammer = new NGramGenerator()
    .setN(2)
    .setEnableCumulative(false)
    .setInputCols("token")
    .setOutputCol("ngrams")

// Stage to convert n-gram CHUNKS to TOKEN type
val chunk2Token = new Chunk2Token()
    .setInputCols("ngrams")
    .setOutputCol("ngram_tokens")

val trainingPipeline = new Pipeline().setStages(Array(
    document, 
    sentenceDetector, 
    token, 
    ngrammer, 
    chunk2Token))

val data = Seq(("Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc.")).toDF("text")

val result = trainingPipeline.fit(data).transform(data)

+--------------------------------------------------------------------+
|col                                                                 |
+--------------------------------------------------------------------+
|{token, 0, 14, Our competitors, {sentence -> 0, chunk -> 0}, []}    |
|{token, 4, 22, competitors include, {sentence -> 0, chunk -> 1}, []}|
|{token, 16, 26, include the, {sentence -> 0, chunk -> 2}, []}       |
|{token, 24, 36, the following, {sentence -> 0, chunk -> 3}, []}     |
|{token, 28, 39, following by, {sentence -> 0, chunk -> 4}, []}      |
+--------------------------------------------------------------------+

import spark.implicits._

// Define a pipeline for generating n-grams
val document = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val token = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val ngrammer = new NGramGenerator()
    .setN(2)
    .setEnableCumulative(false)
    .setInputCols("token")
    .setOutputCol("ngrams")

// Stage to convert n-gram CHUNKS to TOKEN type
val chunk2Token = new Chunk2Token()
    .setInputCols("ngrams")
    .setOutputCol("ngram_tokens")

val trainingPipeline = new Pipeline().setStages(Array(
    document, 
    sentenceDetector, 
    token, 
    ngrammer, 
    chunk2Token))

val data = Seq(("This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc.")).toDF("text")

val result = trainingPipeline.fit(data).transform(data)

+-----------------------------------------------------------------------+
|col                                                                    |
+-----------------------------------------------------------------------+
|{token, 0, 6, This is, {sentence -> 0, chunk -> 0}, []}                |
|{token, 5, 9, is an, {sentence -> 0, chunk -> 1}, []}                  |
|{token, 8, 22, an Intellectual, {sentence -> 0, chunk -> 2}, []}       |
|{token, 11, 31, Intellectual Property, {sentence -> 0, chunk -> 3}, []}|
|{token, 24, 41, Property Agreement, {sentence -> 0, chunk -> 4}, []}   |
+-----------------------------------------------------------------------+

ChunkConverter

Model

Convert chunks from RegexMatcher to chunks with a entity in the metadata.

This annotator is important when the user wants to merge entities identified by NER models together with rules-based matching used by the RegexMathcer annotator. In the following steps of the pipeline, all the identified entities can be treated in a unified field.

Parameters:

inputCols: The name of the columns containing the input annotations. It can read either a String column or an Array.
outputCol: The name of the column in Document type that is generated. We can specify only one column here.
resetSentenceIndices: Whether to reset sentence indices to treat the entire output as if it originates from a single document. Default: False.

All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols().

Input Annotator Types: DOCUMENT, CHUNK

Output Annotator Type: CHUNK

Python API: ChunkConverter

Scala API: ChunkConverter

Notebook: ChunkConverterNotebook

Show Example

from johnsnowlabs import nlp, medical

# Creating the pipeline
rules = '''
\b[A-Z]+(\s+[A-Z]+)*:\b, SECTION_HEADER
'''

with open('regex_rules.txt', 'w') as f:
    f.write(rules)

sample_text = """
POSTOPERATIVE DIAGNOSIS: Cervical lymphadenopathy.
PROCEDURE:  Excisional biopsy of right cervical lymph node.
ANESTHESIA:  General endotracheal anesthesia.
Specimen:  Right cervical lymph node.
EBL: 10 cc.
COMPLICATIONS:  None.
FINDINGS: Enlarged level 2 lymph node was identified and removed and sent for pathologic examination.
FLUIDS:  Please see anesthesia report.
URINE OUTPUT:  None recorded during the case.
INDICATIONS FOR PROCEDURE:  This is a 43-year-old female with a several-year history of persistent cervical lymphadenopathy. She reports that it is painful to palpation on the right and has had multiple CT scans as well as an FNA which were all nondiagnostic. After risks and benefits of surgery were discussed with the patient, an informed consent was obtained. She was scheduled for an excisional biopsy of the right cervical lymph node.
PROCEDURE IN DETAIL:  The patient was taken to the operating room and placed in the supine position. She was anesthetized with general endotracheal anesthesia. The neck was then prepped and draped in the sterile fashion. Again, noted on palpation there was an enlarged level 2 cervical lymph node.A 3-cm horizontal incision was made over this lymph node. Dissection was carried down until the sternocleidomastoid muscle was identified. The enlarged lymph node that measured approximately 2 cm in diameter was identified and was removed and sent to Pathology for touch prep evaluation. The area was then explored for any other enlarged lymph nodes. None were identified, and hemostasis was achieved with electrocautery. A quarter-inch Penrose drain was placed in the wound.The wound was then irrigated and closed with 3-0 interrupted Vicryl sutures for a deep closure followed by a running 4-0 Prolene subcuticular suture. Mastisol and Steri-Strip were placed over the incision, and sterile bandage was applied. The patient tolerated this procedure well and was extubated without complications and transported to the recovery room in stable condition. She will return to the office tomorrow in followup to have the Penrose drain removed.
"""

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model = medical.NerModel.pretrained("ner_clinical_large","en","clinical/models") \
    .setInputCols("sentence","token","embeddings") \
    .setOutputCol("ner")

ner_converter= medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\

regex_matcher = nlp.RegexMatcher()\
    .setInputCols('document')\
    .setStrategy("MATCH_ALL")\
    .setOutputCol("regex_matches")\
    .setExternalRules(path='/content/regex_rules.txt', delimiter=',')

chunkConverter = medical.ChunkConverter()\
    .setInputCols("regex_matches")\
    .setOutputCol("regex_chunk")

merger= medical.ChunkMergeApproach()\
    .setInputCols(["regex_chunk", "ner_chunk"])\
    .setOutputCol("merged_chunks")\
    .setMergeOverlapping(True)\
    .setChunkPrecedence("field")

pipeline= nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    ner_model,
    ner_converter,
    regex_matcher,
    chunkConverter,
    merger
])

data= spark.createDataFrame([[sample_text]]).toDF("text")
result = pipeline.fit(data).transform(data)

# Results
result.select(F.explode(F.arrays_zip(result.merged_chunks.result, 
                                     result.merged_chunks.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("chunk"),
                          F.expr("cols['1']['entity']").alias("merged_entity")).show(15, truncate=100)

+----------------------------------------------+--------------+
|                                         chunk| merged_entity|
+----------------------------------------------+--------------+
|                      POSTOPERATIVE DIAGNOSIS:|SECTION_HEADER|
|                      Cervical lymphadenopathy|       PROBLEM|
|                                    PROCEDURE:|SECTION_HEADER|
|Excisional biopsy of right cervical lymph node|          TEST|
|                                   ANESTHESIA:|SECTION_HEADER|
|               General endotracheal anesthesia|     TREATMENT|
|                     Right cervical lymph node|       PROBLEM|
|                                          EBL:|SECTION_HEADER|
|                                COMPLICATIONS:|SECTION_HEADER|
|                                     FINDINGS:|SECTION_HEADER|
|                   Enlarged level 2 lymph node|       PROBLEM|
|                        pathologic examination|          TEST|
|                                       FLUIDS:|SECTION_HEADER|
|                                 URINE OUTPUT:|SECTION_HEADER|
|                    INDICATIONS FOR PROCEDURE:|SECTION_HEADER|
+----------------------------------------------+--------------+

from johnsnowlabs import nlp, finance

# Creating the pipeline
rules = '''
\b[A-Z]+(\s+[A-Z]+)*:\b, SECTION_HEADER
'''

with open('regex_rules.txt', 'w') as f:
    f.write(rules)

sample_text="""AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price. """

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter= nlp.NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")
    # .setWhiteList(["ORG"]) # Return only ORG entities

regex_matcher = nlp.RegexMatcher()\
    .setInputCols('document')\
    .setStrategy("MATCH_ALL")\
    .setOutputCol("regex_matches")\
    .setExternalRules(path='/content/regex_rules.txt', delimiter=',')

chunkConverter = finance.ChunkConverter()\
    .setInputCols("regex_matches")\
    .setOutputCol("regex_chunk")

merger= finance.ChunkMergeApproach()\
    .setInputCols(["regex_chunk", "ner_chunk"])\
    .setOutputCol("merged_chunks")\
    .setMergeOverlapping(True)\
    .setChunkPrecedence("field")

pipeline= nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    ner_model,
    ner_converter,
    regex_matcher,
    chunkConverter,
    merger
])

data= spark.createDataFrame([[sample_text]]).toDF("text")
result = pipeline.fit(data).transform(data)

# Results
result.select(F.explode(F.arrays_zip(result.merged_chunks.result, 
                                     result.merged_chunks.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("chunk"),
                          F.expr("cols['1']['entity']").alias("merged_entity")).show(15, truncate=100)

+--------+-------------+
|   chunk|merged_entity|
+--------+-------------+
|Group LP|          ORG|
+--------+-------------+

from johnsnowlabs import nlp, legal

# Creating the pipeline
rules = '''
\b[A-Z]+(\s+[A-Z]+)*:\b, SECTION_HEADER
'''

with open('regex_rules.txt', 'w') as f:
    f.write(rules)

sample_text="""AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price. """

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = legal.NerModel.pretrained("legner_org_per_role_date", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter= nlp.NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")
    # .setWhiteList(["ORG"]) # Return only ORG entities

regex_matcher = nlp.RegexMatcher()\
    .setInputCols('document')\
    .setStrategy("MATCH_ALL")\
    .setOutputCol("regex_matches")\
    .setExternalRules(path='/content/regex_rules.txt', delimiter=',')

chunkConverter = legal.ChunkConverter()\
    .setInputCols("regex_matches")\
    .setOutputCol("regex_chunk")

merger= legal.ChunkMergeApproach()\
    .setInputCols(["regex_chunk", "ner_chunk"])\
    .setOutputCol("merged_chunks")\
    .setMergeOverlapping(True)\
    .setChunkPrecedence("field")

pipeline= nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    ner_model,
    ner_converter,
    regex_matcher,
    chunkConverter,
    merger
])

data= spark.createDataFrame([[sample_text]]).toDF("text")
result = pipeline.fit(data).transform(data)

# Results
result.select(F.explode(F.arrays_zip(result.merged_chunks.result, 
                                     result.merged_chunks.metadata)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("chunk"),
                          F.expr("cols['1']['entity']").alias("merged_entity")).show(15, truncate=100)

+--------+-------------+
|   chunk|merged_entity|
+--------+-------------+
|Group LP|          ORG|
+--------+-------------+

// val rules = """\b[A-Z]+(\s+[A-Z]+)*:\b, SECTION_HEADER""" 
// with open("regex_rules.txt","w") as f: 
//    f.write(rules) 


import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text") 
    .setOutputCol("document") 

val sentenceDetector = new SentenceDetector()
    .setInputCols(Array("document")) 
    .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
    .setInputCols(Array("sentence")) 
    .setOutputCol("token") 

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
    .setInputCols(Array("sentence","token"))
    .setOutputCol("embeddings") 

val ner_model = MedicalNerModel.pretrained("ner_clinical_large","en","clinical/models")
    .setInputCols("sentence","token","embeddings") 
    .setOutputCol("ner") 

val ner_converter= new NerConverterInternal() 
    .setInputCols(Array("sentence","token","ner")) 
    .setOutputCol("ner_chunk") 

val regex_matcher = new RegexMatcher()
    .setInputCols("document") 
    .setStrategy("MATCH_ALL") 
    .setOutputCol("regex_matches") 
    .setExternalRules(path="/content/regex_rules.txt",delimiter=",") 

val chunkConverter = new ChunkConverter()
    .setInputCols("regex_matches") 
    .setOutputCol("regex_chunk") 

val merger= new ChunkMergeApproach() 
    .setInputCols(Array("regex_chunk","ner_chunk")) 
    .setOutputCol("merged_chunks") 
    .setMergeOverlapping(true) 
    .setChunkPrecedence("field") 

val pipeline= new Pipeline().setStages(Array( 
    documentAssembler, 
    sentenceDetector, 
    tokenizer, 
    word_embeddings,
    ner_model, 
    ner_converter, 
    regex_matcher, 
    chunkConverter, 
    merger )) 

val data = Seq(("POSTOPERATIVE DIAGNOSIS: Cervical lymphadenopathy. PROCEDURE: Excisional biopsy of right cervical lymph node. ANESTHESIA: General endotracheal anesthesia. Specimen: Right cervical lymph node. EBL: 10 cc. COMPLICATIONS: None. FINDINGS: Enlarged level 2 lymph node was identified and removed and sent for pathologic examination. FLUIDS: Please see anesthesia report. URINE OUTPUT: None recorded during the case. INDICATIONS FOR PROCEDURE: This is a 43-year-old female with a several-year history of persistent cervical lymphadenopathy. She reports that it is painful to palpation on the right and has had multiple CT scans as well as an FNA which were all nondiagnostic. After risks and benefits of surgery were discussed with the patient,an informed consent was obtained. She was scheduled for an excisional biopsy of the right cervical lymph node. PROCEDURE IN DETAIL: The patient was taken to the operating room and placed in the supine position. She was anesthetized with general endotracheal anesthesia. The neck was then prepped and draped in the sterile fashion. Again,noted on palpation there was an enlarged level 2 cervical lymph node.A 3-cm horizontal incision was made over this lymph node. Dissection was carried down until the sternocleidomastoid muscle was identified. The enlarged lymph node that measured approximately 2 cm in diameter was identified and was removed and sent to Pathology for touch prep evaluation. The area was then explored for any other enlarged lymph nodes. None were identified,and hemostasis was achieved with electrocautery. A quarter-inch Penrose drain was placed in the wound.The wound was then irrigated and closed with 3-0 interrupted Vicryl sutures for a deep closure followed by a running 4-0 Prolene subcuticular suture. Mastisol and Steri-Strip were placed over the incision,and sterile bandage was applied. The patient tolerated this procedure well and was extubated without complications and transported to the recovery room in stable condition. She will return to the office tomorrow in followup to have the Penrose drain removed.")).toDF("text")

val result = pipeline.fit(data).transform(data)

+----------------------------------------------+--------------+
|                                         chunk| merged_entity|
+----------------------------------------------+--------------+
|                      POSTOPERATIVE DIAGNOSIS:|SECTION_HEADER|
|                      Cervical lymphadenopathy|       PROBLEM|
|                                    PROCEDURE:|SECTION_HEADER|
|Excisional biopsy of right cervical lymph node|          TEST|
|                                   ANESTHESIA:|SECTION_HEADER|
|               General endotracheal anesthesia|     TREATMENT|
|                     Right cervical lymph node|       PROBLEM|
|                                          EBL:|SECTION_HEADER|
|                                COMPLICATIONS:|SECTION_HEADER|
|                                     FINDINGS:|SECTION_HEADER|
|                   Enlarged level 2 lymph node|       PROBLEM|
|                        pathologic examination|          TEST|
|                                       FLUIDS:|SECTION_HEADER|
|                                 URINE OUTPUT:|SECTION_HEADER|
|                    INDICATIONS FOR PROCEDURE:|SECTION_HEADER|
+----------------------------------------------+--------------+

// val rules = """\b[A-Z]+(\s+[A-Z]+)*:\b, SECTION_HEADER""" 
// with open("regex_rules.txt","w") as f: 
//    f.write(rules) 

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text") 
    .setOutputCol("document") 

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
    .setInputCols(Array("document")) 
    .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
    .setInputCols(Array("sentence")) 
    .setOutputCol("token") 

val word_embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
    .setInputCols(Array("sentence","token")) 
    .setOutputCol("embeddings") 

val ner_model = FinanceNerModel.pretrained("finner_orgs_prods_alias","en","finance/models")
    .setInputCols(Array("sentence","token","embeddings")) 
    .setOutputCol("ner") 

val ner_converter= new NerConverterInternal()  
    .setInputCols(Array("sentence","token","ner")) 
    .setOutputCol("ner_chunk") 
    // .setWhiteList(Array("ORG")) 

// Return only ORG entities 
val regex_matcher = new RegexMatcher()
    .setInputCols("document") 
    .setStrategy("MATCH_ALL") 
    .setOutputCol("regex_matches") 
    .setExternalRules(path="/content/regex_rules.txt",delimiter=",") 

val chunkConverter = new ChunkConverter()
    .setInputCols("regex_matches") 
    .setOutputCol("regex_chunk") 
 
val merger= new ChunkMergeApproach() 
    .setInputCols(Array("regex_chunk","ner_chunk")) 
    .setOutputCol("merged_chunks") 
    .setMergeOverlapping(true) 
    .setChunkPrecedence("field") 

val pipeline= new Pipeline().setStages(Array( 
    documentAssembler, 
    sentenceDetector, 
    tokenizer, 
    word_embeddings, 
    ner_model, 
    ner_converter, 
    regex_matcher, 
    chunkConverter, 
    merger )) 
    
val data = Seq(("AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price.")).toDF("text")

val result = pipeline.fit(data).transform(data)

+--------+-------------+
|   chunk|merged_entity|
+--------+-------------+
|Group LP|          ORG|
+--------+-------------+

// val rules = """[A-Z]+[\s+[A-Z]+]*,SECTION_HEADER """ 
// with open("regex_rules.txt","w") as f: 
//    f.write(rules) 

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text") 
    .setOutputCol("document") 

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
    .setInputCols(Array("document")) 
    .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
    .setInputCols(Array("sentence")) 
    .setOutputCol("token") 

val word_embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
    .setInputCols(Array("sentence","token")) 
    .setOutputCol("embeddings") 

val ner_model = LegalNerModel.pretrained("legner_org_per_role_date", "en", "legal/models")
    .setInputCols(Array("sentence","token","embeddings")) 
    .setOutputCol("ner") 

val ner_converter= new NerConverterInternal()  
    .setInputCols(Array("sentence","token","ner")) 
    .setOutputCol("ner_chunk") 
    // .setWhiteList(Array("ORG")) 

// Return only ORG entities 
val regex_matcher = new RegexMatcher()
    .setInputCols("document") 
    .setStrategy("MATCH_ALL") 
    .setOutputCol("regex_matches") 
    .setExternalRules(path="/content/regex_rules.txt",delimiter=",") 

val chunkConverter = new ChunkConverter()
    .setInputCols("regex_matches") 
    .setOutputCol("regex_chunk") 
 
val merger= new ChunkMergeApproach() 
    .setInputCols(Array("regex_chunk","ner_chunk")) 
    .setOutputCol("merged_chunks") 
    .setMergeOverlapping(true) 
    .setChunkPrecedence("field") 

val pipeline= new Pipeline().setStages(Array( 
    documentAssembler, 
    sentenceDetector, 
    tokenizer, 
    word_embeddings, 
    ner_model, 
    ner_converter, 
    regex_matcher, 
    chunkConverter, 
    merger )) 
    
val data = Seq(("AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price.")).toDF("text")

val result = pipeline.fit(data).transform(data)
                          
+--------+-------------+
|   chunk|merged_entity|
+--------+-------------+
|Group LP|          ORG|
+--------+-------------+

ChunkEntityResolver

The ChunkEntityResolverModel encompasses the functionality to produce a normalized entity from a specialized ontology or curated dataset (such as ICD-10, RxNorm, SNOMED, etc.). This model includes comprehensive parameters and methods essential for its training. It operates by transforming a dataset that incorporates two Input Annotations: TOKEN and WORD_EMBEDDINGS, sourced from tools like ChunkTokenizer and ChunkEmbeddings Annotators. Ultimately, it generates the normalized entity relevant to the specified trained ontology or curated dataset, ensuring accurate entity resolution within the given context.

For available pretrained models please see the Models Hub.

Importand Note: This annotator has been deprecated.

Input Annotator Types: TOKEN, WORD_EMBEDDINGS

Output Annotator Type: ENTITY

Scala API: ChunkEntityResolverModel

Show Example

from johnsnowlabs import nlp, medical
# Using pretrained models for SNOMED
# First the prior steps of the pipeline are defined.
# Output of types TOKEN and WORD_EMBEDDINGS are needed.
data = spark.createDataFrame([["A 63-year-old man presents to the hospital ..."]]).toDF("text")

docAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("word_embeddings")

icdo_ner = medical.NerModel.pretrained("ner_bionlp", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "word_embeddings"]) \
    .setOutputCol("icdo_ner")

icdo_chunk = nlp.NerConverter()\
    .setInputCols(["sentence","token","icdo_ner"])\
    .setOutputCol("icdo_chunk")\
    .setWhiteList(["Cancer"])

icdo_chunk_embeddings = nlp.ChunkEmbeddings() \
    .setInputCols(["icdo_chunk", "word_embeddings"]) \
    .setOutputCol("icdo_chunk_embeddings")

icdo_chunk_resolver = medical.ChunkEntityResolverModel\
    .pretrained("chunkresolve_icdo_clinical", "en", "clinical/models") \
    .setInputCols(["token","icdo_chunk_embeddings"]) \
    .setOutputCol("tm_icdo_code")

clinical_ner = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "word_embeddings"]) \
    .setOutputCol("ner")

ner_converter = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

ner_chunk_tokenizer = nlp.ChunkTokenizer() \
    .setInputCols(["ner_chunk"]) \
    .setOutputCol("ner_token")
    
ner_chunk_embeddings = nlp.ChunkEmbeddings() \
    .setInputCols(["ner_chunk", "word_embeddings"]) \
    .setOutputCol("ner_chunk_embeddings")

# Definition of the SNOMED Resolution
ner_snomed_resolver = medical.ChunkEntityResolverModel\
    .pretrained("chunkresolve_snomed_findings_clinical","en","clinical/models") \
    .setInputCols(["ner_token","ner_chunk_embeddings"])\
    .setOutputCol("snomed_result")

pipelineFull = nlp.Pipeline().setStages([
      docAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,

      clinical_ner,
      ner_converter,
      ner_chunk_embeddings,
      ner_chunk_tokenizer,
      ner_snomed_resolver,

      icdo_ner,
      icdo_chunk,
      icdo_chunk_embeddings,
      icdo_chunk_resolver
])
pipelineModelFull = pipelineFull.fit(data)
result = pipelineModelFull.transform(data).cache()

# Show results
result.selectExpr("explode(snomed_result)")
  .selectExpr(
    "col.metadata.target_text",
    "col.metadata.resolved_text",
    "col.metadata.confidence",
    "col.metadata.all_k_results",
    "col.metadata.all_k_resolutions")
  .filter($"confidence" > 0.2).show(5)
+--------------------+--------------------+----------+--------------------+--------------------+
|         target_text|       resolved_text|confidence|       all_k_results|   all_k_resolutions|
+--------------------+--------------------+----------+--------------------+--------------------+
|hypercholesterolemia|Hypercholesterolemia|    0.2524|13644009:::267432...|Hypercholesterole...|
|                 CBC|             Neocyte|    0.4980|259680000:::11573...|Neocyte:::Blood g...|
|                CD38|       Hypoviscosity|    0.2560|47872005:::370970...|Hypoviscosity:::E...|
|           platelets| Increased platelets|    0.5267|6631009:::2596800...|Increased platele...|
|                CD38|       Hypoviscosity|    0.2560|47872005:::370970...|Hypoviscosity:::E...|
+--------------------+--------------------+----------+--------------------+--------------------+

import spark.implicits._
// Using pretrained models for SNOMED
// First the prior steps of the pipeline are defined.
// Output of types TOKEN and WORD_EMBEDDINGS are needed.
val data = Seq(("A 63-year-old man presents to the hospital ...")).toDF("text")

val docAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("word_embeddings")

val icdo_ner = MedicalNerModel.pretrained("ner_bionlp", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "word_embeddings"))
    .setOutputCol("icdo_ner")

val icdo_chunk = new NerConverter()
    .setInputCols(Array("sentence","token","icdo_ner"))
    .setOutputCol("icdo_chunk")
    .setWhiteList("Cancer")

val icdo_chunk_embeddings = new ChunkEmbeddings()
    .setInputCols(Array("icdo_chunk", "word_embeddings"))
    .setOutputCol("icdo_chunk_embeddings")

val icdo_chunk_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_icdo_clinical", "en", "clinical/models")
    .setInputCols(Array("token","icdo_chunk_embeddings"))
    .setOutputCol("tm_icdo_code")

val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence", "token", "word_embeddings"))
  .setOutputCol("ner")

val ner_converter = new NerConverter()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")

val ner_chunk_tokenizer = new ChunkTokenizer()
    .setInputCols("ner_chunk")
    .setOutputCol("ner_token")
  
val ner_chunk_embeddings = new ChunkEmbeddings()
    .setInputCols(Array("ner_chunk", "word_embeddings"))
    .setOutputCol("ner_chunk_embeddings")

// Definition of the SNOMED Resolution
val ner_snomed_resolver = ChunkEntityResolverModel\
    .pretrained("chunkresolve_snomed_findings_clinical","en","clinical/models")
    .setInputCols(Array("ner_token","ner_chunk_embeddings"))\
    .setOutputCol("snomed_result")

val pipelineFull = new Pipeline().setStages(Array(
    docAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,

    clinical_ner,
    ner_converter,
    ner_chunk_embeddings,
    ner_chunk_tokenizer,
    ner_snomed_resolver,

    icdo_ner,
    icdo_chunk,
    icdo_chunk_embeddings,
    icdo_chunk_resolver
))
val pipelineModelFull = pipelineFull.fit(data)
val result = pipelineModelFull.transform(data).cache()

// Show results
//
// result.selectExpr("explode(snomed_result)")
//   .selectExpr(
//     "col.metadata.target_text",
//     "col.metadata.resolved_text",
//     "col.metadata.confidence",
//     "col.metadata.all_k_results",
//     "col.metadata.all_k_resolutions")
//   .filter($"confidence" > 0.2).show(5)
// +--------------------+--------------------+----------+--------------------+--------------------+
// |         target_text|       resolved_text|confidence|       all_k_results|   all_k_resolutions|
// +--------------------+--------------------+----------+--------------------+--------------------+
// |hypercholesterolemia|Hypercholesterolemia|    0.2524|13644009:::267432...|Hypercholesterole...|
// |                 CBC|             Neocyte|    0.4980|259680000:::11573...|Neocyte:::Blood g...|
// |                CD38|       Hypoviscosity|    0.2560|47872005:::370970...|Hypoviscosity:::E...|
// |           platelets| Increased platelets|    0.5267|6631009:::2596800...|Increased platele...|
// |                CD38|       Hypoviscosity|    0.2560|47872005:::370970...|Hypoviscosity:::E...|
// +--------------------+--------------------+----------+--------------------+--------------------+
//

Contains all the parameters and methods to train a ChunkEntityResolverModel. It transform a dataset with two Input Annotations of types TOKEN and WORD_EMBEDDINGS, coming from e.g. ChunkTokenizer and ChunkEmbeddings Annotators and returns the normalized entity for a particular trained ontology / curated dataset. (e.g. ICD-10, RxNorm, SNOMED etc.)

To use pretrained models please use ChunkEntityResolverModel and see the Models Hub for available models.

Input Annotator Types: TOKEN, WORD_EMBEDDINGS

Output Annotator Type: ENTITY

Scala API: ChunkEntityResolverApproach

Show Example

from johnsnowlabs import nlp, medical
# Training a SNOMED model
# Define pre-processing pipeline for training data. It needs consists of columns for the normalized training data and their labels.
document = nlp.DocumentAssembler() \
    .setInputCol("normalized_text") \
    .setOutputCol("document")

chunk = nlp.Doc2Chunk() \
    .setInputCols(["document"]) \
    .setOutputCol("chunk")

token = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel\
    .pretrained("embeddings_healthcare_100d", "en", "clinical/models") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

chunkEmb = nlp.ChunkEmbeddings() \
    .setInputCols(["chunk", "embeddings"]) \
    .setOutputCol("chunk_embeddings")

snomedTrainingPipeline = nlp.Pipeline().setStages([
    document,
    chunk,
    token,
    embeddings,
    chunkEmb
])

snomedTrainingModel = snomedTrainingPipeline.fit(data)

snomedData = snomedTrainingModel.transform(data).cache()

# Then the Resolver can be trained with
snomedExtractor = medical.ChunkEntityResolverApproach() \
    .setInputCols(["token", "chunk_embeddings"]) \
    .setOutputCol("recognized") \
    .setNeighbours(1000) \
    .setAlternatives(25) \
    .setNormalizedCol("normalized_text") \
    .setLabelCol("label") \
    .setEnableWmd(True).setEnableTfidf(True).setEnableJaccard(True) \
    .setEnableSorensenDice(True).setEnableJaroWinkler(True).setEnableLevenshtein(True) \
    .setDistanceWeights([1, 2, 2, 1, 1, 1]) \
    .setAllDistancesMetadata(True) \
    .setPoolingStrategy("MAX") \
    .setThreshold(1e32)
model = snomedExtractor.fit(snomedData)

import spark.implicits._
// Training a SNOMED model
// Define pre-processing pipeline for training data. It needs consists of columns for the normalized training data and their labels.
val document = new DocumentAssembler()
    .setInputCol("normalized_text")
    .setOutputCol("document")

val chunk = new Doc2Chunk()
    .setInputCols("document")
    .setOutputCol("chunk")

val token = new Tokenizer()
    .setInputCols("document")
    .setOutputCol("token")

val embeddings = WordEmbeddingsModel\
    .pretrained("embeddings_healthcare_100d", "en", "clinical/models")
    .setInputCols(Array("document", "token"))
    .setOutputCol("embeddings")

val chunkEmb = new ChunkEmbeddings()
    .setInputCols(Array("chunk", "embeddings"))
    .setOutputCol("chunk_embeddings")

val snomedTrainingPipeline = new Pipeline().setStages(Array(
    document,
    chunk,
    token,
    embeddings,
    chunkEmb
))

val snomedTrainingModel = snomedTrainingPipeline.fit(data)

val snomedData = snomedTrainingModel.transform(data).cache()

// Then the Resolver can be trained with
val snomedExtractor = new ChunkEntityResolverApproach()
    .setInputCols(Array("token", "chunk_embeddings"))
    .setOutputCol("recognized")
    .setNeighbours(1000)
    .setAlternatives(25)
    .setNormalizedCol("normalized_text")
    .setLabelCol("label")
    .setEnableWmd(true).setEnableTfidf(true).setEnableJaccard(true)
    .setEnableSorensenDice(true).setEnableJaroWinkler(true).setEnableLevenshtein(true)
    .setDistanceWeights(Array(1, 2, 2, 1, 1, 1))
    .setAllDistancesMetadata(true)
    .setPoolingStrategy("MAX")
    .setThreshold(1e32)
val model = snomedExtractor.fit(snomedData)

ChunkFilterer

Model

Filters entities coming from CHUNK annotations. Filters can be set via a white list of terms or a regular expression. White list criteria is enabled by default. To use regex, criteria has to be set to regex.

Parametres:

inputCols: The name of the columns containing the input annotations. It can read either a String column or an Array.
outputCol: The name of the column in Document type that is generated. We can specify only one column here.
criteria: Tag representing what is the criteria to filter the chunks. Possibles values are: - isIn: Filter by the chunk - regex: Filter using a regex
whiteList: If defined, list of entities to process. The rest will be ignored.
blackList: If defined, list of entities to ignore. The rest will be processed.
regex: If defined, list of regex to process the chunks (Default: []).
filterEntity: If equal to “entity”, use the ner label to filter. If set to “result”, use the result attribute of the annotation to filter.
entitiesConfidence: Path to csv with pairs (entity,confidenceThreshold). Filter the chunks with entities which have confidence lower than the confidence threshold.

All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols().

Input Annotator Types: DOCUMENT,CHUNK

Output Annotator Type: CHUNK

Python API: ChunkFilterer

Scala API: ChunkFilterer

Notebook: ChunkFiltererNotebook

Show Example

from johnsnowlabs import nlp, medical

# Filtering POS tags
# First pipeline stages to extract the POS tags are defined

docAssembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")

posTagger = nlp.PerceptronModel.pretrained() \
  .setInputCols(["sentence", "token"]) \
  .setOutputCol("pos")

chunker = nlp.Chunker() \
  .setInputCols(["pos", "sentence"]) \
  .setOutputCol("chunk") \
  .setRegexParsers(["(<NN>)+"])

# Then the chunks can be filtered via a white list. Here only terms with "gastroenteritis" remain.
chunkerFilter = medical.ChunkFilterer() \
  .setInputCols(["sentence","chunk"]) \
  .setOutputCol("filtered") \
  .setCriteria("isin") \
  .setWhiteList(["gastroenteritis"])

pipeline = nlp.Pipeline(stages=[
  docAssembler,
  sentenceDetector,
  tokenizer,
  posTagger,
  chunker,
  chunkerFilter])

data = spark.createDataFrame([["Has a past history of gastroenteritis and stomach pain, however patient ..."]]).toDF("text")

result = pipeline.fit(data).transform(data)
result.selectExpr("explode(chunk)").show(truncate=False)
+---------------------------------------------------------------------------------+
|col                                                                              |
+---------------------------------------------------------------------------------+
|{chunk, 11, 17, history, {sentence -> 0, chunk -> 0}, []}                        |
|{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []}                |
|{chunk, 42, 53, stomach pain, {sentence -> 0, chunk -> 2}, []}                   |
|{chunk, 64, 70, patient, {sentence -> 0, chunk -> 3}, []}                        |
|{chunk, 81, 110, stomach pain now.We don't care, {sentence -> 0, chunk -> 4}, []}|
|{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []}              |
+---------------------------------------------------------------------------------+

result.selectExpr("explode(filtered)").show(truncate=False)
+-------------------------------------------------------------------+
|col                                                                |
+-------------------------------------------------------------------+
|{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []}  |
|{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []}|
+-------------------------------------------------------------------+

from johnsnowlabs import nlp, finance

# Filtering POS tags
# First pipeline stages to extract the POS tags are defined

docAssembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")

posTagger = nlp.PerceptronModel.pretrained() \
  .setInputCols(["sentence", "token"]) \
  .setOutputCol("pos")

chunker = nlp.Chunker() \
  .setInputCols(["pos", "sentence"]) \
  .setOutputCol("chunk") \
  .setRegexParsers(["(<NN>)+"])

# Then the chunks can be filtered via a white list. Here only terms with "gastroenteritis" remain.
chunkerFilter = finance.ChunkFilterer() \
  .setInputCols(["sentence","chunk"]) \
  .setOutputCol("filtered") \
  .setCriteria("isin") \
  .setWhiteList(["rate"])

pipeline = nlp.Pipeline(stages=[
  docAssembler,
  sentenceDetector,
  tokenizer,
  posTagger,
  chunker,
  chunkerFilter])

data = spark.createDataFrame([["AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price."]]).toDF("text")

result = pipeline.fit(data).transform(data)
result.selectExpr("explode(chunk)").show(truncate=False)
+-------------------------------------------------------+
|col                                                    |
+-------------------------------------------------------+
|{chunk, 73, 77, basis, {sentence -> 0, chunk -> 0}, []}|
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+

result.selectExpr("explode(filtered)").show(truncate=False)
+-------------------------------------------------------+
|col                                                    |
+-------------------------------------------------------+
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+

from johnsnowlabs import nlp, legal

# Filtering POS tags
# First pipeline stages to extract the POS tags are defined

docAssembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")

posTagger = nlp.PerceptronModel.pretrained() \
  .setInputCols(["sentence", "token"]) \
  .setOutputCol("pos")

chunker = nlp.Chunker() \
  .setInputCols(["pos", "sentence"]) \
  .setOutputCol("chunk") \
  .setRegexParsers(["(<NN>)+"])

# Then the chunks can be filtered via a white list. Here only terms with "gastroenteritis" remain.
chunkerFilter = legal.ChunkFilterer() \
  .setInputCols(["sentence","chunk"]) \
  .setOutputCol("filtered") \
  .setCriteria("isin") \
  .setWhiteList(["rate"])

pipeline = nlp.Pipeline(stages=[
  docAssembler,
  sentenceDetector,
  tokenizer,
  posTagger,
  chunker,
  chunkerFilter])

data = spark.createDataFrame([["AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price."]]).toDF("text")

result = pipeline.fit(data).transform(data)
result.selectExpr("explode(chunk)").show(truncate=False)
+-------------------------------------------------------+
|col                                                    |
+-------------------------------------------------------+
|{chunk, 73, 77, basis, {sentence -> 0, chunk -> 0}, []}|
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+

result.selectExpr("explode(filtered)").show(truncate=False)
+-------------------------------------------------------+
|col                                                    |
+-------------------------------------------------------+
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+

// Filtering POS tags
// First pipeline stages to extract the POS tags are defined

import spark.implicits._

val docAssembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val sentenceDetector = new SentenceDetector()
 .setInputCols(Array("document")) 
 .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
 .setInputCols(Array("sentence")) 
 .setOutputCol("token") 

val posTagger = PerceptronModel.pretrained()
 .setInputCols(Array("sentence","token")) 
 .setOutputCol("pos") 

val chunker = new Chunker()
 .setInputCols(Array("pos","sentence")) 
 .setOutputCol("chunk") 
 .setRegexParsers(Array("(<NN>) +")) 

val chunkerFilter = new ChunkFilterer()
 .setInputCols(Array("sentence","chunk")) 
 .setOutputCol("filtered") 
 .setCriteria("isin") 
 .setWhiteList(Array("gastroenteritis"))

val pipeline = new Pipeline().setStages(Array(
  docAssembler, 
  sentenceDetector, 
  tokenizer, 
  posTagger, 
  chunker, 
  chunkerFilter)) 

val text ="""Has a past history of gastroenteritis and stomach pain, however patient ..."""

val data = Seq(text).toDF("text")
val result = pipeline.fit(data).transform(data)

// result.selectExpr("explode(chunk)").show(truncate=false)
+---------------------------------------------------------------------------------+
|col                                                                              |
+---------------------------------------------------------------------------------+
|{chunk, 11, 17, history, {sentence -> 0, chunk -> 0}, []}                        |
|{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []}                |
|{chunk, 42, 53, stomach pain, {sentence -> 0, chunk -> 2}, []}                   |
|{chunk, 64, 70, patient, {sentence -> 0, chunk -> 3}, []}                        |
|{chunk, 81, 110, stomach pain now.We don't care, {sentence -> 0, chunk -> 4}, []}|
|{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []}              |
+---------------------------------------------------------------------------------+

// result.selectExpr("explode(filtered)").show(truncate=false)
+-------------------------------------------------------------------+
|col                                                                |
+-------------------------------------------------------------------+
|{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []}  |
|{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []}|
+-------------------------------------------------------------------+

import spark.implicits._

val docAssembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val sentenceDetector = new SentenceDetector()
 .setInputCols(Array("document")) 
 .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
 .setInputCols(Array("sentence")) 
 .setOutputCol("token") 

val posTagger = PerceptronModel.pretrained()
 .setInputCols(Array("sentence","token")) 
 .setOutputCol("pos") 

val chunker = new Chunker()
 .setInputCols(Array("pos","sentence")) 
 .setOutputCol("chunk") 
 .setRegexParsers(Array("(<NN>) +")) 

val chunkerFilter = new ChunkFilterer()
 .setInputCols(Array("sentence","chunk")) 
 .setOutputCol("filtered") 
 .setCriteria("isin") 

val pipeline = new Pipeline().setStages(Array(
  docAssembler, 
  sentenceDetector, 
  tokenizer, 
  posTagger, 
  chunker, 
  chunkerFilter)) 

val text ="""AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price."""

val data = Seq(text).toDF("text")
val result = pipeline.fit(data).transform(data)

// result.selectExpr("explode(chunk)").show(truncate=false)
+-------------------------------------------------------+
|col                                                    |
+-------------------------------------------------------+
|{chunk, 73, 77, basis, {sentence -> 0, chunk -> 0}, []}|
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+

// result.selectExpr("explode(filtered)").show(truncate=False)
+-------------------------------------------------------+
|col                                                    |
+-------------------------------------------------------+
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+

import spark.implicits._

val docAssembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val sentenceDetector = new SentenceDetector()
 .setInputCols(Array("document")) 
 .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
 .setInputCols(Array("sentence")) 
 .setOutputCol("token") 

val posTagger = PerceptronModel.pretrained()
 .setInputCols(Array("sentence","token")) 
 .setOutputCol("pos") 

val chunker = new Chunker()
 .setInputCols(Array("pos","sentence")) 
 .setOutputCol("chunk") 
 .setRegexParsers(Array("(<NN>) +")) 

val chunkerFilter = new ChunkFilterer()
 .setInputCols(Array("sentence","chunk")) 
 .setOutputCol("filtered") 
 .setCriteria("isin") 

val pipeline = new Pipeline().setStages(Array(
  docAssembler, 
  sentenceDetector, 
  tokenizer, 
  posTagger, 
  chunker, 
  chunkerFilter)) 

val text ="""AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price."""

val data = Seq(text).toDF("text")
val result = pipeline.fit(data).transform(data)

// result.selectExpr("explode(chunk)").show(truncate=false)
+-------------------------------------------------------+
|col                                                    |
+-------------------------------------------------------+
|{chunk, 73, 77, basis, {sentence -> 0, chunk -> 0}, []}|
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+

// result.selectExpr("explode(filtered)").show(truncate=False)
+-------------------------------------------------------+
|col                                                    |
+-------------------------------------------------------+
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+

ChunkKeyPhraseExtraction

Model

Chunk KeyPhrase Extraction uses Bert Sentence Embeddings to determine the most relevant key phrases describing a text. The input to the model consists of chunk annotations and sentence or document annotation. The model compares the chunks against the corresponding sentences/documents and selects the chunks which are most representative of the broader text context (i.e. the document or the sentence they belong to). The key phrases candidates (i.e. the input chunks) can be generated in various ways, e.g. by NGramGenerator, TextMatcher or NerConverter. The model operates either at sentence (selecting the most descriptive chunks from the sentence they belong to) or at document level. In the latter case, the key phrases are selected to represent all the input document annotations.

Parametres:

setConcatenateSentences(value: Boolean): Concatenate the input sentence/documentation annotations before computing their embedding Default value is ‘true’.
setDivergence(value: Float): Set the level of divergence of the extracted key phrases.
setDocumentLevelProcessing(value: Boolean): Extract key phrases from the whole document (true) or from particular sentences which the chunks refer to (false) Default value is ‘true’.
setDropPunctuation(value: Boolean): Remove punctuation marks from input chunks.
setSelectMostDifferent(value: Boolean): Let the model return the top N key phrases which are the most different from each other.
setTopN(value: Int): Set the number of key phrases to extract.

This model is a subclass of [[BertSentenceEmbeddings]] and shares all parameters with it. It can load any pretrained BertSentenceEmbeddings model. Available models can be found at the Models Hub.

Input Annotator Types: DOCUMENT, CHUNK

Output Annotator Type: CHUNK

Python API: ChunkKeyPhraseExtraction

Scala API: ChunkKeyPhraseExtraction

Notebook: ChunkKeyPhraseExtractionNotebook

Show Example

from johnsnowlabs import nlp, medical

documenter = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentencer = nlp.SentenceDetector() \
    .setInputCols(["document"])\
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("tokens") \

embeddings = nlp.WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "tokens"]) \
    .setOutputCol("embeddings")

ner_tagger = medical.NerModel().pretrained("ner_jsl_slim", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens", "embeddings"]) \
    .setOutputCol("ner_tags")

ner_converter = nlp.NerConverter()\
    .setInputCols("sentences", "tokens", "ner_tags")\
    .setOutputCol("ner_chunks")

key_phrase_extractor = medical.ChunkKeyPhraseExtraction.pretrained()\
    .setTopN(1)\
    .setDocumentLevelProcessing(False)\
    .setDivergence(0.4)\
    .setInputCols(["sentences", "ner_chunks"])\
    .setOutputCol("ner_chunk_key_phrases")

pipeline = nlp.Pipeline(stages=[
    documenter, 
    sentencer, 
    tokenizer, 
    embeddings, 
    ner_tagger, 
    ner_converter,
    key_phrase_extractor])

data = spark.createDataFrame([["Her Diabetes has become type 2 in the last year with her Diabetes.He complains of swelling in his right forearm."]]).toDF("text")
results = pipeline.fit(data).transform(data)

results.selectExpr("explode(ner_chunk_key_phrases) AS key_phrase")\
       .selectExpr("key_phrase.result",
                   "key_phrase.metadata.entity",
                   "key_phrase.metadata.DocumentSimilarity",
                   "key_phrase.metadata.MMRScore").show(truncate=False)

+--------+-------------------------+------------------+-----------------+
|result  |entity                   |DocumentSimilarity|MMRScore         |
+--------+-------------------------+------------------+-----------------+
|Diabetes|Disease_Syndrome_Disorder|0.66827321499841  |0.400963944931921|
+--------+-------------------------+------------------+-----------------+

from johnsnowlabs import nlp, finance

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
        
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

key_phrase_extractor = finance.ChunkKeyPhraseExtraction\
    .pretrained()\
    .setTopN(1)\
    .setDocumentLevelProcessing(False)\
    .setDivergence(0.4)\
    .setInputCols(["sentence", "ner_chunk"])\
    .setOutputCol("ner_chunk_key_phrases")

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    key_phrase_extractor])

text = ["""In 2020, we acquired certain assets of Spell Security Private Limited (also known as "Spell Security"). More specifically, their Compliance product - Policy Compliance (PC)")."""]

data = spark.createDataFrame([text]).toDF("text")
result = nlpPipeline.fit(data).transform(data)

result.selectExpr("explode(ner_chunk_key_phrases) AS key_phrase")\
      .selectExpr("key_phrase.result",
                  "key_phrase.metadata.entity",
                  "key_phrase.metadata.DocumentSimilarity",
                  "key_phrase.metadata.MMRScore").show(truncate=False)

+------------------------------+-------+------------------+-------------------+
|result                        |entity |DocumentSimilarity|MMRScore           |
+------------------------------+-------+------------------+-------------------+
|Policy Compliance             |PRODUCT|0.6446724461374882|0.38680348305268175|
|Spell Security Private Limited|ORG    |0.6282153013401193|0.3769291957818915 |
+------------------------------+-------+------------------+-------------------+

from johnsnowlabs import nlp, legal

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
        
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias","en","legal/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

key_phrase_extractor = legal.ChunkKeyPhraseExtraction\
    .pretrained()\
    .setTopN(1)\
    .setDocumentLevelProcessing(False)\
    .setDivergence(0.4)\
    .setInputCols(["sentence", "ner_chunk"])\
    .setOutputCol("ner_chunk_key_phrases")

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    key_phrase_extractor])

text = ["""This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties").
"""]

data = spark.createDataFrame([text]).toDF("text")
result = nlpPipeline.fit(data).transform(data)

result.selectExpr("explode(ner_chunk_key_phrases) AS key_phrase")\
      .selectExpr("key_phrase.result",
                  "key_phrase.metadata.entity",
                  "key_phrase.metadata.DocumentSimilarity",
                  "key_phrase.metadata.MMRScore").show(truncate=False)

+--------------+------+------------------+-------------------+
|result        |entity|DocumentSimilarity|MMRScore           |
+--------------+------+------------------+-------------------+
|Buyer Entities|ALIAS |0.5680936022739617|0.34085617490878395|
+--------------+------+------------------+-------------------+

import spark.implicits._

val documenter = new DocumentAssembler()
  .setInputCol("text") 
  .setOutputCol("document") 

val sentencer = new SentenceDetector()
  .setInputCols(Array("document")) 
  .setOutputCol("sentences") 

val tokenizer = new Tokenizer()
  .setInputCols(Array("document")) 
  .setOutputCol("tokens") 

val embeddings = WordEmbeddingsModel
  .pretrained("embeddings_clinical","en","clinical/models") 
  .setInputCols(Array("document","tokens")) 
  .setOutputCol("embeddings") 

val ner_tagger = MedicalNerModel.pretrained("ner_jsl_slim","en","clinical/models") 
  .setInputCols(Array("sentences","tokens","embeddings")) 
  .setOutputCol("ner_tags") 
 
val ner_converter = new NerConverter()
  .setInputCols("sentences","tokens","ner_tags") 
  .setOutputCol("ner_chunks") 

val key_phrase_extractor = ChunkKeyPhraseExtraction.pretrained()
  .setTopN(1) 
  .setDocumentLevelProcessing(false) 
  .setDivergence(0.4) 
  .setInputCols(Array("sentences","ner_chunks")) 
  .setOutputCol("ner_chunk_key_phrases") 

val pipeline = new Pipeline().setStages(Array( 
  documenter, 
  sentencer, 
  tokenizer, 
  embeddings, 
  ner_tagger, 
  ner_converter, 
  key_phrase_extractor)) 

val text ="""Her Diabetes has become type 2 in the last year with her Diabetes.He complains of swelling in his right forearm."""
val data = Seq(text).toDF("text")

val results = pipeline.fit(data).transform(data)

+--------+-------------------------+------------------+-----------------+
|result  |entity                   |DocumentSimilarity|MMRScore         |
+--------+-------------------------+------------------+-----------------+
|Diabetes|Disease_Syndrome_Disorder|0.66827321499841  |0.400963944931921|
+--------+-------------------------+------------------+-----------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text") 
  .setOutputCol("document") 

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
  .setInputCols(Array("document") ) 
  .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence")) 
  .setOutputCol("token") 

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
  .setInputCols(Array("sentence","token")) 
  .setOutputCol("embeddings") 

val ner_model = FinanceNerModel.pretrained("finner_orgs_prods_alias","en","finance/models")
  .setInputCols(Array("sentence","token","embeddings")) 
  .setOutputCol("ner") 

val ner_converter = new NerConverter()
  .setInputCols(Array("sentence","token","ner")) 
  .setOutputCol("ner_chunk") 

val key_phrase_extractor = ChunkKeyPhraseExtraction.pretrained() 
  .setTopN(1) 
  .setDocumentLevelProcessing(false) 
  .setDivergence(0.4) 
  .setInputCols(Array("sentence","ner_chunk")) 
  .setOutputCol("ner_chunk_key_phrases")

val nlpPipeline = new Pipeline().setStages(Array( 
  documentAssembler, 
  sentenceDetector, 
  tokenizer, 
  embeddings, 
  ner_model, n
  er_converter, 
  key_phrase_extractor) ) 

val text ="""In 2020, we acquired certain assets of Spell Security Private Limited (also known as "Spell Security"). More specifically, their Compliance product - Policy Compliance (PC)."""
val data = Seq(text).toDF("text")

val result = nlpPipeline.fit(data).transform(data)

+------------------------------+-------+------------------+-------------------+
|result                        |entity |DocumentSimilarity|MMRScore           |
+------------------------------+-------+------------------+-------------------+
|Policy Compliance             |PRODUCT|0.6446724461374882|0.38680348305268175|
|Spell Security Private Limited|ORG    |0.6282153013401193|0.3769291957818915 |
+------------------------------+-------+------------------+-------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text") 
  .setOutputCol("document") 

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
  .setInputCols(Array("document") ) 
  .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence")) 
  .setOutputCol("token") 

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
  .setInputCols(Array("sentence","token")) 
  .setOutputCol("embeddings") 

val ner_model = LegalNerModel.pretrained("legner_orgs_prods_alias","en","legal/models")
  .setInputCols(Array("sentence","token","embeddings")) 
  .setOutputCol("ner") 

val ner_converter = new NerConverter()
  .setInputCols(Array("sentence","token","ner")) 
  .setOutputCol("ner_chunk") 

val key_phrase_extractor = ChunkKeyPhraseExtraction.pretrained() 
  .setTopN(1) 
  .setDocumentLevelProcessing(false) 
  .setDivergence(0.4) 
  .setInputCols(Array("sentence","ner_chunk")) 
  .setOutputCol("ner_chunk_key_phrases")

val nlpPipeline = new Pipeline().setStages(Array( 
  documentAssembler, 
  sentenceDetector, 
  tokenizer, 
  embeddings, 
  ner_model, n
  er_converter, 
  key_phrase_extractor) ) 

val text ="""This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties")."""
val data = Seq(text).toDF("text")

val result = nlpPipeline.fit(data).transform(data)

+--------------+------+------------------+-------------------+
|result        |entity|DocumentSimilarity|MMRScore           |
+--------------+------+------------------+-------------------+
|Buyer Entities|ALIAS |0.5680936022739617|0.34085617490878395|
+--------------+------+------------------+-------------------+

ChunkMapper

We can use ChunkMapper to map entities with their associated code/reference based on pre-defined dictionaries.

This is the AnnotatorModel of the ChunkMapper, which can be used to access pretrained models with the .pretrained() or .load() methods. To train a new model, check the documentation of the ChunkMapperApproach annotator.

The annotator also allows using fuzzy matching, which can take into consideration parts of the tokens tha can map even when word order is different, char ngrams that can map even when thre are typos, and using fuzzy distance metric (Jaccard, Levenshtein, etc.).

Parametres:

setRels (List[str]): Relations that we are going to use to map the chunk
setLowerCase (Boolean): Set if we want to map the chunks in lower case or not (Default: True)
setAllowMultiTokenChunk (Boolean): Whether to skip relations with multitokens (Default: True)
setMultivaluesRelations (Boolean): Whether to decide to return all values in a relation together or separately (Default: False)

Example usage and more details can be found on Spark NLP Workshop repository accessible in GitHub, for example the notebook Healthcare Chunk Mapping.

Input Annotator Types: CHUNK

Output Annotator Type: LABEL_DEPENDENCY

Python API: ChunkMapperModel

Scala API: ChunkMapperModel

Notebook: ChunkMapperModelNotebook

Show Example

from johnsnowlabs import nlp, medical

documenter = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentencer = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentences")

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentences"])\
  .setOutputCol("tokens")

words_embedder = nlp.WordEmbeddingsModel()\
  .pretrained("embeddings_clinical", "en", "clinical/models")\
  .setInputCols(["sentences", "tokens"])\
  .setOutputCol("embeddings")

ner_tagger = medical.NerModel()\
  .pretrained("ner_posology", "en", "clinical/models")\
  .setInputCols("sentences", "tokens", "embeddings")\
  .setOutputCol("ner_tags")

ner_converter = medical.NerConverterInternal()\
  .setInputCols(["sentences", "tokens", "ner_tags"])\
  .setOutputCol("ner_chunks")\
  .setWhiteList(["DRUG"])

chunkToDoc = nlp.Chunk2Doc()\
  .setInputCols("ner_chunks")\
  .setOutputCol("ner_chunks_doc")

sbert_embedder = nlp.BertSentenceEmbeddings\
  .pretrained("sbiobert_base_cased_mli", "en","clinical/models")\
  .setInputCols(["ner_chunks_doc"])\
  .setOutputCol("sbert_embeddings")\
  .setCaseSensitive(False)

rxnorm_resolver = medical.SentenceEntityResolverModel\
  .pretrained("sbiobertresolve_rxnorm_augmented", "en", "clinical/models")\
  .setInputCols(["sbert_embeddings"])\
  .setOutputCol("rxnorm_code")\
  .setDistanceFunction("EUCLIDEAN")\

resolver2chunk = medical.Resolution2Chunk()\
  .setInputCols(["rxnorm_code"]) \
  .setOutputCol("rxnorm_chunk")\

chunkerMapper = medical.ChunkMapperModel.pretrained("rxnorm_drug_brandname_mapper", "en", "clinical/models")\
  .setInputCols(["rxnorm_chunk"])\
  .setOutputCol("rxnorm_drug_brandname_mapper")\
  .setRels(["rxnorm_brandname"])

pipeline = nlp.Pipeline(
    stages = [
        documenter,
        sentencer,
        tokenizer,
        words_embedder,
        ner_tagger,
        ner_converter,
        chunkToDoc,
        sbert_embedder,
        rxnorm_resolver,
        resolver2chunk,
        chunkerMapper
        ])


data = spark.createDataFrame([["The doctor prescribed Sinequan 150 MG for depression and Zonalon 50 mg for managing skin itching"]]).toDF("text")

result= pipeline.fit(data).transform(data)

result.select(F.explode(F.arrays_zip(result.ner_chunks.result,
                                     result.rxnorm_code.result)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("ner_chunks"),
                          F.expr("cols['1']").alias("rxnorm_code")).show(15, truncate=100)

+----------+-----------+----------------------------+
|ner_chunks|rxnorm_code|rxnorm_drug_brandname_mapper|
+----------+-----------+----------------------------+
|  Sinequan|     224915|         Sinequan (Sinequan)|
|   Zonalon|       9801|           Zonalon (Zonalon)|
+----------+-----------+----------------------------+

from johnsnowlabs import nlp, finance

document_assembler = nlp.DocumentAssembler()\
  .setInputCol('text')\
  .setOutputCol('document')

tokenizer = nlp.Tokenizer()\
  .setInputCols("document")\
  .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
  .setInputCols(["document", "token"]) \
  .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_ticker", "en", "finance/models")\
  .setInputCols(["document", "token", "embeddings"])\
  .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
  .setInputCols(["document", "token", "ner"])\
  .setOutputCol("ner_chunk")

CM = finance.ChunkMapperModel.pretrained('finmapper_nasdaq_ticker_stock_screener', 'en', 'finance/models')\
  .setInputCols(["ner_chunk"])\
  .setOutputCol("mappings")

pipeline = nlp.Pipeline().setStages([
  document_assembler,
  tokenizer, 
  embeddings,
  ner_model, 
  ner_converter, 
  CM])
                                 
text = ["""There are some serious purchases and sales of AMZN stock today."""]

data = spark.createDataFrame([text]).toDF("text")

result = pipeline.fit(data).transform(data)

+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result|result                                                                                                                                                             |
+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[AMZN]|[AMZN, Amazon.com Inc. Common Stock, $98.12, 2.85, 2.991%, 9.98556270184E11, United States, 1997, 85412563, Consumer Discretionary, Catalog/Specialty Distribution]|
+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+

from johnsnowlabs import nlp, legal

document_assembler = nlp.DocumentAssembler()\
  .setInputCol('text')\
  .setOutputCol('document')

tokenizer = nlp.Tokenizer()\
  .setInputCols("document")\
  .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained('glove_100d') \
  .setInputCols(['document', 'token']) \
  .setOutputCol('embeddings')

ner_model = nlp.NerDLModel.pretrained("onto_100", "en") \
  .setInputCols(["document", "token", "embeddings"]) \
  .setOutputCol("ner")
 
ner_converter = nlp.NerConverter()\
  .setInputCols(["document", "token", "ner"])\
  .setOutputCol("ner_chunk")\
  .setWhiteList(["CARDINAL"])

CM = legal.ChunkMapperModel().pretrained("legmapper_edgar_irs", "en", "legal/models")\
  .setInputCols(["ner_chunk"])\
  .setOutputCol("mappings")

pipeline = nlp.Pipeline().setStages([
  document_assembler,
  tokenizer, 
  embeddings,
  ner_model, 
  ner_converter, 
  CM])

text = ["""873474341 is an American multinational corporation that is engaged in the design, development, manufacturing, and worldwide marketing and sales of footwear, apparel, equipment, accessories, and services"""]

data = spark.createDataFrame([text]).toDF("text")

result= pipeline.fit(data).transform(data)

+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result     |result                                                                                                                                                               |
+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[873474341]|[Masterworks 096, LLC, RETAIL-RETAIL STORES, NEC [5990], 5990, 873474341, 1231, NY, DE, 225 LIBERTY STREET, NEW YORK, NY, 10281, 2035185172, , , 2022-01-10, 1894064]|
+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val documenter = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val sentencer = new SentenceDetector()
 .setInputCols("document")
 .setOutputCol("sentences") 

val tokenizer = new Tokenizer()
 .setInputCols("sentences")
 .setOutputCol("tokens") 

val words_embedder = WordEmbeddingsModel
 .pretrained("embeddings_clinical","en","clinical/models") 
 .setInputCols(Array("sentences","tokens")) 
 .setOutputCol("embeddings") 

val ner_tagger = MedicalNerModel
 .pretrained("ner_posology","en","clinical/models") 
 .setInputCols(Array("sentences","tokens","embeddings"))
 .setOutputCol("ner_tags") 

val ner_converter = new NerConverterInternal()
 .setInputCols(Array("sentences","tokens","ner_tags")) 
 .setOutputCol("ner_chunks") 
 .setWhiteList("DRUG") 

val chunkToDoc = new Chunk2Doc()
 .setInputCols("ner_chunks") 
 .setOutputCol("ner_chunks_doc") 

val sbert_embedder = BertSentenceEmbeddings
 .pretrained("sbiobert_base_cased_mli","en","clinical/models") 
 .setInputCols("ner_chunks_doc")
 .setOutputCol("sbert_embeddings") 
 .setCaseSensitive(false) 

val rxnorm_resolver = SentenceEntityResolverModel
 .pretrained("sbiobertresolve_rxnorm_augmented","en","clinical/models") 
 .setInputCols("sbert_embeddings")
 .setOutputCol("rxnorm_code") 
 .setDistanceFunction("EUCLIDEAN") 

val resolver2chunk = new Resolution2Chunk()
 .setInputCols("rxnorm_code")
 .setOutputCol("rxnorm_chunk") 

val chunkerMapper = ChunkMapperModel.pretrained("rxnorm_drug_brandname_mapper","en","clinical/models")
 .setInputCols("rxnorm_chunk")
 .setOutputCol("rxnorm_drug_brandname_mapper") 
 .setRels(Array("rxnorm_brandname")) 

val pipeline = new Pipeline().setStages(Array(
 documenter, 
 sentencer, 
 tokenizer, 
 words_embedder, 
 ner_tagger, 
 ner_converter, 
 chunkToDoc, 
 sbert_embedder, 
 rxnorm_resolver, 
 resolver2chunk,
  chunkerMapper )) 

val text ="""The doctor prescribed Sinequan 150 MG for depression and Zonalon 50 mg for managing skin itching"""
val data = Seq(text).toDF("text")

val result= mapper_pipeline.fit(data).transform(data)

+----------+-----------+----------------------------+
|ner_chunks|rxnorm_code|rxnorm_drug_brandname_mapper|
+----------+-----------+----------------------------+
|  Sinequan|     224915|         Sinequan (Sinequan)|
|   Zonalon|       9801|           Zonalon (Zonalon)|
+----------+-----------+----------------------------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val tokenizer = new Tokenizer()
 .setInputCols("document") 
 .setOutputCol("token") 

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
 .setInputCols(Array("document","token")) 
 .setOutputCol("embeddings") 

val ner_model = FinanceNerModel.pretrained("finner_ticker","en","finance/models")
 .setInputCols(Array("document","token","embeddings")) 
 .setOutputCol("ner") 

val ner_converter = new NerConverter()
 .setInputCols(Array("document","token","ner")) 
 .setOutputCol("ner_chunk") 

val CM = ChunkMapperModel.pretrained("finmapper_nasdaq_ticker_stock_screener","en","finance/models")
 .setInputCols("ner_chunk")
 .setOutputCol("mappings") 

val pipeline = new Pipeline().setStages(Array( 
  document_assembler, 
  tokenizer, 
  embeddings, 
  ner_model, 
  ner_converter, 
  CM) ) 
 
val text ="""There are some serious purchases and sales of AMZN stock today."""
val data = Seq(text).toDF("text")

val result = pipeline.fit(data).transform(data)

+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result|result                                                                                                                                                             |
+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[AMZN]|[AMZN, Amazon.com Inc. Common Stock, $98.12, 2.85, 2.991%, 9.98556270184E11, United States, 1997, 85412563, Consumer Discretionary, Catalog/Specialty Distribution]|
+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val tokenizer = new Tokenizer()
 .setInputCols("document") 
 .setOutputCol("token") 

val embeddings = WordEmbeddingsModel.pretrained("glove_100d")
 .setInputCols(Array("document","token")) 
 .setOutputCol("embeddings") 

val ner_model = NerDLModel.pretrained("onto_100","en")
 .setInputCols(Array("document","token","embeddings")) 
 .setOutputCol("ner") 

val ner_converter = new NerConverter()
 .setInputCols(Array("document","token","ner")) 
 .setOutputCol("ner_chunk") 
 .setWhiteList(Array("CARDINAL")) 

val CM = ChunkMapperModel.pretrained("legmapper_edgar_irs","en","legal/models") 
.setInputCols("ner_chunk")
.setOutputCol("mappings") 

val pipeline = new Pipeline().setStages(Array( 
  document_assembler, 
  tokenizer, 
  embeddings, 
  ner_model, 
  ner_converter, 
  CM) ) 

val text ="""873474341 is an American multinational corporation that is engaged in the design,development,manufacturing,and worldwide marketing and sales of footwear,apparel,equipment,accessories,and services"""
val data = Seq(text).toDF("text")

val result= pipeline.fit(data).transform(data)

+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result     |result                                                                                                                                                               |
+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[873474341]|[Masterworks 096, LLC, RETAIL-RETAIL STORES, NEC [5990], 5990, 873474341, 1231, NY, DE, 225 LIBERTY STREET, NEW YORK, NY, 10281, 2035185172, , , 2022-01-10, 1894064]|
+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+

We can use ChunkMapper to map entities with their associated code/reference based on pre-defined dictionaries.

This is the AnnotatorApproach of the ChunkMapper, which can be used to train ChunkMapper models by giving a custom mapping dictionary. To use pretriained models, check the documentation of the ChunkMapperModel annotator.

Example usage and more details can be found on Spark NLP Workshop repository accessible in GitHub, for example the notebook Healthcare Chunk Mapping.

Input Annotator Types: CHUNK

Output Annotator Type: LABEL_DEPENDENCY

Python API: ChunkMapperApproach

Scala API: ChunkMapperApproach

Notebook: ChunkMapperApproachModelNotebook

Show Example

from johnsnowlabs import nlp, medical

# First, create a dictionay in JSON format following this schema:
import json
data_set= {
  "mappings": [
    {
      "key": "metformin",
      "relations": [
        {
          "key": "action",
          "values" : ["hypoglycemic", "Drugs Used In Diabetes"]
        },
        {
          "key": "treatment",
          "values" : ["diabetes", "t2dm"]
        }
      ]
    }
  ]
}

with open('sample_drug.json', 'w', encoding='utf-8') as f:
    json.dump(data_set, f, ensure_ascii=False, indent=4)


# Create a pipeline
document_assembler = nlp.DocumentAssembler()\
    .setInputCol('text')\
    .setOutputCol('document')

sentence_detector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols("sentence")\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

#NER model to detect drug in the text
clinical_ner =  medical.NerModel.pretrained("ner_posology_small","en","clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner")\
    .setLabelCasing("upper")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setWhiteList(["DRUG"])

chunkerMapper =  medical.ChunkMapperApproach()\
    .setInputCols(["ner_chunk"])\
    .setOutputCol("mappings")\
    .setDictionary("/content/sample_drug.json")\
    .setRels(["action"]) #or treatment

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    sentence_detector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    chunkerMapper])

text = ["The patient was given 1 unit of metformin daily."]

test_data = spark.createDataFrame([text]).toDF("text")

model = pipeline.fit(test_data)
res= model.transform(test_data)

model.stages[-1].write().save("models/drug_mapper")

from johnsnowlabs import nlp, finance

# First, create a dictionay in JSON format following this schema:
import json

data_set= {
  "mappings": [
    {
      "key": "Rayton Solar Inc.",
      "relations": [
        {
          "key": "name",
          "values" : ['Rayton Solar Inc.']
        },
        {
          "key": "sic",
          "values" : ['SEMICONDUCTORS & RELATED DEVICES [3674]']
        }]
    }]
}

with open('sample_finance.json', 'w', encoding='utf-8') as f:
    json.dump(data_set, f, ensure_ascii=False, indent=4)

# Create a pipeline
document_assembler = nlp.DocumentAssembler()\
  .setInputCol('text')\
  .setOutputCol('document')

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols("sentence")\
  .setOutputCol("token")

word_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
  .setInputCols(["sentence", "token"]) \
  .setOutputCol("embeddings")

finance_ner = finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\
  .setInputCols(["sentence", "token", "embeddings"])\
  .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
  .setInputCols(["sentence","token","ner"])\
  .setOutputCol("ner_chunk")\
  .setWhiteList(["ORG"]) # Return only ORG entities

chunkerMapper =  finance.ChunkMapperApproach()\
  .setInputCols(["ner_chunk"])\
  .setOutputCol("mappings")\
  .setDictionary("/content/sample_finance.json")\
  .setRels(all_rels)

pipeline = nlp.Pipeline().setStages([
  document_assembler,
  sentence_detector,
  tokenizer,
  word_embeddings,
  finance_ner,
  ner_converter,
  chunkerMapper])

text = ["AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price. "]

test_data = spark.createDataFrame([text]).toDF("text")

model = pipeline.fit(test_data)
res= model.transform(test_data)

model.stages[-1].write().save("models/finance_mapper")

from johnsnowlabs import nlp, legal

# First, create a dictionay in JSON format following this schema:
import json

data_set= {
  "mappings": [
    {
      "key": "Rayton Solar Inc.",
      "relations": [
        {
          "key": "name",
          "values" : ['Rayton Solar Inc.']
        },
        {
          "key": "sic",
          "values" : ['SEMICONDUCTORS & RELATED DEVICES [3674]']
        }]
    }]
}

with open('sample_legal.json', 'w', encoding='utf-8') as f:
    json.dump(data_set, f, ensure_ascii=False, indent=4)

# Create a pipeline
document_assembler = nlp.DocumentAssembler()\
  .setInputCol('text')\
  .setOutputCol('document')

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols("sentence")\
  .setOutputCol("token")

word_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
  .setInputCols(["sentence", "token"]) \
  .setOutputCol("embeddings")

legal_ner = legal.NerModel.pretrained("legner_org_per_role_date", "en", "legal/models")\
  .setInputCols(["sentence", "token", "embeddings"])\
  .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
  .setInputCols(["sentence","token","ner"])\
  .setOutputCol("ner_chunk")\
  .setWhiteList(["ORG"]) # Return only ORG entities

chunkerMapper =  legal.ChunkMapperApproach()\
  .setInputCols(["ner_chunk"])\
  .setOutputCol("mappings")\
  .setDictionary("/content/sample_legal.json")\
  .setRels(all_rels)

pipeline = nlp.Pipeline().setStages([
  document_assembler,
  sentence_detector,
  tokenizer,
  word_embeddings,
  legal_ner,
  ner_converter,
  chunkerMapper])

text = ["AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price. "]

test_data = spark.createDataFrame([text]).toDF("text")

model = pipeline.fit(test_data)
res= model.transform(test_data)

model.stages[-1].write().save("models/legal_mapper")

import spark.implicits._

val document_assembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val sentence_detector = new SentenceDetector()
 .setInputCols("document")
 .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
 .setInputCols("sentence") 
 .setOutputCol("token") 

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
 .setInputCols(Array("sentence","token")) 
 .setOutputCol("embeddings") //NER model to detect drug in the text 

val clinical_ner = MedicalNerModel.pretrained("ner_posology_small","en","clinical/models")
 .setInputCols(Array("sentence","token","embeddings")) 
 .setOutputCol("ner") 
 .setLabelCasing("upper") 

val ner_converter = new NerConverterInternal()
 .setInputCols(Array("sentence","token","ner")) 
 .setOutputCol("ner_chunk") 
 .setWhiteList(Array("DRUG")) 

val chunkerMapper = new ChunkMapperApproach()
 .setInputCols("ner_chunk") 
 .setOutputCol("mappings") 
 .setDictionary("/content/sample_drug.json") 
 .setRels(Array("action") ) //or treatment 

val pipeline = new Pipeline()
 .setStages(Array(
  document_assembler, 
  sentence_detector, 
  tokenizer, 
  word_embeddings, 
  clinical_ner, 
  ner_converter, 
  chunkerMapper) ) 
val text = new Array("The patient was given 1 unit of metformin daily.") 

val test_data = seq(Array(text)) .toDF("text") 
val model = pipeline.fit(test_data) 

res= model.transform(test_data) 
model.stagesArray(-1) .write() .save("models/drug_mapper")

import spark.implicits._

val document_assembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
 .setInputCols("document")
 .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
 .setInputCols("sentence") 
 .setOutputCol("token") 

val word_embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
 .setInputCols(Array("sentence","token") ) 
 .setOutputCol("embeddings") 

val finance_ner = FinanceNerModel.pretrained("finner_orgs_prods_alias","en","finance/models")
 .setInputCols(Array("sentence","token","embeddings") ) 
 .setOutputCol("ner") 

val ner_converter = new NerConverter()
 .setInputCols(Array("sentence","token","ner") ) 
 .setOutputCol("ner_chunk") 
 .setWhiteList(Array("ORG") ) // Return only ORG entities 

val chunkerMapper = new ChunkMapperApproach()
 .setInputCols("ner_chunk")
 .setOutputCol("mappings") 
 .setDictionary("/content/sample_json") 
 .setRels(all_rels) 

val pipeline = new Pipeline()
 .setStages(Array( 
  document_assembler, 
  sentence_detector, 
  tokenizer, 
  word_embeddings, 
  finance_ner, 
  ner_converter, 
  chunkerMapper) ) 

val text = new Array("AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price. ") 

val test_data = seq(Array(text)).toDF("text") 

val model = pipeline.fit(test_data) 
res= model.transform(test_data) 

model.stagesArray(-1) .write() .save("models/finance_mapper")

import spark.implicits._
 
val document_assembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
 .setInputCols("document")
 .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
 .setInputCols("sentence") 
 .setOutputCol("token") 

val word_embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
 .setInputCols(Array("sentence","token")) 
 .setOutputCol("embeddings") 

val legal_ner = LegalNerModel.pretrained("legner_org_per_role_date","en","legal/models")
 .setInputCols(Array("sentence","token","embeddings")) 
 .setOutputCol("ner") 

val ner_converter = new NerConverter()
 .setInputCols(Array("sentence","token","ner")) 
 .setOutputCol("ner_chunk") 
 .setWhiteList("ORG") // Return only ORG entities 

val chunkerMapper = new ChunkMapperApproach()
 .setInputCols("ner_chunk")
 .setOutputCol("mappings") 
 .setDictionary("/content/sample_json") 
 .setRels(all_rels) 

val pipeline = new Pipeline()
 .setStages(Array( 
  document_assembler, 
  sentence_detector, 
  tokenizer, 
  word_embeddings, 
  legal_ner, 
  ner_converter, 
  chunkerMapper) ) 

val text = new Array("AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price. ") 
val test_data = seq(Array(text) ) .toDF("text") 
val model = pipeline.fit(test_data) 

res= model.transform(test_data) 
model.stagesArray(-1) .write() .save("models/legal_mapper")

ChunkMapperFilterer

Model

ChunkMapperFilterer is an annotator to be used after ChunkMapper that allows to filter chunks based on the results of the mapping, whether it was successful or failed.

Parametres:

ReturnCriteria (String): Has two possible values: “success” or “fail”. If “fail” (default), returns the chunks that are not in the label dependencies; if “success”, returns the labels that were successfully mapped by the ChunkMapperModel annotator.

Example usage and more details can be found on Spark NLP Workshop repository accessible in GitHub, for example the notebook Healthcare Chunk Mapping.

Input Annotator Types: CHUNK, LABEL_DEPENDENCY

Output Annotator Type: CHUNK

Python API: ChunkMapperFilterer

Scala API: ChunkMapperFilterer

Notebook: ChunkMapperFiltererNotebook

Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

sentence_detector = nlp.SentenceDetector()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
      .setInputCols("sentence")\
      .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

ner_model = medical.NerModel.pretrained("ner_posology_greedy", "en", "clinical/models")\
      .setInputCols(["sentence", "token", "embeddings"])\
      .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
      .setInputCols("sentence", "token", "ner")\
      .setOutputCol("chunk")

chunkerMapper = medical.ChunkMapperModel.pretrained("rxnorm_mapper", "en", "clinical/models")\
      .setInputCols(["chunk"])\
      .setOutputCol("RxNorm_Mapper")\
      .setRels(["rxnorm_code"])

chunk_mapper_filterer = medical.ChunkMapperFilterer() \
      .setInputCols(["chunk", "RxNorm_Mapper"]) \
      .setOutputCol("chunks_fail") \
      .setReturnCriteria("fail")

mapper_pipeline = nlp.Pipeline(
      stages = [
          document_assembler,
          sentence_detector,
          tokenizer,
          word_embeddings,
          ner_model,
          ner_converter,
          chunkerMapper,
          chunkerMapper,
          chunk_mapper_filterer
      ])

samples = [["The patient was given Adapin 10 MG, coumadn 5 mg"],
           ["The patient was given Avandia 4 mg, Tegretol, zitiga"] ]
           
data = spark.createDataFrame(samples).toDF("text")

result = mapper_pipeline.fit(data).transform(data)

result.selectExpr("chunk.result as chunk", 
                  "RxNorm_Mapper.result as RxNorm_Mapper", 
                  "chunks_fail.result as chunks_fail").show(truncate = False)

+--------------------------------+----------------------+--------------+
|chunk                           |RxNorm_Mapper         |chunks_fail   |
+--------------------------------+----------------------+--------------+
|[Adapin 10 MG, coumadn 5 mg]    |[1000049, NONE]       |[coumadn 5 mg]|
|[Avandia 4 mg, Tegretol, zitiga]|[261242, 203029, NONE]|[zitiga]      |
+--------------------------------+----------------------+--------------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
 
val sentence_detector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols("sentence", "token")
    .setOutputCol("embeddings")

val ner_model = MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models")
    .setInputCols("sentence", "token", "embeddings")
    .setOutputCol("ner")

val ner_converter = new NerConverterInternal()
    .setInputCols("sentence", "token", "ner")
    .setOutputCol("chunk")

val chunkerMapper = ChunkMapperModel.pretrained("rxnorm_mapper", "en", "clinical/models")
    .setInputCols("chunk")
    .setOutputCol("RxNorm_Mapper")
    .setRels(Array("rxnorm_code"))

val chunk_mapper_filterer = new ChunkMapperFilterer()
    .setInputCols("chunk", "RxNorm_Mapper")
    .setOutputCol("chunks_fail")
    .setReturnCriteria("fail")

val mapper_pipeline = new Pipeline().setStages(Array(
    document_assembler,
    sentence_detector,
    tokenizer,
    word_embeddings,
    ner_model,
    ner_converter,
    chunkerMapper,
    chunk_mapper_filterer
    ))


val data = Seq("The patient was given Adapin 10 MG, coumadn 5 mg",
"The patient was given Avandia 4 mg, Tegretol, zitiga").toDF("text")

val result = mapper_pipeline.fit(data).transform(data)

+--------------------------------+----------------------+--------------+
|chunk                           |RxNorm_Mapper         |chunks_fail   |
+--------------------------------+----------------------+--------------+
|[Adapin 10 MG, coumadn 5 mg]    |[1000049, NONE]       |[coumadn 5 mg]|
|[Avandia 4 mg, Tegretol, zitiga]|[261242, 203029, NONE]|[zitiga]      |
+--------------------------------+----------------------+--------------+

ChunkMerge

Approach

Merges two chunk columns coming from two annotators(NER, ContextualParser or any other annotator producing chunks). The merger of the two chunk columns is made by selecting one chunk from one of the columns according to certain criteria. The decision on which chunk to select is made according to the chunk indices in the source document. (chunks with longer lengths and highest information will be kept from each source) Labels can be changed by setReplaceDictResource.

Parameters:

inputCols: The name of the columns containing the input annotations. It can read either a String column or an Array.
outputCol: The name of the column in Document type that is generated. We can specify only one column here.
mergeOverlapping: (Boolean) Sets whether to merge overlapping matched chunks. Default True.
falsePositivesResource: Sets file with false positive pairs
replaceDictResource: Sets replace dictionary pairs for NER labels
blackList: (String List) If defined, list of entities to ignore. The rest will be processed.
whiteList: (String List) If defined, list of entities to accept.
selectionStrategy: (String) Sets Whether to select annotations sequentially based on annotation order Sequential or using any other available strategy; currently only Sequential and DiverseLonger are available. Default Sequential.
orderingFeatures: (String List) The ordering features to use for overlapping entities. Possible values are ChunkBegin, ChunkLength, ChunkPrecedence, ChunkConfidence.
defaultConfidence: (Float) Sets when ChunkConfidence ordering feature is included and a given annotation does not have any confidence. The value of this param will be used as a confidence score for annotations without a confidence score.
chunkPrecedence: (String List) Sets what is the precedence order when a chunk labeled by two models.
chunkPrecedenceValuePrioritization: (String List) Sets when ChunkPrecedence ordering feature is used. This param contains an Array of comma-separated values representing the desired order of prioritization for the values in the metadata fields included from chunkPrecedence.
resetSentenceIndices: Whether to reset sentence indices to treat the entire output as if it originates from a single document. Default: False.

All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols().

Input Annotator Types: CHUNK, CHUNK

Output Annotator Type: CHUNK

Python API: ChunkMergeApproach

Scala API: ChunkMergeApproach

Show Example

from johnsnowlabs import nlp, medical

# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# 1- ner_clinical model
clinical_ner = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("clinical_ner")

clinical_ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "clinical_ner"]) \
    .setOutputCol("clinical_ner_chunk")

# 2- posology ner model
posology_ner = medical.NerModel.pretrained("ner_posology", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("posology_ner")

posology_ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "posology_ner"]) \
    .setOutputCol("posology_ner_chunk")

# 3- generate a text matcher annotator that extracts female related entities
entities = ['she', 'her', 'girl', 'woman', 'women', 'womanish', 'womanlike', 'womanly', 'madam', 'madame', 'senora', 'lady', 'miss', 'girlfriend', 'wife', 'bride', 'misses', 'mrs.', 'female']
with open ('female_entities.txt', 'w') as f:
    for i in entities:
        f.write(i+'\n')

# Find female entities using TextMatcher
female_entity_extractor = nlp.TextMatcher() \
    .setInputCols(["sentence",'token'])\
    .setOutputCol("female_entities")\
    .setEntities("female_entities.txt")\
    .setCaseSensitive(False)\
    .setEntityValue('female_entity')

# Chunk Merge annotator is used to merge columns
chunk_merger = medical.ChunkMergeApproach()\
    .setInputCols("posology_ner_chunk", 'clinical_ner_chunk', "female_entities")\
    .setOutputCol('merged_ner_chunk')

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    clinical_ner_converter,
    posology_ner,
    posology_ner_converter,
    female_entity_extractor,
    chunk_merger])

sample_text = """The lady was treated with a five-day course of amoxicillin for a respiratory tract infection .
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG ."""


data = spark.createDataFrame([[sample_text]]).toDF("text")
model = nlpPipeline.fit(data).transform(data)

# Show results
model.selectExpr("explode(merged_ner_chunk) as a") \
  .selectExpr("a.begin","a.end","a.result as chunk","a.metadata.entity as entity") \
  .show(10, False)
+-----+---+-----------------------------+-------------+
|begin|end|chunk                        |entity       |
+-----+---+-----------------------------+-------------+
|4    |7  |lady                         |female_entity|
|47   |57 |amoxicillin                  |DRUG         |
|63   |91 |a respiratory tract infection|PROBLEM      |
|95   |97 |She                          |female_entity|
|106  |114|metformin                    |DRUG         |
|118  |126|glipizide                    |TREATMENT    |
|134  |146|dapagliflozin                |TREATMENT    |
|152  |155|T2DM                         |PROBLEM      |
|161  |172|atorvastatin                 |DRUG         |
|178  |188|gemfibrozil                  |TREATMENT    |
+-----+---+-----------------------------+-------------+

from johnsnowlabs import nlp, finance

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

bert_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("bert_embeddings")

fin_ner = finance.NerModel.pretrained('finner_deid', "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

ner_converter =  finance.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setReplaceLabels({"ORG": "PARTY"}) # Replace "ORG" entity as "PARTY"

ner_finner = finance.NerModel.pretrained("finner_org_per_role_date", "en", "finance/models")\
    .setInputCols(["sentence", "token", "bert_embeddings"]) \
    .setOutputCol("ner_finner") 
    #.setLabelCasing("upper")

ner_converter_finner = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner_finner"]) \
    .setOutputCol("ner_finner_chunk") \
    .setWhiteList(['ROLE']) # Just use "ROLE" entity from this NER

chunk_merge =  finance.ChunkMergeApproach()\
    .setInputCols("ner_finner_chunk", "ner_chunk")\
    .setOutputCol("deid_merged_chunk")

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      bert_embeddings,
      fin_ner,
      ner_converter,
      ner_finner,
      ner_converter_finner,
      chunk_merge])

data = spark.createDataFrame([["Jeffrey Preston Bezos is an American entrepreneur, founder and CEO of Amazon"]]).toDF("text")

# Show results
result = nlpPipeline.fit(data).transform(data).cache()
result.select(F.explode(F.arrays_zip(result.deid_merged_chunk.result, 
                                     result.deid_merged_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)
+---------------------+---------+
|chunk                |ner_label|
+---------------------+---------+
|Jeffrey Preston Bezos|PERSON   |
|founder              |ROLE     |
|CEO                  |ROLE     |
|Amazon               |PARTY    |
+---------------------+---------+

from johnsnowlabs import nlp, legal

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

legal_ner = legal.NerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

ner_converter = legal.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setReplaceLabels({"ALIAS": "PARTY"})

ner_signers = legal.NerModel.pretrained("legner_signers", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_signers") 
    #.setLabelCasing("upper")

ner_converter_signers = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner_signers"]) \
    .setOutputCol("ner_signer_chunk")

chunk_merge = legal.ChunkMergeApproach()\
    .setInputCols("ner_signer_chunk", "ner_chunk")\
    .setOutputCol("deid_merged_chunk")

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      legal_ner,
      ner_converter,
      ner_signers,
      ner_converter_signers,
      chunk_merge])


data = spark.createDataFrame([["ENTIRE AGREEMENT.  This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby, supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter.

2THEMART.COM, INC.:  I-ESCROW, INC.: By:Dominic J. Magliarditi By:Sanjay Bajaj Name: Dominic J. Magliarditi Name: Sanjay Bajaj Title: President Title: VP Business Development Date: 6/21/99    Date: 6/11/99 "]]).toDF("text")

# Show results
result = nlpPipeline.fit(data).transform(data).cache()
result.select(F.explode(F.arrays_zip(result.deid_merged_chunk.result, 
                                     result.deid_merged_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)
+-----------------------+--------------+
|chunk                  |ner_label     |
+-----------------------+--------------+
|ENTIRE AGREEMENT       |DOC           |
|INC                    |PARTY         |
|J. Magliarditi         |SIGNING_PERSON|
|Bajaj                  |SIGNING_PERSON|
|Dominic J. Magliarditi |SIGNING_PERSON|
|Sanjay Bajaj           |SIGNING_PERSON|
|President              |SIGNING_TITLE |
|VP Business Development|SIGNING_TITLE |
+-----------------------+--------------+

import spark.implicits._

// Annotator that transforms a text column from dataframe into an Annotation ready for NLP 
val documentAssembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 
 
// Sentence Detector annotator,processes various sentences per line 
val sentenceDetector = new SentenceDetector()
 .setInputCols("document")
 .setOutputCol("sentence") 
 
// Tokenizer splits words in a relevant format for NLP 
val tokenizer = new Tokenizer()
 .setInputCols("sentence") 
 .setOutputCol("token") 
 
// Clinical word embeddings trained on PubMED dataset 
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
 .setInputCols(Array("sentence","token")) 
 .setOutputCol("embeddings") 
 
// 1- ner_clinical model 
val clinical_ner = MedicalNerModel.pretrained("ner_clinical","en","clinical/models")
 .setInputCols(Array("sentence","token","embeddings")) 
 .setOutputCol("clinical_ner") 

val clinical_ner_converter = new NerConverterInternal()
 .setInputCols(Array("sentence","token","clinical_ner")) 
 .setOutputCol("clinical_ner_chunk") 
 
// 2- posology ner model 
val posology_ner = MedicalNerModel.pretrained("ner_posology","en","clinical/models")
 .setInputCols(Array("sentence","token","embeddings")) 
 .setOutputCol("posology_ner") 

val posology_ner_converter = new NerConverterInternal()
 .setInputCols(Array("sentence","token","posology_ner")) 
 .setOutputCol("posology_ner_chunk") 
 
// 3- generate a text matcher annotator that extracts female related entities 
val entities = new Array("she","her","girl","woman","women","womanish","womanlike","womanly","madam","madame","senora","lady","miss","girlfriend","wife","bride","misses","mrs.","female")

with open ('female_entities.txt', 'w') as f:
    for i in entities:
        f.write(i+'\n')
 
// Find female entities using TextMatcher 
val female_entity_extractor = new TextMatcher()
 .setInputCols(Array("sentence","token")) 
 .setOutputCol("female_entities") 
 .setEntities("female_entities.txt") 
 .setCaseSensitive(false) 
 .setEntityValue("female_entity") 
 
// Chunk Merge annotator is used to merge columns 
val chunk_merger = new ChunkMergeApproach()
 .setInputCols(Array("posology_ner_chunk","clinical_ner_chunk","female_entities"))
 .setOutputCol("merged_ner_chunk") 

val nlpPipeline = new Pipeline().setStages(Array( 
    documentAssembler, 
    sentenceDetector, 
    tokenizer, 
    word_embeddings, 
    clinical_ner, 
    clinical_ner_converter, 
    posology_ner, 
    posology_ner_converter, 
    female_entity_extractor, 
    chunk_merger)) 

val text ="""The lady was treated with a five-day course of amoxicillin for a respiratory tract infection .
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG ."""
val data = Seq(text).toDF("text")

val model = nlpPipeline.fit(data).transform(data)

+-----+---+-----------------------------+-------------+
|begin|end|chunk                        |entity       |
+-----+---+-----------------------------+-------------+
|4    |7  |lady                         |female_entity|
|47   |57 |amoxicillin                  |DRUG         |
|63   |91 |a respiratory tract infection|PROBLEM      |
|95   |97 |She                          |female_entity|
|106  |114|metformin                    |DRUG         |
|118  |126|glipizide                    |TREATMENT    |
|134  |146|dapagliflozin                |TREATMENT    |
|152  |155|T2DM                         |PROBLEM      |
|161  |172|atorvastatin                 |DRUG         |
|178  |188|gemfibrozil                  |TREATMENT    |
+-----+---+-----------------------------+-------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCol("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCol("sentence")
    .setOutputCol("token")

val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val bert_embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("bert_embeddings")

val fin_ner = FinanceNerModel.pretrained('finner_deid', "en", "finance/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

val ner_converter =  new NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")
    .setReplaceLabels({"ORG": "PARTY"}) # Replace "ORG" entity as "PARTY"

val ner_finner = FinanceNerModel.pretrained("finner_org_per_role_date", "en", "finance/models")\
    .setInputCols(Array("sentence", "token", "bert_embeddings"))
    .setOutputCol("ner_finner") 
    #.setLabelCasing("upper")

val ner_converter_finner = new NerConverter()
    .setInputCols(Array("sentence", "token", "ner_finner"))
    .setOutputCol("ner_finner_chunk")
    .setWhiteList(['ROLE']) # Just use "ROLE" entity from this NER

val chunk_merge =  new ChunkMergeApproach()
    .setInputCols(Array("ner_finner_chunk", "ner_chunk"))
    .setOutputCol("deid_merged_chunk")

val nlpPipeline = new Pipeline().setStages(Array(
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      bert_embeddings,
      fin_ner,
      ner_converter,
      ner_finner,
      ner_converter_finner,
      chunk_merge))

val data = Seq(("Jeffrey Preston Bezos is an American entrepreneur, founder and CEO of Amazon")).toDF("text")

# Show results
result = nlpPipeline.fit(data).transform(data)

+---------------------+---------+
|chunk                |ner_label|
+---------------------+---------+
|Jeffrey Preston Bezos|PERSON   |
|founder              |ROLE     |
|CEO                  |ROLE     |
|Amazon               |PARTY    |
+---------------------+---------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCol("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCol("sentence")
    .setOutputCol("token")

val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val legal_ner = LegalNerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")\
    .setReplaceLabels({"ALIAS": "PARTY"})

val ner_signers = LegalNerModel.pretrained("legner_signers", "en", "legal/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner_signers") 
    #.setLabelCasing("upper")

val ner_converter_signers = new NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner_signers"))
    .setOutputCol("ner_signer_chunk")

val chunk_merge = new ChunkMergeApproach()
    .setInputCols(Array("ner_signer_chunk", "ner_chunk"))
    .setOutputCol("deid_merged_chunk")

val nlpPipeline = new Pipeline().setStages(Array(
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      legal_ner,
      ner_converter,
      ner_signers,
      ner_converter_signers,
      chunk_merge))

val data = Seq(("ENTIRE AGREEMENT.  This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby, supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter.
2THEMART.COM, INC.: I-ESCROW, INC.: By:Dominic J. Magliarditi By:Sanjay Bajaj Name: Dominic J. Magliarditi Name: Sanjay Bajaj Title: President Title: VP Business Development Date: 6/21/99 Date: 6/11/99 ")).toDF("text")

# Show results
result = nlpPipeline.fit(data).transform(data)

+-----------------------+--------------+
|chunk                  |ner_label     |
+-----------------------+--------------+
|ENTIRE AGREEMENT       |DOC           |
|INC                    |PARTY         |
|J. Magliarditi         |SIGNING_PERSON|
|Bajaj                  |SIGNING_PERSON|
|Dominic J. Magliarditi |SIGNING_PERSON|
|Sanjay Bajaj           |SIGNING_PERSON|
|President              |SIGNING_TITLE |
|VP Business Development|SIGNING_TITLE |
+-----------------------+--------------+

ChunkSentenceSplitter

Model

ChunkSentenceSplitter annotator can split the documents into chunks according to separators given as CHUNK columns. It is useful when you need to perform different models or analysis in different sections of your document (for example, for different headers, clauses, items, etc.). The given separator chunk can be the output from, for example, RegexMatcher or NerModel.

Parametres;

GroupBySentences: (boolean) Sets the groupBySentences that allow split the paragraphs grouping the chunks by sentences.
InsertChunk: (boolean) Whether to insert the chunk in the paragraph or not.
DefaultEntity: (str) Sets the key in the metadata dictionary that you want to filter (by default ‘entity’)

For detailed usage of this annotator, visit this notebook from our Spark NLP Workshop.

Input Annotator Types: DOCUMENT, CHUNK

Output Annotator Type: DOCUMENT

Python API: ChunkSentenceSplitter

Scala API: ChunkSentenceSplitter

Notebook: ChunkSentenceSplitterNotebook

Show Example

from johnsnowlabs import nlp, medical

# Defining the pipeline
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")\

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("ner_jsl_slim", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\
    .setWhiteList(["Header"])

#applying ChunkSentenceSplitter
chunkSentenceSplitter = medical.ChunkSentenceSplitter()\
    .setInputCols("document","ner_chunk")\
    .setOutputCol("paragraphs")\
    .setGroupBySentences(False)

pipeline_model = nlp.Pipeline(
  stages = [
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    chunkSentenceSplitter
  ])


sentences = [["""Sample Name: Mesothelioma - Pleural Biopsy
Description: Right pleural effusion and suspected malignant mesothelioma. (Medical Transcription Sample Report)
PREOPERATIVE DIAGNOSIS:  Right pleural effusion and suspected malignant mesothelioma.
POSTOPERATIVE DIAGNOSIS: Right pleural effusion, suspected malignant mesothelioma.
ANESTHESIA: General double-lumen endotracheal.
DESCRIPTION OF FINDINGS:  Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface.
SPECIMEN:  Pleural biopsies for pathology and microbiology.
INDICATIONS:  Briefly, this is a 66-year-old gentleman who has been transferred from an outside hospital after a pleural effusion had been drained and biopsies taken from the right chest that were thought to be consistent with mesothelioma. Upon transfer, he had a right pleural effusion demonstrated on x-ray as well as some shortness of breath and dyspnea on exertion. The risks, benefits, and alternatives to right VATS pleurodesis and pleural biopsy were discussed with the patient and his family and they wished to proceed.
Dr. X was present for the entire procedure which was right VATS pleurodesis and pleural biopsies.The counts were correct x2 at the end of the case."""]]

df = spark.createDataFrame(sentences).toDF("text")
paragraphs = pipeline_model.fit(df).transform(df)

paragraphs.selectExpr("explode(paragraphs) as result")\
          .selectExpr("result.result","result.metadata.entity", "result.metadata.splitter_chunk").show(truncate=80)

+--------------------------------------------------------------------------------+------------+------------------------+
|                                                                          result|      entity|          splitter_chunk|
+--------------------------------------------------------------------------------+------------+------------------------+
|                                     Sample Name: Mesothelioma - Pleural Biopsy |introduction|                     UNK|
|Description: Right pleural effusion and suspected malignant mesothelioma. (Me...|      Header|            Description:|
|PREOPERATIVE DIAGNOSIS:  Right pleural effusion and suspected malignant mesot...|      Header| PREOPERATIVE DIAGNOSIS:|
|POSTOPERATIVE DIAGNOSIS: Right pleural effusion, suspected malignant mesothel...|      Header|POSTOPERATIVE DIAGNOSIS:|
|                                 ANESTHESIA: General double-lumen endotracheal. |      Header|             ANESTHESIA:|
|DESCRIPTION OF FINDINGS:  Right pleural effusion, firm nodules, diffuse scatt...|      Header|DESCRIPTION OF FINDINGS:|
|                    SPECIMEN:  Pleural biopsies for pathology and microbiology. |      Header|               SPECIMEN:|
|INDICATIONS:  Briefly, this is a 66-year-old gentleman who has been transferr...|      Header|            INDICATIONS:|
+--------------------------------------------------------------------------------+------------+------------------------+

from johnsnowlabs import nlp, finance, legal

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_headers", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = finance.NerConverterInternal()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

chunkSentenceSplitter = legal.ChunkSentenceSplitter()\
    .setInputCols("document","ner_chunk")\
    .setOutputCol("paragraphs")\
    .setGroupBySentences(False)
    
nlp_pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    chunkSentenceSplitter])


text = """2. DEFINITION. 

For purposes of this Agreement, the following terms have the meanings ascribed thereto in this Section 1 and 2 Appointment as Reseller.

2.1 Appointment. 

The Company hereby [***]. Allscripts may also disclose Company's pricing information relating to its Merchant Processing Services and facilitate procurement of Merchant Processing Services on behalf of Sublicensed Customers, including, without limitation by references to such pricing information and Merchant Processing Services in Customer Agreements. 6

2.2 Customer Agreements."""

sdf = spark.createDataFrame([[text]]).toDF("text")
paragraphs = nlp_pipeline.fit(sdf).transform(sdf)

paragraphs.selectExpr("explode(paragraphs) as result")\
          .selectExpr("result.result","result.metadata.entity").show(truncate=50)

+--------------------------------------------------+---------+
|                                            result|   entity|
+--------------------------------------------------+---------+
|                                               2. |   HEADER|
|DEFINITION.   For purposes of this Agreement, t...|SUBHEADER|
|                               2.1 Appointment.   |SUBHEADER|
|The Company hereby [***]. Allscripts may also d...|SUBHEADER|
|                        6  2.2 Customer Agreements|   HEADER|
+--------------------------------------------------+---------+

from johnsnowlabs import nlp, legal

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
        
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = legal.NerModel.pretrained("legner_headers", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

chunkSentenceSplitter = legal.ChunkSentenceSplitter()\
    .setInputCols("document","ner_chunk")\
    .setOutputCol("paragraphs")\
    .setGroupBySentences(False)
    
nlp_pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    chunkSentenceSplitter])


text = """AGREEMENT

NOW, THEREFORE, for good and valuable consideration, and in consideration of the mutual covenants and conditions herein contained, the Parties agree as follows:

2. Definitions. For purposes of this Agreement, the following terms have the meanings ascribed thereto in this Section 1. 2. Appointment as Reseller.

2.1 Appointment. The Company hereby [***]. Allscripts may also disclose Company's pricing information relating to its Merchant Processing Services and facilitate procurement of Merchant Processing Services on behalf of Sublicensed Customers, including, without limitation by references to such pricing information and Merchant Processing Services in Customer Agreements. 6

2.2 Customer Agreements.

a) Subscriptions. Allscripts and its Affiliates may sell Subscriptions for terms no less than one year and no greater than four (4) years on a subscription basis to Persons who subsequently execute a Customer Agreement, provided that Allscripts may enter into Customer Agreements with terms longer than four (4) years with large organizations, provided that Phreesia consents in each instance in writing in advance, which consent will not be unreasonably withheld."""

sdf = spark.createDataFrame([[text]]).toDF("text")
paragraphs = nlp_pipeline.fit(sdf).transform(sdf)

paragraphs.selectExpr("explode(paragraphs) as result")\
          .selectExpr("result.result","result.metadata.entity").show(truncate=50)

+--------------------------------------------------+---------+
|                                            result|   entity|
+--------------------------------------------------+---------+
|AGREEMENT NOW, THEREFORE, for good and valuabl... |SUBHEADER|
|                        Appointment as Reseller.  |SUBHEADER|
|                                 2.1 Appointment. |SUBHEADER|
|The Company hereby [***]. Allscripts may also d...|SUBHEADER|
|                     6 2.2 Customer Agreements.   |   HEADER|
|a) Subscriptions. Allscripts and its Affiliates...|SUBHEADER|
+--------------------------------------------------+---------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val clinical_ner = MedicalNerModel.pretrained("ner_jsl_slim", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")
    .setWhiteList("Header")

#applying ChunkSentenceSplitter
val chunkSentenceSplitter = new ChunkSentenceSplitter()
    .setInputCols(Array("document","ner_chunk"))
    .setOutputCol("paragraphs")
    .setGroupBySentences(false)

val pipeline_model = new Pipeline().setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    chunkSentenceSplitter
))


val sentences = ("""Sample Name: Mesothelioma - Pleural Biopsy
Description: Right pleural effusion and suspected malignant mesothelioma. (Medical Transcription Sample Report)
PREOPERATIVE DIAGNOSIS:  Right pleural effusion and suspected malignant mesothelioma.
POSTOPERATIVE DIAGNOSIS: Right pleural effusion, suspected malignant mesothelioma.
ANESTHESIA: General double-lumen endotracheal.
DESCRIPTION OF FINDINGS:  Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface.
SPECIMEN:  Pleural biopsies for pathology and microbiology.
INDICATIONS:  Briefly, this is a 66-year-old gentleman who has been transferred from an outside hospital after a pleural effusion had been drained and biopsies taken from the right chest that were thought to be consistent with mesothelioma. Upon transfer, he had a right pleural effusion demonstrated on x-ray as well as some shortness of breath and dyspnea on exertion. The risks, benefits, and alternatives to right VATS pleurodesis and pleural biopsy were discussed with the patient and his family and they wished to proceed.
Dr. X was present for the entire procedure which was right VATS pleurodesis and pleural biopsies.The counts were correct x2 at the end of the case.""")

val data = Seq(sentences).toDF("text")
val paragraphs = pipeline_model.fit(df).transform(df)


+--------------------------------------------------------------------------------+------------+------------------------+
|                                                                          result|      entity|          splitter_chunk|
+--------------------------------------------------------------------------------+------------+------------------------+
|                                     Sample Name: Mesothelioma - Pleural Biopsy |introduction|                     UNK|
|Description: Right pleural effusion and suspected malignant mesothelioma. (Me...|      Header|            Description:|
|PREOPERATIVE DIAGNOSIS:  Right pleural effusion and suspected malignant mesot...|      Header| PREOPERATIVE DIAGNOSIS:|
|POSTOPERATIVE DIAGNOSIS: Right pleural effusion, suspected malignant mesothel...|      Header|POSTOPERATIVE DIAGNOSIS:|
|                                 ANESTHESIA: General double-lumen endotracheal. |      Header|             ANESTHESIA:|
|DESCRIPTION OF FINDINGS:  Right pleural effusion, firm nodules, diffuse scatt...|      Header|DESCRIPTION OF FINDINGS:|
|                    SPECIMEN:  Pleural biopsies for pathology and microbiology. |      Header|               SPECIMEN:|
|INDICATIONS:  Briefly, this is a 66-year-old gentleman who has been transferr...|      Header|            INDICATIONS:|
+--------------------------------------------------------------------------------+------------+------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
        
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer
    .setInputCols("sentence")
    .setOutputCol("token")

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val ner_model = FinanceNerModel.pretrained("finner_headers", "en", "finance/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence","token","ner"))
    .setOutputCol("ner_chunk")

val chunkSentenceSplitter = new ChunkSentenceSplitter()
    .setInputCols(Array("document","ner_chunk"))
    .setOutputCol("paragraphs")
    .setGroupBySentences(false)

val nlp_pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    chunkSentenceSplitter))


val text = """2. DEFINITION. 

For purposes of this Agreement, the following terms have the meanings ascribed thereto in this Section 1 and 2 Appointment as Reseller.

2.1 Appointment. 

The Company hereby [***]. Allscripts may also disclose Company's pricing information relating to its Merchant Processing Services and facilitate procurement of Merchant Processing Services on behalf of Sublicensed Customers, including, without limitation by references to such pricing information and Merchant Processing Services in Customer Agreements. 6

2.2 Customer Agreements."""

val data = Seq(text).toDF("text")
val paragraphs = nlp_pipeline.fit(data).transform(data)


+--------------------------------------------------+---------+
|                                            result|   entity|
+--------------------------------------------------+---------+
|                                               2. |   HEADER|
|DEFINITION.   For purposes of this Agreement, t...|SUBHEADER|
|                               2.1 Appointment.   |SUBHEADER|
|The Company hereby [***]. Allscripts may also d...|SUBHEADER|
|                        6  2.2 Customer Agreements|   HEADER|
+--------------------------------------------------+---------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
        
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val ner_model = LegalNerModel.pretrained("legner_headers", "en", "legal/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence","token","ner"))
    .setOutputCol("ner_chunk")

val chunkSentenceSplitter = new ChunkSentenceSplitter()
    .setInputCols(Array("document","ner_chunk"))
    .setOutputCol("paragraphs")
    .setGroupBySentences(false)
    
val nlp_pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    chunkSentenceSplitter))


val text = """AGREEMENT

NOW, THEREFORE, for good and valuable consideration, and in consideration of the mutual covenants and conditions herein contained, the Parties agree as follows:

2. Definitions. For purposes of this Agreement, the following terms have the meanings ascribed thereto in this Section 1. 2. Appointment as Reseller.

2.1 Appointment. The Company hereby [***]. Allscripts may also disclose Company's pricing information relating to its Merchant Processing Services and facilitate procurement of Merchant Processing Services on behalf of Sublicensed Customers, including, without limitation by references to such pricing information and Merchant Processing Services in Customer Agreements. 6

2.2 Customer Agreements.

a) Subscriptions. Allscripts and its Affiliates may sell Subscriptions for terms no less than one year and no greater than four (4) years on a subscription basis to Persons who subsequently execute a Customer Agreement, provided that Allscripts may enter into Customer Agreements with terms longer than four (4) years with large organizations, provided that Phreesia consents in each instance in writing in advance, which consent will not be unreasonably withheld."""

val data = Seq(text).toDF("text")
val paragraphs = nlp_pipeline.fit(data).transform(data)

+--------------------------------------------------+---------+
|                                            result|   entity|
+--------------------------------------------------+---------+
|AGREEMENT NOW, THEREFORE, for good and valuabl... |SUBHEADER|
|                        Appointment as Reseller.  |SUBHEADER|
|                                 2.1 Appointment. |SUBHEADER|
|The Company hereby [***]. Allscripts may also d...|SUBHEADER|
|                     6 2.2 Customer Agreements.   |   HEADER|
|a) Subscriptions. Allscripts and its Affiliates...|SUBHEADER|
+--------------------------------------------------+---------+

ContextSplitAssembler

Model

Converts and assembles VECTOR_SIMILARITY_RANKINGS type annotations into DOCUMENT type. The input annotations are expected to be of type VECTOR_SIMILARITY_RANKINGS and the output annotation type is DOCUMENT. It concatenates the results of the input annotations into a single result, separated by a join string. When explodeSplits is set to True, the splits are exploded into separate annotations. The joinString parameter is used to add the delimiter between the results of annotations when combining them into a single result.

Parameters:

joinString (str): This parameter specifies the string that will be inserted between results of annotations when combining them into a single result. It acts as a delimiter, ensuring that the elements are properly separated and organized in the final result of annotation. Default: " ".
explodeSplits (Bool): Whether to explode the splits into separate annotations or not. Default: False.

Input Annotator Types: VECTOR_SIMILARITY_RANKINGS

Output Annotator Type: DOCUMENT

Python API: ContextSplitAssembler

Scala API: ContextSplitAssembler

Show Example

from johnsnowlabs import nlp, medical

context_split_assembler = ( medical.ContextSplitAssembler()
  .setInputCols("vector_db")
  .setOutputCol("document")
  .setJoinString("\n")
  .setExplodeSplits(False))

ContextualAssertion

Model

An annotator model for contextual assertion analysis. This model identifies contextual cues within text data, such as negation, uncertainty etc. It is used clinical assertion detection. It annotates text chunks with assertions based on configurable rules, prefix and suffix patterns, and exception patterns.

Parametres:

inputCols: Input annotations.
caseSensitive: Whether to use case sensitive when matching values, by default False.
prefixAndSuffixMatch: Whether to match both prefix and suffix to annotate the hit, by default False.
prefixKeywords: Prefix keywords to match.
suffixKeywords: Suffix keywords to match
exceptionKeywords: Exception keywords not to match.
prefixRegexPatterns: Prefix regex patterns to match
suffixRegexPatterns: Suffix regex pattern to match
exceptionRegexPatterns: Exception regex pattern not to match
scopeWindow: The scope window of the assertion expression
assertion: Assertion to match
scopeWindowDelimiter: Delimiters used to limit the scope window.
includeChunkToScope: Whether to include chunk to scope when matching values
ConfidenceCalculationDirection: Indicates the direction for calculating assertion confidence (left, right, or both; default is left).

See Spark NLP Workshop for more examples of usage.

Input Annotator Types: DOCUMENT, TOKEN, CHUNK

Output Annotator Type: ASSERTION

Python API: ContextualAssertion

Scala API: ContextualAssertion

Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel \
    .pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

clinical_ner = nlp.MedicalNerModel \
    .pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

contextual_assertion = medical.ContextualAssertion() \
    .setInputCols(["sentence", "token", "ner_chunk"]) \
    .setOutputCol("assertion") \
    .setPrefixKeywords(["no", "not"]) \
    .setSuffixKeywords(["unlikely","negative"]) \
    .setPrefixRegexPatterns(["\\b(no|without|denies|never|none|free of|not include)\\b"]) \
    .setSuffixRegexPatterns(["\\b(free of|negative for|absence of|not|rule out)\\b"]) \
    .setExceptionKeywords(["without"]) \
    .setExceptionRegexPatterns(["\\b(not clearly)\\b"]) \
    .addPrefixKeywords(["negative for","negative"]) \
    .addSuffixKeywords(["absent","neither"]) \
    .setCaseSensitive(False) \
    .setPrefixAndSuffixMatch(False) \
    .setAssertion("absent") \
    .setScopeWindow([2, 2])\
    .setIncludeChunkToScope(True)\
    .setScopeWindowDelimiters([","])

flattener = medical.Flattener() \
    .setInputCols("assertion") \
    .setExplodeSelectedFields({"assertion":["metadata.ner_chunk as ner_chunk",
                                            "begin as begin",
                                            "end as end",
                                            "metadata.ner_label as ner_label",
                                            "result as result"]})

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    contextual_assertion,
    flattener])

text = """Patient resting in bed. Patient given azithromycin without any difficulty. Patient has audible wheezing, states chest tightness.
No evidence of hypertension. Patient denies nausea at this time. zofran declined. Patient is also having intermittent sweating
associated with pneumonia. Patient refused pain but tylenol still given. Neither substance abuse nor alcohol use however cocaine
once used in the last year. Alcoholism unlikely. Patient has headache and fever. Patient is not diabetic. Not clearly of diarrhea.
Lab reports confirm lymphocytopenia. Cardaic rhythm is Sinus bradycardia. Patient also has a history of cardiac injury.
No kidney injury reported. No abnormal rashes or ulcers. Patient might not have liver disease. Confirmed absence of hemoptysis.
Although patient has severe pneumonia and fever, test reports are negative for COVID-19 infection. COVID-19 viral infection absent.
"""
data = spark.createDataFrame([[text]]).toDF("text")

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

result = model.transform(data)
result.show(truncate=False)

# result
+------------------+-----+---+---------+------+
|ner_chunk         |begin|end|ner_label|result|
+------------------+-----+---+---------+------+
|nausea            |173  |178|PROBLEM  |absent|
|Alcoholism        |413  |422|PROBLEM  |absent|
|diabetic          |481  |488|PROBLEM  |absent|
|kidney injury     |639  |651|PROBLEM  |absent|
|abnormal rashes   |666  |680|PROBLEM  |absent|
|liver disease     |716  |728|PROBLEM  |absent|
|COVID-19 infection|843  |860|PROBLEM  |absent|
|viral infection   |872  |886|PROBLEM  |absent|
+------------------+-----+---+---------+------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCols(Array("document"))
    .setOutputCol("sentences")

val tokenizer = new Tokenizer()
    .setInputCols(Array("sentences"))
    .setOutputCol("tokens")

val embedder = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentences", "tokens"))
    .setOutputCol("embeddings")

val nerTagger = MedicalNerModel .pretrained("ner_clinical", "en", "clinical/models")
    .setInputCols(Array("sentences", "tokens", "embeddings"))
    .setOutputCol("nerTags")

val nerConverter = new NerConverterInternal()
    .setInputCols(Array("sentences", "tokens", "nerTags"))
    .setOutputCol("nerChunks")

val contextualAssertion = new ContextualAssertion()
  .setInputCols(Array("sentences", "tokens","nerChunks"))
  .setOutputCol("assertion")
  .setScopeWindow(2,2)
  .setPrefixRegexPatterns(Array("\\b(no|without|denies|never|none|free of|negative for|not include)\\b"))
  .setSuffixRegexPatterns(Array("\\b(free of|negative for|absence of|absence|not|neither|rule out)\\b"))
  .setPrefixKeywords(Array("not","never"))
  .setSuffixKeywords(Array("no","never"))
  .setCaseSensitive(false)
  .setIncludeChunkToScope(true)
  .addPrefixKeywords(Array("negative for","no evidence of"))
  .addSuffixKeywords(Array("declined"))
  .setAssertion("absent")
  .setScopeWindowDelimiter(Array(","))



val flattener = new Flattener()
  .setInputCols("assertion")
  .setExplodeSelectedFields(Map("assertion" -> Array( "metadata.ner_chunk as ner_chunk ","begin as begin","end as end"," "metadata.ner_label as ner_label","result as result",
) ) )

val pipeline = new Pipeline()
  .setStages(Array(documentAssembler,
                  sentenceDetector,
                  tokenizer,
                  embedder,
                  nerTagger,
                  nerConverter,
                  contextualAssertion,
                  flattener
                  ))

val text = "Patient resting in bed. Patient given azithromycin without any difficulty. Patient has audible wheezing, states chest tightness." +
" No evidence of hypertension. Patient denies nausea at this time. zofran declined. Patient is also having intermittent sweating " +
"associated with pneumonia. Patient refused pain but tylenol still given. Neither substance abuse nor alcohol use however cocaine " +
"once used in the last year. Alcoholism unlikely. Patient has headache and fever. Patient is not diabetic. Not clearly of diarrhea. " +
"Lab reports confirm lymphocytopenia. Cardaic rhythm is Sinus bradycardia. Patient also has a history of cardiac injury." +
" No kidney injury reported. No abnormal rashes or ulcers. Patient might not have liver disease. Confirmed absence of hemoptysis." +
" Although patient has severe pneumonia and fever, test reports are negative for COVID-19 infection. COVID-19 viral infection absent."

val dataSet = Seq(text).toDS.toDF("text")

val result = pipeline.fit(dataSet).transform(dataSet)

# result
+------------------+-----+---+---------+------+
|ner_chunk         |begin|end|ner_label|result|
+------------------+-----+---+---------+------+
|nausea            |173  |178|PROBLEM  |absent|
|Alcoholism        |413  |422|PROBLEM  |absent|
|diabetic          |481  |488|PROBLEM  |absent|
|kidney injury     |639  |651|PROBLEM  |absent|
|abnormal rashes   |666  |680|PROBLEM  |absent|
|liver disease     |716  |728|PROBLEM  |absent|
|COVID-19 infection|843  |860|PROBLEM  |absent|
|viral infection   |872  |886|PROBLEM  |absent|
+------------------+-----+---+---------+------+

ContextualEntityFilterer

Model

ContextualEntityFilterer can filter chunks coming from CHUNK annotations based on entity(identifier,field) info in metadata and contextual cues. Filters can be done via white list entities, black list entities, black list word and white list words. The filter can be applied to the scope of the sentence or the document.

Parameters:

ruleScope: The rule scope to apply the filter. Options: sentence, document.
caseSensitive: Whether to use case-sensitive when matching words. Default is False.
rules: The filtering rules. Each rule is a dictionary with the following keys:
- entity: The target entity field for filtering.
- scopeWindow: A list of two integers [before, after], specifying how many tokens/chunks before and after the target to consider.
- whiteListEntities: The white list of entities. If one of the entity from this list appears within the scope window, the chunk will be kept. Only one element is enough to keep the chunk.
- blackListEntities: The black list of entities. If an entity from this list appears within the scope window, the chunk will be filtered out. All elements must be absent to keep the chunk.
- scopeWindowLevel: Determines whether the scopeWindow is applied at the token or chunk level. Options: token, chunk.
- blackListWords: The black list of words. If a word from this list appears within the scope window, the chunk will be filtered out.
- whiteListWords: The white list of words. If a word from this list appears within the scope window, the chunk will be kept.
- confidenceThreshold: The confidence threshold to filter the chunks. Filtering is only applied if the confidence of the chunk is below the threshold.
- possibleRegexContext : The possible regex context to filter the chunks. If the regex is found in the context(chunk), the chunk is kept.
- impossibleRegexContext : The impossible regex context to filter the chunks. If the regex is found in the context(chunk), the chunk is removed.

Input Annotator Types: DOCUMENT, TOKEN, CHUNK

Output Annotator Type: CHUNK

Python API: ContextualEntityFilterer

Scala API: ContextualEntityFilterer

Notebook: ContextualEntityFilterer

Show Example

from johnsnowlabs import nlp, medical

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "token"])\
    .setOutputCol("embeddings")

ner_deid = medical.NerModel.pretrained("ner_deid_subentity_docwise", "en", "clinical/models")  \
    .setInputCols(["document", "token", "embeddings"]) \
    .setOutputCol("ner_deid_subentity_docwise")

ner_deid_converter = medical.NerConverterInternal()\
    .setInputCols(["document", "token", "ner_deid_subentity_docwise"])\
    .setOutputCol("ner_chunk_subentity_docwise")

rules =[{   "entity": "STATE",
            "scopeWindow": [2, 2],
            "whiteListEntities": ["CITY"],
            "blackListEntities": ["NAME"],
            "scopeWindowLevel": "token"
        }]

contextual_entity_filterer = medical.ContextualEntityFilterer() \
    .setInputCols("document", "token", "ner_chunk_subentity_docwise") \
    .setOutputCol("filtered_ner_chunks") \
    .setRules(rules)\
    .setRuleScope("sentence") 

nlpPipeline = nlp.Pipeline(
  stages=[
      documentAssembler,
      tokenizer,
      word_embeddings,
      ner_deid,
      ner_deid_converter,
      contextual_entity_filterer
])

text = "NY, a 34-year-old woman, Dr. Michael Johnson cares wit her, at CarePlus Clinic, located at 456 Elm Street, NewYork, NY has recommended starting insulin therapy."
df = spark.createDataFrame([[text]]).toDF("text")
result = nlpPipeline.fit(df).transform(df)


# result

+---------------+-----+---+---------+----------+
|chunk          |begin|end|ner_label|confidence|
+---------------+-----+---+---------+----------+
|NY             |0    |1  |STATE    |0.9299    |
|34-year-old    |6    |16 |AGE      |0.7687    |
|Michael Johnson|29   |43 |DOCTOR   |0.89965   |
|CarePlus Clinic|63   |77 |HOSPITAL |0.9661    |
|456 Elm Street |91   |104|STREET   |0.7733667 |
|NewYork        |107  |113|CITY     |0.9302    |
|NY             |116  |117|STATE    |0.9991    |
+---------------+-----+---+---------+----------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new Tokenizer()
    .setInputCols("document")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols("document", "token")
    .setOutputCol("embeddings")

val ner_deid = MedicalNerModel.pretrained("ner_deid_subentity_docwise", "en", "clinical/models")
    .setInputCols("document", "token", "embeddings")
    .setOutputCol("ner_deid_subentity_docwise")

val ner_deid_converter = new NerConverterInternal()
    .setInputCols("document", "token", "ner_deid_subentity_docwise")
    .setOutputCol("ner_chunk_subentity_docwise")

val rules =
  """
    |[{
    |"entity": "STATE",
    |  "scopeWindow": [2, 2],
    |  "whiteListEntities": ["CITY"],
    |  "blackListEntities": ["NAME"],
    |  "scopeWindowLevel": "token"
    |
    | }
    | ]
    |
    |""".stripMargin

val contextual_entity_filterer = new ContextualEntityFilterer()
    .setInputCols("document", "token", "ner_chunk_subentity_docwise")
    .setOutputCol("filtered_ner_chunks")
    .setRulesAsStr(rules)
    .setRuleScope("sentence")

val nlpPipeline = new Pipeline().setStages(
  Array(
    documentAssembler,
    tokenizer,
    word_embeddings,
    ner_deid,
    ner_deid_converter,
    contextual_entity_filterer
  ))

val text = "NY, a 34-year-old woman, Dr. Michael Johnson cares wit her, at CarePlus Clinic, located at 456 Elm Street, NewYork, NY has recommended starting insulin therapy."
val df = Seq(text).toDF("text")
val result = nlpPipeline.fit(df).transform(df)


# result
+---------------+-----+---+---------+----------+
|chunk          |begin|end|ner_label|confidence|
+---------------+-----+---+---------+----------+
|NY             |0    |1  |STATE    |0.9299    |
|34-year-old    |6    |16 |AGE      |0.7687    |
|Michael Johnson|29   |43 |DOCTOR   |0.89965   |
|CarePlus Clinic|63   |77 |HOSPITAL |0.9661    |
|456 Elm Street |91   |104|STREET   |0.7733667 |
|NewYork        |107  |113|CITY     |0.9302    |
|NY             |116  |117|STATE    |0.9991    |
+---------------+-----+---+---------+----------+

ContextualEntityRuler

Model

ContextualEntityRuler is an annotator that updates chunks based on contextual rules. These rules are defined in the form of dictionaries and can include prefixes, suffixes, and the context within a specified scope window around the chunks. This annotator modifies detected chunks by replacing their entity labels or content based on the patterns and rules if they match. It is particularly useful for refining entity recognition results according to specific needs.

Parameters:

caseSensitive: Whether to perform case-sensitive matching. Default is False.
allowPunctuationInBetween: Whether to allow punctuation between prefix/suffix patterns and the entity. Default is True.
allowTokensInBetween: Whether to allow tokens between prefix/suffix patterns and the entity. Default is False.
dropEmptyChunks: If True, removes chunks with empty content after applying rules. Default is False.
mergeOverlapping: If False, it returns both modified entities and the original entities at the same time. Default is True.
rules: The updating rules. Each rule is a dictionary with the following keys:
- entity: The target entity label to modify.
  Example: "AGE".
- prefixPatterns: Array of patterns (words/phrases) to match before the entity.
  Example: ["years", "old"] matches entities preceded by “years” or “old.”
- suffixPatterns: Array of patterns (words/phrases) to match after the entity.
  Example: ["years", "old"] matches entities followed by “years” or “old.”
- scopeWindowLevel: Specifies the level of the scope window to consider.
  Valid values: "token" or "char". Default: "token".
- scopeWindow: A tuple defining the range of tokens or characters (based on scopeWindowLevel) to include in the scope.
  Default for “token” level: (2, 2). Default for “char” level: (10,10) Example: (2, 3) means 2 tokens/characters before and 3 after the entity are considered.
- prefixRegexes: Array of regular expressions to match before the entity.
  Example: ["\\b(years|months)\\b"] matches words like “years” or “months” as prefixes.
- suffixRegexes: Array of regular expressions to match after the entity.
  Example: ["\\b(old|young)\\b"] matches words like “old” or “young” as suffixes.
- prefixEntites: Array of entity labels to match before the entity.
  Example: ["DATE"] matches entities of type “DATE” as prefixes.
- suffixEntities: Array of entity labels to match after the entity.
  Example: ["DATE"] matches entities of type “DATE” as suffixes.
- regexInBetween: Regular expression to match text between the entity and prefix/suffix. If matched, the prefix/suffix entities will be included with the target entity.
- replaceEntity: Optional string specifying the new entity label to replace with the target entity label.
  Example: "MODIFIED_AGE" replaces "AGE" with "MODIFIED_AGE" in matching cases.
- mode: Specifies the operational mode for the rules. Options: include, exclude, or replace_label_only. Default is include.

Input Annotator Types: DOCUMENT, TOKEN, CHUNK

Output Annotator Type: CHUNK

Python API: ContextualEntityRuler

Scala API: ContextualEntityRuler

Notebook: ContextualEntityRuler

Show Example

from johnsnowlabs import nlp, medical

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "token"])\
    .setOutputCol("embeddings")

jsl_ner = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("jsl_ner") 

jsl_ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "jsl_ner"]) \
    .setOutputCol("ner_chunks")

rules = [   {
                "entity" : "Age",
                "scopeWindow" : [15,15],
                "scopeWindowLevel"  : "char",
                "suffixPatterns" : ["years old", "year old", "months",],
                "replaceEntity" : "Modified_Age",
                "mode" : "exclude"
            },
            {
                "entity" : "Diabetes",
                "scopeWindow" : [3,3],
                "scopeWindowLevel"  : "token",
                "suffixPatterns" : ["with complications"],
                "replaceEntity" : "Modified_Diabetes",
                "mode" : "include"
            },
            {
                "entity" : "Date",
                "suffixRegexes" : ["\\d{4}"],
                "replaceEntity" : "Modified_Date",
                "mode" : "include"
            }
        ]

contextual_entity_ruler = medical.ContextualEntityRuler() \
    .setInputCols("sentence", "token", "ner_chunks") \
    .setOutputCol("ruled_ner_chunks") \
    .setRules(rules) \
    .setCaseSensitive(False)\
    .setDropEmptyChunks(True)\
    .setAllowPunctuationInBetween(True)

ruler_pipeline = nlp.Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        jsl_ner,
        jsl_ner_converter,
        contextual_entity_ruler,
        flattener
    ])

empty_data = spark.createDataFrame([[""]]).toDF("text")
ruler_model = ruler_pipeline.fit(empty_data)

text ="""The Doctor assessed the 36 years old who has a history of the diabetes mellitus with complications in May, 2006"""
ruler_result = ruler_model.transform(data)
ruler_result.show(truncate = False)


# result

+-----------------+-----+---+------------------------------------+
|entity           |begin|end|ruled_ner_chunks_result             |
+-----------------+-----+---+------------------------------------+
|Modified_Age     |28   |29 |36                                  |
|Modified_Diabetes|66   |101|diabetes mellitus with complications|
|Modified_Date    |106  |114|May, 2006                           |
+-----------------+-----+---+------------------------------------+

import spark.implicits._

val rules =
    """
    |[
    |  {
    |    "entity": "Age",
    |    "scopeWindow": [15, 15],
    |    "scopeWindowLevel": "char",
    |    "suffixPatterns": ["years old", "year old", "months"],
    |    "replaceEntity": "Modified_Age",
    |    "mode": "exclude"
    |  },
    |  {
    |    "entity": "Diabetes",
    |    "scopeWindow": [3, 3],
    |    "scopeWindowLevel": "token",
    |    "suffixPatterns": ["with complications"],
    |    "replaceEntity": "Modified_Diabetes",
    |    "mode": "include"
    |  },
    |  {
    |    "entity": "Date",
    |    "suffixRegexes": ["\\d{4}"],
    |    "replaceEntity": "Modified_Date",
    |    "mode": "include"
    |  }
    |]
    """.stripMargin


val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new Tokenizer()
    .setInputCols("document")
    .setOutputCol("token")

val wordEmbeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols("document", "token")
    .setOutputCol("embeddings")

val jslNer = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models")
   .setInputCols("sentence", "token", "embeddings")
   .setOutputCol("jsl_ner")

val jslNerConverter = new NerConverterInternal()
    .setInputCols("sentence", "token", "jsl_ner")
    .setOutputCol("ner_chunks")

val  contextualEntityRuler = new ContextualEntityRuler()
    .setInputCols("sentence", "token", "ner_chunks")
    .setOutputCol("ruled_ner_chunks")
    .setRulesAsStr(rules)
    .setCaseSensitive(false)
    .setDropEmptyChunks(true)
    .setAllowPunctuationInBetween(true)

val ruler_pipeline = new Pipeline().setStages(
  Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    wordEmbeddings,
    jslNer,
    jslNerConverter,
    contextualEntityRuler
  ))

val text = "The Doctor assessed the 36 years old who has a history of the diabetes mellitus with complications in May, 2006"
val df = Seq(text).toDF("text")
val result = nlpPipeline.fit(df).transform(df)


# result
+-----------------+-----+---+------------------------------------+
|entity           |begin|end|ruled_ner_chunks_result             |
+-----------------+-----+---+------------------------------------+
|Modified_Age     |28   |30 |36                                  |
|Modified_Diabetes|66   |101|diabetes mellitus with complications|
|Modified_Date    |106  |114|May, 2006                           |
+-----------------+-----+---+------------------------------------+

ContextualParser

Approach

Creates a model, that extracts entity from a document based on user defined rules. Rule matching is based on a RegexMatcher defined in a JSON file. It is set through the parameter setJsonPath() In this JSON file, regex is defined that you want to match along with the information that will output on metadata field. Additionally, a dictionary can be provided with setDictionary to map extracted entities to a unified representation. The first column of the dictionary file should be the representation with following columns the possible matches.

Parametres;

inputCols: The name of the columns containing the input annotations. It can read either a String column or an Array.
outputCol: The name of the column in Document type that is generated. We can specify only one column here.
jsonPath: Path to json file containing regex patterns and rules to match the entities.
dictionary: Path to dictionary file in tsv or csv format.
caseSensitive: Whether to use case sensitive when matching values.
prefixAndSuffixMatch: Whether to match both prefix and suffix to annotate the match.
optionalContextRules: When set to true, it will output regex match regardless of context matches.
shortestContextMatch: When set to true, it will stop finding for matches when prefix/suffix data is found in the text.
completeContextMatch: Whether to do an exact match of prefix and suffix.

All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols().

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CHUNK

Python API: ContextualParserApproach

Scala API: ContextualParserApproach

Notebook: ContextualParserApproachNotebook

Show Example

from johnsnowlabs import nlp, medical

# An example JSON file `regex_token.json` can look like this:
#
# {
#    "entity": "Stage",
#    "ruleScope": "sentence",
#    "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]",
#    "matchScope": "token"
#  }
#
# Which means to extract the stage code on a sentence level.
# An example pipeline could then be defined like this
# Pipeline could then be defined like this

documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

contextualParser = medical.ContextualParserApproach() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("entity") \
    .setJsonPath("/path/to/regex_token.json") \
    .setCaseSensitive(True) \
    .setContextMatch(False)

pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    contextualParser
  ])

# Define the parser (json file needs to be provided)
data = spark.createDataFrame([["A patient has liver metastases pT1bN0M0 and the T5 primary site may be colon or... "]]).toDF("text")

result = pipeline.fit(data).transform(data)

# Show Results
result.selectExpr("explode(entity)").show(5, truncate=False)
+-------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                      |
+-------------------------------------------------------------------------------------------------------------------------+
|{chunk, 32, 39, pT1bN0M0, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 0}, []}   |
|{chunk, 49, 50, T5, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 0}, []}         |
|{chunk, 148, 156, cT4bcN2M1, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 1}, []}|
|{chunk, 189, 194, T?N3M1, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 2}, []}   |
|{chunk, 316, 323, pT1bN0M0, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 3}, []} |
+-------------------------------------------------------------------------------------------------------------------------+

from johnsnowlabs import nlp, finance

# An example JSON file `regex_token.json` can look like this:
#
# {
#    "entity": "Stage",
#    "ruleScope": "sentence",
#    "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]",
#    "matchScope": "token"
#  }
#
# Which means to extract the stage code on a sentence level.
# An example pipeline could then be defined like this
# Pipeline could then be defined like this

documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

# Define the parser (json file needs to be provided)

contextualParser = finance.ContextualParserApproach() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("entity") \
    .setJsonPath("/path/to/regex_token.json") \
    .setCaseSensitive(True) \
    .setContextMatch(False)

pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    contextualParser
  ])

# Define the parser (json file needs to be provided)
data = spark.createDataFrame([["Peter Parker is a nice guy and lives in New York . Bruce Wayne is also a nice guy and lives in San Antonio and Gotham City ."]]).toDF("text")

result = pipeline.fit(data).transform(data)

# Show Results
result.selectExpr("explode(entity)").show(5, truncate=False)

+---------------------------------------------------------------+
|result                                                         |
+---------------------------------------------------------------+
|[Peter Parker, New York, Bruce Wayne, San Antonio, Gotham City]|
+---------------------------------------------------------------+

from johnsnowlabs import nlp, legal

# An example JSON file `regex_token.json` can look like this:
#
# {
#    "entity": "Stage",
#    "ruleScope": "sentence",
#    "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]",
#    "matchScope": "token"
#  }
#
# Which means to extract the stage code on a sentence level.
# An example pipeline could then be defined like this
# Pipeline could then be defined like this

documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

contextualParser = legal.ContextualParserApproach() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("entity") \
    .setJsonPath("/path/to/regex_token.json") \
    .setCaseSensitive(True) \
    .setContextMatch(False)

pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    contextualParser
  ])

# Define the parser (json file needs to be provided)
data = spark.createDataFrame([["Peter Parker is a nice guy and lives in New York . Bruce Wayne is also a nice guy and lives in San Antonio and Gotham City ."]]).toDF("text")

result = pipeline.fit(data).transform(data)

# Show Results
result.selectExpr("explode(entity)").show(5, truncate=False)

+---------------------------------------------------------------+
|result                                                         |
+---------------------------------------------------------------+
|[Peter Parker, New York, Bruce Wayne, San Antonio, Gotham City]|
+---------------------------------------------------------------+

import spark.implicits._

// An example JSON file `regex_token.json` can look like this:
//
// {
//    "entity": "Stage",
//    "ruleScope": "sentence",
//    "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]",
//    "matchScope": "token"
//  }
//
// Which means to extract the stage code on a sentence level.
// An example pipeline could then be defined like this

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val contextualParser = new ContextualParserApproach()
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("entity")
    .setJsonPath("/path/to/regex_token.json")
    .setCaseSensitive(true)
    .setContextMatch(false)

val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    contextualParser
  ))

// Define the parser (json file needs to be provided)
val data = Seq("A patient has liver metastases pT1bN0M0 and the T5 primary site may be colon or... ").toDF("text")

val result = pipeline.fit(data).transform(data)

// Show Results
//
// result.selectExpr("explode(entity)").show(5, truncate=false)
// +-------------------------------------------------------------------------------------------------------------------------+
// |col                                                                                                                      |
// +-------------------------------------------------------------------------------------------------------------------------+
// |{chunk, 32, 39, pT1bN0M0, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 0}, []}   |
// |{chunk, 49, 50, T5, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 0}, []}         |
// |{chunk, 148, 156, cT4bcN2M1, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 1}, []}|
// |{chunk, 189, 194, T?N3M1, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 2}, []}   |
// |{chunk, 316, 323, pT1bN0M0, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 3}, []} |
// +-------------------------------------------------------------------------------------------------------------------------+
//

import spark.implicits._

// An example JSON file `regex_token.json` can look like this:
//
// {
//    "entity": "Stage",
//    "ruleScope": "sentence",
//    "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]",
//    "matchScope": "token"
//  }
//
// Which means to extract the stage code on a sentence level.
// An example pipeline could then be defined like this

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val contextualParser = new ContextualParserApproach()
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("entity")
    .setJsonPath("/path/to/regex_token.json")
    .setCaseSensitive(true)
    .setContextMatch(false)

val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    contextualParser
  ))

// Define the parser (json file needs to be provided)
val data = Seq("Peter Parker is a nice guy and lives in New York . Bruce Wayne is also a nice guy and lives in San Antonio and Gotham City .").toDF("text")

val result = pipeline.fit(data).transform(data)

// Show Results

+---------------------------------------------------------------+
|result                                                         |
+---------------------------------------------------------------+
|[Peter Parker, New York, Bruce Wayne, San Antonio, Gotham City]|
+---------------------------------------------------------------+

import spark.implicits._

// An example JSON file `regex_token.json` can look like this:
//
// {
//    "entity": "Stage",
//    "ruleScope": "sentence",
//    "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]",
//    "matchScope": "token"
//  }
//
// Which means to extract the stage code on a sentence level.
// An example pipeline could then be defined like this

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val contextualParser = new ContextualParserApproach()
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("entity")
    .setJsonPath("/path/to/regex_token.json")
    .setCaseSensitive(true)
    .setContextMatch(false)

val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    contextualParser
  ))

// Define the parser (json file needs to be provided)
val data = Seq("Peter Parker is a nice guy and lives in New York . Bruce Wayne is also a nice guy and lives in San Antonio and Gotham City .").toDF("text")

val result = pipeline.fit(data).transform(data)

// Show Results
+---------------------------------------------------------------+
|result                                                         |
+---------------------------------------------------------------+
|[Peter Parker, New York, Bruce Wayne, San Antonio, Gotham City]|
+---------------------------------------------------------------+

DateNormalizer

Model

This annotator transforms date mentions to a common standard format: YYYY/MM/DD. It is useful when using data from different sources, some times from different countries that has different formats to represent dates.

For the relative dates (next year, past month, etc.), you can define an achor date to create the normalized date by setting the parameters anchorDateYear, anchorDateMonth, and anchorDateDay.

The resultant chunk date will contain a metada indicating whether the normalization was successful or not (True / False).

Parametres:

anchorDateYear: (Int) Sets an anchor year for the relative dates such as a day after tomorrow. If not set it will use the current year.
anchorDateMonth: (Int) Sets an anchor month for the relative dates such as a day after tomorrow. If not set it will use the current month.
anchorDateDay: (Int) Sets an anchor day of the day for the relative dates such as a day after tomorrow. If not set it will use the current day.
outputDateformat: (string) Select what output format to use. If not set, the dates will be formatted as YYYY/MM/DD. Options are:
- eu: Format the dates as DD/MM/YYYY
- us: Format the dates as MM/DD/YYYY
defaultReplacementDay: (Int) Defines which value to use for creating the Day Value when original Date-Entity has no Day Information. Defaults to 15.
defaultReplacementMonth: (Int) Defines which value to use for creating the Month Value when original Date-Entity has no Month Information. Defaults to 06.
defaultReplacementYear: (Int) Defines which value to use for creating the Year Value when original Date-Entity has no Year Information. Defaults to 2020.

Input Annotator Types: CHUNK

Output Annotator Type: CHUNK

Python API: DateNormalizer

Scala API: DateNormalizer

Notebook: DateNormalizerNotebook

Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("original_date")\
    .setOutputCol("document")

doc2chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("date_chunk")

date_normalizer = medical.DateNormalizer()\
    .setInputCols("date_chunk")\
    .setOutputCol("date")\
    .setAnchorDateYear(2000)

pipeline = nlp.Pipeline(stages=[document_assembler, doc2chunk, date_normalizer])

dates = [
    "08/02/2018",
    "11/2018",
    "11/01/2018",
    "12Mar2021",
    "Jan 30, 2018",
    "13.04.1999",
    "3April 2020",
    "next monday",
    "today",
    "next week",
]
df = spark.createDataFrame(dates, StringType()).toDF("original_date")

result = pipeline.fit(df).transform(df)
result.selectExpr(
    "date.result as normalized_date",
    "original_date",
    "date.metadata[0].normalized as metadata",
).show()

+---------------+-------------+--------+
|normalized_date|original_date|metadata|
+---------------+-------------+--------+
|   [2018/08/02]|   08/02/2018|    true|
|   [2018/11/15]|      11/2018|    true|
|   [2018/11/01]|   11/01/2018|    true|
|   [2021/03/12]|    12Mar2021|    true|
|   [2018/01/30]| Jan 30, 2018|    true|
|   [1999/04/13]|   13.04.1999|    true|
|   [2020/04/03]|  3April 2020|    true|
|   [2000/12/11]|  next monday|    true|
|   [2000/12/06]|        today|    true|
|   [2000/12/13]|    next week|    true|
+---------------+-------------+--------+

from johnsnowlabs import nlp, finance

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("original_date")\
    .setOutputCol("document")

doc2chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("date_chunk")

date_normalizer = finance.DateNormalizer()\
    .setInputCols("date_chunk")\
    .setOutputCol("date")\
    .setAnchorDateYear(2000)

pipeline = nlp.Pipeline(stages=[document_assembler, doc2chunk, date_normalizer])

dates = [
    "08/02/2018",
    "11/2018",
    "11/01/2018",
    "12Mar2021",
    "Jan 30, 2018",
    "13.04.1999",
    "3April 2020",
    "next monday",
    "today",
    "next week",
]
df = spark.createDataFrame(dates, StringType()).toDF("original_date")

result = pipeline.fit(df).transform(df)
result.selectExpr(
    "date.result as normalized_date",
    "original_date",
    "date.metadata[0].normalized as metadata",
).show()

+---------------+-------------+--------+
|normalized_date|original_date|metadata|
+---------------+-------------+--------+
|   [2018/08/02]|   08/02/2018|    true|
|   [2018/11/15]|      11/2018|    true|
|   [2018/11/01]|   11/01/2018|    true|
|   [2021/03/12]|    12Mar2021|    true|
|   [2018/01/30]| Jan 30, 2018|    true|
|   [1999/04/13]|   13.04.1999|    true|
|   [2020/04/03]|  3April 2020|    true|
|   [2000/12/11]|  next monday|    true|
|   [2000/12/06]|        today|    true|
|   [2000/12/13]|    next week|    true|
+---------------+-------------+--------+

from johnsnowlabs import nlp, legal

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("original_date")\
    .setOutputCol("document")

doc2chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("date_chunk")

date_normalizer = legal.DateNormalizer()\
    .setInputCols("date_chunk")\
    .setOutputCol("date")\
    .setAnchorDateYear(2000)

pipeline = nlp.Pipeline(stages=[document_assembler, doc2chunk, date_normalizer])

dates = [
    "08/02/2018",
    "11/2018",
    "11/01/2018",
    "12Mar2021",
    "Jan 30, 2018",
    "13.04.1999",
    "3April 2020",
    "next monday",
    "today",
    "next week",
]
df = spark.createDataFrame(dates, StringType()).toDF("original_date")

result = pipeline.fit(df).transform(df)


+---------------+-------------+--------+
|normalized_date|original_date|metadata|
+---------------+-------------+--------+
|   [2018/08/02]|   08/02/2018|    true|
|   [2018/11/15]|      11/2018|    true|
|   [2018/11/01]|   11/01/2018|    true|
|   [2021/03/12]|    12Mar2021|    true|
|   [2018/01/30]| Jan 30, 2018|    true|
|   [1999/04/13]|   13.04.1999|    true|
|   [2020/04/03]|  3April 2020|    true|
|   [2000/12/11]|  next monday|    true|
|   [2000/12/06]|        today|    true|
|   [2000/12/13]|    next week|    true|
+---------------+-------------+--------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("original_date")
    .setOutputCol("document")

val doc2chunk = new Doc2Chunk()
    .setInputCols("document")
    .setOutputCol("date_chunk")

val date_normalizer = new DateNormalizer()
    .setInputCols("date_chunk")
    .setOutputCol("date")
    .setAnchorDateYear(2000)

val pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    doc2chunk, 
    date_normalizer
))

val df = Seq(("08/02/2018"),("11/2018"),("11/01/2018"),("next monday"),("today"),("next week")).toDF("original_date")

val result = pipeline.fit(df).transform(df)

+---------------+-------------+--------+
|normalized_date|original_date|metadata|
+---------------+-------------+--------+
|   [2018/08/02]|   08/02/2018|    true|
|   [2018/11/15]|      11/2018|    true|
|   [2018/11/01]|   11/01/2018|    true|
|   [2021/03/12]|    12Mar2021|    true|
|   [2018/01/30]| Jan 30, 2018|    true|
|   [1999/04/13]|   13.04.1999|    true|
|   [2020/04/03]|  3April 2020|    true|
|   [2000/12/11]|  next monday|    true|
|   [2000/12/06]|        today|    true|
|   [2000/12/13]|    next week|    true|
+---------------+-------------+--------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("original_date")
    .setOutputCol("document")

val doc2chunk = new Doc2Chunk()
    .setInputCols("document")
    .setOutputCol("date_chunk")

val date_normalizer = new DateNormalizer()
    .setInputCols("date_chunk")
    .setOutputCol("date")
    .setAnchorDateYear(2000)

val pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    doc2chunk, 
    date_normalizer
))
 
val df = Seq(("08/02/2018"),("11/2018"),("11/01/2018"),("next monday"),("today"),("next week")).toDF("original_date")

val result = pipeline.fit(df).transform(df)


+---------------+-------------+--------+
|normalized_date|original_date|metadata|
+---------------+-------------+--------+
|   [2018/08/02]|   08/02/2018|    true|
|   [2018/11/15]|      11/2018|    true|
|   [2018/11/01]|   11/01/2018|    true|
|   [2021/03/12]|    12Mar2021|    true|
|   [2018/01/30]| Jan 30, 2018|    true|
|   [1999/04/13]|   13.04.1999|    true|
|   [2020/04/03]|  3April 2020|    true|
|   [2000/12/11]|  next monday|    true|
|   [2000/12/06]|        today|    true|
|   [2000/12/13]|    next week|    true|
+---------------+-------------+--------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("original_date")
    .setOutputCol("document")

val doc2chunk = new Doc2Chunk()
    .setInputCols("document")
    .setOutputCol("date_chunk")

val date_normalizer = new DateNormalizer()
    .setInputCols("date_chunk")
    .setOutputCol("date")
    .setAnchorDateYear(2000)

val pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    doc2chunk, 
    date_normalizer
))
 
val df = Seq(("08/02/2018"),("11/2018"),("11/01/2018"),("next monday"),("today"),("next week")).toDF("original_date")

val result = pipeline.fit(df).transform(df)


+---------------+-------------+--------+
|normalized_date|original_date|metadata|
+---------------+-------------+--------+
|   [2018/08/02]|   08/02/2018|    true|
|   [2018/11/15]|      11/2018|    true|
|   [2018/11/01]|   11/01/2018|    true|
|   [2021/03/12]|    12Mar2021|    true|
|   [2018/01/30]| Jan 30, 2018|    true|
|   [1999/04/13]|   13.04.1999|    true|
|   [2020/04/03]|  3April 2020|    true|
|   [2000/12/11]|  next monday|    true|
|   [2000/12/06]|        today|    true|
|   [2000/12/13]|    next week|    true|
+---------------+-------------+--------+

DeIdentification

Deidentification is a critical and important technology to facilitate the use of structured or unstructured clinical text while protecting patient privacy and confidentiality. John Snow Labs teams has invested great efforts in developing methods and corpora for deidentification of clinical text, PDF, image, DICOM, containing Protected Health Information (PHI):

individual’s past, present, or future physical or mental health or condition.
provision of health care to the individual.
past, present, or future payment for the health care.

Protected health information includes many common identifiers (e.g., name, address, birth date, Social Security Number) when they can be associated with the health information.

Spark NLP for Healthcare proposes several techniques and strategies for deidentification, the principal ones are:

Mask:

entity_labels: Mask with the entity type of that chunk. (default)
same_length_chars: Mask the deid entities with same length of asterix ( * ) with brackets ( [ , ] ) on both end.
fixed_length_chars: Mask the deid entities with a fixed length of asterix ( * ). The length is setting up using the setFixedMaskLength() method.
same_length_chars_without_brackets: masks entities with asterisks of the same length without square brackets.
entity_labels_without_brackets: replaces entities with their label without square brackets.

Obfuscation: replace sensetive entities with random values of the same type.

Faker: allows the user to use a set of fake entities that are in the memory of spark-nlp-internal

Also there is an advanced option allowing to deidentify with multiple modes at the same time. (Multi-Mode Deididentification). Deidentifies Input Annotations of types DOCUMENT, TOKEN and CHUNK, by either masking or obfuscating the given CHUNKS.

Parameters:

ageRanges: (IntArrayParam) List of integers specifying limits of the age groups to preserve during obfuscation
blackList: (StringArrayParam) List of entities that will be ignored to in the regex file.
consistentObfuscation: (BooleanParam) Whether to replace very similar entities in a document with the same randomized term (default: true) The similarity is based on the Levenshtein Distance between the words.
dateFormats: (StringArrayParam) Format of dates to displace
dateTag: (Param[String]) Tag representing what are the NER entity (default: DATE)
dateToYear: (BooleanParam) true if dates must be converted to years, false otherwise
days: (IntParam) Number of days to obfuscate the dates by displacement.
fixedMaskLength: (IntParam) Select the fixed mask length: this is the length of the masking sequence that will be used when the ‘fixed_length_chars’ masking policy is selected.
ignoreRegex: (BooleanParam) Select if you want to use regex file loaded in the model.
isRandomDateDisplacement: (BooleanParam) Use a random displacement days in dates entities,that random number is based on the DeIdentificationParams.seed If true use random displacement days in dates entities,if false use the DeIdentificationParams.days The default value is false.
language: (Param[String]) The language used to select the regex file and some faker entities.’en’(english),’de’(German), ‘es’(Spanish), ‘fr’(French) or ‘ro’(Romanian)
mappingsColumn: (Param[String]) This is the mapping column that will return the Annotations chunks with the fake entities
maskingPolicy: (Param[String]) Select the masking policy: same_length_chars: Replace the obfuscated entity with a masking sequence composed of asterisks and surrounding squared brackets, being the total length of the masking sequence of the same length as the original sequence. Example, Smith -> [***]. If the entity is less than 3 chars (like Jo, or 5), asterisks without brackets will be returned. entity_labels: Replace the values with the corresponding entity labels. fixed_length_chars: Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisks. same_length_chars_without_brackets: masks entities with asterisks of the same length without square brackets. entity_labels_without_brackets: replaces entities with their label without square brackets.
minYear: (IntParam) Minimum year to use when converting date to year
mode: (Param[String]) Mode for Anonymizer [‘mask’, ‘obfuscate’] Given the following text
obfuscateDate: (BooleanParam) When mode==”obfuscate” whether to obfuscate dates or not.
obfuscateRefFile: (Param[String]) File with the terms to be used for Obfuscation
obfuscateRefSource: (Param[String]) The source of obfuscation of to obfuscate the entities.For dates entities doesnt apply tha method.
outputAsDocument: (BooleanParam) Whether to return all sentences joined into a single document
refFileFormat: (Param[String]) Format of the reference file for Obfuscation the default value for that is “csv”
refSep: (Param[String]) Separator character for the csv reference file for Obfuscation de default value is “#”
regexOverride: (BooleanParam) If is true prioritize the regex entities, if is false prioritize the ner.
regexPatternsDictionary: (ExternalResourceParam) dictionary with regular expression patterns that match some protected entity if the dictionary in not setting up we will use the default regex file.
region: (Param[String]) Usa or eu
returnEntityMappings: (BooleanParam) With this property you select if you want to return mapping column
sameEntityThreshold: (DoubleParam) Similarity threshold [0.0-1.0] to consider two appearances of an entity as the same (default: 0.9) For date entities this method doesn’t apply.
sameLengthFormattedEntities: (StringArrayParam) List of formatted entities to generate the same length outputs as original ones during obfuscation.
seed: (IntParam) It is the seed to select the entities on obfuscate mode.With the seed you can reply a execution several times with the same ouptut.
selectiveObfuscationModesPath: (Param[String]) Dictionary path where is the json that contains the selective obfuscation modes
unnormalizedDateMode: (Param[String]) The mode to use if the date is not formatted.
zipCodeTag: (Param[String]) Tag representing zip codes in the obfuscate reference file (default: ZIP).
MetadataMaskingPolicy(str): (Param[String]) Options : ‘entity_labels’, ‘same_length_chars’, ‘fixed_length_chars’ If set, metadata includes the masked form of the document.
obfuscateByAgeGroups: (BooleanParam) Whether to obfuscate ages based on age groups. When True, the age groups specified in the ageGroups parameter will be used to obfuscate ages. When False, the age ranges specified in the ageRanges parameter will be used to obfuscate ages. Default: False
ageGroups: A dictionary of age groups to obfuscate ages. For this parameter to be active, the obfuscateByAgeGroups parameter must be true. If the given ageGroups do not fully contain the ages, the ages continue to be obfuscated according to the ageRanges parameter. The dictionary should contain the age group name as the key and an list of two integers as the value. The first integer is the lower bound of the age group, and the second integer is the upper bound of the age group.
keepYear: (BooleanParam) Whether to keep the year intact when obfuscating date entities. If True, the year will remain unchanged during the obfuscation process. If False, the year will be modified along with the month and day. Default: False.
keepMonth : Whether to keep the month intact when obfuscating date entities. If True, the month will remain unchanged during the obfuscation process. If False, the month will be modified along with the year and day. Default: False.
keepTextSizeForObfuscation : Whether to keep the text length same obfuscating entities. If True, the output text will remain the same if a same length fake is available, otherwise length might vary.
fakerLengthOffset : It specifies how much length deviation is accepted in obfuscation, with keepTextSizeForObfuscation enabled. It must be greater than 0.
consistentAcrossNameParts : Param that indicates whether consistency should be enforced across different parts of a name (e.g., first name, middle name, last name).

When set to True, the same transformation or obfuscation will be applied consistently to all parts of the same name entity, even if those parts appear separately.

For example, if “John Smith” is obfuscated as “Liam Brown”, then:

When the full name “John Smith” appears, it will be replaced with “Liam Brown”
When “John” or “Smith” appear individually, they will still be obfuscated as “Liam” and “Brown” respectively, ensuring consistency in name transformation.

Default: True

groupByCol : The column name used to group the dataset. This parameter is used in conjunction with consistentObfuscation to ensure consistent obfuscation within each group. When groupByCol is set, the dataset is partitioned into groups based on the values of the specified column. Default: "" (empty string, meaning no grouping)
chunkMatching :Performs entity chunk matching across rows or within groups in a DataFrame. This function is useful in de-identification pipelines where certain entity labels like “NAME” or “DATE” may be missing in some rows and need to be filled from other rows within the same group.

-geoConsistency: (BooleanParam) Sets whether to enforce consistent obfuscation across geographical entities: state, city, street, zip and phone. This parameter enables intelligent geographical entity obfuscation that maintains realistic relationships between different geographic components. When enabled, the system ensures that obfuscated addresses form coherent, valid combinations rather than random replacements. Default: False

-countryObfuscation: (BooleanParam) Whether to obfuscate country entities or not. If True, the country entities will be obfuscated. Default: False.

additionalDateFormats: (Param[String]) Additional date formats to be considered during date obfuscation. This allows users to specify custom date formats in addition to the default date formats. Default: [].
selectiveObfuscateRefSource: Dict[str, str] A dictionary of entity names to their obfuscation modes. This is used to selectively apply different obfuscation methods to specific entities. The keys are entity names and the values are the obfuscation sources. If an entity is not specified in this map, the obfuscateRefSource param is used to determine the obfuscation source. Possible values in dict for the obfuscation source are: ‘custom’, ‘faker’, ‘both’, ‘file’.
staticObfuscationPairs: static obfuscation pairs is used to set static obfuscation pairs that will be used for de-identification. Each pair should contain three elements: original, entity type, and fake. The pairs must have exactly 3 elements: [original, entityType, fake].
staticObfuscationPairsResource: allows loading obfuscation triplets from an external file (e.g., CSV). Supports custom delimiter through the options parameter.
obfuscationEquivalents : used to define variant-to-canonical mappings to ensure consistent obfuscation.
Each pair should contain three elements: variant, entity type, and canonical.
The pairs must have exactly 3 elements: [variant, entityType, canonical].
enableDefaultObfuscationEquivalents : whether to enable default obfuscation equivalents for common entities. This parameter allows the system to automatically include a set of predefined common English name equivalents. Default is False.
obfuscationEquivalentsResource : Allows loading obfuscation equivalents from an external file (e.g., CSV).
Supports custom delimiter through the options parameter.

To create a configured DeIdentificationModel, please see the example of DeIdentification.

Input Annotator Types: DOCUMENT, TOKEN, CHUNK

Output Annotator Type: DOCUMENT

Python API: DeIdentificationModel

Scala API: DeIdentificationModel

Notebook: DeIdentificationModelNotebook

Show Example

from johnsnowlabs import nlp, medical

documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence") \
    .setUseAbbreviations(True)

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")\

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_sensitive_entities = medical.NerModel \
    .pretrained("ner_deid_enriched", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

nerConverter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

deIdentification = medical.DeIdentificationModel.pretrained("deidentify_large", "en", "clinical/models") \
    .setInputCols(["ner_chunk", "token", "sentence"]) \
    .setOutputCol("dei") \
    .setMode("obfuscate") \
    .setDateFormats(["MM/dd/yy","yyyy-MM-dd"]) \
    .setObfuscateDate(True) \
    .setDateTag("DATE") \
    .setDays(5) \
    .setObfuscateRefSource("both")

data = spark.createDataFrame([
    ["# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09."]
    ]).toDF("text")

pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    clinical_sensitive_entities,
    nerConverter,
    deIdentification
])

result = pipeline.fit(data).transform(data)
result.select(F.expr("sentence.result as Input") ,F.expr("dei.result as deidentified")).show(truncate=100)
+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+
|                                                                                Input|                                                                            deidentified|
+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+
|[# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09.]|[# 1610960 Date : 01/18/93 PCP : Vida Rigger , 27 years-old , Record date : 2079-11-14.]|
+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+

from johnsnowlabs import nlp, medical, finance, legal

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = legal.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

bert_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("bert_embeddings")

fin_ner = finance.NerModel.pretrained('finner_deid', "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")
    #.setLabelCasing("upper")

ner_converter =  medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setReplaceLabels({"ORG": "COMPANY"}) # Replace "ORG" entity as "COMPANY"

ner_finner = finance.NerModel.pretrained("finner_org_per_role_date", "en", "finance/models")\
    .setInputCols(["sentence", "token", "bert_embeddings"]) \
    .setOutputCol("ner_finner")
    #.setLabelCasing("upper")

ner_converter_finner = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner_finner"]) \
    .setOutputCol("ner_finner_chunk") 
    # .setWhiteList(['ROLE']) # Just use "ROLE" entity from this NER

chunk_merge =  medical.ChunkMergeApproach()\
    .setInputCols("ner_finner_chunk", "ner_chunk")\
    .setOutputCol("deid_merged_chunk")

deidentification =  finance.DeIdentification() \
    .setInputCols(["sentence", "token", "deid_merged_chunk"]) \
    .setOutputCol("deidentified") \
    .setMode("mask")\
    .setIgnoreRegex(True)

# Pipeline
nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      embeddings,
      bert_embeddings,
      fin_ner,
      ner_converter,
      ner_finner,
      ner_converter_finner,
      chunk_merge,
      deidentification])

data = spark.createDataFrame([
    ["Jeffrey Preston Bezos, DoB 12/01/1964, is an American entrepreneur, founder and CEO of Amazon"]
]).toDF("text")

result = nlpPipeline.fit(data).transform(data)
result.select("sentence.result", "deidentified.result").show(truncate = False)

+-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------+
|result                                                                                         |result                                                                     |
+-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------+
|[Jeffrey Preston Bezos, DoB 12/01/1964, is an American entrepreneur, founder and CEO of Amazon]|[<PERSON>, <DATE>, is an American entrepreneur, <ROLE> and <ROLE> of <ORG>]|
+-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------+

from johnsnowlabs import nlp, legal, medical

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = legal.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

legal_ner = legal.NerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")
    #.setLabelCasing("upper")

ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setReplaceLabels({"ALIAS": "PARTY"})

ner_signers = legal.NerModel.pretrained("legner_signers", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_signers")
    #.setLabelCasing("upper")

ner_converter_signers = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner_signers"]) \
    .setOutputCol("ner_signer_chunk")

chunk_merge = medical.ChunkMergeApproach()\
    .setInputCols("ner_signer_chunk", "ner_chunk")\
    .setOutputCol("deid_merged_chunk")

deidentification = legal.DeIdentification() \
    .setInputCols(["sentence", "token", "deid_merged_chunk"]) \
    .setOutputCol("deidentified") \
    .setMode("mask")\
    .setIgnoreRegex(True)

# Pipeline
nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      embeddings,
      legal_ner,
      ner_converter,
      ner_signers,
      ner_converter_signers,
      chunk_merge,
      deidentification])

data = spark.createDataFrame([["ENTIRE AGREEMENT.  This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby,\
 supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter. THE MART.COM, INC.:                         I-ESCROW, INC.: By:Dominic J. Magliarditi               \
  By:Sanjay Bajaj Name: Dominic J. Magliarditi                Name: Sanjay Bajaj Title: President                            Title: VP Business Development Date: 6/21/2023 "]]).toDF("text")

result = nlpPipeline.fit(data).transform(data)
result.select("sentence.result", "deidentified.result").toPandas()

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|sentence                                                                                                                                                                                                                                |deidentified                                                                                                                                                                                                                            |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ENTIRE AGREEMENT.                                                                                                                                                                                                                       |<DOC>.                                                                                                                                                                                                                                  |
|This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby, supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter.|This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby, supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter.|
|THE MART.COM, INC.: I-ESCROW, INC.: By:Dominic J. Magliarditi                 By:Sanjay Bajaj Name: Dominic J. Magliarditi Name: Sanjay Bajaj Title: President Title: VP Business Development Date: 6/21/2023                           |<PARTY>.: <PARTY>.: By:Dominic <SIGNING_PERSON>                 By:Sanjay <SIGNING_PERSON> Name: <SIGNING_PERSON> Name: <SIGNING_PERSON> Title: <SIGNING_TITLE> Title: <SIGNING_TITLE> Date: 6/21/2023                                  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")
  .setUseAbbreviations(true)

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val clinicalSensitiveEntities = MedicalNerModel.pretrained("ner_deid_enriched", "en", "clinical/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")

val nerConverter = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")

val deIdentification = DeIdentificationModel.pretrained("deidentify_large", "en", "clinical/models")
  .setInputCols(Array("ner_chunk", "token", "sentence"))
  .setOutputCol("dei")
  .setMode("obfuscate")
  .setDateFormats(Array("MM/dd/yy", "yyyy-MM-dd"))
  .setObfuscateDate(true)
  .setDateTag("DATE")
  .setDays(5)
  .setObfuscateRefSource("both")

val data = Seq(
  "# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09."
).toDF("text")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  clinicalSensitiveEntities,
  nerConverter,
  deIdentification
))

val result = pipeline.fit(data).transform(data)

+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+
|                                                                                Input|                                                                            deidentified|
+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+
|[# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09.]|[# 1610960 Date : 01/18/93 PCP : Vida Rigger , 27 years-old , Record date : 2079-11-14.]|
+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val bertEmbeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("bert_embeddings")

val finNer = FinanceNerModel.pretrained("finner_deid", "en", "finance/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")

val nerConverter = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")
  .setReplaceLabels(Map("ORG" -> "COMPANY"))

val nerFinner = FinanceNerModel.pretrained("finner_org_per_role_date", "en", "finance/models")
  .setInputCols(Array("sentence", "token", "bert_embeddings"))
  .setOutputCol("ner_finner")

val nerConverterFinner = new NerConverter()
  .setInputCols(Array("sentence", "token", "ner_finner"))
  .setOutputCol("ner_finner_chunk")

val chunkMerge = new ChunkMergeApproach()
  .setInputCols(Array("ner_finner_chunk", "ner_chunk"))
  .setOutputCol("deid_merged_chunk")

val deidentification = new DeIdentification()
  .setInputCols(Array("sentence", "token", "deid_merged_chunk"))
  .setOutputCol("deidentified")
  .setMode("mask")
  .setIgnoreRegex(true)

val nlpPipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  bertEmbeddings,
  finNer,
  nerConverter,
  nerFinner,
  nerConverterFinner,
  chunkMerge,
  deidentification
))

val data = Seq(
  "Jeffrey Preston Bezos, DoB 12/01/1964, is an American entrepreneur, founder and CEO of Amazon"
).toDF("text")

val result = nlpPipeline.fit(data).transform(data)

+-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------+
|result                                                                                         |result                                                                     |
+-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------+
|[Jeffrey Preston Bezos, DoB 12/01/1964, is an American entrepreneur, founder and CEO of Amazon]|[<PERSON>, <DATE>, is an American entrepreneur, <ROLE> and <ROLE> of <ORG>]|
+-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val legalNer = LegalNerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")
  .setLabelCasing("upper")

val nerConverter = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")
  .setReplaceLabels(Map("ALIAS" -> "PARTY"))

val nerSigners = LegalNerModel.pretrained("legner_signers", "en", "legal/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner_signers")
  .setLabelCasing("upper")

val nerConverterSigners = new NerConverter()
  .setInputCols(Array("sentence", "token", "ner_signers"))
  .setOutputCol("ner_signer_chunk")

val chunkMerge = new ChunkMergeApproach()
  .setInputCols(Array("ner_signer_chunk", "ner_chunk"))
  .setOutputCol("deid_merged_chunk")

val deidentification = new DeIdentification()
  .setInputCols(Array("sentence", "token", "deid_merged_chunk"))
  .setOutputCol("deidentified")
  .setMode("mask")
  .setIgnoreRegex(true)

val nlpPipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  legalNer,
  nerConverter,
  nerSigners,
  nerConverterSigners,
  chunkMerge,
  deidentification
))

val data = Seq(
  "ENTIRE AGREEMENT. This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby, supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter. THE MART.COM, INC.: I-ESCROW, INC.: By:Dominic J. Magliarditi By:Sanjay Bajaj Name: Dominic J. Magliarditi Name: Sanjay Bajaj Title: President Title: VP Business Development Date: 6/21/2023"
).toDF("text")

val result = nlpPipeline.fit(data).transform(data)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|sentence                                                                                                                                                                                                                                |deidentified                                                                                                                                                                                                                            |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ENTIRE AGREEMENT.                                                                                                                                                                                                                       |<DOC>.                                                                                                                                                                                                                                  |
|This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby, supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter.|This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby, supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter.|
|THE MART.COM, INC.: I-ESCROW, INC.: By:Dominic J. Magliarditi                 By:Sanjay Bajaj Name: Dominic J. Magliarditi Name: Sanjay Bajaj Title: President Title: VP Business Development Date: 6/21/2023                           |<PARTY>.: <PARTY>.: By:Dominic <SIGNING_PERSON>                 By:Sanjay <SIGNING_PERSON> Name: <SIGNING_PERSON> Name: <SIGNING_PERSON> Title: <SIGNING_TITLE> Title: <SIGNING_TITLE> Date: 6/21/2023                                  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Contains all the methods for training a DeIdentificationModel model. This module can obfuscate or mask the entities that contains personal information. These can be set with a file of regex patterns with setRegexPatternsDictionary, where each line is a mapping of entity to regex.

DATE \d{4}
AID \d{6,7}

Additionally, obfuscation strings can be defined with setObfuscateRefFile, where each line is a mapping of string to entity. The format and seperator can be speficied with setRefFileFormat and setRefSep.

Dr. Gregory House#DOCTOR
01010101#MEDICALRECORD

Ideally this annotator works in conjunction with Demographic Named EntityRecognizers that can be trained either using TextMatchers, RegexMatchers, DateMatchers, NerCRFs or NerDLs

Input Annotator Types: DOCUMENT, TOKEN, CHUNK

Output Annotator Type: DOCUMENT

Python API: DeIdentification

Scala API: DeIdentification

Show Example

from johnsnowlabs import nlp, medical

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# NER model trained on n2c2 (de-identification and Heart Disease Risk Factors Challenge) datasets)
clinical_ner = medical.NerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

#deid model with "entity_labels"
deid_entity_labels= medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk"])\
    .setOutputCol("deid_entity_label")\
    .setMode("mask")\
    .setReturnEntityMappings(True)\
    .setMaskingPolicy("entity_labels")

obs_lines = """Marvin MARSHALL#PATIENT
Hubert GROGAN#PATIENT
ALTHEA COLBURN#PATIENT
Kalil AMIN#PATIENT
Inci FOUNTAIN#PATIENT
Ekaterina Rosa#DOCTOR
Rudiger Chao#DOCTOR
COLLETTE KOHLER#NAME
Mufi HIGGS#NAME"""

with open ('obfuscation.txt', 'w') as f:
  f.write(obs_lines)

obfuscation = medical.DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk"]) \
    .setOutputCol("deidentified") \
    .setMode("obfuscate")\
    .setObfuscateDate(True)\
    .setObfuscateRefFile('obfuscation.txt')\
    .setObfuscateRefSource("both")\  #file or faker
    .setGenderAwareness(True)\
    .setLanguage("en")\
    .setUnnormalizedDateMode("obfuscate")  #mask or skip

deidPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      deid_entity_labels,
      obfuscation
      ])


empty_data = spark.createDataFrame([[""]]).toDF("text")


model = deidPipeline.fit(empty_data)

#sample data
text ='''
Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .
'''

result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

result.select(F.explode(F.arrays_zip(result.sentence.result,
                                     result.deid_entity_label.result,
                                     result.deidentified.result,
                                     )).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"),
              F.expr("cols['1']").alias("deid_entity_label"),
              F.expr("cols['2']").alias("deidentified"),
              ).toPandas()

+-----------------------------------------------------------------------+-------------------------------------------------------+-----------------------------------------------------------------+
|                                                               sentence|                                      deid_entity_label|                                                     deidentified|
+-----------------------------------------------------------------------+-------------------------------------------------------+-----------------------------------------------------------------+
|                          Record date : 2093-01-13 , David Hale , M.D .|                  Record date : <DATE> , <NAME> , M.D .|                  Record date : 2093-01-25 , Daryl Dieter , M.D .|
|              , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 .|            , Name : <NAME> , MR # <ID> Date : <DATE> .|         , Name : Langston Papas , MR # 4784828 Date : 01/25/93 .|
|             PCP : Oliveira , 25 years-old , Record date : 2079-11-09 .|PCP : <NAME> , <AGE> years-old , Record date : <DATE> .|PCP : Roseann Lederer , 23 years-old , Record date : 2079-11-21 .|
|Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .|            <LOCATION> , <LOCATION> , Phone <CONTACT> .|    31 North St Joseph Ave , 400 Tickle St , Phone (59) 106-048 .|
+-----------------------------------------------------------------------+-------------------------------------------------------+-----------------------------------------------------------------+

from johnsnowlabs import nlp, finance

documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

 sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence") \
    .setUseAbbreviations(True)

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel \
    .pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

# Ner entities
ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

nerConverter = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_con")

# Deidentification
deIdentification = finance.DeIdentification() \
    .setInputCols(["ner_chunk", "token", "sentence"]) \
    .setOutputCol("dei") \
    # file with custom regex pattern for custom entities
    .setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt") \
    # file with custom obfuscator names for the entities
    .setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt") \
    .setRefFileFormat("csv") \
    .setRefSep("#") \
    .setMode("obfuscate") \
    .setDateFormats(Array("MM/dd/yy","yyyy-MM-dd")) \
    .setObfuscateDate(True) \
    .setDateTag("DATE") \
    .setDays(5) \
    .setObfuscateRefSource("file")

# Pipeline
pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    ner_model,
    nerConverter,
    deIdentification
])

from johnsnowlabs import nlp, legal

documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

 sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence") \
    .setUseAbbreviations(True)

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel \
    .pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

# Ner entities
ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

nerConverter = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_con")

# Deidentification
deIdentification = legal.DeIdentification() \
    .setInputCols(["ner_chunk", "token", "sentence"]) \
    .setOutputCol("dei") \
    # file with custom regex pattern for custom entities
    .setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt") \
    # file with custom obfuscator names for the entities
    .setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt") \
    .setRefFileFormat("csv") \
    .setRefSep("#") \
    .setMode("obfuscate") \
    .setDateFormats(Array("MM/dd/yy","yyyy-MM-dd")) \
    .setObfuscateDate(True) \
    .setDateTag("DATE") \
    .setDays(5) \
    .setObfuscateRefSource("file")

# Pipeline
pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    ner_model,
    nerConverter,
    deIdentification
])

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

// Sentence Detector annotator, processes various sentences per line
val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

// Tokenizer splits words in a relevant format for NLP
val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

// Clinical word embeddings trained on PubMED dataset
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

// NER model trained on n2c2 (de-identification and Heart Disease Risk Factors Challenge) datasets)
val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

//deid model with "entity_labels"
val deid_entity_labels= new DeIdentification()
    .setInputCols(Array("ner_chunk", "token", "sentence"))
    .setOutputCol("deid_entity_label")
    .setMode("mask")
    .setReturnEntityMappings(true)
    .setMaskingPolicy("entity_labels")
    
//
val obs_lines = """Marvin MARSHALL#PATIENT
Hubert GROGAN#PATIENT
ALTHEA COLBURN#PATIENT
Kalil AMIN#PATIENT
Inci FOUNTAIN#PATIENT
Ekaterina Rosa#DOCTOR
Rudiger Chao#DOCTOR
COLLETTE KOHLER#NAME
Mufi HIGGS#NAME"""
//

val obfuscation =  new DeIdentification()
    .setInputCols(Array("ner_chunk", "token", "sentence"))
    .setOutputCol("deidentified")
    .setMode("obfuscate")
    .setObfuscateDate(true)
    .setObfuscateRefFile("obfuscation.txt")
    .setObfuscateRefSource("both")       //file or faker  
    .setGenderAwareness(true)
    .setLanguage("en")
    .setUnnormalizedDateMode("obfuscate") //mask or skip


val deidPipeline = new Pipeline().setStages(Array(
                                                  documentAssembler,
                                                  sentenceDetector,
                                                  tokenizer,
                                                  word_embeddings,
                                                  clinical_ner,
                                                  ner_converter,
                                                  deid_entity_labels,
                                                  obfuscation
                                                ))

//sample data

val text =
          '''
          Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old ,
          Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .
          '''

val data = Seq(text).toDF("text")

val result = new deidPipeline.fit(data).transform(data)

+-----------------------------------------------------------------------+-------------------------------------------------------+-----------------------------------------------------------------+
|                                                               sentence|                                      deid_entity_label|                                                     deidentified|
+-----------------------------------------------------------------------+-------------------------------------------------------+-----------------------------------------------------------------+
|                          Record date : 2093-01-13 , David Hale , M.D .|                  Record date : <DATE> , <NAME> , M.D .|                  Record date : 2093-01-25 , Daryl Dieter , M.D .|
|              , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 .|            , Name : <NAME> , MR # <ID> Date : <DATE> .|         , Name : Langston Papas , MR # 4784828 Date : 01/25/93 .|
|             PCP : Oliveira , 25 years-old , Record date : 2079-11-09 .|PCP : <NAME> , <AGE> years-old , Record date : <DATE> .|PCP : Roseann Lederer , 23 years-old , Record date : 2079-11-21 .|
|Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .|            <LOCATION> , <LOCATION> , Phone <CONTACT> .|    31 North St Joseph Ave , 400 Tickle St , Phone (59) 106-048 .|
+-----------------------------------------------------------------------+-------------------------------------------------------+-----------------------------------------------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
     .setInputCol("text")
     .setOutputCol("document")

 val sentenceDetector = new SentenceDetector()
     .setInputCols(document)
     .setOutputCol("sentence")
     .setUseAbbreviations(true)

 val tokenizer = new Tokenizer()
     .setInputCols("sentence")
     .setOutputCol("token")

 val embeddings = WordEmbeddingsModel
     .pretrained("embeddings_clinical", "en", "clinical/models")
     .setInputCols(Array("sentence", "token"))
     .setOutputCol("embeddings")

// Ner entities
val ner_model = FinanceNerModel.pretrained("finner_orgs_prods_alias","en","finance/models")
     .setInputCols(Array("sentence", "token", "embeddings"))
     .setOutputCol("ner")

 val nerConverter = new NerConverter()
     .setInputCols(Array("sentence", "token", "ner"))
     .setOutputCol("ner_con")

// Deidentification
val deIdentification = new DeIdentification()
     .setInputCols(Array("ner_chunk", "token", "sentence"))
     .setOutputCol("dei")
     // file with custom regex patterns for custom entities
     .setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt")
     // file with custom obfuscator names for the entities
     .setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt")
     .setRefFileFormat("csv")
     .setRefSep("#")
     .setMode("obfuscate")
     .setDateFormats(Array("MM/dd/yy","yyyy-MM-dd"))
     .setObfuscateDate(true)
     .setDateTag("DATE")
     .setDays(5)
     .setObfuscateRefSource("file")

// Pipeline
val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  ner_model,
  nerConverter,
  deIdentification
))

import spark.implicits._

val documentAssembler = new DocumentAssembler()
     .setInputCol("text")
     .setOutputCol("document")

 val sentenceDetector = new SentenceDetector()
     .setInputCols("document")
     .setOutputCol("sentence")
     .setUseAbbreviations(true)

 val tokenizer = new Tokenizer()
     .setInputCols("sentence")
     .setOutputCol("token")

 val embeddings = WordEmbeddingsModel
     .pretrained("embeddings_clinical", "en", "clinical/models")
     .setInputCols(Array("sentence", "token"))
     .setOutputCol("embeddings")

// Ner entities
val ner_model = LegalNerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

 val nerConverter = new NerConverter()
     .setInputCols(Array("sentence", "token", "ner"))
     .setOutputCol("ner_con")

// Deidentification
val deIdentification = new DeIdentification()
     .setInputCols(Array("ner_chunk", "token", "sentence"))
     .setOutputCol("dei")
     // file with custom regex patterns for custom entities
     .setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt")
     // file with custom obfuscator names for the entities
     .setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt")
     .setRefFileFormat("csv")
     .setRefSep("#")
     .setMode("obfuscate")
     .setDateFormats(Array("MM/dd/yy","yyyy-MM-dd"))
     .setObfuscateDate(true)
     .setDateTag("DATE")
     .setDays(5)
     .setObfuscateRefSource("file")

// Pipeline

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  ner_model,
  nerConverter,
  deIdentification
))

DistilBertForSequenceClassification

Model

DistilBertForSequenceClassification can load DistilBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.

Parameters:

batchSize’, ‘Size of every batch’: default: 8,
coalesceSentences’: “Instead of 1 class per sentence (if inputCols is ‘'’sentence’’’ output 1 class per document by averaging probabilities in all sentences.” default: False,
maxSentenceLength’, ‘Max sentence length to process’, default: 128
caseSensitive’, ‘whether to ignore case in tokens for embeddings matching’,default: True,

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CATEGORY

Python API: DistilBertForSequenceClassification

Scala API: DistilBertForSequenceClassification

Show Example

from johnsnowlabs import nlp, medical
 
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

sequenceClassifier = medical.DistilBertForSequenceClassification.pretrained("distilbert_sequence_classifier_ade", "en", "clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("classes")

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    tokenizer,
    sequenceClassifier
])

data = spark.createDataFrame([["I have an allergic reaction to vancomycin so I have itchy skin, sore throat/burning/itching, numbness of tongue and gums.I would not recommend this drug to anyone, especially since I have never had such an adverse reaction to any other medication."],
                              ["Religare Capital Ranbaxy has been accepting approval for Diovan since 2012"]]).toDF("text")

result = pipeline.fit(data).transform(data)

result.select("text", "classes.result").show(truncate=100)

| text                                                                                           | result |
|------------------------------------------------------------------------------------------------|-------|
| I have an allergic reaction to vancomycin so I have itchy skin, sore throat/burning/itching, numb... | [True] |
| Religare Capital Ranbaxy has been accepting approval for Diovan since 2012 | [False] |

import spark.implicits._

val document_assembler = new DocumentAssembler() 
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new Tokenizer()
    .setInputCols("document")
    .setOutputCol("token")

val sequenceClassifier = MedicalDistilBertForSequenceClassification.pretrained("distilbert_sequence_classifier_ade", "en", "clinical/models")
    .setInputCols(Array("document","token"))
    .setOutputCol("classes")

val pipeline =  new Pipeline().setStages(Array(
    document_assembler, 
    tokenizer, 
    sequenceClassifier))

var text =List(
    List("I have an allergic reaction to vancomycin so I have itchy skin, sore throat/burning/itching, numbness of tongue and gums.I would not recommend this drug to anyone, especially since I have never had such an adverse reaction to any other medication."),
    List("Religare Capital Ranbaxy has been accepting approval for Diovan since 2012")
)

val data = Seq(text).toDF("text")
val result = pipeline.fit(data).transform(data)

| text                                                                                           | result |
|------------------------------------------------------------------------------------------------|-------|
| I have an allergic reaction to vancomycin so I have itchy skin, sore throat/burning/itching, numb... | [True] |
| Religare Capital Ranbaxy has been accepting approval for Diovan since 2012 | [False] |

Doc2ChunkInternal

Model

Converts DOCUMENT, TOKEN typed annotations into CHUNK type with the contents of a chunkCol. Chunk text must be contained within input DOCUMENT. May be either StringType or ArrayType[StringType] (using setIsArray). Useful for annotators that require a CHUNK type input.

Parameters:

inputCols: The name of the columns containing the input annotations. It can read either a String column or an Array.
outputCol: The name of the column in Document type that is generated. We can specify only one column here.

All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols().

For more extended examples on document pre-processing see the Spark NLP Workshop.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CHUNK

Python API: Doc2ChunkInternal

Scala API: Doc2ChunkInternal

Notebook: Doc2ChunkInternalNotebook

Show Example

from johnsnowlabs import nlp, medical

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

chunkAssembler = medical.Doc2ChunkInternal()\
    .setInputCols("document", "token")\
    .setChunkCol("target")\
    .setOutputCol("chunk")\
    .setIsArray(True)

pipeline = nlp.Pipeline().setStages([
                                    documentAssembler,
                                    tokenizer, 
                                    chunkAssembler])

data = spark.createDataFrame(
    [
        [
            "Spark NLP is an open-source text processing library for advanced natural language processing.",
            ["Spark NLP", "text processing library", "natural language processing"],
        ]
    ]
).toDF("text", "target")


result = pipeline.fit(data).transform(data)
result.selectExpr("chunk.result", "chunk.annotatorType").show(truncate=False)

+-----------------------------------------------------------------+---------------------+
|result                                                           |annotatorType        |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+

from johnsnowlabs import nlp, finance

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

chunkAssembler = finance.Doc2ChunkInternal()\
    .setInputCols("document", "token")\
    .setChunkCol("target")\
    .setOutputCol("chunk")\
    .setIsArray(True)

pipeline = nlp.Pipeline().setStages([documentAssembler, tokenizer, chunkAssembler])

data = spark.createDataFrame(
    [
        [
            "Spark NLP is an open-source text processing library for advanced natural language processing.",
            ["Spark NLP", "text processing library", "natural language processing"],
        ]
    ]
).toDF("text", "target")


result = pipeline.fit(data).transform(data)
result.selectExpr("chunk.result", "chunk.annotatorType").show(truncate=False)

+-----------------------------------------------------------------+---------------------+
|result                                                           |annotatorType        |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+

from johnsnowlabs import nlp, legal

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

chunkAssembler = legal.Doc2ChunkInternal()\
    .setInputCols("document", "token")\
    .setChunkCol("target")\
    .setOutputCol("chunk")\
    .setIsArray(True)

pipeline = nlp.Pipeline().setStages([documentAssembler, tokenizer, chunkAssembler])

data = spark.createDataFrame(
    [
        [
            "Spark NLP is an open-source text processing library for advanced natural language processing.",
            ["Spark NLP", "text processing library", "natural language processing"],
        ]
    ]
).toDF("text", "target")


result = pipeline.fit(data).transform(data)
result.selectExpr("chunk.result", "chunk.annotatorType").show(truncate=False)

+-----------------------------------------------------------------+---------------------+
|result                                                           |annotatorType        |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new Tokenizer()
    .setInputCols("document")
    .setOutputCol("token")

val chunkAssembler = new Doc2ChunkInternal()
    .setInputCols(Array("document", "token"))
    .setChunkCol("target")
    .setOutputCol("chunk")
    .setIsArray(true)

val pipeline = new Pipeline().setStages(Array(
    documentAssembler, 
    tokenizer, 
    chunkAssembler
))

val data = Seq(("Spark NLP is an open-source text processing library for advanced natural language processing.",
               "Spark NLP", "text processing library", "natural language processing")).toDF("text", "target")

val result = pipeline.fit(data).transform(data)

+-----------------------------------------------------------------+---------------------+
|result                                                           |annotatorType        |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new Tokenizer()
    .setInputCols("document")
    .setOutputCol("token")

val chunkAssembler = new Doc2ChunkInternal()
    .setInputCols(Array("document", "token"))
    .setChunkCol("target")
    .setOutputCol("chunk")
    .setIsArray(true)

val pipeline = new Pipeline().setStages(Array(
    documentAssembler, 
    tokenizer, 
    chunkAssembler
))

val data = Seq(("Spark NLP is an open-source text processing library for advanced natural language processing.",
               "Spark NLP", "text processing library", "natural language processing")).toDF("text", "target")

val result = pipeline.fit(data).transform(data)

+-----------------------------------------------------------------+---------------------+
|result                                                           |annotatorType        |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new Tokenizer()
    .setInputCols("document")
    .setOutputCol("token")

val chunkAssembler = new Doc2ChunkInternal()
    .setInputCols(Array("document", "token"))
    .setChunkCol("target")
    .setOutputCol("chunk")
    .setIsArray(true)

val pipeline = new Pipeline().setStages(Array(
    documentAssembler, 
    tokenizer, 
    chunkAssembler
))

val data = Seq(("Spark NLP is an open-source text processing library for advanced natural language processing.",
               "Spark NLP", "text processing library", "natural language processing")).toDF("text", "target")

val result = pipeline.fit(data).transform(data)

+-----------------------------------------------------------------+---------------------+
|result                                                           |annotatorType        |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+

DocMapper

DocMapper uses the text representation of document annotations to map clinical codes to other codes or relevant information.

Parametres:

setRels (List[str]): Relations that we are going to use to map the document
setLowerCase (Boolean): Set if we want to map the documents in lower case or not (Default: True)
setAllowMultiTokenChunk (Boolean): Whether to skip relations with multitokens (Default: True)
setMultivaluesRelations (Boolean): Whether to decide to return all values in a relation together or separately (Default: False)
setDoExceptionHandling: If it is set as True, the annotator tries to process as usual and ff exception-causing data (e.g. corrupted record/ document) is passed to the annotator, an exception warning is emitted which has the exception message.

Input Annotator Types: DOCUMENT

Output Annotator Type: LABEL_DEPENDENCY

Python API: DocMapperModel

Scala API: DocMapperModel

Notebook: DocMapperModelNotebook

Show Example

from johnsnowlabs import nlp, medical

#ChunkMapper Pipeline
document_assembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

#drug_action_treatment_mapper 
docMapper= medical.DocMapperModel().pretrained("drug_action_treatment_mapper", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("mappings")\
    .setRels(["action", "treatment"])

mapperPipeline = nlp.Pipeline().setStages([
    document_assembler,
    docMapper])

test_data = spark.createDataFrame([["Dermovate"], ["Aspagin"]]).toDF("text")

res = mapperPipeline.fit(test_data).transform(test_data)

# Show results
res.select(F.explode(F.arrays_zip(res.mappings.result, 
                                  res.mappings.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("ner_chunk"),
            F.expr("col['0']").alias("mapping_result"),
            F.expr("col['1']['relation']").alias("relation"),
            F.expr("col['1']['all_relations']").alias("all_mappings")).show(truncate=False)

+---------+----------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ner_chunk|mapping_result        |relation |all_mappings                                                                                                                                                                                                           |
+---------+----------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Dermovate|anti-inflammatory     |action   |corticosteroids::: dermatological preparations:::very strong                                                                                                                                                           |
|Dermovate|lupus                 |treatment|discoid lupus erythematosus:::empeines:::psoriasis:::eczema                                                                                                                                                            |
|Aspagin  |analgesic             |action   |anti-inflammatory:::antipyretic                                                                                                                                                                                        |
|Aspagin  |ankylosing spondylitis|treatment|arthralgia:::pain:::bursitis:::headache:::migraine:::myositis:::neuralgia:::osteoarthritis:::gout:::rheumatoid arthritis:::spondylitis:::spondyloarthritis:::tendinitis:::tenosynovitis:::crush injury:::golfer's elbow|
+---------+----------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

// ChunkMapper Pipeline
val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

// drug_action_treatment_mapper 
val docMapper= DocMapperModel().pretrained("drug_action_treatment_mapper", "en", "clinical/models")
    .setInputCols("document")
    .setOutputCol("mappings")
    .setRels(Array("action", "treatment"))

val mapperPipeline = new Pipeline().setStages(Array(
    document_assembler,
    docMapper))


val test_data = Seq(("Dermovate", "Aspagin")).toDF("text")

val res = mapperPipeline.fit(test_data).transform(test_data)

// Show results

+---------+----------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ner_chunk|mapping_result        |relation |all_mappings                                                                                                                                                                                                           |
+---------+----------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Dermovate|anti-inflammatory     |action   |corticosteroids::: dermatological preparations:::very strong                                                                                                                                                           |
|Dermovate|lupus                 |treatment|discoid lupus erythematosus:::empeines:::psoriasis:::eczema                                                                                                                                                            |
|Aspagin  |analgesic             |action   |anti-inflammatory:::antipyretic                                                                                                                                                                                        |
|Aspagin  |ankylosing spondylitis|treatment|arthralgia:::pain:::bursitis:::headache:::migraine:::myositis:::neuralgia:::osteoarthritis:::gout:::rheumatoid arthritis:::spondylitis:::spondyloarthritis:::tendinitis:::tenosynovitis:::crush injury:::golfer's elbow|
+---------+----------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

DocMapper that can be used to map short strings via DocumentAssembler without using any other annotator between to convert strings to Chunk type that ChunkMapperModel expects.

Parameters:

setDictionary (Str): Dictionary path where is the JsonDictionary that contains the mappings columns
setRels (Boolean): Relations that we are going to use to map the document
setLowerCase (Boolean): Set if we want to map the documents in lower case or not (Default: True)
setAllowMultiTokenChunk (Boolean): Whether to skip relations with multitokens (Default: True)
setMultivaluesRelations (Boolean): Whether to decide to return all values in a relation together or separately (Default: False)
setDoExceptionHandling: If it is set as True, the annotator tries to process as usual and ff exception-causing data (e.g. corrupted record/ document) is passed to the annotator, an exception warning is emitted which has the exception message.

Input Annotator Types: DOCUMENT

Output Annotator Type: LABEL_DEPENDENCY

Python API: DocMapperApproach

Scala API: DocMapperApproach

Notebook: DocMapperApproachNotebook

Show Example

from johnsnowlabs import nlp,  medical

data_set= {
  "mappings": [
    {
      "key": "metformin",
      "relations": [
        {
          "key": "action",
          "values" : ["hypoglycemic", "Drugs Used In Diabetes"]
        },
        {
          "key": "treatment",
          "values" : ["diabetes", "t2dm"]
        }
      ]
    }
  ]
}

import json
with open('sample_drug.json', 'w', encoding='utf-8') as f:
    json.dump(data_set, f, ensure_ascii=False, indent=4)

document_assembler = nlp.DocumentAssembler()\
      .setInputCol('text')\
      .setOutputCol('document')

chunkerMapper = medical.DocMapperApproach()\
      .setInputCols(["document"])\
      .setOutputCol("mappings")\
      .setDictionary("./sample_drug.json")\
      .setRels(["action"])

pipeline = nlp.Pipeline().setStages([document_assembler,
                                     chunkerMapper])

test_data = spark.createDataFrame([["metformin"]]).toDF("text")

res = pipeline.fit(test_data).transform(test_data)


## Results
res.select(F.explode(F.arrays_zip(res.mappings.result,
                                  res.mappings.metadata)).alias("col"))\
    .select(F.expr("col['1']['entity']").alias("document"),
            F.expr("col['0']").alias("mapping_result"),
            F.expr("col['1']['relation']").alias("relation"),
            F.expr("col['1']['all_relations']").alias("all_mappings")).show(truncate=False)

+---------+--------------+--------+----------------------+
|document |mapping_result|relation|all_mappings          |
+---------+--------------+--------+----------------------+
|metformin|hypoglycemic  |action  |Drugs Used In Diabetes|
+---------+--------------+--------+----------------------+

import spark.implicits._

/* sample_drug.json file
{
  "mappings": [
    {
      "key": "metformin",
      "relations": [
        {
          "key": "action",
          "values" : ["hypoglycemic", "Drugs Used In Diabetes"]
        },
        {
          "key": "treatment",
          "values" : ["diabetes", "t2dm"]
        }
      ]
    }
  ]
}
*/

val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document") 

val chunkerMapper = new DocMapperApproach()
  .setInputCols("document")
  .setOutputCol("mappings")
  .setDictionary("./sample_drug.json")
  .setRels("action")

val pipeline = new Pipeline().setStages(Array(document_assembler, chunkerMapper))

val test_data = Seq("metformin").toDF("text") 

val res = pipeline.fit(test_data).transform(test_data)


// Results 

+---------+--------------+--------+----------------------+
|document |mapping_result|relation|all_mappings          |
+---------+--------------+--------+----------------------+
|metformin|hypoglycemic  |action  |Drugs Used In Diabetes|
+---------+--------------+--------+----------------------+

DocumentFiltererByClassifier

Model

The DocumentFiltererByClassifier function is designed to filter documents based on the outcomes generated by classifier annotators. It operates using a white list and a black list. The white list comprises classifier results that meet the criteria to pass through the filter, while the black list includes results that are prohibited from passing through. This filtering process is sensitive to cases by default. However, by setting caseSensitive to False, the filter becomes case-insensitive, allowing for a broader range of matches based on the specified criteria. This function serves as an effective tool for systematically sorting and managing documents based on specific classifier outcomes, facilitating streamlined document handling and organization.

Parameters:

whiteList: (list) If defined, list of entities to process. The rest will be ignored.
CaseSensitive: (bool) Determines whether the definitions of the white listed entities are case sensitive.

Input Annotator Types: DOCUMENT, CATEGORY

Output Annotator Type: DOCUMENT

Notebook: DocumentFiltererByClassifierNotebook

Show Example

example = """Medical Specialty:
Cardiovascular / Pulmonary

Sample Name: Aortic Valve Replacement

Description: Aortic valve replacement using a mechanical valve and two-vessel coronary artery bypass grafting procedure using saphenous vein graft to the first obtuse marginal artery and left radial artery graft to the left anterior descending artery.
(Medical Transcription Sample Report)

DIAGNOSIS: Aortic valve stenosis with coronary artery disease associated with congestive heart failure. The patient has diabetes and is morbidly obese.

PROCEDURES: Aortic valve replacement using a mechanical valve and two-vessel coronary artery bypass grafting procedure using saphenous vein graft to the first obtuse marginal artery and left radial artery graft to the left anterior descending artery.

ANESTHESIA: General endotracheal

INCISION: Median sternotomy

INDICATIONS: The patient presented with severe congestive heart failure associated with the patient's severe diabetes. The patient was found to have moderately stenotic aortic valve. In addition, The patient had significant coronary artery disease consisting of a chronically occluded right coronary artery but a very important large obtuse marginal artery coming off as the main circumflex system. The patient also has a left anterior descending artery which has moderate disease and this supplies quite a bit of collateral to the patient's right system. It was decided to perform a valve replacement as well as coronary artery bypass grafting procedure.

FINDINGS: The left ventricle is certainly hypertrophied· The aortic valve leaflet is calcified and a severe restrictive leaflet motion. It is a tricuspid type of valve. The coronary artery consists of a large left anterior descending artery which is associated with 60% stenosis but a large obtuse marginal artery which has a tight proximal stenosis.

The radial artery was used for the left anterior descending artery. Flow was excellent. Looking at the targets in the posterior descending artery territory, there did not appear to be any large branches. On the angiogram these vessels appeared to be quite small. Because this is a chronically occluded vessel and the patient has limited conduit due to the patient's massive obesity, attempt to bypass to this area was not undertaken. The patient was brought to the operating room

PROCEDURE: The patient was brought to the operating room and placed in supine position. A median sternotomy incision was carried out and conduits were taken from the left arm as well as the right thigh. The patient weighs nearly three hundred pounds. There was concern as to taking down the left internal mammary artery. Because the radial artery appeared to be a good conduit The patient would have arterial graft to the left anterior descending artery territory. The patient was cannulated after the aorta and atrium were exposed and full heparinization.

The patient went on cardiopulmonary bypass and the aortic cross-clamp was applied Cardioplegia was delivered through the coronary sinuses in a retrograde manner. The patient was cooled to 32 degrees. Iced slush was applied to the heart. The aortic valve was then exposed through the aortic root by transverse incision. The valve leaflets were removed and the #23 St. Jude mechanical valve was secured into position by circumferential pledgeted sutures. At this point, aortotomy was closed.

The first obtuse marginal artery was a very large target and the vein graft to this target indeed produced an excellent amount of flow. Proximal anastomosis was then carried out to the foot of the aorta. The left anterior descending artery does not have severe disease but is also a very good target and the radial artery was anastomosed to this target in an end-to-side manner. The two proximal anastomoses were then carried out to the root of the aorta.

The patient came off cardiopulmonary bypass after aortic cross-clamp was released. The patient was adequately warmed. Protamine was given without adverse effect. Sternal closure was then done using wires. The subcutaneous layers were closed using Vicryl suture. The skin was approximated using staples.
"""

df = spark.createDataFrame([[example]]).toDF("text")

from johnsnowlabs import nlp, medical 

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")\

document_splitter = medical.InternalDocumentSplitter() \
    .setInputCols("document")\
    .setOutputCol("splits")\
    .setSplitMode("recursive")\
    .setChunkSize(100)\
    .setChunkOverlap(3)\
    .setExplodeSplits(True)\
    .setPatternsAreRegex(False)\
    .setSplitPatterns(["\n\n", "\n"])\
    .setKeepSeparators(False)\
    .setTrimWhitespace(True)
    #.setEnableSentenceIncrement(False)

sequenceClassifier = medical.BertForSequenceClassification\
    .pretrained('bert_sequence_classifier_clinical_sections', 'en', 'clinical/models')\
    .setInputCols(["splits", "token"])\
    .setOutputCol("prediction")\
    .setCaseSensitive(False)

document_filterer = medical.DocumentFiltererByClassifier()\
    .setInputCols(["splits", "prediction"])\
    .setOutputCol("filteredDocuments")\
    .setWhiteList(["Diagnostic and Laboratory Data"])\
    .setCaseSensitive(False)\


pipeline = nlp.Pipeline().setStages([
    document_assembler,
    tokenizer,
    document_splitter,
    sequenceClassifier,
    #document_filterer
])

result = pipeline.fit(df).transform(df)

# before filterer result

result.selectExpr("splits.result[0] as splits",
                  "prediction.result[0] as classes"
                  ).show(truncate=80)

+--------------------------------------------------------------------------------+------------------------------+
|                                                                          splits|                       classes|
+--------------------------------------------------------------------------------+------------------------------+
|Medical Specialty:\nCardiovascular / Pulmonary\n\nSample Name: Aortic Valve R...|                       History|
|Description: Aortic valve replacement using a mechanical valve and two-vessel...|Complications and Risk Factors|
|                                           (Medical Transcription Sample Report)|Complications and Risk Factors|
|DIAGNOSIS: Aortic valve stenosis with coronary artery disease associated with...|Diagnostic and Laboratory Data|
|PROCEDURES: Aortic valve replacement using a mechanical valve and two-vessel ...|                    Procedures|
|                 ANESTHESIA: General endotracheal\n\nINCISION: Median sternotomy|                    Procedures|
|INDICATIONS: The patient presented with severe congestive heart failure assoc...|     Consultation and Referral|
|FINDINGS: The left ventricle is certainly hypertrophied· The aortic valve lea...|Diagnostic and Laboratory Data|
|The radial artery was used for the left anterior descending artery. Flow was ...|Diagnostic and Laboratory Data|
|PROCEDURE: The patient was brought to the operating room and placed in supine...|                    Procedures|
|The patient went on cardiopulmonary bypass and the aortic cross-clamp was app...|                    Procedures|
|The first obtuse marginal artery was a very large target and the vein graft t...|Diagnostic and Laboratory Data|
|The patient came off cardiopulmonary bypass after aortic cross-clamp was rele...|                    Procedures|
+--------------------------------------------------------------------------------+------------------------------+


# after filterer result

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    tokenizer,
    document_splitter,
    sequenceClassifier,
    document_filterer
])

result = pipeline.fit(df).transform(df)
from pyspark.sql.functions import col
result.selectExpr("filteredDocuments.result[0] as splits",
                  "filteredDocuments.metadata[0].class_label as classes")\
                  .filter(col("classes").isNotNull()).show(truncate=80)

+--------------------------------------------------------------------------------+------------------------------+
|                                                                          splits|                       classes|
+--------------------------------------------------------------------------------+------------------------------+
|DIAGNOSIS: Aortic valve stenosis with coronary artery disease associated with...|Diagnostic and Laboratory Data|
|FINDINGS: The left ventricle is certainly hypertrophied· The aortic valve lea...|Diagnostic and Laboratory Data|
|The radial artery was used for the left anterior descending artery. Flow was ...|Diagnostic and Laboratory Data|
|The first obtuse marginal artery was a very large target and the vein graft t...|Diagnostic and Laboratory Data|
+--------------------------------------------------------------------------------+------------------------------+

import spark.implicits._
 
val example = "Medical Specialty:
Cardiovascular / Pulmonary
Sample Name: Aortic Valve Replacement
Description: Aortic valve replacement using a mechanical valve and two-vessel coronary artery bypass grafting procedure using saphenous vein graft to the first obtuse marginal artery and left radial artery graft to the left anterior descending artery.
(Medical Transcription Sample Report)
DIAGNOSIS: Aortic valve stenosis with coronary artery disease associated with congestive heart failure. The patient has diabetes and is morbidly obese.
PROCEDURES: Aortic valve replacement using a mechanical valve and two-vessel coronary artery bypass grafting procedure using saphenous vein graft to the first obtuse marginal artery and left radial artery graft to the left anterior descending artery.
ANESTHESIA: General endotracheal
INCISION: Median sternotomy
INDICATIONS: The patient presented with severe congestive heart failure associated with the patient's severe diabetes. The patient was found to have moderately stenotic aortic valve. In addition, The patient had significant coronary artery disease consisting of a chronically occluded right coronary artery but a very important large obtuse marginal artery coming off as the main circumflex system. The patient also has a left anterior descending artery which has moderate disease and this supplies quite a bit of collateral to the patient's right system. It was decided to perform a valve replacement as well as coronary artery bypass grafting procedure.
FINDINGS: The left ventricle is certainly hypertrophied· The aortic valve leaflet is calcified and a severe restrictive leaflet motion. It is a tricuspid type of valve. The coronary artery consists of a large left anterior descending artery which is associated with 60% stenosis but a large obtuse marginal artery which has a tight proximal stenosis.
The radial artery was used for the left anterior descending artery. Flow was excellent. Looking at the targets in the posterior descending artery territory, there did not appear to be any large branches. On the angiogram these vessels appeared to be quite small. Because this is a chronically occluded vessel and the patient has limited conduit due to the patient's massive obesity, attempt to bypass to this area was not undertaken. The patient was brought to the operating room
PROCEDURE: The patient was brought to the operating room and placed in supine position. A median sternotomy incision was carried out and conduits were taken from the left arm as well as the right thigh. The patient weighs nearly three hundred pounds. There was concern as to taking down the left internal mammary artery. Because the radial artery appeared to be a good conduit The patient would have arterial graft to the left anterior descending artery territory. The patient was cannulated after the aorta and atrium were exposed and full heparinization.
The patient went on cardiopulmonary bypass and the aortic cross-clamp was applied Cardioplegia was delivered through the coronary sinuses in a retrograde manner. The patient was cooled to 32 degrees. Iced slush was applied to the heart. The aortic valve was then exposed through the aortic root by transverse incision. The valve leaflets were removed and the #23 St. Jude mechanical valve was secured into position by circumferential pledgeted sutures. At this point, aortotomy was closed.
The first obtuse marginal artery was a very large target and the vein graft to this target indeed produced an excellent amount of flow. Proximal anastomosis was then carried out to the foot of the aorta. The left anterior descending artery does not have severe disease but is also a very good target and the radial artery was anastomosed to this target in an end-to-side manner. The two proximal anastomoses were then carried out to the root of the aorta.
The patient came off cardiopulmonary bypass after aortic cross-clamp was released. The patient was adequately warmed. Protamine was given without adverse effect. Sternal closure was then done using wires. The subcutaneous layers were closed using Vicryl suture. The skin was approximated using staples.
"

val df = Seq(example).toDF("text") 

val document_assembler = new DocumentAssembler()
  .setInputCol("text") 
  .setOutputCol("document") 

val tokenizer = new Tokenizer()
  .setInputCols(Array("document")) 
  .setOutputCol("token")

val document_splitter = new InternalDocumentSplitter()
  .setInputCols("document") 
  .setOutputCol("splits") 
  .setSplitMode("recursive") 
  .setChunkSize(100) 
  .setChunkOverlap(3) 
  .setExplodeSplits(true) 
  .setPatternsAreRegex(false) 
  .setSplitPatterns(Array(" "," ")) 
  .setKeepSeparators(false) 
  .setTrimWhitespace(true) 
  //.setEnableSentenceIncrement(false) 

val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_clinical_sections","en","clinical/models") 
  .setInputCols(Array("splits","token")) 
  .setOutputCol("prediction") 
  .setCaseSensitive(false) 

val document_filterer = new DocumentFiltererByClassifier()
  .setInputCols(Array("splits","prediction")) 
  .setOutputCol("filteredDocuments") 
  .setWhiteList(Array("Diagnostic and Laboratory Data")) 
  .setCaseSensitive(false) 

val pipeline = new Pipeline().setStages(Array( 
                                              document_assembler, 
                                              tokenizer, 
                                              document_splitter, 
                                              sequenceClassifier, 
                                              //document_filterer )) 

val result = pipeline.fit(df).transform(df) 

// before filterer result 

+--------------------------------------------------------------------------------+------------------------------+
|                                                                          splits|                       classes|
+--------------------------------------------------------------------------------+------------------------------+
|Medical Specialty:\nCardiovascular / Pulmonary\n\nSample Name: Aortic Valve R...|                       History|
|Description: Aortic valve replacement using a mechanical valve and two-vessel...|Complications and Risk Factors|
|                                           (Medical Transcription Sample Report)|Complications and Risk Factors|
|DIAGNOSIS: Aortic valve stenosis with coronary artery disease associated with...|Diagnostic and Laboratory Data|
|PROCEDURES: Aortic valve replacement using a mechanical valve and two-vessel ...|                    Procedures|
|                 ANESTHESIA: General endotracheal\n\nINCISION: Median sternotomy|                    Procedures|
|INDICATIONS: The patient presented with severe congestive heart failure assoc...|     Consultation and Referral|
|FINDINGS: The left ventricle is certainly hypertrophied· The aortic valve lea...|Diagnostic and Laboratory Data|
|The radial artery was used for the left anterior descending artery. Flow was ...|Diagnostic and Laboratory Data|
|PROCEDURE: The patient was brought to the operating room and placed in supine...|                    Procedures|
|The patient went on cardiopulmonary bypass and the aortic cross-clamp was app...|                    Procedures|
|The first obtuse marginal artery was a very large target and the vein graft t...|Diagnostic and Laboratory Data|
|The patient came off cardiopulmonary bypass after aortic cross-clamp was rele...|                    Procedures|
+--------------------------------------------------------------------------------+------------------------------+


// after filterer result

val pipeline = new Pipeline().setStages(Array( 
                                              document_assembler, 
                                              tokenizer, 
                                              document_splitter, 
                                              sequenceClassifier, 
                                              document_filterer )) 

val result = pipeline.fit(df) .transform(df) 

+--------------------------------------------------------------------------------+------------------------------+
|                                                                          splits|                       classes|
+--------------------------------------------------------------------------------+------------------------------+
|DIAGNOSIS: Aortic valve stenosis with coronary artery disease associated with...|Diagnostic and Laboratory Data|
|FINDINGS: The left ventricle is certainly hypertrophied· The aortic valve lea...|Diagnostic and Laboratory Data|
|The radial artery was used for the left anterior descending artery. Flow was ...|Diagnostic and Laboratory Data|
|The first obtuse marginal artery was a very large target and the vein graft t...|Diagnostic and Laboratory Data|
+--------------------------------------------------------------------------------+------------------------------+

DocumentFiltererByNER

Model

The DocumentFiltererByNER annotator returns sentences containing the entity chunks you have filtered, allowing you to see only the sentences with the entities you want.It is particularly useful for extracting and organizing the results obtained from Spark NLP Pipelines.

Parameters:

blackList: If defined, list of entities to ignore. The rest will be processed.
whiteList: If defined, list of entities to process. The rest will be ignored.
caseSensitive: Determines whether the definitions of the white listed and black listed entities are case sensitive or not.
outputAsDocument: Whether to return all sentences joined into a single document.(default : False).
joinString: This parameter specifies the string that will be inserted between results of documents when combining them into a single result if outputAsDocument is set to True (default is : “ “).

Input Annotator Types: DOCUMENT, CHUNK

Output Annotator Type: DOCUMENT

Python API: DocumentFiltererByNER

Scala API: DocumentFiltererByNER

Notebook: DocumentFiltererByNER

Show Example

documentAssembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
  .setInputCols("document")\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols("sentence")\
  .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embeddings")

ner_jsl = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"])\
  .setOutputCol("ner")

ner_converter = medical.NerConverterInternal() \
  .setInputCols(["sentence", "token", "ner"]) \
  .setOutputCol("ner_chunk")

filterer = medical.DocumentFiltererByNER() \
  .setInputCols(["sentence", "ner_chunk"]) \
  .setOutputCol("filterer") \
  .setWhiteList(["Disease_Syndrome_Disorder"])

pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    ner_jsl,
    ner_converter,
    filterer])

df = spark.createDataFrame([
    ["Coronavirus disease (COVID-19) is an infectious disease caused by the SARS-CoV-2 virus."],
    ["Most people infected with the virus will experience mild to moderate respiratory illness and recover without requiring special treatment."],
    ["However, some will become seriously ill and require medical attention. "],
    ["Older people and those with underlying medical conditions like cardiovascular disease, diabetes, chronic respiratory disease, or cancer are more likely to develop serious illness."],
    ["Anyone can get sick with COVID-19 and become seriously ill or die at any age."],
    ["The best way to prevent and slow down transmission is to be well informed about the disease and how the virus spreads."],
    ["Protect yourself and others from infection by staying at least 1 metre apart from others, wearing a properly fitted mask, and washing your hands or using an alcohol-based rub frequently."],
    ["Get vaccinated when it’s your turn and follow local guidance."],
    ["Stay home if you feel unwell."],
    ["If you have a fever, cough and difficulty breathing, seek medical attention."],
    ["The virus can spread from an infected person’s mouth or nose in small liquid particles when they cough, sneeze, speak, sing or breathe. "],
    ["These particles range from larger respiratory droplets to smaller aerosols. It is important to practice respiratory etiquette, for example by coughing into a flexed elbow, and to stay home and self-isolate until you recover if you feel unwell."]
    ]).toDF("text")

from pyspark.sql.window import Window as W
from pyspark.sql import functions as F
spark_df = df.coalesce(1).withColumn("idx", F.monotonically_increasing_id())

res = pipeline.fit(spark_df).transform(spark_df)

# Result

res.selectExpr("idx as doc_id","explode(filterer) as filter").show(truncate=80)

+------+--------------------------------------------------------------------------------+
|doc_id|                                                                          filter|
+------+--------------------------------------------------------------------------------+
|     0|{document, 0, 86, Coronavirus disease (COVID-19) is an infectious DISAESE cau...|
|     1|{document, 0, 136, Most people infected with the virus will experience mild t...|
|     3|{document, 0, 178, Older people and those with underlying medical conditions ...|
|     6|{document, 0, 185, Protect yourself and others from infection by staying at l...|
|    10|{document, 0, 134, The virus can spread from an infected person’s mouth or no...|
+------+--------------------------------------------------------------------------------+

import spark.implicits._
 
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = SentenceDetectorDLModel
  .pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val wordEmbeddings = WordEmbeddingsModel
  .pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val ner_jsl = NerModel
  .pretrained("ner_jsl", "en", "clinical/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")

val nerConverter = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")

val filterer = new DocumentFiltererByNER()
  .setInputCols(Array("sentence", "ner_chunk"))
  .setOutputCol("filterer")
  .setWhiteList(Array("Disease_Syndrome_Disorder"))

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  wordEmbeddings,
  ner_jsl,
  nerConverter,
  filterer
))

val data = Seq(
  "Coronavirus disease (COVID-19) is an infectious disease caused by the SARS-CoV-2 virus.",
  "Most people infected with the virus will experience mild to moderate respiratory illness and recover without requiring special treatment.",
  "However, some will become seriously ill and require medical attention.",
  "Older people and those with underlying medical conditions like cardiovascular disease, diabetes, chronic respiratory disease, or cancer are more likely to develop serious illness.",
  "Anyone can get sick with COVID-19 and become seriously ill or die at any age.",
  "The best way to prevent and slow down transmission is to be well informed about the disease and how the virus spreads.",
  "Protect yourself and others from infection by staying at least 1 metre apart from others, wearing a properly fitted mask, and washing your hands or using an alcohol-based rub frequently.",
  "Get vaccinated when it’s your turn and follow local guidance.",
  "Stay home if you feel unwell.",
  "If you have a fever, cough and difficulty breathing, seek medical attention.",
  "The virus can spread from an infected person’s mouth or nose in small liquid particles when they cough, sneeze, speak, sing or breathe.",
  "These particles range from larger respiratory droplets to smaller aerosols. It is important to practice respiratory etiquette, for example by coughing into a flexed elbow, and to stay home and self-isolate until you recover if you feel unwell."
).toDF("text")

val dfWithIdx = data.coalesce(1).withColumn("idx", monotonically_increasing_id())

val model = pipeline.fit(dfWithIdx)
val result = model.transform(dfWithIdx)

result.show(false) 

// result 

+------+--------------------------------------------------------------------------------+
|doc_id|                                                                          filter|
+------+--------------------------------------------------------------------------------+
|     0|{document, 0, 86, Coronavirus disease (COVID-19) is an infectious DISAESE cau...|
|     1|{document, 0, 136, Most people infected with the virus will experience mild t...|
|     3|{document, 0, 178, Older people and those with underlying medical conditions ...|
|     6|{document, 0, 185, Protect yourself and others from infection by staying at l...|
|    10|{document, 0, 134, The virus can spread from an infected person’s mouth or no...|
+------+--------------------------------------------------------------------------------+

DocumentHashCoder

Model

This annotator can replace dates in a column of DOCUMENT type according with the hash code of any other column. It uses the hash of the specified column and creates a new document column containing the day shift information. In sequence, the DeIdentification annotator deidentifies the document with the shifted date information.

If the specified column contains strings that can be parsed to integers, use those numbers to make the shift in the data accordingly.

Parametres:

PatientIdColumn (String): Name of the column containing patient ID.
setDateShiftColumn (String): Sets column to be used for hash or predefined shift.
setNewDateShift (String): Sets column that has a reference of where chunk begins.
setRangeDays (int): Sets the range of dates to be sampled from.
setSeed (int): Sets the seed for random number generator.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

Python API: DocumentHashCoder

Scala API: DocumentHashCoder

Notebook: DocumentHashCoderNotebook

Show Example

from johnsnowlabs import nlp, medical
import pandas as pd

data = pd.DataFrame(
    {'patientID' : ['A001', 'A001', 
                    'A003', 'A003'],
     'text' : ['Chris Brown was discharged on 10/02/2022', 
               'Mark White was discharged on 10/04/2022', 
               'John was discharged on 15/03/2022',
               'John Moore was discharged on 15/12/2022'
              ],
     'dateshift' : ['10', '10', 
                    '30', '30']
    }
)

my_input_df = spark.createDataFrame(data)

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

documentHasher = medical.DocumentHashCoder()\
    .setInputCols("document")\
    .setOutputCol("document2")\
    .setPatientIdColumn("patientID")\
    .setNewDateShift("shift_days")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document2"])\
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["document2", "token"])\
    .setOutputCol("word_embeddings")

clinical_ner = medical.NerModel\
    .pretrained("ner_deid_subentity_augmented", "en", "clinical/models")\
    .setInputCols(["document2","token", "word_embeddings"])\
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["document2", "token", "ner"])\
    .setOutputCol("ner_chunk")

de_identification = medical.DeIdentification() \
    .setInputCols(["ner_chunk", "token", "document2"]) \
    .setOutputCol("deid_text") \
    .setMode("obfuscate") \
    .setObfuscateDate(True) \
    .setDateTag("DATE") \
    .setLanguage("en") \
    .setObfuscateRefSource('faker') \
    .setUseShifDays(True)\
    .setRegion('us')

pipeline = nlp.Pipeline().setStages([
    documentAssembler,
    documentHasher,
    tokenizer,
    embeddings,
    clinical_ner,
    ner_converter,
    de_identification

])

empty_data = spark.createDataFrame([["", ""]]).toDF("text", "patientID")
pipeline_model = pipeline.fit(empty_data)

output = pipeline_model.transform(my_input_df)
output.select('patientID','text', 'deid_text.result').show(truncate = False)

+---------+----------------------------------------+---------------------------------------------+
|patientID|text                                    |result                                       |
+---------+----------------------------------------+---------------------------------------------+
|A001     |Chris Brown was discharged on 10/02/2022|[Aldona Bar was discharged on 05/18/2022]    |
|A001     |Mark White was discharged on 02/28/2020 |[Leta Speller was discharged on 10/14/2019]  |
|A002     |John was discharged on 03/15/2022       |[Lonia Blood was discharged on 01/19/2022]   |
|A002     |John Moore was discharged on 12/31/2022 |[Murriel Hopper was discharged on 11/06/2022]|
+---------+----------------------------------------+---------------------------------------------+

from johnsnowlabs import nlp, finance
import pandas as pd

data = pd.DataFrame(
    {'patientID' : ['A001', 'A001', 
                    'A003', 'A003'],
     'text' : ['Chris Brown was discharged on 10/02/2022', 
               'Mark White was discharged on 10/04/2022', 
               'John was discharged on 15/03/2022',
               'John Moore was discharged on 15/12/2022'
              ],
     'dateshift' : ['10', '10', 
                    '30', '30']
    }
)

my_input_df = spark.createDataFrame(data)

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

documentHasher = finance.DocumentHashCoder()\
    .setInputCols("document")\
    .setOutputCol("document2")\
    .setPatientIdColumn("patientID")\
    .setNewDateShift("shift_days")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document2"])\
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["document2", "token"])\
    .setOutputCol("word_embeddings")

clinical_ner = finance.NerModel\
    .pretrained("ner_deid_subentity_augmented", "en", "clinical/models")\
    .setInputCols(["document2","token", "word_embeddings"])\
    .setOutputCol("ner")

ner_converter = finance.NerConverterInternal()\
    .setInputCols(["document2", "token", "ner"])\
    .setOutputCol("ner_chunk")

de_identification = finance.DeIdentification() \
    .setInputCols(["ner_chunk", "token", "document2"]) \
    .setOutputCol("deid_text") \
    .setMode("obfuscate") \
    .setObfuscateDate(True) \
    .setDateTag("DATE") \
    .setLanguage("en") \
    .setObfuscateRefSource('faker') \
    .setUseShifDays(True)\
    .setRegion('us')

pipeline = nlp.Pipeline().setStages([
    documentAssembler,
    documentHasher,
    tokenizer,
    embeddings,
    clinical_ner,
    ner_converter,
    de_identification

])

empty_data = spark.createDataFrame([["", ""]]).toDF("text", "patientID")
pipeline_model = pipeline.fit(empty_data)

output = pipeline_model.transform(my_input_df)
output.select('patientID','text', 'deid_text.result').show(truncate = False)

+---------+----------------------------------------+----------------------------------------------+
|patientID|text                                    |result                                        |
+---------+----------------------------------------+----------------------------------------------+
|A001     |Chris Brown was discharged on 10/02/2022|[Andreas Newport was discharged on 04/09/2022]|
|A001     |Mark White was discharged on 02/28/2020 |[Kara Dies was discharged on 09/05/2019]      |
|A002     |John was discharged on 03/15/2022       |[Lane Hacker was discharged on 02/17/2022]    |
|A002     |John Moore was discharged on 12/31/2022 |[Orlena Sheldon was discharged on 12/05/2022] |
+---------+----------------------------------------+----------------------------------------------+

from johnsnowlabs import nlp, legal
import pandas as pd

data = pd.DataFrame(
    {'patientID' : ['A001', 'A001', 
                    'A003', 'A003'],
     'text' : ['Chris Brown was discharged on 10/02/2022', 
               'Mark White was discharged on 10/04/2022', 
               'John was discharged on 15/03/2022',
               'John Moore was discharged on 15/12/2022'
              ],
     'dateshift' : ['10', '10', 
                    '30', '30']
    }
)

my_input_df = spark.createDataFrame(data)

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

documentHasher = legal.DocumentHashCoder()\
    .setInputCols("document")\
    .setOutputCol("document2")\
    .setPatientIdColumn("patientID")\
    .setNewDateShift("shift_days")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document2"])\
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["document2", "token"])\
    .setOutputCol("word_embeddings")

clinical_ner = legal.NerModel\
    .pretrained("ner_deid_subentity_augmented", "en", "clinical/models")\
    .setInputCols(["document2","token", "word_embeddings"])\
    .setOutputCol("ner")

ner_converter = legal.NerConverterInternal()\
    .setInputCols(["document2", "token", "ner"])\
    .setOutputCol("ner_chunk")

de_identification = legal.DeIdentification() \
    .setInputCols(["ner_chunk", "token", "document2"]) \
    .setOutputCol("deid_text") \
    .setMode("obfuscate") \
    .setObfuscateDate(True) \
    .setDateTag("DATE") \
    .setLanguage("en") \
    .setObfuscateRefSource('faker') \
    .setUseShifDays(True)\
    .setRegion('us')

pipeline = nlp.Pipeline().setStages([
    documentAssembler,
    documentHasher,
    tokenizer,
    embeddings,
    clinical_ner,
    ner_converter,
    de_identification

])

empty_data = spark.createDataFrame([["", ""]]).toDF("text", "patientID")
pipeline_model = pipeline.fit(empty_data)

output = pipeline_model.transform(my_input_df)
output.select('patientID','text', 'deid_text.result').show(truncate = False)

+---------+----------------------------------------+----------------------------------------------+
|patientID|text                                    |result                                        |
+---------+----------------------------------------+----------------------------------------------+
|A001     |Chris Brown was discharged on 10/02/2022|[Andreas Newport was discharged on 04/09/2022]|
|A001     |Mark White was discharged on 02/28/2020 |[Kara Dies was discharged on 09/05/2019]      |
|A002     |John was discharged on 03/15/2022       |[Lane Hacker was discharged on 02/17/2022]    |
|A002     |John Moore was discharged on 12/31/2022 |[Orlena Sheldon was discharged on 12/05/2022] |
+---------+----------------------------------------+----------------------------------------------+

import spark.implicits._
  
val data = Seq(
  ("A001", "Chris Brown was discharged on 10/02/2022"),
  ("A001", "Mark White was discharged on 02/28/2020"),
  ("A002", "John was discharged on 03/15/2022"),
  ("A002", "John Moore was discharged on 12/31/2022")
)

val columns = Seq("patientID", "text")
val myInputDF: DataFrame = spark.createDataFrame(data).toDF(columns: _*)


val my_input_df = spark.createDataFrame(data) 

val documentAssembler = new DocumentAssembler()
    .setInputCol("text") 
    .setOutputCol("document") 

val documentHasher = new DocumentHashCoder()
    .setInputCols("document") 
    .setOutputCol("document2") 
    .setPatientIdColumn("patientID") 
    .setNewDateShift("shift_days") 

val tokenizer = new Tokenizer()
    .setInputCols("document2") 
    .setOutputCol("token") 

val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
    .setInputCols(Array("document2","token")) 
    .setOutputCol("word_embeddings") 

val clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity_augmented","en","clinical/models")
    .setInputCols(Array("document2","token","word_embeddings")) 
    .setOutputCol("ner") 

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("document2","token","ner")) 
    .setOutputCol("ner_chunk") 

val de_identification = new DeIdentification()
    .setInputCols(Array("ner_chunk","token","document2")) 
    .setOutputCol("deid_text") 
    .setMode("obfuscate") 
    .setObfuscateDate(true) 
    .setDateTag("DATE") 
    .setLanguage("en") 
    .setObfuscateRefSource("faker") 
    .setUseShifDays(true) 
    .setRegion("us") 

val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      documentHasher,
      tokenizer,
      embeddings,
      clinicalNer,
      nerConverter,
      deIdentification
))

val emptyData = Seq(("", "")).toDF("text", "patientID")

val pipelineModel = pipeline.fit(emptyData)
val result = pipelineModel.transform(myInputDF)

+---------+----------------------------------------+----------------------------------------------+
|patientID|text                                    |result                                        |
+---------+----------------------------------------+----------------------------------------------+
|A001     |Chris Brown was discharged on 10/02/2022|[Andreas Newport was discharged on 04/09/2022]|
|A001     |Mark White was discharged on 02/28/2020 |[Kara Dies was discharged on 09/05/2019]      |
|A002     |John was discharged on 03/15/2022       |[Lane Hacker was discharged on 02/17/2022]    |
|A002     |John Moore was discharged on 12/31/2022 |[Orlena Sheldon was discharged on 12/05/2022] |
+---------+----------------------------------------+----------------------------------------------+

import spark.implicits._
  
val data = Seq(
  ("A001", "Chris Brown was discharged on 10/02/2022"),
  ("A001", "Mark White was discharged on 02/28/2020"),
  ("A002", "John was discharged on 03/15/2022"),
  ("A002", "John Moore was discharged on 12/31/2022")
)

val columns = Seq("patientID", "text")
val myInputDF: DataFrame = spark.createDataFrame(data).toDF(columns: _*)

val my_input_df = spark.createDataFrame(data) 

val documentAssembler = new DocumentAssembler()
    .setInputCol("text") 
    .setOutputCol("document") 

val documentHasher = new DocumentHashCoder()
    .setInputCols("document") 
    .setOutputCol("document2") 
    .setPatientIdColumn("patientID") 
    .setNewDateShift("shift_days") 

val tokenizer = new Tokenizer()
    .setInputCols("document2")
    .setOutputCol("token") 

val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
    .setInputCols(Array("document2","token")) 
    .setOutputCol("word_embeddings") 

val clinical_ner = FinanceNerModel.pretrained("ner_deid_subentity_augmented","en","clinical/models")
    .setInputCols(Array("document2","token","word_embeddings")) 
    .setOutputCol("ner") 

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("document2","token","ner")) 
    .setOutputCol("ner_chunk") 

val de_identification = new DeIdentification()
    .setInputCols(Array("ner_chunk","token","document2")) 
    .setOutputCol("deid_text") 
    .setMode("obfuscate") 
    .setObfuscateDate(true) 
    .setDateTag("DATE") 
    .setLanguage("en") 
    .setObfuscateRefSource("faker") 
    .setUseShifDays(true) 
    .setRegion("us") 

val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      documentHasher,
      tokenizer,
      embeddings,
      clinicalNer,
      nerConverter,
      deIdentification
))

val emptyData = Seq(("", "")).toDF("text", "patientID")
val pipelineModel = pipeline.fit(emptyData)
val result = pipelineModel.transform(myInputDF)

+---------+----------------------------------------+----------------------------------------------+
|patientID|text                                    |result                                        |
+---------+----------------------------------------+----------------------------------------------+
|A001     |Chris Brown was discharged on 10/02/2022|[Andreas Newport was discharged on 04/09/2022]|
|A001     |Mark White was discharged on 02/28/2020 |[Kara Dies was discharged on 09/05/2019]      |
|A002     |John was discharged on 03/15/2022       |[Lane Hacker was discharged on 02/17/2022]    |
|A002     |John Moore was discharged on 12/31/2022 |[Orlena Sheldon was discharged on 12/05/2022] |
+---------+----------------------------------------+----------------------------------------------+

import spark.implicits._
  
val data = Seq(
  ("A001", "Chris Brown was discharged on 10/02/2022"),
  ("A001", "Mark White was discharged on 02/28/2020"),
  ("A002", "John was discharged on 03/15/2022"),
  ("A002", "John Moore was discharged on 12/31/2022")
)

val columns = Seq("patientID", "text")
val myInputDF: DataFrame = spark.createDataFrame(data).toDF(columns: _*)


val my_input_df = spark.createDataFrame(data) 

val documentAssembler = new DocumentAssembler()
    .setInputCol("text") 
    .setOutputCol("document") 

val documentHasher = new DocumentHashCoder()
    .setInputCols("document") 
    .setOutputCol("document2") 
    .setPatientIdColumn("patientID") 
    .setNewDateShift("shift_days") 

val tokenizer = new Tokenizer()
    .setInputCols("document2")
    .setOutputCol("token") 

val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
    .setInputCols(Array("document2","token")) 
    .setOutputCol("word_embeddings") 

val clinical_ner = LegalNerModel.pretrained("ner_deid_subentity_augmented","en","clinical/models")
    .setInputCols(Array("document2","token","word_embeddings")) 
    .setOutputCol("ner") 

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("document2","token","ner")) 
    .setOutputCol("ner_chunk") 

val de_identification = new DeIdentification()
    .setInputCols(Array("ner_chunk","token","document2")) 
    .setOutputCol("deid_text") 
    .setMode("obfuscate") 
    .setObfuscateDate(true) 
    .setDateTag("DATE") 
    .setLanguage("en") 
    .setObfuscateRefSource("faker") 
    .setUseShifDays(true) 
    .setRegion("us") 

val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      documentHasher,
      tokenizer,
      embeddings,
      clinicalNer,
      nerConverter,
      deIdentification
))

val emptyData = Seq(("", "")).toDF("text", "patientID")

val pipelineModel = pipeline.fit(emptyData)
val result = pipelineModel.transform(myInputDF)

+---------+----------------------------------------+----------------------------------------------+
|patientID|text                                    |result                                        |
+---------+----------------------------------------+----------------------------------------------+
|A001     |Chris Brown was discharged on 10/02/2022|[Andreas Newport was discharged on 04/09/2022]|
|A001     |Mark White was discharged on 02/28/2020 |[Kara Dies was discharged on 09/05/2019]      |
|A002     |John was discharged on 03/15/2022       |[Lane Hacker was discharged on 02/17/2022]    |
|A002     |John Moore was discharged on 12/31/2022 |[Orlena Sheldon was discharged on 12/05/2022] |
+---------+----------------------------------------+----------------------------------------------+

DocumentLogRegClassifier

Classifies documents with a Logarithmic Regression algorithm. Currently there are no pretrained models available. Please see DocumentLogRegClassifierApproach to train your own model.

Parameters:

setMergeChunks: Sets whether to merge all chunks in a document or not (Default: false).
setLabels: Sets array to output the label in the original form.
setVectorizationModel: Sets a path to the classification model if it has been already trained.
setClassificationModel: Sets a path to the the classification model if it has been already trained.

Please check out the Models Hub for available models in the future.

Input Annotator Types: TOKEN

Output Annotator Type: CATEGORY

Python API: DocumentLogRegClassifierModel

Scala API: DocumentLogRegClassifierModel

Trains a model to classify documents with a Logarithmic Regression algorithm. Training data requires columns for text and their label. The result is a trained DocumentLogRegClassifierModel.

Parameters:

maxIter: Maximum number of iterations.
tol: Convergence tolerance after each iteration.
setLabels: Sets array to output the label in the original form.
setVectorizationModel: Sets a path to the classification model if it has been already trained.
setClassificationModel: Sets a path to the the classification model if it has been already trained.

Input Annotator Types: TOKEN

Output Annotator Type: CATEGORY

Python API: DocumentLogRegClassifierApproach

Scala API: DocumentLogRegClassifierApproach

Show Example

from johnsnowlabs import nlp, medical
# Define pipeline stages to prepare the data
document_assembler = nlp.DocumentAssembler() \
  .setInputCol("text") \
  .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
  .setInputCols(["document"]) \
  .setOutputCol("token")

normalizer = nlp.Normalizer() \
  .setInputCols(["token"]) \
  .setOutputCol("normalized")

stopwords_cleaner = nlp.StopWordsCleaner() \
  .setInputCols(["normalized"]) \
  .setOutputCol("cleanTokens") \
  .setCaseSensitive(False)

stemmer = nlp.Stemmer() \
  .setInputCols(["cleanTokens"]) \
  .setOutputCol("stem")

# Define the document classifier and fit training data to it
logreg = medical.DocumentLogRegClassifierApproach() \
  .setInputCols(["stem"]) \
  .setLabelCol("category") \
  .setOutputCol("prediction")

pipeline = nlp.Pipeline(stages=[
  document_assembler,
  tokenizer,
  normalizer,
  stopwords_cleaner,
  stemmer,
  logreg
])

model = pipeline.fit(trainingData)

from johnsnowlabs import nlp, finance
# Define pipeline stages to prepare the data
document_assembler = nlp.DocumentAssembler() \
  .setInputCol("text") \
  .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
  .setInputCols(["document"]) \
  .setOutputCol("token")

normalizer = nlp.Normalizer() \
  .setInputCols(["token"]) \
  .setOutputCol("normalized")

stopwords_cleaner = nlp.StopWordsCleaner() \
  .setInputCols(["normalized"]) \
  .setOutputCol("cleanTokens") \
  .setCaseSensitive(False)

stemmer = nlp.Stemmer() \
  .setInputCols(["cleanTokens"]) \
  .setOutputCol("stem")

# Define the document classifier and fit training data to it
logreg = finance.DocumentLogRegClassifierApproach() \
  .setInputCols(["stem"]) \
  .setLabelCol("category") \
  .setOutputCol("prediction")

pipeline = nlp.Pipeline(stages=[
  document_assembler,
  tokenizer,
  normalizer,
  stopwords_cleaner,
  stemmer,
  logreg
])

model = pipeline.fit(trainingData)

from johnsnowlabs import nlp, legal
# Define pipeline stages to prepare the data
document_assembler = nlp.DocumentAssembler() \
  .setInputCol("text") \
  .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
  .setInputCols(["document"]) \
  .setOutputCol("token")

normalizer = nlp.Normalizer() \
  .setInputCols(["token"]) \
  .setOutputCol("normalized")

stopwords_cleaner = nlp.StopWordsCleaner() \
  .setInputCols(["normalized"]) \
  .setOutputCol("cleanTokens") \
  .setCaseSensitive(False)

stemmer = nlp.Stemmer() \
  .setInputCols(["cleanTokens"]) \
  .setOutputCol("stem")

# Define the document classifier and fit training data to it
logreg = legal.DocumentLogRegClassifierApproach() \
  .setInputCols(["stem"]) \
  .setLabelCol("category") \
  .setOutputCol("prediction")

pipeline = nlp.Pipeline(stages=[
  document_assembler,
  tokenizer,
  normalizer,
  stopwords_cleaner,
  stemmer,
  logreg
])

model = pipeline.fit(trainingData)

import spark.implicits._ 
// Define pipeline stages to prepare the data
val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val normalizer = new Normalizer()
  .setInputCols("token")
  .setOutputCol("normalized")

val stopwords_cleaner = new StopWordsCleaner()
  .setInputCols("normalized")
  .setOutputCol("cleanTokens")
  .setCaseSensitive(false)

val stemmer = new Stemmer()
  .setInputCols("cleanTokens")
  .setOutputCol("stem")

// Define the document classifier and fit training data to it
val logreg = new DocumentLogRegClassifierApproach()
  .setInputCols("stem")
  .setLabelCol("category")
  .setOutputCol("prediction")

val pipeline = new Pipeline().setStages(Array(
  document_assembler,
  tokenizer,
  normalizer,
  stopwords_cleaner,
  stemmer,
  logreg
))

val model = pipeline.fit(trainingData)

import spark.implicits._ 

// Define pipeline stages to prepare the data
val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val normalizer = new Normalizer()
  .setInputCols("token")
  .setOutputCol("normalized")

val stopwords_cleaner = new StopWordsCleaner()
  .setInputCols("normalized")
  .setOutputCol("cleanTokens")
  .setCaseSensitive(false)

val stemmer = new Stemmer()
  .setInputCols("cleanTokens")
  .setOutputCol("stem")

// Define the document classifier and fit training data to it
val logreg = new DocumentLogRegClassifierApproach()
  .setInputCols("stem")
  .setLabelCol("category")
  .setOutputCol("prediction")

val pipeline = new Pipeline().setStages(Array(
  document_assembler,
  tokenizer,
  normalizer,
  stopwords_cleaner,
  stemmer,
  logreg
))

val model = pipeline.fit(trainingData)

import spark.implicits._

// Define pipeline stages to prepare the data
val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val normalizer = new Normalizer()
  .setInputCols("token")
  .setOutputCol("normalized")

val stopwords_cleaner = new StopWordsCleaner()
  .setInputCols("normalized")
  .setOutputCol("cleanTokens")
  .setCaseSensitive(false)

val stemmer = new Stemmer()
  .setInputCols("cleanTokens")
  .setOutputCol("stem")

// Define the document classifier and fit training data to it
val logreg = new DocumentLogRegClassifierApproach()
  .setInputCols("stem")
  .setLabelCol("category")
  .setOutputCol("prediction")

val pipeline = new Pipeline().setStages(Array(
  document_assembler,
  tokenizer,
  normalizer,
  stopwords_cleaner,
  stemmer,
  logreg
))

val model = pipeline.fit(trainingData)

DocumentMLClassifier

DocumentMLClassifier classifies documents with a Logarithmic Regression algorithm.

Input Annotator Types: TOKEN

Output Annotator Type: CATEGORY

Python API: DocumentMLClassifierModel

Scala API: DocumentMLClassifierModel

Notebook: DocumentMLClassifierModelNotebook

Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

classifier_ml = medical.DocumentMLClassifierModel.pretrained("classifierml_ade", "en", "clinical/models")\
    .setInputCols("token")\
    .setOutputCol("prediction")

clf_Pipeline = nlp.Pipeline(stages=[
    document_assembler,
    tokenizer,
    classifier_ml])

data = spark.createDataFrame([["""I feel great after taking tylenol."""], ["""Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient."""]]).toDF("text")

result = clf_Pipeline.fit(data).transform(data)


# Show results
result.select('text','prediction.result').show(truncate=False)

+----------------------------------------------------------------------------------------+-------+
|text                                                                                    |result |
+----------------------------------------------------------------------------------------+-------+
|Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient.|[False]|
|I feel great after taking tylenol.                                                      |[False]|
+----------------------------------------------------------------------------------------+-------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new Tokenizer()
    .setInputCols("document")
    .setOutputCol("token")

val classifier_ml = DocumentMLClassifierModel.pretrained("classifierml_ade", "en", "clinical/models")
    .setInputCols("token")
    .setOutputCol("prediction")

val clf_Pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    tokenizer, 
    classifier_ml))

val data = Seq(
  "I feel great after taking tylenol.",
  "Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient.").toDF("text")
  
val result = clf_Pipeline.fit(data).transform(data)

// Show results

+----------------------------------------------------------------------------------------+-------+
|text                                                                                    |result |
+----------------------------------------------------------------------------------------+-------+
|Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient.|[False]|
|I feel great after taking tylenol.                                                      |[False]|
+----------------------------------------------------------------------------------------+-------+

Trains a model to classify documents with a Logarithmic Regression algorithm. Training data requires columns for text and their label. The result is a trained DocumentMLClassifierModel.

Parametres:

labelCol: (str) Sets column with the value result we are trying to predict.
maxIter: (Int) Sets maximum number of iterations.
tol: (float) Sets convergence tolerance after each iteration.
fitIntercept: (str) Sets whether to fit an intercept term, default is true.
vectorizationModelPath: (str) Sets a path to the classification model if it has been already trained.
classificationModelPath: (str) Sets a path to the classification model if it has been already trained.
classificationModelClass: (str) Sets a the classification model class from SparkML to use; possible values are: logreg, svm.
minTokenNgram: (int) Sets minimum number of tokens for Ngrams.
maxTokenNgram: (int) Sets maximum number of tokens for Ngrams.
mergeChunks: (boolean) whether to merge all chunks in a document or not (Default: false)

Input Annotator Types: TOKEN

Output Annotator Type: CATEGORY

Python API: DocumentMLClassifierApproach

Scala API: DocumentMLClassifierApproach

Notebook: DocumentMLClassifierApproachNotebook

Show Example

from johnsnowlabs import nlp, medical 

document = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

token = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

classifier_logreg = medical.DocumentMLClassifierApproach() \
    .setInputCols("token") \
    .setLabelCol("category") \
    .setOutputCol("prediction") \
    .setClassificationModelClass("logreg")\
    .setFitIntercept(True)

pipeline = nlp.Pipeline(stages=[
    document, 
    token, 
    classifier_logreg])

result_logreg = pipeline.fit(train_data).transform(test_data).cache()

import spark.implicits._

val document = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val token = new Tokenizer()
    .setInputCols("document")
    .setOutputCol("token")

val classifier_logreg = new DocumentMLClassifierApproach()
    .setInputCols("token")
    .setLabelCol("category")
    .setOutputCol("prediction")
    .setClassificationModelClass("logreg")
    .setFitIntercept(true) 

val pipeline = new Pipeline().setStages(Array(
    document,
    token,
    classifier_logreg)) 

val result_logreg = pipeline.fit(train_data).transform(test_data).cache()

DrugNormalizer

Model

Normalizes drug mentions in clinical text.

Adds spaces between punctuation and words, as well as normalize the drug mentions. The policy parameter controls what drug information should be normalized, check the parameter documentation for details.

For usage examples and discussion, check out Blogpost

Parametres:

lowercase: (boolean) whether to convert strings to lowercase. Default is False.
policy: (str) rule to remove patterns from text. Valid policy values are:
- "all" – replaces both abbreviations and dosages
- "abbreviations" – replaces all abbreviations with their full forms
  
  e.g., "oral sol" → "oral solution"
- "dosages" – converts dosages to a standardized format
  
  e.g., "10 million units" → "10000000 unt"
Default: "all"

See Spark NLP Workshop for more examples of usage.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

Python API: DrugNormalizer

Scala API: DrugNormalizer

Notebook: DrugNormalizerNotebook

Show Example

from johnsnowlabs import nlp, medical

# Sample data
data_to_normalize = spark.createDataFrame([
            ("A", "Sodium Chloride/Potassium Chloride 13bag", "Sodium Chloride / Potassium Chloride 13 bag"),
            ("B", "interferon alfa-2b 10 million unit ( 1 ml ) injec", "interferon alfa - 2b 10000000 unt ( 1 ml ) injection"),
            ("C", "aspirin 10 meq/ 5 ml oral sol", "aspirin 2 meq/ml oral solution")
        ]).toDF("cuid", "text", "target_normalized_text")

# Annotator that transforms a text column from dataframe into normalized text (with all policy)

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

drug_normalizer = medical.DrugNormalizer() \
    .setInputCols("document") \
    .setOutputCol("document_normalized") \
    .setPolicy("all")

drug_normalizer_pipeline = nlp.Pipeline(stages=[
    document_assembler,
    drug_normalizer
    ])

ds = drug_normalizer_pipeline.fit(data_to_normalize).transform(data_to_normalize)

ds = ds.selectExpr("document", "target_normalized_text", "explode(document_normalized.result) as all_normalized_text")
ds.show(truncate = False)

+-------------------------------------------------------------------------------------------+----------------------------------------------------+----------------------------------------------------+
|document                                                                                   |target_normalized_text                              |all_normalized_text                                 |
+-------------------------------------------------------------------------------------------+----------------------------------------------------+----------------------------------------------------+
|[{document, 0, 39, Sodium Chloride/Potassium Chloride 13bag, {sentence -> 0}, []}]         |Sodium Chloride / Potassium Chloride 13 bag         |Sodium Chloride / Potassium Chloride 13 bag         |
|[{document, 0, 48, interferon alfa-2b 10 million unit ( 1 ml ) injec, {sentence -> 0}, []}]|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|
|[{document, 0, 28, aspirin 10 meq/ 5 ml oral sol, {sentence -> 0}, []}]                    |aspirin 2 meq/ml oral solution                      |aspirin 2 meq/ml oral solution                      |
+-------------------------------------------------------------------------------------------+----------------------------------------------------+----------------------------------------------------+

import spark.implicits._

// Sample data 
val data_to_normalize = Seq(Array( ("A","Sodium Chloride/Potassium Chloride 13bag","Sodium Chloride / Potassium Chloride 13 bag") , ("B","interferon alfa-2b 10 million unit ( 1 ml ) injec","interferon alfa - 2b 10000000 unt ( 1 ml ) injection") , ("C","aspirin 10 meq/ 5 ml oral sol","aspirin 2 meq/ml oral solution") )) .toDF("cuid","text","target_normalized_text") 

// Annotator that transforms a text column from dataframe into normalized text (with all policy) 

val document_assembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val drug_normalizer = new DrugNormalizer()
 .setInputCols("document") 
 .setOutputCol("document_normalized") 
 .setPolicy("all") 

val drug_normalizer_pipeline = new Pipeline().setStages(Array(
  document_assembler, 
  drug_normalizer)) 

val ds = drug_normalizer_pipeline.fit(data_to_normalize).transform(data_to_normalize) 

+-------------------------------------------------------------------------------------------+----------------------------------------------------+----------------------------------------------------+
|document                                                                                   |target_normalized_text                              |all_normalized_text                                 |
+-------------------------------------------------------------------------------------------+----------------------------------------------------+----------------------------------------------------+
|[{document, 0, 39, Sodium Chloride/Potassium Chloride 13bag, {sentence -> 0}, []}]         |Sodium Chloride / Potassium Chloride 13 bag         |Sodium Chloride / Potassium Chloride 13 bag         |
|[{document, 0, 48, interferon alfa-2b 10 million unit ( 1 ml ) injec, {sentence -> 0}, []}]|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|
|[{document, 0, 28, aspirin 10 meq/ 5 ml oral sol, {sentence -> 0}, []}]                    |aspirin 2 meq/ml oral solution                      |aspirin 2 meq/ml oral solution                      |
+-------------------------------------------------------------------------------------------+----------------------------------------------------+----------------------------------------------------+

EntityChunkEmbeddings

Model

Weighted average embeddings of multiple named entities chunk annotations.

Entity Chunk Embeddings uses BERT Sentence embeddings to compute a weighted average vector represention of related entity chunks. The input the model consists of chunks of recognized named entities. One or more entities are selected as target entities and for each of them a list of related entities is specified (if empty, all other entities are assumed to be related).

The model looks for chunks of the target entities and then tries to pair each target entity (e.g. DRUG) with other related entities (e.g. DOSAGE, STRENGTH, FORM, etc). The criterion for pairing a target entity with another related entity is that they appear in the same sentence and the maximal syntactic distance is below a predefined threshold.

The relationship between target and related entities is one-to-many, meaning that if there multiple instances of the same target entity (e.g.) within a sentence, the model will map a related entity (e.g. DOSAGE) to at most one of the instances of the target entity. For example, if there is a sentence “The patient was given 125 mg of paracetamol and metformin”, the model will pair “125 mg” to “paracetamol”, but not to “metformin”.

The output of the model is an average embeddings of the chunks of each of the target entities and their related entities. It is possible to specify a particular weight for each entity type.

An entity can be defined both as target a entity and as a related entity for some other target entity. For example, we may want to compute the embeddings of SYMPTOMs and their related entities, as well as the embeddings of DRUGs and their related entities, one of each is also SYMPTOM. In such cases, it is possible to use the TARGET_ENTITY:RELATED_ENTITY notation to specify the weight of an related entity (e.g. “DRUG:SYMPTOM” to set the weight of SYMPTOM when it appears as an related entity to target entity DRUG). The relative weights of entities for particular entity chunk embeddings are available in the annotations metadata.

This model is a subclass of BertSentenceEmbeddings and shares all parameters with it. It can load any pretrained BertSentenceEmbeddings model.

Parametres:

targetEntities: (dict) The target entities mapped to lists of their related entities. A target entity with an empty list of related entities means all other entities are assumed to be related to it. Entity names are case insensitive. Mandatory to set at least one entity
entityWeights: (dict) The relative weights of drug related entities. If not set, all entities have equal weights. If the list is non-empty and some entity is not in it, then its weight is set to 0. The notation TARGET_ENTITY:RELATED_ENTITY can be used to specify the weight of a entity which is related to specific target entity (e.g. “DRUG:SYMPTOM” -> 0.3f). Entity names are case insensitive.
maxSyntacticDistance: (Int) Maximal syntactic distance between the drug entity and the other drug related entities. Default value is 2.

The default model is "sbiobert_base_cased_mli" from clinical/models. Other available models can be found at Models Hub.

Input Annotator Types: DEPENDENCY, CHUNK

Output Annotator Type: SENTENCE_EMBEDDINGS

Python API: EntityChunkEmbeddingsModel

Scala API: EntityChunkEmbeddingsModel

Notebook: EntityChunkEmbeddingsModelNotebook

Show Example

from johnsnowlabs import nlp, medical

documenter = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentence_detector =  nlp.SentenceDetector()\
    .setInputCols("document")\
    .setOutputCol("sentence")\

tokenizer = nlp.Tokenizer()\
    .setInputCols("sentence")\
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

posology_ner_model = medical.NerModel().pretrained("ner_posology_large", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols("sentence", "token", "ner")\
    .setOutputCol("ner_chunk")

pos_tager = nlp.PerceptronModel().pretrained("pos_clinical", "en", "clinical/models")\
    .setInputCols("sentence", "token")\
    .setOutputCol("pos_tag")

dependency_parser = nlp.DependencyParserModel().pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos_tag", "token"])\
    .setOutputCol("dependencies")

entity_chunk_embeddings = medical.EntityChunkEmbeddings().pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
    .setInputCols(["ner_chunk", "dependencies"])\
    .setOutputCol("drug_chunk_embeddings")

entity_chunk_embeddings.setTargetEntities({"DRUG": ["STRENGTH", "ROUTE", "FORM"]})

rxnorm_re = medical.SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented_re", "en", "clinical/models")\
    .setInputCols(["drug_chunk_embeddings"])\
    .setOutputCol("rxnorm_code")\
    .setDistanceFunction("EUCLIDEAN")

rxnorm_pipeline_re = nlp.Pipeline(
    stages=[
        documenter,
        sentence_detector,
        tokenizer,
        embeddings,
        posology_ner_model,
        ner_converter,
        pos_tager,
        dependency_parser,
        entity_chunk_embeddings,
        rxnorm_re,
    ]
)

rxnorm_model = rxnorm_pipeline_re.fit(spark.createDataFrame([[""]]).toDF("text"))

data_df = spark.createDataFrame(
    [
        [
            "The patient was given metformin 500 mg tablet, 2.5 mg of coumadin and then ibuprofen."
        ],
        [
            "The patient was given metformin 400 mg, coumadin 5 mg, coumadin, amlodipine 10 MG tablet"
        ],
    ]
).toDF("text")

results = rxnorm_model.transform(data_df)
results.select("drug_chunk_embeddings.result", "drug_chunk_embeddings.embeddings").show(truncate=200)

+--------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                              result|                                                                                                                                                                                              embeddings|
+--------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|               [metformin 500 mg tablet, 2.5 mg coumadin, ibuprofen]|[[0.13060866, 0.26946265, -0.50702775, 0.7724293, 0.7356907, 0.0962475, -0.5546377, 0.0534295, -0.55345106, 0.48484787, -0.35735086, 0.49109104, 0.84404886, 0.30384326, -0.9923568, -0.24454081, 0.3...|
|[metformin 400 mg, coumadin 5 mg, coumadin, amlodipine 10 MG tablet]|[[-0.177948, 0.25489503, -0.5724586, 0.8031439, 0.9211674, 0.3558219, -0.37258363, -0.194855, -0.7407244, 0.48175216, 0.040639203, 0.6822441, 0.5768623, -0.19830275, -1.1513872, -0.32279214, 0.6181...|
+--------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val documenter = new DocumentAssembler()
    .setInputCol("text") 
    .setOutputCol("document") 

val sentence_detector = new SentenceDetector()
    .setInputCols("document") 
    .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
    .setInputCols("sentence") 
    .setOutputCol("token") 

val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models") 
    .setInputCols(Array("sentence","token")) 
    .setOutputCol("embeddings") 

val posology_ner_model = MedicalNerModel.pretrained("ner_posology_large","en","clinical/models") 
    .setInputCols(Array("sentence","token","embeddings")) 
    .setOutputCol("ner") 

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence","token","ner")) 
    .setOutputCol("ner_chunk") 

val pos_tager = PerceptronModel.pretrained("pos_clinical","en","clinical/models") 
    .setInputCols(Array("sentence","token")) 
    .setOutputCol("pos_tag") 

val dependency_parser = DependencyParserModel.pretrained("dependency_conllu","en") 
    .setInputCols(Array("sentence","pos_tag","token")) 
    .setOutputCol("dependencies") 

val entity_chunk_embeddings = EntityChunkEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models") 
    .setInputCols(Array("ner_chunk","dependencies")) 
    .setOutputCol("drug_chunk_embeddings") 

val entity_chunk_embeddings.setTargetEntities(Map("DRUG" -> "Array("STRENGTH","ROUTE","FORM")")) 

val rxnorm_re = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented_re","en","clinical/models")
    .setInputCols("drug_chunk_embeddings")
    .setOutputCol("rxnorm_code") 
    .setDistanceFunction("EUCLIDEAN") 

val rxnorm_pipeline_re = new Pipeline().setStages(Array( 
    documenter, 
    sentence_detector, 
    tokenizer, 
    embeddings, 
    posology_ner_model, 
    ner_converter, 
    pos_tager, 
    dependency_parser, 
    entity_chunk_embeddings,
    rxnorm_re)) 

val rxnorm_model = Seq(( "The patient was given metformin 500 mg tablet,2.5 mg of coumadin and then ibuprofen." ), ( "The patient was given metformin 400 mg,coumadin 5 mg,coumadin,amlodipine 10 MG tablet" )).toDF("text")

val results = rxnorm_model.fit(rxnorm_model).transform(rxnorm_model) 


+--------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                              result|                                                                                                                                                                                              embeddings|
+--------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|               [metformin 500 mg tablet, 2.5 mg coumadin, ibuprofen]|[[0.13060866, 0.26946265, -0.50702775, 0.7724293, 0.7356907, 0.0962475, -0.5546377, 0.0534295, -0.55345106, 0.48484787, -0.35735086, 0.49109104, 0.84404886, 0.30384326, -0.9923568, -0.24454081, 0.3...|
|[metformin 400 mg, coumadin 5 mg, coumadin, amlodipine 10 MG tablet]|[[-0.177948, 0.25489503, -0.5724586, 0.8031439, 0.9211674, 0.3558219, -0.37258363, -0.194855, -0.7407244, 0.48175216, 0.040639203, 0.6822441, 0.5768623, -0.19830275, -1.1513872, -0.32279214, 0.6181...|
+--------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

EntityRulerInternal

This annotator match exact strings or regex patterns provided in a file against a Document and assigns them an named entity. The definitions can contain any number of named entities.

Parametres:

setPatternsResource (str): Sets Resource in JSON or CSV format to map entities to patterns. path : str Path to the resource read_as : str, optional How to interpret the resource, by default ReadAs.TEXT options : dict, optional Options for parsing the resource, by default {“format”: “JSON”}
setSentenceMatch (Boolean): Whether to find match at sentence level. True: sentence level. False: token level.
setAlphabetResource (str): Alphabet Resource (a simple plain text with all language characters)
setUseStorage (Boolean): Sets whether to use RocksDB storage to serialize patterns.

See Spark NLP Workshop for more examples of usage.

Input Annotator Types: DOCUMENT, CHUNK

Output Annotator Type: CHUNK

Python API: EntityRulerInternal

Scala API: EntityRulerInternal

Show Example

from johnsnowlabs import nlp, medical

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

entityRuler = medical.EntityRulerInternalApproach()\
    .setInputCols(["document", "token"])\
    .setOutputCol("entities")\
    .setPatternsResource("entities.json")\
    .setCaseSensitive(False)\

pipeline = nlp.Pipeline().setStages([
    documentAssembler,
    tokenizer,
    entityRuler
])

data = spark.createDataFrame([['''John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01.''']]).toDF("text")

model = pipeline.fit(data)
result = model.transform(data)

# Result
+---------------+-----+---+-------+
|          chunk|begin|end|  label|
+---------------+-----+---+-------+
|        aspirin|   25| 31|   Drug|
|heart condition|   41| 55|Disease|
|    paracetamol|   69| 79|   Drug|
|          fever|   89| 93|Symptom|
|       headache|   99|106|Symptom|
|     tonsilitis|  129|138|Disease|
|      ibuprofen|  141|149|   Drug|
|    lansoprazol|  177|187|   Drug|
|           GORD|  198|201|Disease|
+---------------+-----+---+-------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("token")

val entityRuler = new EntityRulerInternalApproach()
  .setInputCols(Array("document", "token"))
  .setOutputCol("entities")
  .setPatternsResource("entities.json")
  .setCaseSensitive(false)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  entityRuler
))

val data = Seq(
  ("""John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsillitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01.""")
).toDF("text")

val model = pipeline.fit(data)

# Result
+---------------+-----+---+-------+
|          chunk|begin|end|  label|
+---------------+-----+---+-------+
|        aspirin|   25| 31|   Drug|
|heart condition|   41| 55|Disease|
|    paracetamol|   69| 79|   Drug|
|          fever|   89| 93|Symptom|
|       headache|   99|106|Symptom|
|     tonsilitis|  129|138|Disease|
|      ibuprofen|  141|149|   Drug|
|    lansoprazol|  177|187|   Drug|
|           GORD|  198|201|Disease|
+---------------+-----+---+-------+

EntityRulerInternal will handle the chunks output based on the patterns defined, as shown in the example below. We can define an id field to identify entities.

Parameters:

setPatternsResource (str): Sets Resource in JSON or CSV format to map entities to patterns. path : str Path to the resource read_as : str, optional How to interpret the resource, by default ReadAs.TEXT options : dict, optional Options for parsing the resource, by default {“format”: “JSON”}
setSentenceMatch (Boolean): Whether to find match at sentence level. True: sentence level. False: token level.
setAlphabetResource (str): Alphabet Resource (a simple plain text with all language characters)
setUseStorage (Boolean): Sets whether to use RocksDB storage to serialize patterns.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CHUNK

Python API: FewShotClassifierApproach

Scala API: EntityRulerInternal

FeaturesAssembler

Model

The FeaturesAssembler is used to collect features from different columns. It can collect features from single value columns (anything which can be cast to a float, if casts fails then the value is set to 0), array columns or SparkNLP annotations (if the annotation is an embedding, it takes the embedding, otherwise tries to cast the result field). The output of the transformer is a FEATURE_VECTOR annotation (the numeric vector is in the embeddings field).

Parameters:

inputCols: The name of the columns containing the input annotations. It can read either a String column name or an Array of strings (column names).
outputCol: The name of the column in Document type that is generated. We can specify only one column here.

All the parameters can be set using the corresponding set method in the camel case. For example, .setInputcols().

Input Annotator Types: NONE

Output Annotator Type: FEATURE_VECTOR

Python API: FeaturesAssembler

Scala API: FeaturesAssembler

Notebook: FeatureAssemblerNotebook

Show Example

from johnsnowlabs import medical, nlp 

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_healthcare_100d","en","clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("word_embeddings")

sentence_embeddings = nlp.SentenceEmbeddings() \
    .setInputCols(["document", "word_embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

features_asm = medical.FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")

embeddings_pipeline = nlp.Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        word_embeddings,
        sentence_embeddings,
        features_asm
    ])

data_df = spark.createDataFrame(
    [
        [
            "PROCEDURES PERFORMED: Colonoscopy. INDICATIONS: Renewed symptoms likely consistent with active flare of Inflammatory Bowel Disease, not responsive to conventional therapy including sulfasalazine, cortisone, local therapy. PROCEDURE: Informed consent was obtained prior to the procedure with special attention to benefits, risks, alternatives. Risks explained as bleeding, infection, bowel perforation, aspiration pneumonia, or reaction to the medications. Vital signs were monitored by blood pressure, heart rate, and oxygen saturation. Supplemental O2 given. Specifics discussed. Preprocedure physical exam performed. Stable vital signs. Lungs clear. Cardiac exam showed regular rhythm. Abdomen soft. Her past history, her past workup, her past visitation with me for Inflammatory Bowel Disease, well responsive to sulfasalazine reviewed. She currently has a flare and is not responding, therefore, likely may require steroid taper. At the same token, her symptoms are mild. She has rectal bleeding, essentially only some rusty stools. There is not significant diarrhea, just some lower stools. No significant pain. Therefore, it is possible that we are just dealing with a hemorrhoidal bleed, therefore, colonoscopy now needed. Past history reviewed. Specifics of workup, need for followup, and similar discussed. All questions answered. A normal digital rectal examination was performed. The PCF-160 AL was inserted into the anus and advanced to the cecum without difficulty, as identified by the ileocecal valve, cecal stump, and appendical orifice. All mucosal aspects thoroughly inspected, including a retroflexed examination. Withdrawal time was greater than six minutes. Unfortunately, the terminal ileum could not be intubated despite multiple attempts. Findings were those of a normal cecum, right colon, transverse colon, descending colon. A small cecal polyp was noted, this was biopsy-removed, placed in bottle #1. Random biopsies from the cecum obtained, bottle #2; random biopsies from the transverse colon obtained, as well as descending colon obtained, bottle #3. There was an area of inflammation in the proximal sigmoid colon, which was biopsied, placed in bottle #4. There was an area of relative sparing, with normal sigmoid lining, placed in bottle #5, randomly biopsied, and then inflammation again in the distal sigmoid colon and rectum biopsied, bottle #6, suggesting that we may be dealing with Crohn disease, given the relative sparing of the sigmoid colon and junk lesion. Retroflexed showed hemorrhoidal disease. Scope was then withdrawn, patient left in good condition. IMPRESSION: Active flare of Inflammatory Bowel Disease, question of Crohn disease. PLAN: I will have the patient follow up with me, will follow up on histology, follow up on the polyps. She will be put on a steroid taper and make an appointment and hopefully steroids alone will do the job. If not, she may be started on immune suppressive medication, such as azathioprine, or similar. All of this has been reviewed with the patient. All questions answered."
        ],
    ]
).toDF("text")

result = embeddings_pipeline.fit(data_df).transform(data_df)
result.select("features").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|features                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{feature_vector, 0, 0, , {sentence -> 0}, [-0.00896873, 0.011731416, 0.12154201, 0.1149235, -0.14689414, 0.0103584975, 0.053073216, -0.056412186, -0.05143186, -0.0118978135, -0.12175384, -0.035894137, 0.11812756, 0.094671555, 0.15838866, 0.15260744, -0.004094441, -0.13675772, -0.07472433, -0.035856977, -0.026730005, -0.21840473, 0.029632289, -0.011515695, -0.20407394, -0.07848257, 0.040990185, 0.23028605, 0.077140555, 0.066990435, 0.015219222, -0.10295644, 0.038072545, 0.10786369, 0.121525764, -0.09569349, -0.06309264, 0.2778952, 0.06462455, -0.10851931, -0.14370486, -0.1466352, 0.08354363, -0.078758985, -0.08377953, 0.12384644, -0.23281692, -0.25607574, 0.16399069, -0.07780675, -0.18302177, -0.18325584, -0.12128636, -0.0010129504, 0.0070792097, 0.20506753, 0.034964647, 0.058425985, 0.19572404, -0.103953235, -0.20159312, -0.099047214, -0.07337802, -0.03713124, -0.055443633, 0.11107734, 0.048563413, -0.038048305, -0.020617828, 0.17082842, 0.069010496, 0.08457101, -0.038229663, 0.073144384, -0.092326105, -0.10054428, -4.3286112E-4, -0.046703782, -0.080231875, 0.02524295, 0.01368699, -0.19783853, -0.03501917, 0.13324805, 0.09053264, -0.0958231, -0.0032442473, 0.19218525, -0.027179888, 0.030672349, 0.12848215, -0.014700146, -0.089054875, 0.13839856, -0.15778734, 0.07103226, -0.060303356, 0.20854644, -0.008389737, -0.1473986]}]|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

from johnsnowlabs import nlp, finance 

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained()\
    .setInputCols(["document","token"])\
    .setOutputCol("word_embeddings")

sentence_embeddings = nlp.SentenceEmbeddings() \
    .setInputCols(["document", "word_embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

features_asm =finance.FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")

embeddings_pipeline = nlp.Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        word_embeddings,
        sentence_embeddings,
        features_asm
    ])

data_df = spark.createDataFrame(
    [
        [
            "Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc."
        ],
    ]
).toDF("text")

result = embeddings_pipeline.fit(data_df).transform(data_df)
result.select("features").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|features                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{feature_vector, 0, 0, , {sentence -> 0}, [-0.05989722, 0.10907035, 0.25595385, -0.21656203, 0.20777024, -0.17276664, -0.045803867, -0.14506632, -0.16928527, -0.10008922, 0.18800992, -0.36529806, 0.22592439, -0.118487455, 0.006129823, -0.2674002, 0.37149927, 0.12375746, -0.30488327, 0.2507765, -0.060471725, -0.22705032, 0.39436466, 0.40368417, 0.15569581, 0.083455965, 0.11193783, -0.2783573, -0.23566169, -0.12444999, 0.22503565, 0.43343276, -0.3165808, -0.057086047, 0.050554093, 0.3512633, 0.17572127, 0.19258633, -0.09170296, -0.25344467, 0.018219033, -0.117947415, -0.03234701, -0.1549039, -0.0147800855, 0.076972865, 0.08612865, -0.14120182, -0.18348631, -0.4500436, 0.038739346, 0.12991442, -0.032128494, 0.7483725, -0.09843177, -1.6700389, 0.0060545397, -0.1044135, 1.2469376, 0.32064447, -0.17263599, 0.31999183, 0.0077194544, 0.15370668, 0.59472036, -0.16953614, 0.3042488, 0.25355336, 0.60402286, 0.07441569, -0.12468894, 0.03140718, -0.2630037, -0.37703836, 0.034783553, -0.058904923, 0.022686867, 0.07962498, -0.7945683, -0.21051218, 0.6615892, -0.18747853, -0.25412843, 0.26003888, -1.0803214, -0.026889319, -0.11805089, -0.14200646, -0.019682527, -0.2372327, 0.0090960255, -0.071929, -0.115089305, 0.21781716, -0.3569975, 0.07799677, -0.096894525, -0.34368798, 0.66465, 0.14913023]}]|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

from johnsnowlabs import nlp, legal 

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained()\
    .setInputCols(["document","token"])\
    .setOutputCol("word_embeddings")

sentence_embeddings = nlp.SentenceEmbeddings() \
    .setInputCols(["document", "word_embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

features_asm =legal.FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")

embeddings_pipeline = nlp.Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        word_embeddings,
        sentence_embeddings,
        features_asm
    ])

data_df = spark.createDataFrame(
    [
        [
            "This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc."
        ],
    ]
).toDF("text")

result = embeddings_pipeline.fit(data_df).transform(data_df)
result.select("features").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|features                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{feature_vector, 0, 0, , {sentence -> 0}, [0.02474357, -0.08310143, 0.4801927, -0.070223466, 0.33147717, -0.18737249, -0.048361354, -0.052325998, 0.053252153, -0.0067390013, 0.2836935, -0.25569317, 0.3415577, -0.19251995, 0.051623292, -0.25131556, 0.3472208, -0.036604006, -0.35653928, 0.13225944, 0.18795085, -0.09561886, 0.4695179, 0.22093144, 0.32058474, 0.057281215, 0.082858086, -0.3714214, -0.19219379, -0.26751986, -0.148075, 0.6410107, -0.07821157, -0.06398429, 6.32831E-5, 0.21222909, 0.33145514, 0.2575328, 0.009346781, -0.21482512, -0.22197871, -0.14005142, 0.04592571, -0.2919176, 0.011854073, -0.14047821, 0.22201888, -0.13500921, -0.101019345, -0.31175214, -0.0031539474, 0.07841865, 0.23760447, 0.8622971, -0.21095662, -1.9944092, -0.090888076, -0.45743433, 1.5815442, 0.4848822, -0.12528154, 0.33802572, -0.16203907, -0.09874586, 0.63106954, -0.21860953, 0.39005432, 0.25023165, 0.66769457, -0.13867687, 0.02832079, -0.17432508, -0.05764636, -0.44529453, 0.032839067, -0.2266792, -0.002856281, 0.007823931, -1.0165309, 0.08553613, 0.38090998, 0.011592574, -0.18031952, 0.37968582, -0.77948713, -0.068393, -0.029594865, -0.2165647, 0.1665183, -0.23963346, -0.017649503, -0.24768801, -0.2725593, 0.14533372, -0.36786577, 0.23388086, -0.20129707, -0.33582142, 0.5970527, 0.12596472]}]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new Tokenizer() 
    .setInputCols("document")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d","en","clinical/models")
    .setInputCols(Array("document","token"))
    .setOutputCol("word_embeddings")

val sentence_embeddings = new SentenceEmbeddings() 
    .setInputCols(Array("document", "word_embeddings")) 
    .setOutputCol("sentence_embeddings") 
    .setPoolingStrategy("AVERAGE")

val features_asm = new FeaturesAssembler()
    .setInputCols("sentence_embeddings")
    .setOutputCol("features")

val nlpPipeline = new Pipeline().setStages(Array(
        document_assembler,
        tokenizer,
        word_embeddings,
        sentence_embeddings,
        features_asm
))

val data = Seq(
  "PROCEDURES PERFORMED: Colonoscopy. INDICATIONS: Renewed symptoms likely consistent with active flare of Inflammatory Bowel Disease, not responsive to conventional therapy including sulfasalazine, cortisone, local therapy. PROCEDURE: Informed consent was obtained prior to the procedure with special attention to benefits, risks, alternatives. Risks explained as bleeding, infection, bowel perforation, aspiration pneumonia, or reaction to the medications. Vital signs were monitored by blood pressure, heart rate, and oxygen saturation. Supplemental O2 given. Specifics discussed. Preprocedure physical exam performed. Stable vital signs. Lungs clear. Cardiac exam showed regular rhythm. Abdomen soft. Her past history, her past workup, her past visitation with me for Inflammatory Bowel Disease, well responsive to sulfasalazine reviewed. She currently has a flare and is not responding, therefore, likely may require steroid taper. At the same token, her symptoms are mild. She has rectal bleeding, essentially only some rusty stools. There is not significant diarrhea, just some lower stools. No significant pain. Therefore, it is possible that we are just dealing with a hemorrhoidal bleed, therefore, colonoscopy now needed. Past history reviewed. Specifics of workup, need for followup, and similar discussed. All questions answered. A normal digital rectal examination was performed. The PCF-160 AL was inserted into the anus and advanced to the cecum without difficulty, as identified by the ileocecal valve, cecal stump, and appendical orifice. All mucosal aspects thoroughly inspected, including a retroflexed examination. Withdrawal time was greater than six minutes. Unfortunately, the terminal ileum could not be intubated despite multiple attempts. Findings were those of a normal cecum, right colon, transverse colon, descending colon. A small cecal polyp was noted, this was biopsy-removed, placed in bottle #1. Random biopsies from the cecum obtained, bottle #2; random biopsies from the transverse colon obtained, as well as descending colon obtained, bottle #3. There was an area of inflammation in the proximal sigmoid colon, which was biopsied, placed in bottle #4. There was an area of relative sparing, with normal sigmoid lining, placed in bottle #5, randomly biopsied, and then inflammation again in the distal sigmoid colon and rectum biopsied, bottle #6, suggesting that we may be dealing with Crohn disease, given the relative sparing of the sigmoid colon and junk lesion. Retroflexed showed hemorrhoidal disease. Scope was then withdrawn, patient left in good condition. IMPRESSION: Active flare of Inflammatory Bowel Disease, question of Crohn disease. PLAN: I will have the patient follow up with me, will follow up on histology, follow up on the polyps. She will be put on a steroid taper and make an appointment and hopefully steroids alone will do the job. If not, she may be started on immune suppressive medication, such as azathioprine, or similar. All of this has been reviewed with the patient. All questions answered."
).toDF("text")

val result = nlpPipeline.fit(data_df).transform(data_df)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|features                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{feature_vector, 0, 0, , {sentence -> 0}, [-0.00896873, 0.011731416, 0.12154201, 0.1149235, -0.14689414, 0.0103584975, 0.053073216, -0.056412186, -0.05143186, -0.0118978135, -0.12175384, -0.035894137, 0.11812756, 0.094671555, 0.15838866, 0.15260744, -0.004094441, -0.13675772, -0.07472433, -0.035856977, -0.026730005, -0.21840473, 0.029632289, -0.011515695, -0.20407394, -0.07848257, 0.040990185, 0.23028605, 0.077140555, 0.066990435, 0.015219222, -0.10295644, 0.038072545, 0.10786369, 0.121525764, -0.09569349, -0.06309264, 0.2778952, 0.06462455, -0.10851931, -0.14370486, -0.1466352, 0.08354363, -0.078758985, -0.08377953, 0.12384644, -0.23281692, -0.25607574, 0.16399069, -0.07780675, -0.18302177, -0.18325584, -0.12128636, -0.0010129504, 0.0070792097, 0.20506753, 0.034964647, 0.058425985, 0.19572404, -0.103953235, -0.20159312, -0.099047214, -0.07337802, -0.03713124, -0.055443633, 0.11107734, 0.048563413, -0.038048305, -0.020617828, 0.17082842, 0.069010496, 0.08457101, -0.038229663, 0.073144384, -0.092326105, -0.10054428, -4.3286112E-4, -0.046703782, -0.080231875, 0.02524295, 0.01368699, -0.19783853, -0.03501917, 0.13324805, 0.09053264, -0.0958231, -0.0032442473, 0.19218525, -0.027179888, 0.030672349, 0.12848215, -0.014700146, -0.089054875, 0.13839856, -0.15778734, 0.07103226, -0.060303356, 0.20854644, -0.008389737, -0.1473986]}]|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new Tokenizer() 
    .setInputCols("document")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained()
    .setInputCols(Array("document","token"))
    .setOutputCol("word_embeddings")

val sentence_embeddings = new SentenceEmbeddings() 
    .setInputCols(Array("document", "word_embeddings")) 
    .setOutputCol("sentence_embeddings") 
    .setPoolingStrategy("AVERAGE")

val features_asm = new FeaturesAssembler()
    .setInputCols("sentence_embeddings")
    .setOutputCol("features")

val nlpPipeline = new Pipeline().setStages(Array(
        document_assembler,
        tokenizer,
        word_embeddings,
        sentence_embeddings,
        features_asm
))

val data = Seq(
  "Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc."
).toDF("text")

val result = nlpPipeline.fit(data_df).transform(data_df)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|features                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{feature_vector, 0, 0, , {sentence -> 0}, [-0.05989722, 0.10907035, 0.25595385, -0.21656203, 0.20777024, -0.17276664, -0.045803867, -0.14506632, -0.16928527, -0.10008922, 0.18800992, -0.36529806, 0.22592439, -0.118487455, 0.006129823, -0.2674002, 0.37149927, 0.12375746, -0.30488327, 0.2507765, -0.060471725, -0.22705032, 0.39436466, 0.40368417, 0.15569581, 0.083455965, 0.11193783, -0.2783573, -0.23566169, -0.12444999, 0.22503565, 0.43343276, -0.3165808, -0.057086047, 0.050554093, 0.3512633, 0.17572127, 0.19258633, -0.09170296, -0.25344467, 0.018219033, -0.117947415, -0.03234701, -0.1549039, -0.0147800855, 0.076972865, 0.08612865, -0.14120182, -0.18348631, -0.4500436, 0.038739346, 0.12991442, -0.032128494, 0.7483725, -0.09843177, -1.6700389, 0.0060545397, -0.1044135, 1.2469376, 0.32064447, -0.17263599, 0.31999183, 0.0077194544, 0.15370668, 0.59472036, -0.16953614, 0.3042488, 0.25355336, 0.60402286, 0.07441569, -0.12468894, 0.03140718, -0.2630037, -0.37703836, 0.034783553, -0.058904923, 0.022686867, 0.07962498, -0.7945683, -0.21051218, 0.6615892, -0.18747853, -0.25412843, 0.26003888, -1.0803214, -0.026889319, -0.11805089, -0.14200646, -0.019682527, -0.2372327, 0.0090960255, -0.071929, -0.115089305, 0.21781716, -0.3569975, 0.07799677, -0.096894525, -0.34368798, 0.66465, 0.14913023]}]|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new Tokenizer() 
    .setInputCols("document")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained()
    .setInputCols(Array("document","token"))
    .setOutputCol("word_embeddings")

val sentence_embeddings = new SentenceEmbeddings() 
    .setInputCols(Array("document", "word_embeddings")) 
    .setOutputCol("sentence_embeddings") 
    .setPoolingStrategy("AVERAGE")

val features_asm = new FeaturesAssembler()
    .setInputCols("sentence_embeddings")
    .setOutputCol("features")

val nlpPipeline = new Pipeline().setStages(Array(
        document_assembler,
        tokenizer,
        word_embeddings,
        sentence_embeddings,
        features_asm
))

val data = Seq(
  "This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc."
).toDF("text")

val result = nlpPipeline.fit(data_df).transform(data_df)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|features                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{feature_vector, 0, 0, , {sentence -> 0}, [0.02474357, -0.08310143, 0.4801927, -0.070223466, 0.33147717, -0.18737249, -0.048361354, -0.052325998, 0.053252153, -0.0067390013, 0.2836935, -0.25569317, 0.3415577, -0.19251995, 0.051623292, -0.25131556, 0.3472208, -0.036604006, -0.35653928, 0.13225944, 0.18795085, -0.09561886, 0.4695179, 0.22093144, 0.32058474, 0.057281215, 0.082858086, -0.3714214, -0.19219379, -0.26751986, -0.148075, 0.6410107, -0.07821157, -0.06398429, 6.32831E-5, 0.21222909, 0.33145514, 0.2575328, 0.009346781, -0.21482512, -0.22197871, -0.14005142, 0.04592571, -0.2919176, 0.011854073, -0.14047821, 0.22201888, -0.13500921, -0.101019345, -0.31175214, -0.0031539474, 0.07841865, 0.23760447, 0.8622971, -0.21095662, -1.9944092, -0.090888076, -0.45743433, 1.5815442, 0.4848822, -0.12528154, 0.33802572, -0.16203907, -0.09874586, 0.63106954, -0.21860953, 0.39005432, 0.25023165, 0.66769457, -0.13867687, 0.02832079, -0.17432508, -0.05764636, -0.44529453, 0.032839067, -0.2266792, -0.002856281, 0.007823931, -1.0165309, 0.08553613, 0.38090998, 0.011592574, -0.18031952, 0.37968582, -0.77948713, -0.068393, -0.029594865, -0.2165647, 0.1665183, -0.23963346, -0.017649503, -0.24768801, -0.2725593, 0.14533372, -0.36786577, 0.23388086, -0.20129707, -0.33582142, 0.5970527, 0.12596472]}]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

FewShotAssertionClassifierModel

FewShotAssertionClassifierModel does assertion classification using can run large (LLMS based) few shot classifiers based on the SetFit approach.

Parameters:

batchSize (Int): Batch size
caseSensitive (Bool): Whether the classifier is sensitive to text casing
maxSentenceLength (Int): The maximum length of the input text
blackList (list[str]): If defined, list of entities to ignore. The rest will be processed.
whiteList (list[str]): If defined, list of entities to process. The rest will be ignored. Do not include IOB prefix on labels.
caseSensitive (Bool): Determines whether the definitions of the white listed and black listed entities are case sensitive. Default: True.

Input Annotator Types: DOCUMENT, CHUNK

Output Annotator Type: ASSERTION

Python API: FewShotAssertionClassifierModel

Scala API: FewShotAssertionClassifierModel

Notebook: FewShotAssertionClassifierModel

Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")\
    .setSplitChars(["-", "\/"])

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
    .setInputCols(["sentence","token"])\
    .setOutputCol("embeddings")

# ner_oncology
ner_oncology = medical.NerModel.pretrained("ner_oncology","en","clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_oncology")

ner_oncology_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence","token","ner_oncology"])\
    .setOutputCol("ner_chunk")

few_shot_assertion_converter = medical.FewShotAssertionSentenceConverter()\
    .setInputCols(["sentence", "token", "ner_chunk"])\
    .setOutputCol("assertion_sentence")

e5_embeddings = nlp.E5Embeddings.pretrained("e5_base_v2_embeddings_medical_assertion_oncology", "en", "clinical/models")\
    .setInputCols(["assertion_sentence"])\
    .setOutputCol("assertion_embedding")

few_shot_assertion_classifier = medical.FewShotAssertionClassifierModel()\
    .pretrained("fewhot_assertion_oncology_e5_base_v2_oncology", "en", "clinical/models")\
    .setInputCols(["assertion_embedding"])\
    .setOutputCol("assertion_fewshot")

assertion_pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        ner_oncology,
        ner_oncology_converter,
        few_shot_assertion_converter,
        e5_embeddings,
        few_shot_assertion_classifier
    ])

sample_text= [
    """The patient is suspected to have colorectal cancer. Her family history is positive for other cancers. The result of the biopsy was positive. A CT scan was ordered to rule out metastases."""
    ]

data = spark.createDataFrame([sample_text]).toDF("text")

result = assertion_pipeline.fit(data).transform(data)

## Result

+-----------------+-----+---+----------------+---------+----------+
|ner_chunk        |begin|end|ner_label       |assertion|confidence|
+-----------------+-----+---+----------------+---------+----------+
|colorectal cancer|33   |49 |Cancer_Dx       |Possible |0.5812815 |
|Her              |52   |54 |Gender          |Present  |0.9562998 |
|cancers          |93   |99 |Cancer_Dx       |Family   |0.23465642|
|biopsy           |120  |125|Pathology_Test  |Past     |0.95732147|
|positive         |131  |138|Pathology_Result|Present  |0.9564386 |
|CT scan          |143  |149|Imaging_Test    |Past     |0.9571699 |
|metastases       |175  |184|Metastasis      |Possible |0.54986554|
+-----------------+-----+---+----------------+---------+----------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = SentenceDetectorDLModel
    .pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
    .setInputCols(Array("document"))
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols(Array("sentence"))
    .setOutputCol("token")
    .setSplitChars(Array("-", "/"))

val wordEmbeddings = WordEmbeddingsModel
    .pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val nerOncology = MedicalNerModel
    .pretrained("ner_oncology", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner_oncology")

val nerOncologyConverter = new NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner_oncology"))
    .setOutputCol("ner_chunk")

val fewShotAssertionConverter = new FewShotAssertionSentenceConverter()
    .setInputCols(Array("sentence", "token", "ner_chunk"))
    .setOutputCol("assertion_sentence")

val e5Embeddings = E5Embeddings
    .pretrained("e5_base_v2_embeddings_medical_assertion_oncology", "en", "clinical/models")
    .setInputCols(Array("assertion_sentence"))
    .setOutputCol("assertion_embedding")

val fewShotAssertionClassifier = FewShotAssertionClassifierModel
    .pretrained("fewhot_assertion_oncology_e5_base_v2_oncology", "en", "clinical/models")
    .setInputCols(Array("assertion_embedding"))
    .setOutputCol("assertion_fewshot")

val pipeline = new Pipeline()
    .setStages(Array(
        documentAssembler,
        sentenceDetector,
        tokenizer,
        wordEmbeddings,
        nerOncology,
        nerOncologyConverter,
        fewShotAssertionConverter,
        e5Embeddings,
        fewShotAssertionClassifier
    ))

val sampleText = Seq("The patient is suspected to have colorectal cancer. Her family history is positive for other cancers.
The result of the biopsy was positive. A CT scan was ordered to rule out metastases.")

val data = spark.createDataFrame(sampleText).toDF("text")

val result = pipeline.fit(data).transform(data)

result.show(false)

// Result       

+-----------------+-----+---+----------------+---------+----------+
|ner_chunk        |begin|end|ner_label       |assertion|confidence|
+-----------------+-----+---+----------------+---------+----------+
|colorectal cancer|33   |49 |Cancer_Dx       |Possible |0.5812815 |
|Her              |52   |54 |Gender          |Present  |0.9562998 |
|cancers          |93   |99 |Cancer_Dx       |Family   |0.23465642|
|biopsy           |120  |125|Pathology_Test  |Past     |0.95732147|
|positive         |131  |138|Pathology_Result|Present  |0.9564386 |
|CT scan          |143  |149|Imaging_Test    |Past     |0.9571699 |
|metastases       |175  |184|Metastasis      |Possible |0.54986554|
+-----------------+-----+---+----------------+---------+----------+

Trains a TensorFlow model for few shot assertion classifier.

To train a custom few shot assertion classifier model, you need to first create a Tensorflow graph using either the TfGraphBuilder annotator or the tf_graph module. Then, set the path to the Tensorflow graph using the method .setModelFile("path/to/tensorflow_graph.pb").

Parameters:

batchSize: (int) Batch size
dropout: (float) Dropout coefficient
epochsN: (int) Maximum number of epochs to train
featureScaling: (str) Feature scaling method. Possible values are ‘zscore’, ‘minmax’ or empty (no scaling)
fixImbalance: (boolean) Fix the imbalance in the training set by replicating examples of under represented categories
labelColumn: (str) Column with label per each document
learningRate: (float) Learning Rate
modelFile: (str) Location of file of the model used for classification
multiClass: (boolean) If multiClass is set, the model will return all the labels with corresponding scores. By default, multiClass is false.
outputLogsPath: (str) Folder path to save training logs. If no path is specified, the logs won’t be stored in disk. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3).
validationSplit: (float) The proportion of training dataset to be used as validation set.The model will be validated against this dataset on each Epoch and will not be used for training. The value should be between 0.0 and 1.0.
datasetInfo (Str): Descriptive information about the dataset being used.

Input Annotator Types: SENTENCE_EMBEDDINGS

Output Annotator Type: ASSERTION

Python API: FewShotAssertionClassifierApproach

Scala API: FewShotAssertionClassifierApproach

Notebook: FewShotAssertionClassifierNotebook

Show Example

from johnsnowlabs import nlp, medical

from sparknlp_jsl.annotator import TFGraphBuilder

graph_folder = "./tf_graphs"
graph_name = "assertion_graph.pb"

assertion_graph_builder = TFGraphBuilder()\
    .setModelName("fewshot_assertion")\
    .setInputCols(["assertion_embedding"]) \
    .setLabelColumn("label")\
    .setGraphFolder(graph_folder)\
    .setGraphFile(graph_name)\
    .setHiddenUnitsNumber(100)

fewshot_assertion_approach = FewShotAssertionClassifierApproach()\
    .setInputCols("assertion_embedding")\
    .setOutputCol("assertion")\
    .setLabelCol("label")\
    .setBatchSize(32)\
    .setDropout(0.1)\
    .setLearningRate(0.001)\
    .setEpochsNumber(40)\
    .setValidationSplit(0.2)\
    .setModelFile(f"{graph_folder}/{graph_name}")

clinical_assertion_pipeline = Pipeline(
    stages = [
        assertion_graph_builder,
        fewshot_assertion_approach
    ])

assertion_model = clinical_assertion_pipeline.fit(assertion_train_data)

import spark.implicits._

// Defining pipeline stages to extract entities first
val documentAssembler = new MultiDocumentAssembler()
    .setInputCols(Array("text", "span"))
    .setOutputCols(Array("document", "span_document"))

val tokenizer = new Tokenizer()
    .setInputCols(Array("document"))
    .setOutputCol("token")

val chunker = new Doc2Chunk()
    .setInputCols(Array("span_document"))
    .setOutputCol("span_chunk")

val assertionConverter = new FewShotAssertionSentenceConverter()
    .setInputCols(Array("document", "span_chunk", "token"))
    .setOutputCol("assertion_sentence")

val sentenceEmbeddings = MPNetEmbeddings
    .pretrained()
    .setInputCols(Array("assertion_sentence"))
    .setOutputCol("assertion_embedding")

val fewShotAssertionApproach = FewShotAssertionClassifierApproach
    .setInputCols(Array("assertion_embedding"))
    .setOutputCol("prediction")
    .setLabelColumn("label")
    .setEpochsNumber(10)
    .setBatchSize(1)
    .setMultiClass(false)
    .setlearningRate(0.01f)

val pipeline = new Pipeline().setStages(
  Array(
    documentAssembler,
    tokenizer,
    chunker,
    assertionConverter,
    sentenceEmbeddings,
    fewShotAssertionApproach))

val model = pipeline.fit(trainData)

FewShotAssertionSentenceConverter

Model

It is a util annotator that is used in some datasets to train a new FewShotAssertionClassifierModel.

Parameters:

scopeWindow : The scope window of the assertion expression

Input Annotator Types: TOKEN

Output Annotator Type: TOKEN

Python API: FewShotAssertionSentenceConverter

Scala API: FewShotAssertionSentenceConverter

Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")\
    .setSplitChars(["-", "\/"])

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
    .setInputCols(["sentence","token"])\
    .setOutputCol("embeddings")

# ner_oncology
ner_oncology = medical.NerModel.pretrained("ner_oncology","en","clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner_oncology")

ner_oncology_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence","token","ner_oncology"])\
    .setOutputCol("ner_chunk")

few_shot_assertion_converter = medical.FewShotAssertionSentenceConverter()\
    .setInputCols(["sentence", "token", "ner_chunk"])\
    .setOutputCol("assertion_sentence")

e5_embeddings = nlp.E5Embeddings.pretrained("e5_base_v2_embeddings_medical_assertion_oncology", "en", "clinical/models")\
    .setInputCols(["assertion_sentence"])\
    .setOutputCol("assertion_embedding")

few_shot_assertion_classifier = medical.FewShotAssertionClassifierModel()\
    .pretrained("fewhot_assertion_oncology_e5_base_v2_oncology", "en", "clinical/models")\
    .setInputCols(["assertion_embedding"])\
    .setOutputCol("assertion_fewshot")

assertion_pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        ner_oncology,
        ner_oncology_converter,
        few_shot_assertion_converter,
        e5_embeddings,
        few_shot_assertion_classifier
    ])

sample_text= [
"""The patient is suspected to have colorectal cancer. Her family history is positive for other cancers.
The result of the biopsy was positive. A CT scan was ordered to rule out metastases."""
]

data = spark.createDataFrame([sample_text]).toDF("text")

result = assertion_pipeline.fit(data).transform(data)

result.select(F.explode(F.arrays_zip(result.assertion_fewshot.metadata,
                        result.assertion_fewshot.begin,
                        result.assertion_fewshot.end,
                        result.assertion_fewshot.result,)).alias("cols")) \
                        .select(F.expr("cols['0']['ner_chunk']").alias("ner_chunk"),
                        F.expr("cols['1']").alias("begin"),
                        F.expr("cols['2']").alias("end"),
                        F.expr("cols['0']['ner_label']").alias("ner_label"),
                        F.expr("cols['3']").alias("assertion"),
                        F.expr("cols['0']['confidence']").alias("confidence") ).show(truncate=False)



## Result

+-----------------+-----+---+----------------+---------+----------+
|ner_chunk        |begin|end|ner_label       |assertion|confidence|
+-----------------+-----+---+----------------+---------+----------+
|colorectal cancer|33   |49 |Cancer_Dx       |Possible |0.5812815 |
|Her              |52   |54 |Gender          |Present  |0.9562998 |
|cancers          |93   |99 |Cancer_Dx       |Family   |0.23465642|
|biopsy           |120  |125|Pathology_Test  |Past     |0.95732147|
|positive         |131  |138|Pathology_Result|Present  |0.9564386 |
|CT scan          |143  |149|Imaging_Test    |Past     |0.9571699 |
|metastases       |175  |184|Metastasis      |Possible |0.54986554|
+-----------------+-----+---+----------------+---------+----------+

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = SentenceDetectorDLModel
    .pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
    .setInputCols(Array("document"))
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols(Array("sentence"))
    .setOutputCol("token")
    .setSplitChars(Array("-", "/"))

val wordEmbeddings = WordEmbeddingsModel
    .pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val nerOncology = MedicalNerModel
    .pretrained("ner_oncology", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner_oncology")

val nerOncologyConverter = new NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner_oncology"))
    .setOutputCol("ner_chunk")

val fewShotAssertionConverter = new FewShotAssertionSentenceConverter()
    .setInputCols(Array("sentence", "token", "ner_chunk"))
    .setOutputCol("assertion_sentence")

val e5Embeddings = E5Embeddings
    .pretrained("e5_base_v2_embeddings_medical_assertion_oncology", "en", "clinical/models")
    .setInputCols(Array("assertion_sentence"))
    .setOutputCol("assertion_embedding")

val fewShotAssertionClassifier = FewShotAssertionClassifierModel
    .pretrained("fewhot_assertion_oncology_e5_base_v2_oncology", "en", "clinical/models")
    .setInputCols(Array("assertion_embedding"))
    .setOutputCol("assertion_fewshot")

val pipeline = new Pipeline()
    .setStages(Array(
        documentAssembler,
        sentenceDetector,
        tokenizer,
        wordEmbeddings,
        nerOncology,
        nerOncologyConverter,
        fewShotAssertionConverter,
        e5Embeddings,
        fewShotAssertionClassifier
    ))

val sampleText = Seq("The patient is suspected to have colorectal cancer. Her family history is positive for other cancers. 
The result of the biopsy was positive. A CT scan was ordered to rule out metastases.")

val data = spark.createDataFrame(sampleText).toDF("text")

val result = pipeline.fit(data).transform(data)

result.show(false)

## Result

+-----------------+-----+---+----------------+---------+----------+
|ner_chunk        |begin|end|ner_label       |assertion|confidence|
+-----------------+-----+---+----------------+---------+----------+
|colorectal cancer|33   |49 |Cancer_Dx       |Possible |0.5812815 |
|Her              |52   |54 |Gender          |Present  |0.9562998 |
|cancers          |93   |99 |Cancer_Dx       |Family   |0.23465642|
|biopsy           |120  |125|Pathology_Test  |Past     |0.95732147|
|positive         |131  |138|Pathology_Result|Present  |0.9564386 |
|CT scan          |143  |149|Imaging_Test    |Past     |0.9571699 |
|metastases       |175  |184|Metastasis      |Possible |0.54986554|
+-----------------+-----+---+----------------+---------+----------+

FewShotClassifier

FewShotClassifier annotators specifically target few-shot classification tasks, which involve training a model to make accurate predictions with limited labeled data.

These annotators provide a valuable capability for handling scenarios where labeled data is scarce or expensive to obtain. By effectively utilizing limited labeled examples, the few-shot classification approach enables the creation of models that can generalize and classify new instances accurately, even with minimal training data.

The FewShotClassifier is designed to process sentence embeddings as input. It generates category annotations, providing labels along with confidence scores that range from 0 to 1.

Input Annotator Types: SENTENCE EMBEDDINGS

Output Annotator Type: CATEGORY

Python API: FewShotClassifierModel

Scala API: FewShotClassifierModel

Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

bert_sent = nlp.BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")

few_shot_classifier = medical.FewShotClassifierModel.pretrained("few_shot_classifier_age_group_sbiobert_cased_mli", "en", "clinical/models")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("prediction")

clf_Pipeline = nlp.Pipeline(stages=[
    document_assembler,
    bert_sent,
    few_shot_classifier
])

data = spark.createDataFrame([
    ["""A patient presented with complaints of chest pain and shortness of breath. The medical history revealed the patient had a smoking habit for over 30 years, and was diagnosed with hypertension two years ago. After a detailed physical examination, the doctor found a noticeable wheeze on lung auscultation and prescribed a spirometry test, which showed irreversible airway obstruction. The patient was diagnosed with Chronic obstructive pulmonary disease (COPD) caused by smoking."""],
 ["""Hi, wondering if anyone has had a similar situation. My 1 year old daughter has the following; loose stools/ pale stools, elevated liver enzymes, low iron.  5 months and still no answers from drs. """],
 ["""Hi have chronic gastritis from 4 month(confirmed by endoscopy).I do not have acid reflux.Only dull ache above abdomen and left side of chest.I am on reberprozole and librax.My question is whether chronic gastritis is curable or is it a lifetime condition?I am loosing hope because this dull ache is not going away.Please please reply"""]
    ]).toDF("text")

result = clf_Pipeline.fit(data).transform(data)


# Show results
result.select('prediction.result','text').show(truncate=150)

+---------+------------------------------------------------------------------------------------------------------------------------------------------------------+
|   result|                                                                                                                                                  text|
+---------+------------------------------------------------------------------------------------------------------------------------------------------------------+
|  [Adult]|A patient presented with complaints of chest pain and shortness of breath. The medical history revealed the patient had a smoking habit for over 30...|
|  [Child]|Hi, wondering if anyone has had a similar situation. My 1 year old daughter has the following; loose stools/ pale stools, elevated liver enzymes, l...|
|[Unknown]|Hi have chronic gastritis from 4 month(confirmed by endoscopy).I do not have acid reflux.Only dull ache above abdomen and left side of chest.I am o...|
+---------+------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val bert_sent = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
    .setInputCols("document")
    .setOutputCol("sentence_embeddings")

val few_shot_classifier = FewShotClassifierModel.pretrained("few_shot_classifier_age_group_sbiobert_cased_mli", "en", "clinical/models")
    .setInputCols("sentence_embeddings")
    .setOutputCol("prediction") 
    
val clf_Pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    bert_sent, 
    few_shot_classifier))

val data = Seq(
    ("""A patient presented with complaints of chest pain and shortness of breath. The medical history revealed the patient had a smoking habit for over 30 years, and was diagnosed with hypertension two years ago. After a detailed physical examination, the doctor found a noticeable wheeze on lung auscultation and prescribed a spirometry test, which showed irreversible airway obstruction. The patient was diagnosed with Chronic obstructive pulmonary disease (COPD) caused by smoking."""),
 ("""Hi, wondering if anyone has had a similar situation. My 1 year old daughter has the following; loose stools/ pale stools, elevated liver enzymes, low iron.  5 months and still no answers from drs. """),
 ("""Hi have chronic gastritis from 4 month(confirmed by endoscopy).I do not have acid reflux.Only dull ache above abdomen and left side of chest.I am on reberprozole and librax.My question is whether chronic gastritis is curable or is it a lifetime condition?I am loosing hope because this dull ache is not going away.Please please reply""")).toDF("text")
  
val result = clf_Pipeline.fit(data).transform(data)

// Show results


+---------+------------------------------------------------------------------------------------------------------------------------------------------------------+
|   result|                                                                                                                                                  text|
+---------+------------------------------------------------------------------------------------------------------------------------------------------------------+
|  [Adult]|A patient presented with complaints of chest pain and shortness of breath. The medical history revealed the patient had a smoking habit for over 30...|
|  [Child]|Hi, wondering if anyone has had a similar situation. My 1 year old daughter has the following; loose stools/ pale stools, elevated liver enzymes, l...|
|[Unknown]|Hi have chronic gastritis from 4 month(confirmed by endoscopy).I do not have acid reflux.Only dull ache above abdomen and left side of chest.I am o...|
+---------+------------------------------------------------------------------------------------------------------------------------------------------------------+

FewShotClassifier annotators specifically target few-shot classification tasks, which involve training a model to make accurate predictions with limited labeled data.

The FewShotClassifier is designed to process sentence embeddings as input. It generates category annotations, providing labels along with confidence scores that range from 0 to 1.

Input Annotator Types: SENTENCE EMBEDDINGS

Output Annotator Type: CATEGORY

Python API: FewShotClassifierApproach

Scala API: FewShotClassifierApproach

Show Example

from johnsnowlabs import nlp, medical 

document_asm = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("sentence")

sentence_embeddings = nlp.BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
    .setInputCols(["sentence"])\
    .setOutputCol("sentence_embeddings")

graph_builder = medical.TFGraphBuilder()\
    .setModelName("fewshot_classifier")\
    .setInputCols(["sentence_embeddings"]) \
    .setLabelColumn("label")\
    .setGraphFolder("/tmp")\
    .setGraphFile("log_reg_graph.pb")\

few_shot_approach = medical.FewShotClassifierApproach()\
    .setLabelColumn("label")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("prediction")\
    .setModelFile(f"/tmp/log_reg_graph.pb")\
    .setEpochsNumber(10)\
    .setBatchSize(1)\
    .setLearningRate(0.001)

pipeline = nlp.Pipeline(
    stages=[
        document_asm,
        sentence_embeddings,
        graph_builder,
        few_shot_approach
    ])

model = pipeline.fit(train_data)

import spark.implicits._

val document_asm = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("sentence")

val sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
    .setInputCols("sentence")
    .setOutputCol("sentence_embeddings")

val few_shot_approach = new FewShotClassifierApproach()
    .setLabelColumn("label")
    .setInputCols("sentence_embeddings")
    .setOutputCol("prediction")
    .setModelFile("tmp/log_reg_graph.pb")
    .setEpochsNumber(10)
    .setBatchSize(1)
    .setLearningRate(0.001) 

val pipeline = new Pipeline().setStages(Array(
    document_asm, 
    sentence_embeddings, 
    few_shot_approach ))

val result = pipeline.fit(train_data).transform(test_data).cache()

FhirDeIdentification

Model

A Spark Transformer for de-identifying FHIR resources according to configurable privacy rules.

Performs field-level obfuscation on FHIR JSON documents using FHIR Path expressions. Supports R4, R5, and DSTU3 FHIR versions with type-aware de-identification strategies. Additionally, supports different parser types (JSON, XML) for FHIR resources.

Note: FhirDeIdentification Module Availability
The FhirDeIdentification module is not enabled by default in your current package.
To use this module, please set fhir_deid=True in the start() function.
If you need help or access, feel free to contact us at support@johnsnowlabs.com.

Parameters:

fhirVersion: (Param[String]) The FHIR version to use for de-identification. Options: [‘R4’, ‘R5’, ‘DSTU3’]
parserType: (Param[String]) The parser type to use for de-identification. Options: [‘JSON’, ‘XML’]
mode: (Param[String]) Mode for Anonymizer. Options: [‘mask’, ‘obfuscate’]
dateEntities: (StringArrayParam) List of date entities. Default: [‘DATE’, ‘DOB’, ‘DOD’]
obfuscateDate: (BooleanParam) When mode == "obfuscate", whether to obfuscate dates or not. If True and obfuscation fails, unnormalizedDateMode will be used.
unnormalizedDateMode: (Param[String]) The mode to use if the date is not formatted. Options: [‘mask’, ‘obfuscate’, ‘skip’]. Default: ‘obfuscate’
days: (IntParam) Number of days to displace dates for obfuscation. If not set, a random value between 1 and 60 will be used.
dateFormats: (StringArrayParam) List of date formats to automatically displace if parsed.
obfuscateRefSource: (Param[String]) The source to use for obfuscating entities (not applicable to date entities). Options: [‘custom’, ‘faker’, ‘both’]
language: (Param[String]) Language for regex and faker data. Options: [‘en’, ‘de’, ‘es’, ‘fr’, ‘ar’, ‘ro’]. Default: ‘en’
seed: (IntParam) Seed for deterministic obfuscation results
maskingPolicy: (Param[String]) Select the masking policy. Options: [‘same_length_chars’, ‘entity_labels’, ‘fixed_length_chars’]
fixedMaskLength: (IntParam) Length of masking sequence when using ‘fixed_length_chars’ masking policy
sameLengthFormattedEntities: (StringArrayParam) List of formatted entities to keep same-length during obfuscation. Supported: PHONE, FAX, ID, IDNUM, BIOID, MEDICALRECORD, ZIP, VIN, SSN, DLN, LICENSE, PLATE
genderAwareness: (BooleanParam) Whether to use gender-aware names during obfuscation. Default: False
ageRanges: (IntArrayParam) List of integers specifying limits of age groups to preserve during obfuscation
selectiveObfuscationModes: (MapParam[String, StringArrayParam]) Dictionary of entity-wise obfuscation modes. Example: {‘NAME’: [‘mask_entity_labels’], ‘PHONE’: [‘obfuscate’]}
customFakers: (MapParam[String, StringArrayParam]) Custom dictionary of faker terms to be used for specific entities
keepYear: (BooleanParam) Whether to keep the year intact during date obfuscation. Default: False
keepMonth: (BooleanParam) Whether to keep the month intact during date obfuscation. Default: False

Input Annotator Types: TEXT

Output Annotator Type: TEXT

Python API: FhirDeIdentification

Scala API: FhirDeIdentification

Notebook: FhirDeIdentification

Show Example

from johnsnowlabs import nlp, medical



rules = {
  "Patient.birthDate" : "DATE",
  "Patient.name.given" : "FIRST_NAME",
  "Patient.name.family" : "LAST_NAME",
  "Patient.telecom.value" : "EMAIL",
  "Patient.gender" : "GENDER"
}

fhir = (
    FhirDeIdentification()
      .setInputCol("text")
      .setOutputCol("deid")
      .setMode("obfuscate")
      .setMappingRules(rules)
      .setFhirVersion("R4")
      .setParserType("JSON")
      .setDays(20)
      .setSeed(88)
      .setCustomFakers(
          {
              "GENDER": ["female", "other"]
          }
      )
      .setObfuscateRefSource("both")
)

john_doe = """{
  "resourceType": "Patient",
  "id": "example",
  "name": [
    {
      "use": "official",
      "family": "Doe",
      "given": [
        "John",
        "Michael"
      ]
    }
  ],
  "telecom": [
    {
      "system": "email",
      "value": "john.doe@example.com"
    },
    {
      "system": "url",
      "value": "http://johndoe.com"
    }
  ],
  "birthDate": "1970-01-01",
  "gender": "male"
}"""



# result

{
    'resourceType': 'Patient',
    'id': 'example',
    'name': [
        {
            'use': 'official',
            'family': 'Cease',
            'given': [
                'Mylene',
                'Anola'
            ]
        }
    ],
    'telecom': [
        {
            'system': 'email',
            'value': 'Bryton@yahoo.com'
        },
        {
            'system': 'url',
            'value': 'https://aurora.com'
        }
    ],
    'birthDate': '1970-01-21',
    'gender': 'other'
}

Flattener

Model

The Flattener converts annotation results into a format that easier to use. This annotator produces a DataFrame with flattened and exploded columns containing annotation results, making it easier to interpret and analyze the information. It is particularly useful for extracting and organizing the results obtained from Spark NLP Pipelines.

Parameters:

inputCols: Input annotations.
cleanAnnotations: Whether to remove annotation columns, by default True.
explodeSelectedFields: Dict of input columns to their corresponding selected fields.
flattenExplodedColumns: Whether to flatten exploded columns(default : True).
orderByColumn: Specify the column by which the DataFrame should be ordered..
orderDescending: Specifying whether to order the DataFrame in descending order.(default : True).
keepOriginalColumns : Array of column names that should be kept in the DataFrame after the flattening process.

See Spark NLP Workshop for more examples of usage.

Input Annotator Types: ANY

Output Annotator Type: NONE

Python API: Flattener

Scala API: Flattener

Notebook: FlattenerNotebook

Show Example

from johnsnowlabs import nlp, medical

documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner") \
    .setLabelCasing("upper")

ner_converter = medical.NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk") \
    .setWhiteList(["SYMPTOM","VS_FINDING","DISEASE_SYNDROME_DISORDER","ADMISSION_DISCHARGE","PROCEDURE"])

flattener = medical.Flattener()\
    .setInputCols("ner_chunk") \
    .setExplodeSelectedFields({"ner_chunk": ["result as ner_chunks",
                                             "begin as begins",
                                             "end as ends",
                                             "metadata.entity as entities"]})

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    flattener
])

text = """
GENERAL: He is an elderly gentleman in no acute distress. He is sitting up in bed eating his breakfast. He is alert and oriented and answering questions appropriately.
HEENT: Sclerae showed mild arcus senilis in the right. Left was clear. Pupils are equally round and reactive to light. Extraocular movements are intact. Oropharynx is clear.
NECK: Supple. Trachea is midline. No jugular venous pressure distention is noted. No adenopathy in the cervical, supraclavicular, or axillary areas.
ABDOMEN: Soft and not tender. There may be some fullness in the left upper quadrant, although I do not appreciate a true spleen with inspiration.
EXTREMITIES: There is some edema, but no cyanosis and clubbing .
"""

data = spark.createDataFrame([[text]]).toDF("text")

result = nlpPipeline.fit(data).transform(data)
result.show(truncate=False)

# result
+----------------------------------+------+----+-------------------------+
|ner_chunks                        |begins|ends|entities                 |
+----------------------------------+------+----+-------------------------+
|distress                          |49    |56  |SYMPTOM                  |
|arcus senilis                     |196   |208 |DISEASE_SYNDROME_DISORDER|
|jugular venous pressure distention|380   |413 |SYMPTOM                  |
|adenopathy                        |428   |437 |SYMPTOM                  |
|tender                            |514   |519 |SYMPTOM                  |
|fullness                          |540   |547 |SYMPTOM                  |
|edema                             |665   |669 |SYMPTOM                  |
|cyanosis                          |679   |686 |VS_FINDING               |
|clubbing                          |692   |699 |SYMPTOM                  |
+----------------------------------+------+----+-------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val wordEmbeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val clinicalNer = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")
  .setLabelCasing("upper")

val nerConverter = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")
  .setWhiteList(Array("SYMPTOM", "VS_FINDING", "DISEASE_SYNDROME_DISORDER", "ADMISSION_DISCHARGE", "PROCEDURE"))

val flattener = new Flattener()
  .setInputCols(Array("ner_chunk"))
  .setExplodeSelectedFields(Map("ner_chunk" -> Array("result", "begin", "end", "metadata.entity")))

val nlpPipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  wordEmbeddings,
  clinicalNer,
  nerConverter,
  flattener
))

val text = """
GENERAL: He is an elderly gentleman in no acute distress. He is sitting up in bed eating his breakfast. He is alert and oriented and answering questions appropriately.
HEENT: Sclerae showed mild arcus senilis in the right. Left was clear. Pupils are equally round and reactive to light. Extraocular movements are intact. Oropharynx is clear.
NECK: Supple. Trachea is midline. No jugular venous pressure distention is noted. No adenopathy in the cervical, supraclavicular, or axillary areas.
ABDOMEN: Soft and not tender. There may be some fullness in the left upper quadrant, although I do not appreciate a true spleen with inspiration.
EXTREMITIES: There is some edema, but no cyanosis and clubbing .
"""

val data = Seq(text).toDF("text")

val result = nlpPipeline.fit(data).transform(data)

# result
+----------------------------------+------+----+-------------------------+
|ner_chunks                        |begins|ends|entities                 |
+----------------------------------+------+----+-------------------------+
|distress                          |49    |56  |SYMPTOM                  |
|arcus senilis                     |196   |208 |DISEASE_SYNDROME_DISORDER|
|jugular venous pressure distention|380   |413 |SYMPTOM                  |
|adenopathy                        |428   |437 |SYMPTOM                  |
|tender                            |514   |519 |SYMPTOM                  |
|fullness                          |540   |547 |SYMPTOM                  |
|edema                             |665   |669 |SYMPTOM                  |
|cyanosis                          |679   |686 |VS_FINDING               |
|clubbing                          |692   |699 |SYMPTOM                  |
+----------------------------------+------+----+-------------------------+

GenericClassifier

Creates a generic single-label classifier which uses pre-generated Tensorflow graphs. The model operates on FEATURE_VECTOR annotations which can be produced using FeatureAssembler. Requires the FeaturesAssembler to create the input.

Parameter:

multiClass (Boolean): Whether to return all clases or only the one with highest score (Default: False)
datasetInfo (Str): Descriptive information about the dataset being used.

Input Annotator Types: FEATURE_VECTOR

Output Annotator Type: CATEGORY

Python API: GenericClassifierModel

Scala API: GenericClassifierModel

Notebook: GenericClassifierModelNotebook

Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_embeddings = nlp.BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", 'en','clinical/models')\
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")

features_asm = medical.FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")

generic_classifier = medical.GenericClassifierModel.pretrained("genericclassifier_sdoh_economics_binary_sbiobert_cased_mli", 'en', 'clinical/models')\
    .setInputCols(["features"])\
    .setOutputCol("classes")

pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        sentence_embeddings,
        features_asm,
        generic_classifier
])

text = """Patient works as a building inspector and remodeler. Married with 2 children. He is a current smoker, 1PPD for 25years. He drinks to beers/night, but has not had any alcohol in past 4 days. No IVDU."""

df = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(df).transform(df)
result.select("text", "classes.result").show(truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|text                                                                                                                                                                                                  |result|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|Patient works as a building inspector and remodeler. Married with 2 children. He is a current smoker, 1PPD for 25years. He drinks to beers/night, but has not had any alcohol in past 4 days. No IVDU.|[True]|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
  .setInputCol("text") 
  .setOutputCol("document")

val sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
  .setInputCols("document") 
  .setOutputCol("sentence_embeddings") 

val features_asm = new FeaturesAssembler()
  .setInputCols("sentence_embeddings")
  .setOutputCol("features") 

val generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_economics_binary_sbiobert_cased_mli","en","clinical/models")
  .setInputCols(Array("features")) 
  .setOutputCol("classes") 

val pipeline = new Pipeline().setStages(Array( 
                                            document_assembler, 
                                            sentence_embeddings, 
                                            features_asm, 
                                            generic_classifier )) 

val text = "Patient works as a building inspector and remodeler. Married with 2 children. He is a current smoker,1PPD for 25years. He drinks to beers/night,but has not had any alcohol in past 4 days. No IVDU." 

val df = Seq(text) .toDF("text") 
val result = pipeline.fit(df).transform(df)   

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|text                                                                                                                                                                                                  |result|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|Patient works as a building inspector and remodeler. Married with 2 children. He is a current smoker, 1PPD for 25years. He drinks to beers/night, but has not had any alcohol in past 4 days. No IVDU.|[True]|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+

Trains a TensorFlow model for generic classification of feature vectors. It takes FEATURE_VECTOR annotations from FeaturesAssembler as input, classifies them and outputs CATEGORY annotations. Please see the Parameters section for required training parameters.

Parametres:

batchSize: (int) Batch size
dropout: (float) Dropout coefficient
epochsN: (int) Maximum number of epochs to train
featureScaling: (str) Feature scaling method. Possible values are ‘zscore’, ‘minmax’ or empty (no scaling)
fixImbalance: (boolean) Fix the imbalance in the training set by replicating examples of under represented categories
labelColumn: (str) Column with label per each document
learningRate: (float) Learning Rate
modelFile: (str) Location of file of the model used for classification
multiClass: (boolean) If multiClass is set, the model will return all the labels with corresponding scores. By default, multiClass is false.
outputLogsPath: (str) Folder path to save training logs. If no path is specified, the logs won’t be stored in disk. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3).
validationSplit: (float) The proportion of training dataset to be used as validation set.The model will be validated against this dataset on each Epoch and will not be used for training. The value should be between 0.0 and 1.0.
datasetInfo (Str): Descriptive information about the dataset being used.

For a more extensive example please see the Spark NLP Workshop.

Input Annotator Types: FEATURE_VECTOR

Output Annotator Type: CATEGORY

Python API: GenericClassifierApproach

Scala API: GenericClassifierApproach

Notebook: GenericClassifierApproachNotebook

Show Example

from johnsnowlabs import nlp, medical

features_asm = medical.FeaturesAssembler() \
    .setInputCols(["feature_1", "feature_2", "...", "feature_n"]) \
    .setOutputCol("features")

gen_clf = medical.GenericClassifierApproach() \
    .setLabelColumn("target") \
    .setInputCols(["features"]) \
    .setOutputCol("prediction") \
    .setModelFile("/path/to/graph_file.pb") \
    .setEpochsNumber(50) \
    .setBatchSize(100) \
    .setFeatureScaling("zscore") \
    .setlearningRate(0.001) \
    .setFixImbalance(True) \
    .setOutputLogsPath("logs") \
    .setValidationSplit(0.2) # keep 20% of the data for validation purposes

pipeline = nlp.Pipeline().setStages([
    features_asm,
    gen_clf
])

clf_model = pipeline.fit(data)

from johnsnowlabs import nlp, finance

features_asm = finance.FeaturesAssembler() \
    .setInputCols(["feature_1", "feature_2", "...", "feature_n"]) \
    .setOutputCol("features")

gen_clf = finance.GenericClassifierApproach() \
    .setLabelColumn("target") \
    .setInputCols(["features"]) \
    .setOutputCol("prediction") \
    .setModelFile("/path/to/graph_file.pb") \
    .setEpochsNumber(50) \
    .setBatchSize(100) \
    .setFeatureScaling("zscore") \
    .setlearningRate(0.001) \
    .setFixImbalance(True) \
    .setOutputLogsPath("logs") \
    .setValidationSplit(0.2) # keep 20% of the data for validation purposes

pipeline = nlp.Pipeline().setStages([
    features_asm,
    gen_clf
])

clf_model = pipeline.fit(data)

from johnsnowlabs import nlp, legal

features_asm = legal.FeaturesAssembler() \
    .setInputCols(["feature_1", "feature_2", "...", "feature_n"]) \
    .setOutputCol("features")

gen_clf = legal.GenericClassifierApproach() \
    .setLabelColumn("target") \
    .setInputCols(["features"]) \
    .setOutputCol("prediction") \
    .setModelFile("/path/to/graph_file.pb") \
    .setEpochsNumber(50) \
    .setBatchSize(100) \
    .setFeatureScaling("zscore") \
    .setlearningRate(0.001) \
    .setFixImbalance(True) \
    .setOutputLogsPath("logs") \
    .setValidationSplit(0.2) # keep 20% of the data for validation purposes

pipeline = nlp.Pipeline().setStages([
    features_asm,
    gen_clf
])

clf_model = pipeline.fit(data)

import spark.implicits._

val features_asm = new FeaturesAssembler()
  .setInputCols(Array("feature_1", "feature_2", "...", "feature_n"))
  .setOutputCol("features")

val gen_clf = new GenericClassifierApproach()
  .setLabelColumn("target")
  .setInputCols("features")
  .setOutputCol("prediction")
  .setModelFile("/path/to/graph_file.pb")
  .setEpochsNumber(50)
  .setBatchSize(100)
  .setFeatureScaling("zscore")
  .setlearningRate(0.001f)
  .setFixImbalance(true)
  .setOutputLogsPath("logs")
  .setValidationSplit(0.2f) // keep 20% of the data for validation purposes

val pipeline = new Pipeline().setStages(Array(
  features_asm,
  gen_clf
))

val clf_model = pipeline.fit(data)

import spark.implicits._

val features_asm = new FeaturesAssembler()
  .setInputCols(Array("feature_1", "feature_2", "...", "feature_n"))
  .setOutputCol("features")

val gen_clf = new GenericClassifierApproach()
  .setLabelColumn("target")
  .setInputCols("features")
  .setOutputCol("prediction")
  .setModelFile("/path/to/graph_file.pb")
  .setEpochsNumber(50)
  .setBatchSize(100)
  .setFeatureScaling("zscore")
  .setlearningRate(0.001f)
  .setFixImbalance(true)
  .setOutputLogsPath("logs")
  .setValidationSplit(0.2f) // keep 20% of the data for validation purposes

val pipeline = new Pipeline().setStages(Array(
  features_asm,
  gen_clf
))

val clf_model = pipeline.fit(data)

import spark.implicits._

val features_asm = new FeaturesAssembler()
  .setInputCols(Array("feature_1", "feature_2", "...", "feature_n"))
  .setOutputCol("features")

val gen_clf = new GenericClassifierApproach()
  .setLabelColumn("target")
  .setInputCols("features")
  .setOutputCol("prediction")
  .setModelFile("/path/to/graph_file.pb")
  .setEpochsNumber(50)
  .setBatchSize(100)
  .setFeatureScaling("zscore")
  .setlearningRate(0.001f)
  .setFixImbalance(true)
  .setOutputLogsPath("logs")
  .setValidationSplit(0.2f) // keep 20% of the data for validation purposes

val pipeline = new Pipeline().setStages(Array(
  features_asm,
  gen_clf
))

val clf_model = pipeline.fit(data)

GenericLogRegClassifier

Approach

GenericLogRegClassifier is a derivative of GenericClassifier which implements a multinomial logistic regression. This is a single layer neural network with the logistic function at the output. The input to the model is FeatureVector and the output is category annotations with labels and corresponding confidence scores varying between 0 and 1.

Parameters:

LabelColumn: This parameter sets the name of the column in your input data that contains the labels (categories) for the classification task. The classifier will use this column to learn from the data and make predictions.
ModelFile: This parameter specifies the path to the pre-trained model file for the logistic regression classifier. It should be a protobuf file containing the model graph and trained weights.
EpochsNumber: This parameter sets the number of epochs (iterations) the classifier will go through during the training process. An epoch represents one complete pass through the entire training dataset.
BatchSize: This parameter sets the batch size used during training. The training data is divided into batches, and the model’s weights are updated after processing each batch. A larger batch size may speed up training, but it requires more memory.
LearningRate: This parameter sets the learning rate for the optimization algorithm used during training. The learning rate determines how much the model’s weights are updated based on the computed gradients. A higher learning rate may lead to faster convergence but risks overshooting the optimal solution.
OutputLogsPath: This parameter specifies the path where the logs related to the training process will be stored. These logs can include information such as training loss, accuracy, and other metrics.
Dropout: Dropout is a regularization technique used to prevent overfitting in neural networks. This parameter sets the dropout rate, which determines the probability that each neuron’s output will be temporarily ignored during training.
FixImbalance: Imbalance refers to the situation when some classes have significantly more training examples than others. Setting this parameter to True indicates that the classifier will handle class imbalance during training to help ensure that the model doesn’t become biased towards the majority class.
ValidationSplit: This line seems to be commented out, but it’s worth mentioning its purpose. If uncommented and set to a value between 0 and 1, it would specify the fraction of the training data to be used for validation during the training process. The remaining data would be used for actual training.

Input Annotator Types: FEATURE_VECTOR

Output Annotator Type: CATEGORY

Python API: GenericLogRegClassifierApproach

Scala API: GenericLogRegClassifierApproach

Notebook: GenericLogRegClassifierApproachNotebook

Show Example

from johnsnowlabs import nlp, medical

features_asm = medical.FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

graph_folder = "gc_graph"

gc_logreg_graph_builder = medical.TFGraphBuilder()\
    .setModelName("logreg_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("log_reg_graph.pb")

gen_clf = medical.GenericLogRegClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/log_reg_graph.pb")\
    .setEpochsNumber(20)\
    .setBatchSize(128)\
    .setLearningRate(0.01)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = nlp.Pipeline(stages=[
    features_asm,
    gc_logreg_graph_builder,
    gen_clf])

import spark.implicits._
  
val features_asm = new FeaturesAssembler()
  .setInputCols("sentence_embeddings")
  .setOutputCol("feature_vector")

val gc_logreg_graph_builder = new TFGraphBuilder()
  .setModelName("logreg_classifier")
  .setInputCols("feature_vector")
  .setLabelColumn("category")
  .setGraphFolder("gc_graph")
  .setGraphFile("log_reg_graph.pb")

val gen_clf = new GenericLogRegClassifierApproach()
  .setLabelColumn("category")
  .setInputCols("feature_vector")
  .setOutputCol("prediction")
  .setModelFile("gc_graph/log_reg_graph.pb")
  .setEpochsNumber(20)
  .setBatchSize(128)
  .setLearningRate(0.01)
  .setOutputLogsPath(log_folder)
  .setDropout(0.1)
  .setFixImbalance(true) // .setValidationSplit(0.1)

val clf_Pipeline = new Pipeline().setStages(Array(features_asm, gc_logreg_graph_builder, gen_clf))

GenericREModel

Model

Instantiated RelationExtractionModel for extracting relationships between any entities. This class is not intended to be directly used, please use the RelationExtractionModel instead. Pairs of entities should be specified using setRelationPairs.

Input Annotator Types: WORD_EMBEDDINGS, POS, CHUNK, DEPENDENCY

Output Annotator Type: CATEGORY

Python API: RelationExtractionModel

Scala API: RelationExtractionModel

Notebook: RelationExtractionModelNotebook

Show Example

from johnsnowlabs import nlp, medical

documenter = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencer = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentences"])\
    .setOutputCol("tokens")

words_embedder = nlp.WordEmbeddingsModel()\
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("embeddings")

pos_tagger = nlp.PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("pos_tags")

ner_tagger = medical.NerModel()\
    .pretrained("ner_posology", "en", "clinical/models")\
    .setInputCols("sentences", "tokens", "embeddings")\
    .setOutputCol("ner_tags")

ner_chunker = medical.NerConverterInternal()\
    .setInputCols(["sentences", "tokens", "ner_tags"])\
    .setOutputCol("ner_chunks")

dependency_parser = nlp.DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentences", "pos_tags", "tokens"])\
    .setOutputCol("dependencies")

reModel = medical.RelationExtractionModel()\
    .pretrained("generic_re")\
    .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
    .setOutputCol("relations")\
    .setRelationPairs(["problem-test",
                       "problem-treatment"])\
    .setMaxSyntacticDistance(4)

pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer,
    words_embedder,
    pos_tagger,
    ner_tagger,
    ner_chunker,
    dependency_parser,
    reModel
])

text = """
A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to
presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis
three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index
( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and
vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a
respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin
and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical
examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was
benign with no tenderness , guarding , or rigidity .
"""
df = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(df).transform(df)

# Show results
result.select(F.explode(F.arrays_zip(
                              result.relations.result,
                              result.relations.metadata)).alias("cols"))\
.select(
    F.expr("cols['1']['chunk1']").alias("chunk1"),
    F.expr("cols['1']['chunk2']").alias("chunk2"),
    F.expr("cols['1']['entity1']").alias("entity1"),
    F.expr("cols['1']['entity2']").alias("entity2"),
    F.expr("cols['0']").alias("relations"),
    F.expr("cols['1']['confidence']").alias("confidence")).show(5, truncate=False)

+-----------------+-------------+-------+-------+------------+----------+
|chunk1           |chunk2       |entity1|entity2|relations   |confidence|
+-----------------+-------------+-------+-------+------------+----------+
|obesity          |BMI          |PROBLEM|TEST   |PROBLEM-TEST|1.0       |
|a body mass index|BMI          |PROBLEM|TEST   |PROBLEM-TEST|1.0       |
|BMI              |polyuria     |TEST   |PROBLEM|TEST-PROBLEM|1.0       |
|BMI              |polydipsia   |TEST   |PROBLEM|TEST-PROBLEM|1.0       |
|BMI              |poor appetite|TEST   |PROBLEM|TEST-PROBLEM|1.0       |
+-----------------+-------------+-------+-------+------------+----------+

import spark.implicits._

val documenter = new DocumentAssembler()
    .setInputCol("text") 
    .setOutputCol("document") 

val sentencer = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentences") 

val tokenizer = new Tokenizer()
    .setInputCols("sentences") 
    .setOutputCol("tokens") 

val words_embedder = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models") 
    .setInputCols(Array("sentences","tokens")) 
    .setOutputCol("embeddings") 

val pos_tagger = PerceptronModel.pretrained("pos_clinical","en","clinical/models") 
    .setInputCols(Array("sentences","tokens")) 
    .setOutputCol("pos_tags") 

val ner_tagger = MedicalNerModel.pretrained("ner_posology","en","clinical/models") 
    .setInputCols("sentences","tokens","embeddings") 
    .setOutputCol("ner_tags") 

val ner_chunker = new NerConverterInternal()
    .setInputCols(Array("sentences","tokens","ner_tags")) 
    .setOutputCol("ner_chunks") 

val dependency_parser = DependencyParserModel.pretrained("dependency_conllu","en") 
    .setInputCols(Array("sentences","pos_tags","tokens")) 
    .setOutputCol("dependencies") 

val reModel = RelationExtractionModel.pretrained("generic_re") 
    .setInputCols(Array("embeddings","pos_tags","ner_chunks","dependencies")) 
    .setOutputCol("relations")
    .setRelationPairs(Array("problem-test","problem-treatment"))
    .setMaxSyntacticDistance(4)
    

val pipeline = new Pipeline().setStages(Array(
                                             documenter, 
                                             sentencer, 
                                             tokenizer,
                                             words_embedder, 
                                             pos_tagger, 
                                             ner_tagger, 
                                             ner_chunker, 
                                             dependency_parser, 
                                             reModel )) 

val text = "A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to " +
"presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis " +
"three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index " +
"( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and " +
"vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a " +
"respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin " +
"and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical " +
"examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was " +
"benign with no tenderness , guarding , or rigidity."

val df = Seq(text) .toDF("text") 
val result = pipeline.fit(df) .transform(df) 

// Show results

+-----------------+-------------+-------+-------+------------+----------+
|chunk1           |chunk2       |entity1|entity2|relations   |confidence|
+-----------------+-------------+-------+-------+------------+----------+
|obesity          |BMI          |PROBLEM|TEST   |PROBLEM-TEST|1.0       |
|a body mass index|BMI          |PROBLEM|TEST   |PROBLEM-TEST|1.0       |
|BMI              |polyuria     |TEST   |PROBLEM|TEST-PROBLEM|1.0       |
|BMI              |polydipsia   |TEST   |PROBLEM|TEST-PROBLEM|1.0       |
|BMI              |poor appetite|TEST   |PROBLEM|TEST-PROBLEM|1.0       |
+-----------------+-------------+-------+-------+------------+----------+

GenericSVMClassifier

Parameters:

featureScaling: Feature scaling method. Possible values are ‘zscore’, ‘minmax’ or empty (no scaling) (default:’’)

multiClass: Whether to return only the label with the highest confidence score or all labels (default: False)

inputCols: previous annotations columns, if renamed (default: [‘features’])

outputCol: output annotation column. can be left default. (default: class)

Input Annotator Types: FEATURE_VECTOR

Output Annotator Type: CATEGORY

Python API: GenericSVMClassifier

Scala API: GenericSVMClassifier

Notebook: GenericSVMClassifierNotebook

Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("word_embeddings")

sentence_embeddings = nlp.SentenceEmbeddings() \
    .setInputCols(["document", "word_embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

features_asm = medical.FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")

generic_classifier = medical.GenericSVMClassifierModel.pretrained("generic_svm_classifier_ade", "en", "clinical/models")\
    .setInputCols(["features"])\
    .setOutputCol("class")

clf_Pipeline = nlp.Pipeline(stages=[
    document_assembler,
    tokenizer,
    word_embeddings,
    sentence_embeddings,
    features_asm,
    generic_classifier])

data = spark.createDataFrame([["""None of the patients required treatment for the overdose."""], 
 ["""I feel a bit drowsy & have a little blurred vision after taking an insulin"""]]).toDF("text")

result = clf_Pipeline.fit(data).transform(df)  # sample df

+----------------------------------------------------------------------------------------------------+-------+
|                                                                                                text| result|
+----------------------------------------------------------------------------------------------------+-------+
|                       Multicentric canine lymphoma in a 12-year-old keeshond: chemotherapy options.|[False]|
|                             Pyomyositis is a rare disease, encountered mainly in tropical climates.|[False]|
| Both patients subsequently developed markedly elevated EBV-DNA titers in association with monocl...|[False]|
|Bortezomib-induced paralytic ileus is a potential gastrointestinal side effect of this first-in-c...|[False]|
|However, given the clinically significant result to the interaction between tolazoline and cimeti...| [True]|
|                                              How much do novel antipsychotics benefit the patients?|[False]|
| We hypothesize that during interferon therapy, melanocytes may produce more melanin pigment in t...|[False]|
|They seemed to involve multiple aetiological factors, such as autoimmune thyroid disease, the tox...|[False]|
|               Two days after completing this regimen, the patient developed a rash with blistering.| [True]|
|A diagnosis of masked theophylline poisoning should be considered in similar situations involving...|[False]|
| The overall response rate of these 24 refractory lymphomas to gemcitabine-containing regimens wa...|[False]|
|Development of sarcoidosis during interferon alpha 2b and ribavirin combination therapy for chron...| [True]|
|A patient with coccidioidal meningitis was treated with intrathecally administered amphotericin B...|[False]|
|                                                Renal failure associated with the use of dextran-40.|[False]|
| However, with increased experience in applying BCG, the side effects now appear to be less promi...|[False]|
|                        Hepatotoxicity after high-dose methylprednisolone for demyelinating disease.| [True]|
| Histopathological findings included signs of orthokeratotic hyperkeratosis, moderate follicular ...| [True]|
| Acute spontaneous TLS is rare, and it has been described in leukemia and lymphoma and in some pa...|[False]|
|We present a fatal case of subacute methanol toxicity with associated diffuse brain involvement, ...| [True]|
| The reaction was thought to be triggered by the combination of radiation and epidermal growth fa...|[False]|
+----------------------------------------------------------------------------------------------------+-------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val wordEmbeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("document", "token"))
  .setOutputCol("word_embeddings")

val sentenceEmbeddings = new SentenceEmbeddings()
  .setInputCols(Array("document", "word_embeddings"))
  .setOutputCol("sentence_embeddings")
  .setPoolingStrategy("AVERAGE")

val featuresAssembler = new FeaturesAssembler()
  .setInputCols(Array("sentence_embeddings"))
  .setOutputCol("features")

val genericClassifier = PretrainedPipeline("generic_svm_classifier_ade", lang = "en", remoteLoc = "clinical/models")
  .setInputCols("features")
  .setOutputCol("class")

val pipeline = new Pipeline()
  .setStages(Array(
  documentAssembler,
  tokenizer,
  wordEmbeddings,
  sentenceEmbeddings,
  featuresAssembler,
  genericClassifier))

val data = Seq(
  ("""None of the patients required treatment for the overdose."""),
  ("""I feel a bit drowsy & have a little blurred vision after taking an insulin""")
)

val df = data.toDF("text")

val result = pipeline.fit(df).transform(df)


+----------------------------------------------------------------------------------------------------+-------+
|                                                                                                text| result|
+----------------------------------------------------------------------------------------------------+-------+
|                       Multicentric canine lymphoma in a 12-year-old keeshond: chemotherapy options.|[False]|
|                             Pyomyositis is a rare disease, encountered mainly in tropical climates.|[False]|
| Both patients subsequently developed markedly elevated EBV-DNA titers in association with monocl...|[False]|
|Bortezomib-induced paralytic ileus is a potential gastrointestinal side effect of this first-in-c...|[False]|
|However, given the clinically significant result to the interaction between tolazoline and cimeti...| [True]|
|                                              How much do novel antipsychotics benefit the patients?|[False]|
| We hypothesize that during interferon therapy, melanocytes may produce more melanin pigment in t...|[False]|
|They seemed to involve multiple aetiological factors, such as autoimmune thyroid disease, the tox...|[False]|
|               Two days after completing this regimen, the patient developed a rash with blistering.| [True]|
|A diagnosis of masked theophylline poisoning should be considered in similar situations involving...|[False]|
| The overall response rate of these 24 refractory lymphomas to gemcitabine-containing regimens wa...|[False]|
|Development of sarcoidosis during interferon alpha 2b and ribavirin combination therapy for chron...| [True]|
|A patient with coccidioidal meningitis was treated with intrathecally administered amphotericin B...|[False]|
|                                                Renal failure associated with the use of dextran-40.|[False]|
| However, with increased experience in applying BCG, the side effects now appear to be less promi...|[False]|
|                        Hepatotoxicity after high-dose methylprednisolone for demyelinating disease.| [True]|
| Histopathological findings included signs of orthokeratotic hyperkeratosis, moderate follicular ...| [True]|
| Acute spontaneous TLS is rare, and it has been described in leukemia and lymphoma and in some pa...|[False]|
|We present a fatal case of subacute methanol toxicity with associated diffuse brain involvement, ...| [True]|
| The reaction was thought to be triggered by the combination of radiation and epidermal growth fa...|[False]|
+----------------------------------------------------------------------------------------------------+-------+

GenericSVMClassifier is a derivative of GenericClassifier which implements SVM (Support Vector Machine) classification. The input to the model is FeatureVector and the output is category annotations with labels and corresponding confidence scores. The scores are standardized using the logistic function so that they vary between 0 and 1.

Parameters:

batchSize: (int) Batch size
dropout: (float) Dropout coefficient
epochsNumber: (int) Maximum number of epochs to train
featureScaling: (str) Feature scaling method. Possible values are ‘zscore’, ‘minmax’ or empty (no scaling)
fixImbalance: (boolean) Fix the imbalance in the training set by replicating examples of under represented categories
labelColumn: (str) Column with label per each document
learningRate: (float) Learning Rate
modelFile: (str) Location of file of the model used for classification
multiClass: (boolean) If multiClass is set, the model will return all the labels with corresponding scores. By default, multiClass is false.
outputLogsPath: (str) Folder path to save training logs. If no path is specified, the logs won’t be stored in disk. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3).
validationSplit: (float) The proportion of training dataset to be used as validation set.The model will be validated against this dataset on each Epoch and will not be used for training. The value should be between 0.0 and 1.0.

Input Annotator Types: FEATURE_VECTOR

Output Annotator Type: CATEGORY

Python API: GenericSVMClassifier

Scala API: GenericSVMClassifier

Notebook: GenericSVMClassifierNotebook

Show Example

from jojnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_healthcare_100d","en","clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("word_embeddings")

sentence_embeddings = nlp.SentenceEmbeddings() \
    .setInputCols(["document", "word_embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

embeddings_pipeline = nlp.Pipeline(stages = [
    document_assembler,
    tokenizer,
    word_embeddings,
    sentence_embeddings,
])

trainingData_with_embeddings = embeddings_pipeline.fit(trainingData).transform(trainingData)
trainingData_with_embeddings = trainingData_with_embeddings.select("text","category","sentence_embeddings")

graph_folder = "graph_folder"

gc_svm_graph_builder = medical.TFGraphBuilder()\
    .setModelName("svm_classifier")\
    .setInputCols(["feature_vector"]) \
    .setLabelColumn("category")\
    .setGraphFolder(graph_folder)\
    .setGraphFile("svm_graph.pb")

features_asm = medical.FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("feature_vector")

gen_clf = medical.GenericSVMClassifierApproach()\
    .setLabelColumn("category")\
    .setInputCols("feature_vector")\
    .setOutputCol("prediction")\
    .setModelFile(f"{graph_folder}/svm_graph.pb")\
    .setEpochsNumber(2)\
    .setBatchSize(128)\
    .setLearningRate(0.015)\
    .setOutputLogsPath(log_folder)\
    .setDropout(0.1)\
    .setFixImbalance(True)\
    # .setValidationSplit(0.1)

clf_Pipeline = nlp.Pipeline(stages=[
    features_asm,
    gc_svm_graph_builder,
    gen_clf])

model = clf_Pipeline.fit(trainingData_with_embeddings)
model.stages[-1].write().overwrite().save('/model_path/model_name')

#sample training data
    text	                                            category
0	Clioquinol intoxication occurring in the trea...	neg
1	"Retinoic acid syndrome" was prevented with s...	neg
2	BACKGROUND: External beam radiation therapy o...	neg
3	Although the enuresis ceased, she developed t...	neg
4	A 42-year-old woman had uneventful bilateral ...	neg

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")
  
val sentenceEmbeddings = BertSentenceEmbeddings
  .pretrained()
  .setInputCols(Array("document"))
  .setOutputCol("sentence_embedding")

val featuresAssembler = new FeaturesAssembler()
  .setInputCols(Array("sentence_embedding"))
  .setOutputCol("feature_vector")

val svmClassifier = new GenericSVMClassifierApproach()
  .setInputCols("feature_vector")
  .setOutputCol("prediction")
  .setLabelColumn("label")
  .setModelFile("src/test/resources/classification/svm_graph.pb") 
  .setEpochsNumber(10)
  .setBatchSize(1)
  .setMultiClass(false)
  .setlearningRate(0.01f)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceEmbeddings,
  featuresAssembler,
  svmClassifier,
))

val model = pipeline.fit(trainingData)

IOBTagger

Model

The IOBTagger chunk tag (Chunk based) outputs, namely NerConverter and ChunkMerger, serve the purpose of converting token tags into Named Entity Recognition (NER) tags (token-based). These tags help to identify and categorize specific entities within a given text, enabling valuable information and context to be extracted from tokens. For example output columns as inputs from NerConverter and Tokenizer can be used to merge.

Input Annotator Types: TOKEN, CHUNK

Output Annotator Type: NAMED_ENTITY

Python API: IOBTagger

Scala API: IOBTagger

Notebook: IOBTaggerNotebook

Show Example

from johnsnowlabs import nlp, medical 
# Pipeline stages are defined where NER is done. NER is converted to chunks.

docAssembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embs")

nerModel = medical.NerModel.pretrained("ner_posology_greedy", "en", "clinical/models")\
  .setInputCols(["sentence", "token", "embs"])\
  .setOutputCol("ner")

nerConverter = nlp.NerConverter()\
  .setInputCols(["sentence", "token", "ner"])\
  .setOutputCol("ner_chunk")

# Define the IOB tagger, which needs tokens and chunks as input. Show results.
iobTagger = medical.IOBTagger()\
  .setInputCols(["token", "ner_chunk"])\
  .setOutputCol("ner_label")

pipeline = nlp.Pipeline(stages=[docAssembler,
                            sentenceDetector,
                            tokenizer,
                            embeddings,
                            nerModel,
                            nerConverter,
                            iobTagger])

text = "The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO."
df = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(df).transform(df)

# chunk level result
result.selectExpr("explode(ner_chunk) as a") \
  .selectExpr("a.begin",
              "a.end",
              "a.result as ner_chunk",
              "a.metadata.entity as ner_label").show(50, False)

+-----+---+-------------------------------------------+---------+
|begin|end|ner_chunk                                  |ner_label|
+-----+---+-------------------------------------------+---------+
|27   |50 |1 capsule of Advil 10 mg                   |DRUG     |
|52   |61 |for 5 days                                 |DURATION |
|67   |109|magnesium hydroxide 100mg/1ml suspension PO|DRUG     |
+-----+---+-------------------------------------------+---------+

# token level result
result.selectExpr("explode(ner_label) as a") \
  .selectExpr("a.begin",
              "a.end",
              "a.metadata.word as word",
              "a.result as chunk").show(50, False)

+-----+---+----------+----------+
|begin|end|word      |chunk     |
+-----+---+----------+----------+
|0    |2  |The       |0         |
|4    |10 |patient   |0         |
|12   |14 |was       |0         |
|16   |25 |prescribed|0         |
|27   |27 |1         |B-DRUG    |
|29   |35 |capsule   |I-DRUG    |
|37   |38 |of        |I-DRUG    |
|40   |44 |Advil     |I-DRUG    |
|46   |47 |10        |I-DRUG    |
|49   |50 |mg        |I-DRUG    |
|52   |54 |for       |B-DURATION|
|56   |56 |5         |I-DURATION|
|58   |61 |days      |I-DURATION|
|63   |65 |and       |0         |
|67   |75 |magnesium |B-DRUG    |
|77   |85 |hydroxide |I-DRUG    |
|87   |95 |100mg/1ml |I-DRUG    |
|97   |106|suspension|I-DRUG    |
|108  |109|PO        |I-DRUG    |
|110  |110|.         |0         |
+-----+---+----------+----------+

from johnsnowlabs import nlp, finance
# Pipeline stages are defined where NER is done. NER is converted to chunks.

docAssembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embs")

nerModel = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
  .setInputCols(["sentence", "token", "embs"])\
  .setOutputCol("ner")

nerConverter = nlp.NerConverter()\
  .setInputCols(["sentence", "token", "ner"])\
  .setOutputCol("ner_chunk")

# Define the IOB tagger, which needs tokens and chunks as input. Show results.
iobTagger = finance.IOBTagger()\
  .setInputCols(["token", "ner_chunk"])\
  .setOutputCol("ner_label")

pipeline = nlp.Pipeline(stages=[docAssembler,
                            sentenceDetector,
                            tokenizer,
                            embeddings,
                            nerModel,
                            nerConverter,
                            iobTagger])

text = """In 2020, we acquired certain assets of Spell Security Private Limited (also known as "Spell Security"). More specifically, their Compliance product - Policy Compliance (PC)")."""
df = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(df).transform(df)  

# chunk level result
result.selectExpr("explode(ner_chunk) as a") \
  .selectExpr("a.begin",
              "a.end",
              "a.result as ner_chunk",
              "a.metadata.entity as ner_label").show(50, False)

+-----+---+------------------------------+---------+
|begin|end|ner_chunk                     |ner_label|
+-----+---+------------------------------+---------+
|39   |68 |Spell Security Private Limited|ORG      |
|86   |99 |Spell Security                |ALIAS    |
|129  |138|Compliance                    |PRODUCT  |
|150  |166|Policy Compliance             |PRODUCT  |
|169  |170|PC                            |ALIAS    |
+-----+---+------------------------------+---------+

# token level result
result.selectExpr("explode(ner_label) as a") \
  .selectExpr("a.begin",
              "a.end",
              "a.metadata.word as word",
              "a.result as chunk").show(50, False)

+-----+---+------------+---------+
|begin|end|word        |chunk    |
+-----+---+------------+---------+
|0    |1  |In          |0        |
|3    |6  |2020        |0        |
|7    |7  |,           |0        |
|9    |10 |we          |0        |
|12   |19 |acquired    |0        |
|21   |27 |certain     |0        |
|29   |34 |assets      |0        |
|36   |37 |of          |0        |
|39   |43 |Spell       |B-ORG    |
|45   |52 |Security    |I-ORG    |
|54   |60 |Private     |I-ORG    |
|62   |68 |Limited     |I-ORG    |
|70   |70 |(           |0        |
|71   |74 |also        |0        |
|76   |80 |known       |0        |
|82   |83 |as          |0        |
|85   |85 |"           |0        |
|86   |90 |Spell       |B-ALIAS  |
|92   |99 |Security    |I-ALIAS  |
|100  |102|").         |0        |
|104  |107|More        |0        |
|109  |120|specifically|0        |
|121  |121|,           |0        |
|123  |127|their       |0        |
|129  |138|Compliance  |B-PRODUCT|
|140  |146|product     |0        |
|148  |148|-           |0        |
|150  |155|Policy      |B-PRODUCT|
|157  |166|Compliance  |I-PRODUCT|
|168  |168|(           |0        |
|169  |170|PC          |B-ALIAS  |
|171  |174|)").        |0        |
+-----+---+------------+---------+

from johnsnowlabs import nlp, legal 
# Pipeline stages are defined where NER is done. NER is converted to chunks.

docAssembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols(["sentence"])\
  .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embs")

ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models")\
  .setInputCols(["sentence", "token", "embs"])\
  .setOutputCol("ner")

nerConverter = nlp.NerConverter()\
  .setInputCols(["sentence", "token", "ner"])\
  .setOutputCol("ner_chunk")

# Define the IOB tagger, which needs tokens and chunks as input. Show results.
iobTagger = legal.IOBTagger()\
  .setInputCols(["token", "ner_chunk"])\
  .setOutputCol("ner_label")

pipeline = nlp.Pipeline(stages=[docAssembler, 
                            sentenceDetector, 
                            tokenizer, 
                            embeddings, 
                            ner_model, 
                            nerConverter, 
                            iobTagger])

text = """This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties").
"""

df = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(df).transform(df) 

# chunk level result
result.selectExpr("explode(ner_chunk) as a") \
  .selectExpr("a.begin",
              "a.end",
              "a.result as ner_chunk",
              "a.metadata.entity as ner_label").show(50, False)

+-----+---+-----------------------------------+---------+
|begin|end|ner_chunk                          |ner_label|
+-----+---+-----------------------------------+---------+
|141  |165|Armstrong Flooring, Inc.,          |ORG      |
|192  |197|Seller                             |ALIAS    |
|205  |221|AFI Licensing LLC                  |ORG      |
|263  |271|Licensing                          |ALIAS    |
|292  |297|Seller                             |ALIAS    |
|301  |307|Arizona                            |ALIAS    |
|315  |330|AHF Holding, Inc                   |ORG      |
|399  |403|Buyer                              |ALIAS    |
|411  |445|Armstrong Hardwood Flooring Company|ORG      |
|478  |484|Company                            |ALIAS    |
|505  |509|Buyer                              |ALIAS    |
|516  |529|Buyer Entities                     |ALIAS    |
|542  |548|Arizona                            |ALIAS    |
|574  |587|Buyer Entities                     |ALIAS    |
|611  |615|Party                              |ALIAS    |
|641  |647|Parties                            |ALIAS    |
+-----+---+-----------------------------------+---------+

# token level result
result.selectExpr("explode(ner_label) as a") \
  .selectExpr("a.begin",
              "a.end",
              "a.metadata.word as word",
              "a.result as chunk").show(50, False)

+-----+---+------------+-------+
|begin|end|word        |chunk  |
+-----+---+------------+-------+
|0    |3  |This        |0      |
|5    |16 |INTELLECTUAL|0      |
|18   |25 |PROPERTY    |0      |
|27   |35 |AGREEMENT   |0      |
|37   |37 |(           |0      |
|38   |41 |this        |0      |
|43   |43 |"           |0      |
|44   |52 |Agreement   |0      |
|53   |55 |"),         |0      |
|57   |61 |dated       |0      |
|63   |64 |as          |0      |
|66   |67 |of          |0      |
|69   |76 |December    |0      |
|78   |79 |31          |0      |
|80   |80 |,           |0      |
|82   |85 |2018        |0      |
|87   |87 |(           |0      |
|88   |90 |the         |0      |
|92   |92 |"           |0      |
|93   |101|Effective   |0      |
|103  |106|Date        |0      |
|107  |108|")          |0      |
|110  |111|is          |0      |
|113  |119|entered     |0      |
|121  |124|into        |0      |
|126  |127|by          |0      |
|129  |131|and         |0      |
|133  |139|between     |0      |
|141  |149|Armstrong   |B-ORG  |
|151  |158|Flooring    |I-ORG  |
|159  |159|,           |I-ORG  |
|161  |163|Inc         |I-ORG  |
|164  |165|.,          |I-ORG  |
|167  |167|a           |0      |
|169  |176|Delaware    |0      |
|178  |188|corporation |0      |
|190  |191|("          |0      |
|192  |197|Seller      |B-ALIAS|
|198  |199|")          |0      |
|201  |203|and         |0      |
|205  |207|AFI         |B-ORG  |
|209  |217|Licensing   |I-ORG  |
|219  |221|LLC         |I-ORG  |
|222  |222|,           |0      |
|224  |224|a           |0      |
|226  |233|Delaware    |0      |
|235  |241|limited     |0      |
|243  |251|liability   |0      |
|253  |259|company     |0      |
|261  |262|("          |0      |
+-----+---+------------+-------+
only showing top 50 rows

import spark.implicits._

// Pipeline stages are defined where NER is done. NER is converted to chunks. 
val docAssembler = new DocumentAssembler()
  .setInputCol("text") 
  .setOutputCol("document") 

val sentenceDetector = new SentenceDetector()
  .setInputCols("document") 
  .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
  .setInputCols("sentence") 
  .setOutputCol("token") 

val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
  .setInputCols(Array("sentence","token")) 
  .setOutputCol("embs") 

val nerModel = MedicalNerModel.pretrained("ner_posology_greedy","en","clinical/models")
  .setInputCols(Array("sentence","token","embs")) 
  .setOutputCol("ner") 

val nerConverter = new NerConverter()
  .setInputCols(Array("sentence","token","ner")) 
  .setOutputCol("ner_chunk") 
  
// Define the IOB tagger,which needs tokens and chunks as input. Show results. 
val iobTagger = new IOBTagger()
  .setInputCols(Array("token","ner_chunk")) 
  .setOutputCol("ner_label") 

val pipeline = new Pipeline().setStages(Array(
                                              docAssembler,
                                              sentenceDetector,
                                              tokenizer,
                                              embeddings, 
                                              nerModel,
                                              nerConverter,
                                              iobTagger)) 

val text = "The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO." 
val df = Seq(text) .toDF("text") 
val result = pipeline.fit(df) .transform(df) 

// chunk level result
+-----+---+-------------------------------------------+---------+
|begin|end|ner_chunk                                  |ner_label|
+-----+---+-------------------------------------------+---------+
|27   |50 |1 capsule of Advil 10 mg                   |DRUG     |
|52   |61 |for 5 days                                 |DURATION |
|67   |109|magnesium hydroxide 100mg/1ml suspension PO|DRUG     |
+-----+---+-------------------------------------------+---------+

// token level result
+-----+---+----------+----------+
|begin|end|word      |chunk     |
+-----+---+----------+----------+
|0    |2  |The       |0         |
|4    |10 |patient   |0         |
|12   |14 |was       |0         |
|16   |25 |prescribed|0         |
|27   |27 |1         |B-DRUG    |
|29   |35 |capsule   |I-DRUG    |
|37   |38 |of        |I-DRUG    |
|40   |44 |Advil     |I-DRUG    |
|46   |47 |10        |I-DRUG    |
|49   |50 |mg        |I-DRUG    |
|52   |54 |for       |B-DURATION|
|56   |56 |5         |I-DURATION|
|58   |61 |days      |I-DURATION|
|63   |65 |and       |0         |
|67   |75 |magnesium |B-DRUG    |
|77   |85 |hydroxide |I-DRUG    |
|87   |95 |100mg/1ml |I-DRUG    |
|97   |106|suspension|I-DRUG    |
|108  |109|PO        |I-DRUG    |
|110  |110|.         |0         |
+-----+---+----------+----------+

import spark.implicits._

// Pipeline stages are defined where NER is done. NER is converted to chunks. 
val docAssembler = new DocumentAssembler()
  .setInputCol("text") 
  .setOutputCol("document") 

val sentenceDetector = new SentenceDetector()
  .setInputCols("document") 
  .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token") 

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
  .setInputCols(Array("sentence","token")) 
  .setOutputCol("embs") 

val nerModel = FinanceNerModel.pretrained("finner_orgs_prods_alias","en","finance/models")
  .setInputCols(Array("sentence","token","embs")) 
  .setOutputCol("ner") 

val nerConverter = new NerConverter()
  .setInputCols(Array("sentence","token","ner"))
  .setOutputCol("ner_chunk") 

// Define the IOB tagger,which needs tokens and chunks as input. Show results. 
val iobTagger = new IOBTagger()
  .setInputCols(Array("token","ner_chunk")) 
  .setOutputCol("ner_label") 

val pipeline = new Pipeline().setStages(Array(
                                              docAssembler,
                                              sentenceDetector,
                                              tokenizer,
                                              embeddings, 
                                              nerModel, 
                                              nerConverter,
                                              iobTagger)) 

val text = """In 2020, we acquired certain assets of Spell Security Private Limited (also known as "Spell Security") . More specifically,their Compliance product - Policy Compliance (PC)).""" 
val df = Seq(text) .toDF("text") 
val result = pipeline.fit(df) .transform(df) 

// chunk level result
+-----+---+------------------------------+---------+
|begin|end|ner_chunk                     |ner_label|
+-----+---+------------------------------+---------+
|39   |68 |Spell Security Private Limited|ORG      |
|86   |99 |Spell Security                |ALIAS    |
|129  |138|Compliance                    |PRODUCT  |
|150  |166|Policy Compliance             |PRODUCT  |
|169  |170|PC                            |ALIAS    |
+-----+---+------------------------------+---------+

// token level result
+-----+---+------------+---------+
|begin|end|word        |chunk    |
+-----+---+------------+---------+
|0    |1  |In          |0        |
|3    |6  |2020        |0        |
|7    |7  |,           |0        |
|9    |10 |we          |0        |
|12   |19 |acquired    |0        |
|21   |27 |certain     |0        |
|29   |34 |assets      |0        |
|36   |37 |of          |0        |
|39   |43 |Spell       |B-ORG    |
|45   |52 |Security    |I-ORG    |
|54   |60 |Private     |I-ORG    |
|62   |68 |Limited     |I-ORG    |
|70   |70 |(           |0        |
|71   |74 |also        |0        |
|76   |80 |known       |0        |
|82   |83 |as          |0        |
|85   |85 |"           |0        |
|86   |90 |Spell       |B-ALIAS  |
|92   |99 |Security    |I-ALIAS  |
|100  |102|").         |0        |
|104  |107|More        |0        |
|109  |120|specifically|0        |
|121  |121|,           |0        |
|123  |127|their       |0        |
|129  |138|Compliance  |B-PRODUCT|
|140  |146|product     |0        |
|148  |148|-           |0        |
|150  |155|Policy      |B-PRODUCT|
|157  |166|Compliance  |I-PRODUCT|
|168  |168|(           |0        |
|169  |170|PC          |B-ALIAS  |
|171  |174|)").        |0        |
+-----+---+------------+---------+

import spark.implicits._

// Pipeline stages are defined where NER is done. NER is converted to chunks. 
val docAssembler = new DocumentAssembler()
  .setInputCol("text") 
  .setOutputCol("document") 

val sentenceDetector = new SentenceDetector()
  .setInputCols("document") 
  .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
  .setInputCols("sentence") 
  .setOutputCol("token") 

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
  .setInputCols(Array("sentence","token")) 
  .setOutputCol("embs") 

val ner_model = LegalNerModel.pretrained("legner_orgs_prods_alias","en","legal/models")
  .setInputCols(Array("sentence","token","embs")) 
  .setOutputCol("ner") 

val nerConverter = new NerConverter()
  .setInputCols(Array("sentence","token","ner")) 
  .setOutputCol("ner_chunk") 
  
// Define the IOB tagger,which needs tokens and chunks as input. Show results. 
val iobTagger = new IOBTagger()
  .setInputCols(Array("token","ner_chunk")) 
  .setOutputCol("ner_label") 

val pipeline = new Pipeline().setStages(Array(
                                              docAssembler,
                                              sentenceDetector,
                                              tokenizer,
                                              embeddings, 
                                              ner_model,
                                              nerConverter,
                                              iobTagger)) 

val text = """This
 INTELLECTUAL PROPERTY AGREEMENT (this "Agreement") ,dated as of December 31,2018 (the "Effective Date") is entered into by and between Armstrong Flooring,Inc.,a Delaware corporation ("Seller") and AFI Licensing LLC,a Delaware limited liability company ("Licensing" and together with Seller,"Arizona") and AHF Holding,Inc. (formerly known as Tarzan HoldCo,Inc.) ,a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company,a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand,a "Party" and collectively,the "Parties") .""" 
val df = Seq(text) .toDF("text") 
val result = pipeline.fit(df) .transform(df)

// chunk level result
+-----+---+-----------------------------------+---------+
|begin|end|ner_chunk                          |ner_label|
+-----+---+-----------------------------------+---------+
|141  |165|Armstrong Flooring, Inc.,          |ORG      |
|192  |197|Seller                             |ALIAS    |
|205  |221|AFI Licensing LLC                  |ORG      |
|263  |271|Licensing                          |ALIAS    |
|292  |297|Seller                             |ALIAS    |
|301  |307|Arizona                            |ALIAS    |
|315  |330|AHF Holding, Inc                   |ORG      |
|399  |403|Buyer                              |ALIAS    |
|411  |445|Armstrong Hardwood Flooring Company|ORG      |
|478  |484|Company                            |ALIAS    |
|505  |509|Buyer                              |ALIAS    |
|516  |529|Buyer Entities                     |ALIAS    |
|542  |548|Arizona                            |ALIAS    |
|574  |587|Buyer Entities                     |ALIAS    |
|611  |615|Party                              |ALIAS    |
|641  |647|Parties                            |ALIAS    |
+-----+---+-----------------------------------+---------+

// token level result
+-----+---+------------+-------+
|begin|end|word        |chunk  |
+-----+---+------------+-------+
|0    |3  |This        |0      |
|5    |16 |INTELLECTUAL|0      |
|18   |25 |PROPERTY    |0      |
|27   |35 |AGREEMENT   |0      |
|37   |37 |(           |0      |
|38   |41 |this        |0      |
|43   |43 |"           |0      |
|44   |52 |Agreement   |0      |
|53   |55 |"),         |0      |
|57   |61 |dated       |0      |
|63   |64 |as          |0      |
|66   |67 |of          |0      |
|69   |76 |December    |0      |
|78   |79 |31          |0      |
|80   |80 |,           |0      |
|82   |85 |2018        |0      |
|87   |87 |(           |0      |
|88   |90 |the         |0      |
|92   |92 |"           |0      |
|93   |101|Effective   |0      |
|103  |106|Date        |0      |
|107  |108|")          |0      |
|110  |111|is          |0      |
|113  |119|entered     |0      |
|121  |124|into        |0      |
|126  |127|by          |0      |
|129  |131|and         |0      |
|133  |139|between     |0      |
|141  |149|Armstrong   |B-ORG  |
|151  |158|Flooring    |I-ORG  |
|159  |159|,           |I-ORG  |
|161  |163|Inc         |I-ORG  |
|164  |165|.,          |I-ORG  |
|167  |167|a           |0      |
|169  |176|Delaware    |0      |
|178  |188|corporation |0      |
|190  |191|("          |0      |
|192  |197|Seller      |B-ALIAS|
|198  |199|")          |0      |
|201  |203|and         |0      |
|205  |207|AFI         |B-ORG  |
|209  |217|Licensing   |I-ORG  |
|219  |221|LLC         |I-ORG  |
|222  |222|,           |0      |
|224  |224|a           |0      |
|226  |233|Delaware    |0      |
|235  |241|limited     |0      |
|243  |251|liability   |0      |
|253  |259|company     |0      |
|261  |262|("          |0      |
+-----+---+------------+-------+
only showing top 50 rows

InternalDocumentSplitter

Model

InternalDocumentSplitter splits large documents into small documents. InternalDocumentSplitter has setSplitMode method to decide how to split documents.

If splitMode is recursive, It takes the separators in order and splits subtexts if they are over the chunk length, considering optional overlap of the chunks.

Additionally, you can set

custom patterns with setSplitPatterns
whether patterns should be interpreted as regex with setPatternsAreRegex
whether to keep the separators with setKeepSeparators
whether to trim whitespaces with setTrimWhitespace
whether to explode the splits to individual rows with setExplodeSplits

Parametres:

chunkSize: Size of each chunk of text. This param is applicable only for “recursive” splitMode.
chunkOverlap: Length of the overlap between text chunks, by default 0. This param is applicable only for recursive splitMode.
splitPatterns: Patterns to split the document. patternsAreRegex. Whether to interpret the split patterns as regular expressions, by default True.
keepSeparators: Whether to keep the separators in the final result , by default True. This param is applicable only for “recursive” splitMode.
explodeSplits: Whether to explode split chunks to separate rows , by default False.
trimWhitespace: Whether to trim whitespaces of extracted chunks , by default True.
splitMode: The split mode to determine how text should be segmented. Default: ‘regex’. It should be one of the following values:
- “char”: Split text based on individual characters.
- “token”: Split text based on tokens. You should supply tokens from inputCols.
- “sentence”: Split text based on sentences. You should supply sentences from inputCols.
- “recursive”: Split text recursively using a specific algorithm.
- “regex”: Split text based on a regular expression pattern.
sentenceAwareness: Whether to split the document by sentence awareness if possible.
- If true, it can stop the split process before maxLength.
- If true, you should supply sentences from inputCols. Default: False.
- This param is not applicable only for regex and recursive splitMode.
maxLength: The maximum length allowed for spitting. The mode in which the maximum length is specified:
- “char”: Maximum length is measured in characters. Default: 512
- “token”: Maximum length is measured in tokens. Default: 128
- “sentence”: Maximum length is measured in sentences. Default: 8
customBoundsStrategy: The custom bounds strategy for text splitting using regular expressions. This param is applicable only for regex splitMode.
caseSensitive: Whether to use case sensitive when matching regex, by default False. This param is applicable only for regex splitMode.
metaDataFields: Metadata fields to add specified data in columns to the metadata of the split documents. You should set column names to read columns.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

Notebook: InternalDocumentSplitterNotebook

Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

document_splitter = medical.InternalDocumentSplitter()\
    .setInputCols("document")\
    .setOutputCol("splits")\
    .setSplitMode("recursive")\
    .setChunkSize(100)\
    .setChunkOverlap(3)\
    .setExplodeSplits(True)\
    .setPatternsAreRegex(False)\
    .setSplitPatterns(["\n\n", "\n", " "])\
    .setKeepSeparators(False)\
    .setTrimWhitespace(True)

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    document_splitter
])

df = spark.createDataFrame([[(
    "The patient is a 28-year-old, who is status post gastric bypass surgery"
    " nearly one year ago. \nHe has lost about 200 pounds and was otherwise doing well"
    " until yesterday evening around 7:00-8:00 when he developed nausea and right upper quadrant pain,"
    " which apparently wrapped around toward his right side and back. He feels like he was on it"
    " but has not done so. He has overall malaise and a low-grade temperature of 100.3."
    " \n\nHe denies any prior similar or lesser symptoms. His last normal bowel movement was yesterday."
    " He denies any outright chills or blood per rectum."
)]]).toDF("text")


pipeline_df = pipeline.fit(df).transform(df).select("splits").show(truncate=False)

## Result

+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
|splits                                                                                                                                                         |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 92, The patient is a 28-year-old, who is status post gastric bypass surgery nearly one year ago., {sentence -> 0, document -> 0}, []}]          |
|[{document, 94, 192, He has lost about 200 pounds and was otherwise doing well until yesterday evening around 7:00-8:00, {sentence -> 0, document -> 1}, []}]  |
|[{document, 193, 291, when he developed nausea and right upper quadrant pain, which apparently wrapped around toward his, {sentence -> 0, document -> 2}, []}] |
|[{document, 288, 387, his right side and back. He feels like he was on it but has not done so. He has overall malaise and, {sentence -> 0, document -> 3}, []}]|
|[{document, 384, 421, and a low-grade temperature of 100.3., {sentence -> 0, document -> 4}, []}]                                                              |
|[{document, 424, 520, He denies any prior similar or lesser symptoms. His last normal bowel movement was yesterday. He, {sentence -> 0, document -> 5}, []}]   |
|[{document, 518, 568, He denies any outright chills or blood per rectum., {sentence -> 0, document -> 6}, []}]                                                 |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val document_splitter = new InternalDocumentSplitter()
    .setInputCols("document")
    .setOutputCol("splits")
    .setSplitMode("recursive")
    .setChunkSize(100)
    .setChunkOverlap(3)
    .setExplodeSplits(true)
    .setPatternsAreRegex(false)
    .setSplitPatterns(Array("\n\n", "\n", " "))
    .setKeepSeparators(false)
    .setTrimWhitespace(true)

val pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    document_splitter ))


val test_data = Seq( "The patient is a 28-year-old, who is status post gastric bypass surgery"
    " nearly one year ago. \nHe has lost about 200 pounds and was otherwise doing well"
    " until yesterday evening around 7:00-8:00 when he developed nausea and right upper quadrant pain,"
    " which apparently wrapped around toward his right side and back. He feels like he was on it"
    " but has not done so. He has overall malaise and a low-grade temperature of 100.3."
    " \n\nHe denies any prior similar or lesser symptoms. His last normal bowel movement was yesterday."
    " He denies any outright chills or blood per rectum.").toDF("text")

val res = mapperPipeline.fit(test_data).transform(test_data)

// Show results

+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
|splits                                                                                                                                                         |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 92, The patient is a 28-year-old, who is status post gastric bypass surgery nearly one year ago., {sentence -> 0, document -> 0}, []}]          |
|[{document, 94, 192, He has lost about 200 pounds and was otherwise doing well until yesterday evening around 7:00-8:00, {sentence -> 0, document -> 1}, []}]  |
|[{document, 193, 291, when he developed nausea and right upper quadrant pain, which apparently wrapped around toward his, {sentence -> 0, document -> 2}, []}] |
|[{document, 288, 387, his right side and back. He feels like he was on it but has not done so. He has overall malaise and, {sentence -> 0, document -> 3}, []}]|
|[{document, 384, 421, and a low-grade temperature of 100.3., {sentence -> 0, document -> 4}, []}]                                                              |
|[{document, 424, 520, He denies any prior similar or lesser symptoms. His last normal bowel movement was yesterday. He, {sentence -> 0, document -> 5}, []}]   |
|[{document, 518, 568, He denies any outright chills or blood per rectum., {sentence -> 0, document -> 6}, []}]                                                 |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

LLMLoader

Model

LLMLoader is designed to interact with a LLMs that are converted into gguf format. This module allows using John Snow Labs’ licensed LLMs at various sizes that are finetuned on medical context for certain tasks. It provides various methods for setting parameters, loading models, generating text, and retrieving metadata. The LLMLoader includes methods for setting various parameters such as input prefix, suffix, cache prompt, number of tokens to predict, sampling techniques, temperature, penalties, and more. Overall, the LLMLoader provides a flexible and extensible framework for interacting with language models in a Python and Scala environment using PySpark and Java.

Parameters:

inputPrefix : Prefix for infilling (default: empty)
inputSuffix : Suffix for infilling (default: empty)
cachePrompt : Whether to remember the prompt to avoid reprocessing it
nPredict : Number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled)
topK : Top-k sampling (default: 40, 0 = disabled)
topP : Top-p sampling (default: 0.9, 1.0 = disabled)
minP : Min-p sampling (default: 0.1, 0.0 = disabled)
tfsZ : Tail free sampling, parameter z (default: 1.0, 1.0 = disabled)
typicalP : Locally typical sampling, parameter p (default: 1.0, 1.0 = disabled)
temperature : The temperature (default: 0.8)
dynatempRange : Dynamic temperature range (default: 0.0, 0.0 = disabled)
dynatempExponent : Dynamic temperature exponent (default: 1.0)
repeatLastN : Last n tokens to consider for penalties (default: 64, 0 = disabled, -1 = ctx_size)
repeatPenalty : Penalty of repeated sequences of tokens (default: 1.0, 1.0 = disabled)
frequencyPenalty : Repetition alpha frequency penalty (default: 0.0, 0.0 = disabled)
presencePenalty : Repetition alpha presence penalty (default: 0.0, 0.0 = disabled)
mirostatTau : MiroStat target entropy, parameter tau (default: 5.0)
mirostatEta : MiroStat learning rate, parameter eta (default: 0.1)
penalizeNl : Whether to penalize newline tokens
nKeep : Number of tokens to keep from the initial prompt (default: 0, -1 = all)
seed : RNG seed (default: -1, use random seed for < 0)
nProbs : Amount top tokens probabilities to output if greater than 0.
minKeep : Amount of tokens the samplers should return at least (0 = disabled)
grammar : BNF-like grammar to constrain generations (see samples in grammars/ dir)
penaltyPrompt : Override which part of the prompt is penalized for repetition. E.g. if original prompt is “Alice: Hello!” and penaltyPrompt is “Hello!”, only the latter will be penalized if repeated. See pull request 3727 for more details.
penaltyPromptTokens : PenaltyPromptTokens
ignoreEos : Whether to ignore end of stream token and continue generating (implies –logit-bias 2-inf)
stopStrings : Strings upon seeing which token generation is stopped
useChatTemplate : Whether or not generate should apply a chat template (default: false)

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

Python API: LLMLoader

Scala API: LLMLoader

Show Example

from sparknlp_jsl.llm import LLMLoader

llm_loader_pretrained = medical.LLMLoader(spark).pretrained("JSL_MedS_q16_v1", "en", "clinical/models")

llm_loader_pretrained.generate("What is the indication for the drug Methadone?")

## Result

'Methadone is used to treat opioid addiction. It is a long-acting opioid agonist that is used to help individuals who are addicted to short-acting opioids such as heroin or other illicit opioids. It is also used to treat chronic pain in patients who have developed tolerance to other opioids.'

val llmLoader = new LLMLoader()
  .setSparkSession(spark)
  .pretrained("JSL_MedS_q16_v1", remoteLoc="clinical/models", lang = "en")
  .setTemperature(0.0f)
  .setNPredict(20)

val prompt = "What is the indication for the drug Methadone?"

val output = llmLoader.setUseChatTemplate(false).setStopStrings(Array.empty[String]).generate(prompt)

println(output)

## Result

'Methadone is used to treat opioid addiction. It is a long-acting opioid agonist that is used to help individuals who are addicted to short-acting opioids such as heroin or other illicit opioids. It is also used to treat chronic pain in patients who have developed tolerance to other opioids.'

LargeFewShotClassifier

Model

The LargeFewShotClassifierModel annotator is designed to work effectively with minimal labeled data, offering flexibility and adaptability to new, unseen classes. Key parameters include batch size, case sensitivity, and maximum sentence length.Large Few-Shot Classifier Model can achieve impressive performance even with minimal labeled data.

Parameters:

inputCols: The name of the columns containing the input annotations. It can read either a String column or an Array.
outputCol: The name of the column in Document type that is generated. We can specify only one column here.

Input Annotator Types: DOCUMENT

Output Annotator Type: CATEGORY

Python API: LargeFewShotClassifier

Scala API: LargeFewShotClassifier

Notebook: Mapper2Chunk

Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

large_few_shot_classifier = medical.LargeFewShotClassifierModel()\
    .pretrained("large_fewshot_classifier_ade", "en", "clinical/models")\
    .setInputCols("document")\
    .setOutputCol("prediction")

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    large_few_shot_classifier
])

data = spark.createDataFrame(["The patient developed severe liver toxicity after taking the medication for three weeks",
                              "He experienced no complications during the treatment and reported feeling much better.",
                              "She experienced a sudden drop in blood pressure after the administration of the new drug.",
                              "The doctor recommended a daily dosage of the vitamin supplement to improve her health."], StringType()).toDF("text")

result = pipeline.fit(data).transform(data)

result.select("text", col("prediction.result").getItem(0).alias("result")).show(truncate=False)

## Result

+-----------------------------------------------------------------------------------------+------+
|text                                                                                     |result|
+-----------------------------------------------------------------------------------------+------+
|The patient developed severe liver toxicity after taking the medication for three weeks  |ADE   |
|He experienced no complications during the treatment and reported feeling much better.   |noADE |
|She experienced a sudden drop in blood pressure after the administration of the new drug.|ADE   |
|The doctor recommended a daily dosage of the vitamin supplement to improve her health.   |noADE |
+-----------------------------------------------------------------------------------------+------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val largeFewShotClassifier = LargeFewShotClassifierModel()
    .pretrained("large_fewshot_classifier_ade")
    .setInputCols("document")
    .setOutputCol("prediction")

val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    largeFewShotClassifier
))

val textList = Seq(
    ("The patient developed severe liver toxicity after taking the medication for three weeks"),
    ("He experienced no complications during the treatment and reported feeling much better."),
    ("She experienced a sudden drop in blood pressure after the administration of the new drug."),
    ("The doctor recommended a daily dosage of the vitamin supplement to improve her health.")
)

val data = spark.createDataFrame(textList).toDF("text")

val result = pipeline.fit(data).transform(data)

result.select(col("text"), col("prediction.result").getItem(0).alias("result")).show(truncate = false)

// Result

+-----------------------------------------------------------------------------------------+------+
|text                                                                                     |result|
+-----------------------------------------------------------------------------------------+------+
|The patient developed severe liver toxicity after taking the medication for three weeks  |ADE   |
|He experienced no complications during the treatment and reported feeling much better.   |noADE |
|She experienced a sudden drop in blood pressure after the administration of the new drug.|ADE   |
|The doctor recommended a daily dosage of the vitamin supplement to improve her health.   |noADE |
+-----------------------------------------------------------------------------------------+------+

LightDeIdentification

Model

Light DeIdentification is a light version of DeIdentification. It replaces sensitive information in a text with obfuscated or masked fakers. It is designed to work with healthcare data, and it can be used to de-identify patient names, dates, and other sensitive information. It can also be used to obfuscate or mask any other type of sensitive information, such as doctor names, hospital names, and other types of sensitive information. Additionally, it supports millions of embedded fakers and If desired, custom external fakers can be set with setCustomFakers function. It also supports multiple languages such as English, Spanish, French, German, and Arabic. And it supports multi-mode de-Identification with setSelectiveObfuscationModes function at the same time.

Parameters:

mode (str): Mode for Anonimizer [‘mask’,’obfuscate’]
dateEntities (list[str]): List of date entities. Default: [‘DATE’, ‘DOB’, ‘DOD’]
obfuscateDate (Bool): When mode==’obfuscate’ whether to obfuscate dates or not. This param helps in consistency to make dateFormats more visible. When setting to True, make sure dateFormats param fits the needs. If the value is True and obfuscation is failed, then unnormalizedDateMode param will be activated. When setting to ‘False’, then the date will be masked to . Default: False
unnormalizedDateMode (str): The mode to use if the date is not formatted. Options: [mask, obfuscate, skip]. Default: obfuscate.
days (IntParam): Number of days to obfuscate the dates by displacement.If not provided a random integer between 1 and 60 will be used.
useShiftDays (Bool): Whether to use the random shift day when the document has this in its metadata. Default: False
dateFormats (list[str]): List of date formats to automatically displace if parsed.
region (str): The region to use for date parsing. This property is especially used when obfuscating dates. You can decide whether the first part of 11/11/2023 is a day or the second part is a day when obfuscating dates. Options: ‘eu’ for European Union, ‘us’ for the USA, Default: ‘eu’
obfuscateRefSource (str): The source of obfuscation of to obfuscate the entities. For dates entities, This property is invalid. The values ar the following: custom: Takes the entities from the setCustomFakers function. faker: Takes the entities from the Faker module both : Takes the entities from the setCustomFakers function and the faker module randomly
language (str): The language used to select the regex file and some faker entities. The values are the following: ‘en’(English), ‘de’(German), ‘es’(Spanish), ‘fr’(French), ‘ar’(Arabic) or ‘ro’(Romanian). Default:’en’.
seed (Int): It is the seed to select the entities on obfuscate mode. With the seed, you can reply to an execution several times with the same output.
maskingPolicy (str): Select the masking policy: entity_labels: Mask with the entity type of that chunk. (default) same_length_chars: Mask the deid entities with same length of asterix ( * ) with brackets ( [ , ] ) on both end. fixed_length_chars: Mask the deid entities with a fixed length of asterix ( * ). The length is setting up using the setFixedMaskLength() method. same_length_chars_without_brackets: masks entities with asterisks of the same length without square brackets. entity_labels_without_brackets: replaces entities with their label without square brackets.
fixedMaskLength (Int): The length of the masking sequence in case of fixed_length_chars masking policy.
sameLengthFormattedEntities (list[str]): List of formatted entities to generate the same length outputs as original ones during obfuscation. The supported and default formatted entities are: PHONE, FAX, ID, IDNUM, BIOID, MEDICALRECORD, ZIP, VIN, SSN, DLN, LICENSE, PLATE.
genderAwareness (Bool): Whether to use gender-aware names or not during obfuscation. This param effects only names. If the value is true, it might decrease performance. Default: False
ageRanges (list[str]): list of integer specifying limits of the age groups to preserve during obfuscation.
selectiveObfuscationModes (dict[str, dict[str]]): The dictionary of modes to enable multi-mode deIdentification. ‘obfuscate’: Replace the values with random values. ‘mask_same_length_chars’: Replace the name with the asterisks with same length minus two plus brackets on both end. ‘mask_entity_labels’: Replace the values with the entity value. ‘mask_fixed_length_chars’: Replace the name with the asterisks with fixed length. You can also invoke “setFixedMaskLength()” ‘skip’: Skip the values (intact) The entities which have not been given in dictionary will deidentify according to :param:mode
customFakers (dict[str, dict[str]]): The dictionary of custom fakers to specify the obfuscation terms for the entities. You can specify the entity and the terms to be used for obfuscation.
keepYear: (BooleanParam) Whether to keep the year intact when obfuscating date entities. If True, the year will remain unchanged during the obfuscation process. If False, the year will be modified along with the month and day. Default: False.
keepMonth : Whether to keep the month intact when obfuscating date entities. If True, the month will remain unchanged during the obfuscation process. If False, the month will be modified along with the year and day. Default: False.
consistentAcrossNameParts : Param that indicates whether consistency should be enforced across different parts of a name (e.g., first name, middle name, last name).

When set to True, the same transformation or obfuscation will be applied consistently to all parts of the same name entity, even if those parts appear separately.

For example, if “John Smith” is obfuscated as “Liam Brown”, then:

When the full name “John Smith” appears, it will be replaced with “Liam Brown”
When “John” or “Smith” appear individually, they will still be obfuscated as “Liam” and “Brown” respectively, ensuring consistency in name transformation.

Default: True

geoConsistency: (BooleanParam) Sets whether to enforce consistent obfuscation across geographical entities: state, city, street, zip and phone. This parameter enables intelligent geographical entity obfuscation that maintains realistic relationships between different geographic components. When enabled, the system ensures that obfuscated addresses form coherent, valid combinations rather than random replacements. Default: False
countryObfuscation: (BooleanParam) Whether to obfuscate country entities or not. If True, the country entities will be obfuscated. Default: False.
additionalDateFormats: (Param[String]) Additional date formats to be considered during date obfuscation. This allows users to specify custom date formats in addition to the default date formats. Default: [].
selectiveObfuscateRefSource: Dict[str, str] A dictionary of entity names to their obfuscation modes. This is used to selectively apply different obfuscation methods to specific entities. The keys are entity names and the values are the obfuscation sources. If an entity is not specified in this map, the obfuscateRefSource param is used to determine the obfuscation source. Possible values in dict for the obfuscation source are: ‘custom’, ‘faker’, ‘both’, ‘file’.
staticObfuscationPairs: static obfuscation pairs is used to set static obfuscation pairs that will be used for de-identification. Each pair should contain three elements: original, entity type, and fake. The pairs must have exactly 3 elements: [original, entityType, fake].
obfuscationEquivalents : used to define variant-to-canonical mappings to ensure consistent obfuscation.
Each pair should contain three elements: variant, entity type, and canonical.
The pairs must have exactly 3 elements: [variant, entityType, canonical].
enableDefaultObfuscationEquivalents : whether to enable default obfuscation equivalents for common entities. This parameter allows the system to automatically include a set of predefined common English name equivalents. Default is False.

Input Annotator Types: DOCUMENT, CHUNK

Output Annotator Type: DOCUMENT

Python API: LightDeIdentification

Scala API: LightDeIdentification

Notebook: LightDeIdentification

Show Example

from johnsnowlabs import nlp, medical

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# NER model trained on n2c2 (de-identification and Heart Disease Risk Factors Challenge) datasets)
ner_subentity = medical.NerModel.pretrained("ner_deid_subentity_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_subentity")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner_subentity"])\
    .setOutputCol("ner_chunk")

light_deidentification = medical.LightDeIdentification() \
    .setInputCols(["ner_chunk", "sentence"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate") \
    .setObfuscateDate(True)\
    .setDateFormats(["MM/dd/yyyy","yyyy-MM-dd", "MM/dd/yy"]) \
    .setDays(7) \
    .setObfuscateRefSource('custom') \
    .setCustomFakers({"Doctor": ["John", "Joe"],
                      "Patient": ["James", "Michael"],
                      "Hospital": ["Medical Center"],
                      "Street" : ["Main Street"],
                      "Age":["1","10", "20", "40","80"],
                      "PHONE":["555-555-0000"]}) \
    .setAgeRanges([1, 4, 12, 20, 40, 60, 80])\
    .setLanguage("en") \
    .setSeed(42) \
    .setDateEntities(["DATE", "DOB",  "DOD"]) \

flattener = medical.Flattener()\
    .setInputCols("obfuscated","sentence")\
    .setExplodeSelectedFields({"obfuscated": ["result"],  "sentence": ["result"]})

nlpPipeline = nlp.Pipeline(stages=[
                documentAssembler,
                sentenceDetector,
                tokenizer,
                word_embeddings,
                ner_subentity,
                ner_converter,
                light_deidentification,
                flattener
                ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

text ='''
    Record date : 2093-01-13 , David Hale , M.D . ,
    Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 .
    PCP : Oliveira , 95 years-old , Record date : 2079-11-09 .
    Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555.
    '''

result = model.transform(spark.createDataFrame([[text]]).toDF("text"))
result.show(truncate=False)

## Result

+----------------------------------------------------------------------+-----------------------------------------------------+
|sentence_result                                                       |obfuscated_result                                    |
+----------------------------------------------------------------------+-----------------------------------------------------+
|Record date : 2093-01-13 , David Hale , M.D .                         |Record date : 2093-01-20 , John , M.D .              |
|,\nName : Hendrickson Ora , MR # 7194334 Date : 01/13/93 .            |,\nName : Michael , MR # 1478295 Date : 01/20/93 .   |
|PCP : Oliveira , 95 years-old , Record date : 2079-11-09 .            |PCP : Joe , 95 years-old , Record date : 2079-11-16 .|
|Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555.|Medical Center , Main Street , Phone 62-130-8657.    |
+----------------------------------------------------------------------+-----------------------------------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val wordEmbeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val nerSubEntity = MedicalNerModel.pretrained("ner_deid_subentity_augmented", "en", "clinical/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner_subentity")

val nerConverter = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "ner_subentity"))
  .setOutputCol("ner_chunk")

val lightDeidentification = new LightDeIdentification()
  .setInputCols(Array("ner_chunk", "sentence"))
  .setOutputCol("obfuscated")
  .setMode("obfuscate")
  .setObfuscateDate(true)
  .setDateFormats(Array("MM/dd/yyyy", "yyyy-MM-dd", "MM/dd/yy"))
  .setDays(7)
  .setObfuscateRefSource("custom")
  .setCustomFakers(Map("Doctor" -> Array("John", "Joe"),
    "Patient" -> Array("James", "Michael"),
    "Hospital" -> Array("Medical Center"),
    "Street" -> Array("Main Street"),
    "Age" -> Array("1", "10", "20", "40", "80"),
    "PHONE" -> Array("555-555-0000")))
  .setAgeRanges(Array(1, 4, 12, 20, 40, 60, 80))
  .setLanguage("en")
  .setSeed(42)
  .setDateEntities(Array("DATE", "DOB", "DOD"))

val flattener = new Flattener()
  .setInputCols(Array("obfuscated", "sentence"))
  .setExplodeSelectedFields(Map("obfuscated" -> Array("result"), "sentence" -> Array("result")))

val nlpPipeline = new Pipeline().setStages(Array(
      documentAssembler,
      sentenceDetector,
      tokenizer,
      wordEmbeddings,
      nerSubEntity,
      nerConverter,
      lightDeidentification,
      flattener
))

val emptyData =Seq(("")).toDF("text")

val model = nlpPipeline.fit(emptyData)

// Result

+----------------------------------------------------------------------+-----------------------------------------------------+
|sentence_result                                                       |obfuscated_result                                    |
+----------------------------------------------------------------------+-----------------------------------------------------+
|Record date : 2093-01-13 , David Hale , M.D .                         |Record date : 2093-01-20 , John , M.D .              |
|,\nName : Hendrickson Ora , MR # 7194334 Date : 01/13/93 .            |,\nName : Michael , MR # 1478295 Date : 01/20/93 .   |
|PCP : Oliveira , 95 years-old , Record date : 2079-11-09 .            |PCP : Joe , 95 years-old , Record date : 2079-11-16 .|
|Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555.|Medical Center , Main Street , Phone 62-130-8657.    |
+----------------------------------------------------------------------+-----------------------------------------------------+

Mapper2Chunk

Model

This annotator converts ‘LABELED_DEPENDENCY’ type annotations coming from [[ChunkMapper]] into ‘CHUNK’ type to create new chunk-type column, compatible with annotators that use chunk type as input.

Parameters:

filterNoneValues: (Bool) Filter ‘NONE’ values

Input Annotator Types: LABELED_DEPENDENCY

Output Annotator Type: CHUNK

Python API: Mapper2Chunk

Scala API: Mapper2Chunk

Notebook: Mapper2Chunk

Show Example

from johnsnowlabs import nlp, medical

# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("sentence")\

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
  .setInputCols("sentence")\
  .setOutputCol("token")\

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# NER model trained on n2c2 (de-identification and Heart Disease Risk Factors Challenge) datasets
clinical_ner = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter_name = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

chunkMapper = medical.ChunkMapperModel.pretrained("drug_action_treatment_mapper", "en", "clinical/models") \
    .setInputCols(["ner_chunk"]) \
    .setOutputCol("relations") \
    .setRels(["action"])

mapper2chunk = medical.Mapper2Chunk() \
    .setInputCols(["relations"]) \
    .setOutputCol("chunk") \
    .setFilterNoneValues(True)

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler, 
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter_name,
    chunkMapper,
    mapper2chunk
    ])

sample_text = "Patient resting in bed. Patient given azithromycin without any difficulty. Patient denies nausea at this time. zofran declined. Patient is also having intermittent sweating"

data = spark.createDataFrame([[sample_text]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)

## Result

+--------------------------+--------------+
|result                    |annotatorType |
+--------------------------+--------------+
|[bactericidal, antiemetic]|[chunk, chunk]|
+--------------------------+--------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
    .setInputCols(Array("sentence","token"))
    .setOutputCol("embeddings")

val clinical_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models")
    .setInputCols(Array("sentence","token","embeddings"))
    .setOutputCol("ner")

val ner_converter_name = new NerConverterInternal()
    .setInputCols(Array("sentence","token","ner"))
    .setOutputCol("ner_chunk")

val chunkMapper = ChunkMapperModel.pretrained("drug_action_treatment_mapper", "en", "clinical/models")
    .setInputCols("ner_chunk")
    .setOutputCol("relations")
    .setRels("action")

val mapper2chunk = new Mapper2Chunk()
    .setInputCols("relations")
    .setOutputCol("chunk")
    .setFilterNoneValues(True)

val nlpPipeline = new Pipeline().setStages(Array(
    documentAssembler, 
    tokenizer, 
    word_embeddings, 
    clinical_ner, 
    ner_converter_name, 
    chunkMapper, 
    mapper2chunk))


val test_data = Seq("""Patient resting in bed. Patient given azithromycin without any difficulty. Patient denies nausea at this time. zofran declined. Patient is also having intermittent sweating""").toDF("text")

val res = mapperPipeline.fit(test_data).transform(test_data)

// Show results

+--------------------------+--------------+
|result                    |annotatorType |
+--------------------------+--------------+
|[bactericidal, antiemetic]|[chunk, chunk]|
+--------------------------+--------------+

MedicalLLM

Model

MedicalLLM was designed to load and run large language models (LLMs) in GGUF format with scalable performance. Ideal for clinical and healthcare applications, MedicalLLM supports tasks like medical entity extraction, summarization, Q&A, Retrieval Augmented Generation (RAG), and conversational AI. With simple integration into Spark NLP pipelines, it allows for customizable batch sizes, prediction settings, and chat templates. GPU optimization is also available, enhancing its capabilities for high-performance environments. MedicalLLM empowers users to link medical entities and perform complex NLP tasks with efficiency and precision.

To use GPU inference with this annotator, make sure to use the Spark NLP GPU package and set the number of GPU layers with the setNGpuLayers method. When using larger models, we recommend adjusting GPU usage with setNCtx and setNGpuLayers according to your hardware to avoid out-of-memory errors.

Parameters:

inputPrefix : Prefix for infilling (default: empty)
inputSuffix : Suffix for infilling (default: empty)
cachePrompt : Whether to remember the prompt to avoid reprocessing it
nPredict : Number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled)
topK : Top-k sampling (default: 40, 0 = disabled)
topP : Top-p sampling (default: 0.9, 1.0 = disabled)
minP : Min-p sampling (default: 0.1, 0.0 = disabled)
tfsZ : Tail free sampling, parameter z (default: 1.0, 1.0 = disabled)
typicalP : Locally typical sampling, parameter p (default: 1.0, 1.0 = disabled)
temperature : The temperature (default: 0.8)
dynatempRange : Dynamic temperature range (default: 0.0, 0.0 = disabled)
dynatempExponent : Dynamic temperature exponent (default: 1.0)
repeatLastN : Last n tokens to consider for penalties (default: 64, 0 = disabled, -1 = ctx_size)
repeatPenalty : Penalty of repeated sequences of tokens (default: 1.0, 1.0 = disabled)
frequencyPenalty : Repetition alpha frequency penalty (default: 0.0, 0.0 = disabled)
presencePenalty : Repetition alpha presence penalty (default: 0.0, 0.0 = disabled)
mirostatTau : MiroStat target entropy, parameter tau (default: 5.0)
mirostatEta : MiroStat learning rate, parameter eta (default: 0.1)
penalizeNl : Whether to penalize newline tokens
nKeep : Number of tokens to keep from the initial prompt (default: 0, -1 = all)
seed : RNG seed (default: -1, use random seed for < 0)
nProbs : Amount top tokens probabilities to output if greater than 0.
minKeep : Amount of tokens the samplers should return at least (0 = disabled)
grammar : BNF-like grammar to constrain generations (see samples in grammars/ dir)
penaltyPrompt : Override which part of the prompt is penalized for repetition. E.g. if original prompt is “Alice: Hello!” and penaltyPrompt is “Hello!”, only the latter will be penalized if repeated. See pull request 3727 for more details.
penaltyPromptTokens : PenaltyPromptTokens
ignoreEos : Whether to ignore end of stream token and continue generating (implies –logit-bias 2-inf)
stopStrings : Strings upon seeing which token generation is stopped
useChatTemplate : Whether or not generate should apply a chat template (default: false)

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

Python API: MedicalLLM

Scala API: MedicalLLM

Notebook: MedicalLLMNotebook

Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

medical_llm = medical.AutoGGUFModel.pretrained("jsl_meds_ner_q4_v2", "en", "clinical/models")\
    .setInputCols("document")\
    .setOutputCol("completions")\
    .setBatchSize(1)\
    .setNPredict(100)\
    .setUseChatTemplate(True)\
    .setTemperature(0)\
    #.setNGpuLayers(100) # if you have GPU


pipeline = nlp.Pipeline(
    stages = [
        document_assembler,
        medical_llm
    ])

med_ner_prompt = """
    ### Template:
    {
    "drugs": [
    {
    "name": "",
    "reactions": []
    }
    ]
    }
    ### Text:
    I feel a bit drowsy & have a little blurred vision , and some gastric problems .
    I 've been on Arthrotec 50 for over 10 years on and off , only taking it when I needed it .
    Due to my arthritis getting progressively worse , to the point where I am in tears with the agony.
    Gp 's started me on 75 twice a day and I have to take it every day for the next month to see how I get on , here goes .
    So far its been very good , pains almost gone , but I feel a bit weird , did n't have that when on 50.
    """

data = spark.createDataFrame([[med_ner_prompt]]).toDF("text")
data.show(truncate=100)



## Result

    {
    "drugs": [
    {
    "name": "Arthrotec",
    "reactions": [
    "drowsy",
    "blurred vision",
    "gastric problems"
    ]
    }
    ]
    }
    </s> #### Template:
    {"drugs": [{"name": "", "reaction": []}]}
    #### Text:
    The patient is a 65-year

import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.seq2seq.MedicalLLM
import org.apache.spark.ml.Pipeline

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val medicalLLM = MedicalLLM.pretrained("jsl_meds_ner_q4_v2", "en", "clinical/models")
    .setInputCols("document")
    .setOutputCol("completions")
    .setBatchSize(1)
    .setNPredict(100)
    .setUseChatTemplate(true)
    .setTemperature(0)
//  .setNGpuLayers(100)  if you have GPU

val pipeline = new Pipeline().setStages(
    Array(
        documentAssembler,
        medicalLLM
))

val medPrompt =
    """
    |### Template:
    |{
    |"drugs": [
    |{
    |"name": "",
    |"reactions": []
    |}
    |]
    |}
    |### Text:
    |I feel a bit drowsy & have a little blurred vision , and some gastric problems .
    |I 've been on Arthrotec 50 for over 10 years on and off , only taking it when I needed it .
    |Due to my arthritis getting progressively worse , to the point where I am in tears with the agony.
    |Gp 's started me on 75 twice a day and I have to take it every day for the next month to see how I get on , here goes .
    |So far its been very good , pains almost gone , but I feel a bit weird , did n't have that when on 50.
    |""".stripMargin

val data = Seq(medPrompt).toDF("text")
data.select("completions.result").show(false)


## Result

    {
    "drugs": [
    {
    "name": "Arthrotec",
    "reactions": [
    "drowsy",
    "blurred vision",
    "gastric problems"
    ]
    }
    ]
    }
    </s> #### Template:
    {"drugs": [{"name": "", "reaction": []}]}
    #### Text:
    The patient is a 65-year

MetadataAnnotationConverter

Model

Converts metadata fields in annotations into actual begin, end, or result values.

MetadataAnnotationConverter enables users to override fields in Spark NLP annotations using values from their metadata dictionary. This is especially useful when metadata contains normalized values, corrected character offsets, or alternative representations of the entity or phrase.

The transformation is handled on the Scala side and is compatible with Spark NLP pipelines.

Parameters:

inputType: Type of the input annotation (e.g., “chunk”, “token”).
resultField: Name of the metadata key to override the result value.
beginField: Name of the metadata key to override the begin offset.
endField: Name of the metadata key to override the end offset.

Input Annotator Types: ANY

Output Annotator Type: ANY

Python API: MetadataAnnotationConverter

Scala API: MetadataAnnotationConverter

Show Example

from johnsnowlabs import nlp, medical
from sparknlp_jsl.annotator import AnnotationConverter

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel\
    .pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

text_matcher = TextMatcherInternal()\
    .setInputCols(["sentence","token"])\
    .setOutputCol("matched_text")\
    .setEntities("./test-phrases.txt")\
    .setEnableLemmatizer(True) \
    .setEnableStemmer(True) \
    .setCleanStopWords(True) \
    .setBuildFromTokens(False)\
    .setReturnChunks("original")\

metadata_annotation_converter = MetadataAnnotationConverter()\
    .setInputCols("matched_text")\
    .setInputType("chunk") \
    .setResultField("original_or_matched") \
    .setOutputCol("new_chunk")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_assertion_dl = AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models") \
    .setInputCols(["sentence", "new_chunk", "embeddings"]) \
    .setOutputCol("assertion_dl")\


text_matcher_pipeline= Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        text_matcher,
        metadata_annotation_converter,
        clinical_assertion_dl
])

text_matcher_model = text_matcher_pipeline.fit(empty_data)
text_matcher_result_df = text_matcher_model.transform(data)




# result

+------+-----+---+----------------------------+
|entity|begin|end|result                      |
+------+-----+---+----------------------------+
|entity|69   |99 |evaluation psychiatric state|
|entity|52   |60 |stressor                    |
|entity|114  |132|difficulty sleep            |
+------+-----+---+----------------------------+

MultiChunk2Doc

Model

MultiChunk2Doc annotator merges a given chunks to create a document. During document creation, a specific whitelist and blacklist filter can be applied, and case sensitivity can be adjusted. Additionally, specified prefix and suffix texts can be placed before and after the merged chunks in the resulting document. And a separator can be placed between the chunks.

Parameters:

separator (str): Separator to add between the chunks
prefix (str): Prefix to add to the result
suffix (str): Suffix to add to the result
blackList (list[str]): If defined, list of entities to ignore. The rest will be processed. Do not include IOB prefix on labels
whiteList (list[str]): If defined, list of entities to process. The rest will be ignored. Do not include IOB prefix on labels
caseSensitive (Bool): Determines whether the definitions of the white listed entities are case sensitive or not.

Input Annotator Types: CHUNK

Output Annotator Type: DOCUMENT

Python API: MultiChunk2Doc

Scala API: MultiChunk2Doc

Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("ner_clinical_large_langtest", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

multi_chunk2_doc = medical.MultiChunk2Doc() \
    .setInputCols(["ner_chunk"]) \
    .setOutputCol("new_document") \
    .setWhiteList(["test"]) \
    .setCaseSensitive(False) \
    .setPrefix("<") \
    .setSeparator("><") \
    .setSuffix(">") \

nlpPipeline = nlp.Pipeline(stages=[document_assembler,
    sentence_detector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    multi_chunk2_doc])

model = nlpPipeline.fit(self.spark.createDataFrame([[""]]).toDF("text"))

data = spark.createDataFrame([
["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation
and subsequent type two diabetes mellitus (T2DM),
one prior episode of HTG-induced pancreatitis three years prior to presentation,
and associated with an acute hepatitis, presented with a one-week history of polyuria, poor appetite, and vomiting.
She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG.
She had been on dapagliflozin for six months at the time of presentation.
Physical examination on presentation was significant for dry oral mucosa ;
significantly , her abdominal examination was benign with no tenderness, guarding, or rigidity."""]
]).toDF("text")

result = model.transform(data)

## Result

+-------------------------------------------------------------------------------------------------------------------------------+-------------+-----------+-------------------------+----------------------------+
|new_doc_result                                                                                                                 |new_doc_begin|new_doc_end|new_doc_metadata_document|new_doc_metadata_chunk_count|
+-------------------------------------------------------------------------------------------------------------------------------+-------------+-----------+-------------------------+----------------------------+
|<Physical examination> <her abdominal examination> <serum glucose> <creatinine> <triglycerides> <total cholesterol> <venous pH>|0            |126        |0                        |7                           |
+-------------------------------------------------------------------------------------------------------------------------------+-------------+-----------+-------------------------+----------------------------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val ner = MedicalNerModel.pretrained("ner_clinical_large_langtest", "en", "clinical/models")
    .setInputCols("sentence", "token", "embeddings")
    .setOutputCol("ner")

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

val multi_chunk2_doc = new MultiChunk2Doc()
    .setInputCols("ner_chunk").setOutputCol("new_doc")
    .setWhiteList(Array("test"))
    .setCaseSensitive(false)
    .setPrefix("<")
    .setSuffix(">")
    .setSeparator("> <")

val pipeline = new Pipeline().setStages(Array(
    document_assembler,
    sentence_detector,
    tokenizer,
    word_embeddings,
    ner,
    ner_converter,
    multi_chunk2_doc))

val data = Seq("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, and associated with an acute hepatitis, presented with a one-week history of polyuria, poor appetite, and vomiting. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation. Physical examination on presentation was significant for dry oral mucosa; significantly, her abdominal examination was benign with no tenderness, guarding, or rigidity. Pertinent laboratory findings on admission were: serum glucose 111 mg/dl,  creatinine 0.4 mg/dL, triglycerides 508 mg/dL, total cholesterol 122 mg/dL, and venous pH 7.27.""").toDS().toDF("text")

val result = pipeline.fit(emptyDF).transform(data)

// Result

+-------------------------------------------------------------------------------------------------------------------------------+-------------+-----------+-------------------------+----------------------------+
|new_doc_result                                                                                                                 |new_doc_begin|new_doc_end|new_doc_metadata_document|new_doc_metadata_chunk_count|
+-------------------------------------------------------------------------------------------------------------------------------+-------------+-----------+-------------------------+----------------------------+
|<Physical examination> <her abdominal examination> <serum glucose> <creatinine> <triglycerides> <total cholesterol> <venous pH>|0            |126        |0                        |7                           |
+-------------------------------------------------------------------------------------------------------------------------------+-------------+-----------+-------------------------+----------------------------+

NameChunkObfuscator

NameChunkObfuscator annotator allows to transform a dataset with an Input Annotation of type CHUNK, into its obfuscated version of by obfuscating the given CHUNKS. This module can replace name entities with consistent fakers, remain others same.

Obfuscation, refers to the process of de-identifying or removing sensitive patient information from clinical notes or other healthcare documents. The purpose of PHI obfuscation is to protect patient privacy and comply with regulations such as the Health Insurance Portability and Accountability Act (HIPAA).

It is important to note that the obfuscation should be done carefully to ensure that the de-identified data cannot be re-identified. Organizations must follow best practices and adhere to applicable regulations to protect patient privacy and maintain data security.

Parameters:

seed: The seed to select the names on obfuscation. With the seed, you can reply an execution several times with the same output..
obfuscateRefSource: Sets mode for select obfuscate source [‘both’, ’faker’, ‘file’] Default: ‘both’.
language: The language used to select some faker names. The values are the following: ‘en’(english),’de’(german), ‘es’(Spanish), ‘fr’(french) or ‘ro’(romanian) Default:’en’.
sameLength: The sameLength used to select the same length names as original ones during obfuscation. Example: ‘John’ –> ‘Mike’. Default: true.
nameEntities: The nameEntities used to select entities during obfuscation. The supported name entities are NAME, PATIENT, and DOCTOR. Default: ‘NAME’
genderAwareness: Whether to use gender-aware names or not during obfuscation. This param effects only names. Default: False

Input Annotator Types: CHUNK

Output Annotator Type: CHUNK

Python API: NameChunkObfuscator

Scala API: NameChunkObfuscator

Notebook: NameChunkObfuscatorNotebook

Show Example

from johnsnowlabs import medical, nlp

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

nameChunkObfuscator = medical.NameChunkObfuscator()\
  .setInputCols("ner_chunk")\
  .setOutputCol("replacement")\
  .setObfuscateRefSource("faker")\
  .setNameEntities(["DOCTOR", "PATIENT"])\
  .setGenderAwareness(True)

replacer_name = medical.Replacer()\
  .setInputCols("replacement","sentence")\
  .setOutputCol("obfuscated_sentence_name")\
  .setUseReplacement(True)

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      nameChunkObfuscator,
      replacer_name])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

## sample data
text ='''
Record date : 2093-01-13 , David Hale , M.D . , Patient name : Michael  , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555. Analyzed by Dr. Jennifer  .
'''

result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

result.select(F.explode(F.arrays_zip(result.sentence.result,
                                     result.obfuscated_sentence_name.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), 
              F.expr("cols['1']").alias("obfuscated_sentence_name"))


| sentence                                          | obfuscated_sentence_name                                  |
| ------------------------------------------------- | --------------------------------------------------------- |
| Record date : 2093-01-13 , David Hale , M.D .     | Record date : 2093-01-13 , Richardson , M.D .             |
| , Patient name : Michael , MR # 7194334 Date ...	| , Patient name : Thaxter , MR # 7194334 Date ...          |
| PCP : Oliveira , 25 years-old , Record date : ... | PCP : Adelaida , 25 years-old , Record date : ...         |
| Cocke County Baptist Hospital , 0295 Keats Str... | Cocke County Baptist Hospital , 0295 Keats Str...         |
| Analyzed by Dr. Jennifer .                        | Analyzed by Dr. Morganne .                                |

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

val nameChunkObfuscator = new NameChunkObfuscator()
    .setInputCols("ner_chunk")
    .setOutputCol("replacement")
    .setObfuscateRefSource("faker")
    .setNameEntities(Array("DOCTOR", "PATIENT"))
    .setGenderAwareness(true)

val replacer_name = new Replacer()
    .setInputCols(Array("replacement","sentence"))
    .setOutputCol("obfuscated_sentence_name")
    .setUseReplacement(true)

val nlpPipeline = new Pipeline().setStages(Array(
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      nameChunkObfuscator,
      replacer_name))

val data = Seq("Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .").toDF("text")

val result = nlpPipeline.fit(data).transfrom(data)


| sentence                                          | obfuscated_sentence_name                                  |
| ------------------------------------------------- | --------------------------------------------------------- |
| Record date : 2093-01-13 , David Hale , M.D .     | Record date : 2093-01-13 , Richardson , M.D .             |
| , Patient name : Michael , MR # 7194334 Date ...	| , Patient name : Thaxter , MR # 7194334 Date ...          |
| PCP : Oliveira , 25 years-old , Record date : ... | PCP : Adelaida , 25 years-old , Record date : ...         |
| Cocke County Baptist Hospital , 0295 Keats Str... | Cocke County Baptist Hospital , 0295 Keats Str...         |
| Analyzed by Dr. Jennifer .                        | Analyzed by Dr. Morganne .                                |

NameChunkObfuscator annotator that can be used in deidentification tasks for replacing doctor and patient names with fake names using a reference document.

Input Annotator Types: CHUNK

Output Annotator Type: CHUNK

Python API: NameChunkObfuscatorApproach

Scala API: NameChunkObfuscatorApproach

Notebook: NameChunkObfuscatorApproachNotebook

Show Example

from johnsnowlabs import medical, nlp

names = """Mitchell-NAME
Clifford-NAME
Jeremiah-NAME
Lawrence-NAME
Brittany-NAME
Patricia-NAME
Jennifer-NAME
Jackson-NAME
Leonard-NAME
Randall-NAME
Camacho-NAME
Ferrell-NAME
Mueller-NAME
Bowman-NAME
Hansen-NAME
"""

with open('names_test2.txt', 'w') as file:
    file.write(names)

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

nameChunkObfuscator = medical.NameChunkObfuscatorApproach()\
  .setInputCols("ner_chunk")\
  .setOutputCol("replacement")\
  .setObfuscateRefFile("names_test2.txt")\
  .setObfuscateRefSource("file")\
  .setRefFileFormat("csv")\
  .setRefSep("-")

replacer_name = medical.Replacer()\
  .setInputCols("replacement","sentence")\
  .setOutputCol("obfuscated_sentence_name")\
  .setUseReplacement(True)

nlpPipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      nameChunkObfuscator,
      replacer_name])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

## Results
text ='''
M.D . , Patient name : Michael  , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555. Analyzed by Dr. Jennifer  .
'''

result = model.transform(spark.createDataFrame([[text]]).toDF("text"))

result.select(F.explode(F.arrays_zip(result.sentence.result, 
                                     result.obfuscated_sentence_name.result)).alias("cols")) \
      .select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("obfuscated_sentence_name"))

| sentence                                          | obfuscated_sentence_name                          | 
| ------------------------------------------------- | ------------------------------------------------- |
| M.D .                                             | M.D .                                             |
| , Patient name : Michael , MR # 7194334 Date ...  | , Patient name : Ferrell , MR # 7194334 Date ...  |
| PCP : Oliveira , 25 years-old , Record date : ...	| PCP : Clifford , 25 years-old , Record date : ... |
| Cocke County Baptist Hospital , 0295 Keats Str... | Cocke County Baptist Hospital , 0295 Keats Str... |
| Analyzed by Dr. Jennifer .                        | Analyzed by Dr. Jennifer .                        |

val names = """Mitchell-NAME
Clifford-NAME
Jeremiah-NAME
Lawrence-NAME
Brittany-NAME
Patricia-NAME
Jennifer-NAME
Jackson-NAME
Leonard-NAME
Randall-NAME
Camacho-NAME
Ferrell-NAME
Mueller-NAME
Bowman-NAME
Hansen-NAME
"""
/*
with open("names_test2.txt", 'w') as file:
    file.write(names)
*/

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

val nameChunkObfuscator = new NameChunkObfuscatorApproach()
    .setInputCols("ner_chunk")
    .setOutputCol("replacement")
    .setObfuscateRefFile("names_test2.txt")\
    .setObfuscateRefSource("file")
    .setRefFileFormat("csv")
    .setRefSep("-")

val replacer_name = new Replacer()
    .setInputCols(Array("replacement","sentence"))
    .setOutputCol("obfuscated_sentence_name")
    .setUseReplacement(true)

val nlpPipeline = new Pipeline().setStages((
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      nameChunkObfuscator,
      replacer_nam))

val data = Seq("M.D . , Patient name : Michael  , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555. Analyzed by Dr. Jennifer  .").toDF("text")

val res = nlpPipeline.fit(data).transform(data)

| sentence                                          | obfuscated_sentence_name                          | 
| ------------------------------------------------- | ------------------------------------------------- |
| M.D .                                             | M.D .                                             |
| , Patient name : Michael , MR # 7194334 Date ...  | , Patient name : Ferrell , MR # 7194334 Date ...  |
| PCP : Oliveira , 25 years-old , Record date : ...	| PCP : Clifford , 25 years-old , Record date : ... |
| Cocke County Baptist Hospital , 0295 Keats Str... | Cocke County Baptist Hospital , 0295 Keats Str... |
| Analyzed by Dr. Jennifer .                        | Analyzed by Dr. Jennifer .                        |

NerChunker

Model

Extracts phrases that fits into a known pattern using the NER tags. Useful for entity groups with neighboring tokens when there is no pretrained NER model to address certain issues. A Regex needs to be provided to extract the tokens between entities.

Parameter:

setRegexParsers: Array of grammar based chunk parsers.

Input Annotator Types: DOCUMENT, NAMED_ENTITY

Output Annotator Type: CHUNK

Python API: NerChunker

Scala API: NerChunker

Notebook: NerChunkerNotebook

Show Example

from johnsnowlabs import nlp, medical
# Defining pipeline stages for NER

documentAssembler= nlp.DocumentAssembler() \
  .setInputCol("text") \
  .setOutputCol("document")

sentenceDetector= nlp.SentenceDetector() \
  .setInputCols(["document"]) \
  .setOutputCol("sentence") \
  .setUseAbbreviations(False)

tokenizer= nlp.Tokenizer() \
  .setInputCols(["sentence"]) \
  .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
  .setInputCols(["sentence","token"]) \
  .setOutputCol("embeddings") \
  .setCaseSensitive(False)

ner = medical.NerModel.pretrained("ner_radiology", "en", "clinical/models") \
  .setInputCols(["sentence","token","embeddings"]) \
  .setOutputCol("ner") \
  .setIncludeConfidence(True)

# Define the NerChunker to combine to chunks
chunker = medical.NerChunker() \
  .setInputCols(["sentence","ner"]) \
  .setOutputCol("ner_chunk") \
  .setRegexParsers(["<ImagingFindings>.*<BodyPart>"])

pipeline= nlp.Pipeline(stages=[
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  ner,
  chunker
])

data= spark.createDataFrame([["She has cystic cyst on her kidney."]]).toDF("text")
result = pipeline.fit(data).transform(data)

# Show results:
result.selectExpr("explode(arrays_zip(ner.metadata , ner.result))")\
      .selectExpr("col['0'].word as word" , "col['1'] as ner").show(truncate=False)
+------+-----------------+
|word  |ner              |
+------+-----------------+
|She   |O                |
|has   |O                |
|cystic|B-ImagingFindings|
|cyst  |I-ImagingFindings|
|on    |O                |
|her   |O                |
|kidney|B-BodyPart       |
|.     |O                |
+------+-----------------+

result.select("ner_chunk.result").show(truncate=False)
+---------------------------+
|result                     |
+---------------------------+
|[cystic cyst on her kidney]|
+---------------------------+

from johnsnowlabs import nlp, finance 
# Defining pipeline stages for NER

documentAssembler= nlp.DocumentAssembler() \
  .setInputCol("text") \
  .setOutputCol("document")

sentenceDetector= nlp.SentenceDetector() \
  .setInputCols(["document"]) \
  .setOutputCol("sentence") 

tokenizer= nlp.Tokenizer() \
  .setInputCols(["sentence"]) \
  .setOutputCol("token")\
  .setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '"', "'", '%', '&'])

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \
  .setInputCols("sentence", "token") \
  .setOutputCol("embeddings")\
  .setMaxSentenceLength(512)\
  .setCaseSensitive(True)

ner_model = finance.NerModel.pretrained("finner_responsibility_reports_md", "en", "finance/models")\
  .setInputCols(["sentence", "token", "embeddings"])\
  .setOutputCol("ner")

# Define the NerChunker to combine to chunks
chunker = finance.NerChunker() \
  .setInputCols(["sentence","ner"]) \
  .setOutputCol("ner_chunk") \
  .setRegexParsers(["<ENVIRONMENTAL_KPI>.*<AMOUNT>"])

pipeline= nlp.Pipeline(stages=[
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  ner_model,
  chunker
])

data= spark.createDataFrame([["""The company has reduced its direct GHG emissions from 12,135 million tonnes of CO2e in 2017 to 4 million tonnes of CO2e in 2021. The indirect GHG emissions (scope 2) are mainly from imported energy, including electricity, heat, steam, and cooling, and the company has reduced its scope 2 emissions from 3 million tonnes of CO2e in 2017-2018 to 4 million tonnes of CO2e in 2020-2021. The scope 3 emissions are mainly from the use of sold products, and the emissions have increased from 377 million tonnes of CO2e in 2017 to 408 million tonnes of CO2e in 2021."""]]).toDF("text")

result = pipeline.fit(data).transform(data)

# Show results:
result.selectExpr("explode(arrays_zip(ner.metadata , ner.result))")\
      .selectExpr("col['0'].word as word" , "col['1'] as ner").show(truncate=False)

+---------+--------------------+
|word     |ner                 |
+---------+--------------------+
|The      |O                   |
|company  |O                   |
|has      |O                   |
|reduced  |O                   |
|its      |O                   |
|direct   |B-ENVIRONMENTAL_KPI |
|GHG      |I-ENVIRONMENTAL_KPI |
|emissions|I-ENVIRONMENTAL_KPI |
|from     |O                   |
|12,135   |B-AMOUNT            |
|million  |I-AMOUNT            |
|tonnes   |B-ENVIRONMENTAL_UNIT|
|of       |I-ENVIRONMENTAL_UNIT|
|CO2e     |I-ENVIRONMENTAL_UNIT|
|in       |O                   |
|2017     |B-DATE_PERIOD       |
|to       |O                   |
|4        |B-AMOUNT            |
|million  |I-AMOUNT            |
|tonnes   |B-ENVIRONMENTAL_UNIT|
+---------+--------------------+

result.select("ner_chunk.result").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[direct GHG emissions from 12,135 million tonnes of CO2e in 2017 to 4 million, indirect GHG emissions (scope 2) are mainly from imported energy, including electricity, heat, steam, and cooling, and the company has reduced its scope 2 emissions from 3 million tonnes of CO2e in 2017-2018 to 4 million, scope 3 emissions are mainly from the use of sold products, and the emissions have increased from 377 million tonnes of CO2e in 2017 to 408 million]|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

from johnsnowlabs import nlp, legal 
# Defining pipeline stages for NER

documentAssembler= nlp.DocumentAssembler() \
  .setInputCol("text") \
  .setOutputCol("document")

sentenceDetector= nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer= nlp.Tokenizer() \
  .setInputCols(["sentence"]) \
  .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
  .setInputCols(["sentence", "token"]) \
  .setOutputCol("embeddings")

ner_model = legal.NerModel.pretrained("legner_org_per_role_date", "en", "legal/models")\
  .setInputCols(["sentence", "token", "embeddings"])\
  .setOutputCol("ner")

# Define the NerChunker to combine to chunks
chunker = legal.NerChunker() \
  .setInputCols(["sentence","ner"]) \
  .setOutputCol("ner_chunk") \
  .setRegexParsers(["<PERSON>.*<ROLE>"])

pipeline= nlp.Pipeline(stages=[
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  ner_model,
  chunker
])

data= spark.createDataFrame([["""Jeffrey Preston Bezos is an American entrepreneur, founder and CEO of Amazon"""]]).toDF("text")

result = pipeline.fit(data).transform(data)

# Show results:
result.selectExpr("explode(arrays_zip(ner.metadata , ner.result))")\
      .selectExpr("col['0'].word as word" , "col['1'] as ner").show(truncate=False)

+------------+--------+
|word        |ner     |
+------------+--------+
|Jeffrey     |B-PERSON|
|Preston     |I-PERSON|
|Bezos       |I-PERSON|
|is          |O       |
|an          |O       |
|American    |O       |
|entrepreneur|O       |
|,           |O       |
|founder     |B-ROLE  |
|and         |O       |
|CEO         |B-ROLE  |
|of          |O       |
|Amazon      |B-ORG   |
+------------+--------+

result.select("ner_chunk.result").show(truncate=False)

+--------------------------------------------------------------------+
|result                                                              |
+--------------------------------------------------------------------+
|[Jeffrey Preston Bezos is an American entrepreneur, founder and CEO]|
+--------------------------------------------------------------------+

import spark.implicits._

// Defining pipeline stages for NER
val data= Seq("She has cystic cyst on her kidney.").toDF("text")

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")
  .setUseAbbreviations(false)

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence","token"))
  .setOutputCol("embeddings")
  .setCaseSensitive(False)

val ner = MedicalNerModel.pretrained("ner_radiology", "en", "clinical/models")
  .setInputCols(Array("sentence","token","embeddings"))
  .setOutputCol("ner")
  .setIncludeConfidence(True)

// Define the NerChunker to combine to chunks
val chunker = new NerChunker()
  .setInputCols(Array("sentence","ner"))
  .setOutputCol("ner_chunk")
  .setRegexParsers(Array("<ImagingFindings>.<BodyPart>"))

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  ner,
  chunker
))

val data = Seq(
  """She has cystic cyst on her kidney."""
).toDF("text")
val result = pipeline.fit(data).transform(data)

// Show results:

+------+-----------------+
|word  |ner              |
+------+-----------------+
|She   |O                |
|has   |O                |
|cystic|B-ImagingFindings|
|cyst  |I-ImagingFindings|
|on    |O                |
|her   |O                |
|kidney|B-BodyPart       |
|.     |O                |
+------+-----------------+

+---------------------------+
|result                     |
+---------------------------+
|[cystic cyst on her kidney]|
+---------------------------+

import spark.implicits._

// Defining pipeline stages for NER
val documentAssembler= new DocumentAssembler() 
  .setInputCol("text") 
  .setOutputCol("document")

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer= new Tokenizer() 
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") 
  .setInputCols(Array("sentence", "token")) 
  .setOutputCol("embeddings")

val ner_model = FinanceNerModel.pretrained("finner_responsibility_reports_md", "en", "finance/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")

// Define the NerChunker to combine to chunks
val chunker = new NerChunker() 
  .setInputCols(Array("sentence","ner")) 
  .setOutputCol("ner_chunk") 
  .setRegexParsers(Array("<ENVIRONMENTAL_KPI>.*<AMOUNT>"))

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  ner,
  chunker
))

val data = Seq(
  """The company has reduced its direct GHG emissions from 12,135 million tonnes of CO2e in 2017 to 4 million tonnes of CO2e in 2021. The indirect GHG emissions (scope 2) are mainly from imported energy, including electricity, heat, steam, and cooling, and the company has reduced its scope 2 emissions from 3 million tonnes of CO2e in 2017-2018 to 4 million tonnes of CO2e in 2020-2021. The scope 3 emissions are mainly from the use of sold products, and the emissions have increased from 377 million tonnes of CO2e in 2017 to 408 million tonnes of CO2e in 2021."""
).toDF("text")
val result = pipeline.fit(data).transform(data)

// Show results:

+---------+--------------------+
|word     |ner                 |
+---------+--------------------+
|The      |O                   |
|company  |O                   |
|has      |O                   |
|reduced  |O                   |
|its      |O                   |
|direct   |B-ENVIRONMENTAL_KPI |
|GHG      |I-ENVIRONMENTAL_KPI |
|emissions|I-ENVIRONMENTAL_KPI |
|from     |O                   |
|12,135   |B-AMOUNT            |
|million  |I-AMOUNT            |
|tonnes   |B-ENVIRONMENTAL_UNIT|
|of       |I-ENVIRONMENTAL_UNIT|
|CO2e     |I-ENVIRONMENTAL_UNIT|
|in       |O                   |
|2017     |B-DATE_PERIOD       |
|to       |O                   |
|4        |B-AMOUNT            |
|million  |I-AMOUNT            |
|tonnes   |B-ENVIRONMENTAL_UNIT|
+---------+--------------------+


+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[direct GHG emissions from 12,135 million tonnes of CO2e in 2017 to 4 million, indirect GHG emissions (scope 2) are mainly from imported energy, including electricity, heat, steam, and cooling, and the company has reduced its scope 2 emissions from 3 million tonnes of CO2e in 2017-2018 to 4 million, scope 3 emissions are mainly from the use of sold products, and the emissions have increased from 377 million tonnes of CO2e in 2017 to 408 million]|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._
// Defining pipeline stages for NER

val documentAssembler= new DocumentAssembler() 
  .setInputCol("text") 
  .setOutputCol("document")

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer= new Tokenizer() 
  .setInputCols("sentence") 
  .setOutputCol("token")

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") 
  .setInputCols(Array("sentence", "token")) 
  .setOutputCol("embeddings")

val ner_model = LegalNerModel.pretrained("legner_org_per_role_date", "en", "legal/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")

// Define the NerChunker to combine to chunks
val chunker = new NerChunker() 
  .setInputCols(Array("sentence","ner")) 
  .setOutputCol("ner_chunk") 
  .setRegexParsers(Array("<PERSON>.*<ROLE>"))

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  ner,
  chunker
))

val data = Seq(
  """Jeffrey Preston Bezos is an American entrepreneur, founder and CEO of Amazon"""
).toDF("text")
val result = pipeline.fit(data).transform(data)

// Show results:

+------------+--------+
|word        |ner     |
+------------+--------+
|Jeffrey     |B-PERSON|
|Preston     |I-PERSON|
|Bezos       |I-PERSON|
|is          |O       |
|an          |O       |
|American    |O       |
|entrepreneur|O       |
|,           |O       |
|founder     |B-ROLE  |
|and         |O       |
|CEO         |B-ROLE  |
|of          |O       |
|Amazon      |B-ORG   |
+------------+--------+


+--------------------------------------------------------------------+
|result                                                              |
+--------------------------------------------------------------------+
|[Jeffrey Preston Bezos is an American entrepreneur, founder and CEO]|
+--------------------------------------------------------------------+

NerConverterInternal

Model

Converts a IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their label. Chunks with no associated entity (tagged “O”) are filtered out.

Parameters;

setThreshold: Confidence threshold.
setWhiteList: If defined, list of entities to process.
setBlackList: If defined, list of entities to ignore.
setReplaceLabels: If defined, contains a dictionary for entity replacement.
setPreservePosition: Whether to preserve the original position of the tokens in the original document or use the modified tokens.
setReplaceDictResource: If defined, path to the file containing a dictionary for entity replacement.
setIgnoreStopWords: If defined, list of stop words to ignore.
setGreedyMode: (Boolean) Whether to ignore B tags for contiguous tokens of same entity same .
resetSentenceIndices: Whether to reset sentence indices to treat the entire output as if it originates from a single document. Default: False.

This licensed annotator adds extra functionality to the open-source version by adding the following parameters: blackList, greedyMode, threshold, and ignoreStopWords that are not available in the NerConverter annotator.

See also Inside–outside–beginning (tagging) for more information.

Input Annotator Types: DOCUMENT, TOKEN, NAMED_ENTITY

Output Annotator Type: CHUNK

Python API: NerConverterInternal

Scala API: NerConverterInternal

Notebook: Notebook

Show Example

from johnsnowlabs import * 
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
embeddings  = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# NER model
nerModel = medical.NerModel.pretrained("ner_posology", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

# NER Converter
nerConverter = medical.NerConverterInternal() \
   .setInputCols(["sentence", "token", "ner"]) \
   .setOutputCol("ner_chunk")

nlpPipeline = nlp.Pipeline(stages = [document_assembler,
                              sentence_detector,
                              tokenizer,
                              embeddings,
                              nerModel,
                              nerConverter
                              ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

sample_text = """The patient was prescribed 1 capsule of Advil for 5 days.
He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, metformin 1000 mg two times a day.
"""

data = spark.createDataFrame([[sample_text]]).toDF("text")

result = model.transform(data)
result.select('text', 'ner.result', 'ner_chunk.result').show(truncate = 50)

+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|                                              text|                                            result|                                            result|
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|The patient was prescribed 1 capsule of Advil f...|[O, O, O, O, B-DOSAGE, B-FORM, O, B-DRUG, B-DUR...|[1, capsule, Advil, for 5 days, 40 units, insul...|
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+

from johnsnowlabs import * 
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")\
    .setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '"', "'", '%', '&'])

embeddings  = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

# NER model
nerModel = finance.NerModel.pretrained("finner_responsibility_reports_md", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

# NER Converter
nerConverter = finance.NerConverterInternal() \
   .setInputCols(["sentence", "token", "ner"]) \
   .setOutputCol("ner_chunk")

nlpPipeline = nlp.Pipeline(stages = [document_assembler,
                              sentence_detector,
                              tokenizer,
                              embeddings,
                              nerModel,
                              nerConverter
                              ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

sample_text = """The company has reduced its direct GHG emissions from 12,135 million tonnes of CO2e in 2017 to 4 million tonnes of CO2e in 2021. The indirect GHG emissions (scope 2) are mainly from imported energy, including electricity, heat, steam, and cooling, and the company has reduced its scope 2 emissions from 3 million tonnes of CO2e in 2017-2018 to 4 million tonnes of CO2e in 2020-2021. The scope 3 emissions are mainly from the use of sold products, and the emissions have increased from 377 million tonnes of CO2e in 2017 to 408 million tonnes of CO2e in 2021.
"""

data = spark.createDataFrame([[sample_text]]).toDF("text")

result = model.transform(data)
result.select('text', 'ner.result', 'ner_chunk.result').show(truncate = 50)

+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|                                              text|                                            result|                                            result|
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|The company has reduced its direct GHG emission...|[O, O, O, O, O, B-ENVIRONMENTAL_KPI, I-ENVIRONM...|[direct GHG emissions, 12,135 million, tonnes o...|
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+

from johnsnowlabs import * 
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

embeddings  = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

# NER model
nerModel = legal.NerModel.pretrained("legner_org_per_role_date", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

# NER Converter
nerConverter = legal.NerConverterInternal() \
   .setInputCols(["sentence", "token", "ner"]) \
   .setOutputCol("ner_chunk")

nlpPipeline = nlp.Pipeline(stages = [document_assembler,
                              sentence_detector,
                              tokenizer,
                              embeddings,
                              nerModel,
                              nerConverter
                              ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

sample_text = """Jeffrey Preston Bezos is an American entrepreneur, founder and CEO of Amazon
"""

data = spark.createDataFrame([[sample_text]]).toDF("text")

result = model.transform(data)
result.select('text', 'ner.result', 'ner_chunk.result').show(truncate = 50)

+--------------------------------------------------+--------------------------------------------------+---------------------------------------------+
|                                              text|                                            result|                                       result|
+--------------------------------------------------+--------------------------------------------------+---------------------------------------------+
|Jeffrey Preston Bezos is an American entreprene...|[B-PERSON, I-PERSON, I-PERSON, O, O, O, O, O, B...|[Jeffrey Preston Bezos, founder, CEO, Amazon]|
+--------------------------------------------------+--------------------------------------------------+---------------------------------------------+

import spark.implicits._

// Annotator that transforms a text column from dataframe into an Annotation ready for NLP 
val document_assembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document")

val sentence_detector = new SentenceDetector()
 .setInputCols(Array("document")) 
 .setOutputCol("sentence") 
 
// Tokenizer splits words in a relevant format for NLP 
val tokenizer = new Tokenizer()
 .setInputCols(Array("sentence")) 
 .setOutputCol("token") 
 
// Clinical word embeddings trained on PubMED dataset 
val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
 .setInputCols(Array("sentence","token")) 
 .setOutputCol("embeddings") 
 
// NER model 
val nerModel = MedicalNerModel.pretrained("ner_posology","en","clinical/models")
 .setInputCols(Array("sentence","token","embeddings")) 
 .setOutputCol("ner") 
 
// NER Converter 
val nerConverter = new NerConverterInternal()
 .setInputCols(Array("sentence","token","ner")) 
 .setOutputCol("ner_chunk") 

val nlpPipeline = new Pipeline().setStages(Array(
    document_assembler,
    sentence_detector, 
    tokenizer, 
    embeddings, 
    nerModel, 
    nerConverter )) 

val empty_data = Seq("") .toDF("text") 
val model = nlpPipeline.fit(empty_data) 

val sample_text = "The patient was prescribed 1 capsule of Advil for 5 days.He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night,12 units of insulin lispro with meals,metformin 1000 mg two times a day." 

val data = Seq(sample_text) .toDF("text") 
val result = model.transform(data) result.select("text","ner.result","ner_chunk.result") 

+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|                                              text|                                            result|                                            result|
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|The patient was prescribed 1 capsule of Advil f...|[O, O, O, O, B-DOSAGE, B-FORM, O, B-DRUG, B-DUR...|[1, capsule, Advil, for 5 days, 40 units, insul...|
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+

import spark.implicits._

// Annotator that transforms a text column from dataframe into an Annotation ready for NLP 
val document_assembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val sentence_detector = new SentenceDetector()
 .setInputCols(Array("document")) 
 .setOutputCol("sentence") 

// Tokenizer splits words in a relevant format for NLP 
val tokenizer = new Tokenizer()
 .setInputCols(Array("sentence")) 
 .setOutputCol("token") 
 .setContextChars(Array(".",",",";",":","!","?","*","-","(",") ",""",""","%","&")) 

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
 .setInputCols(Array("sentence","token")) 
 .setOutputCol("embeddings") 

// NER model 
val nerModel = FinanceNerModel.pretrained("finner_responsibility_reports_md","en","finance/models")
 .setInputCols(Array("sentence","token","embeddings")) 
 .setOutputCol("ner") 
 
// NER Converter 
val nerConverter = new NerConverterInternal()
 .setInputCols(Array("sentence","token","ner")) 
 .setOutputCol("ner_chunk") 

val nlpPipeline = new Pipeline().setStages(Array(
    document_assembler,
    sentence_detector,
    tokenizer, 
    embeddings,
    nerModel, 
    nerConverter )) 

val empty_data = Seq("") .toDF("text") 
val model = nlpPipeline.fit(empty_data) 

val sample_text = "The company has reduced its direct GHG emissions from 12,135 million tonnes of CO2e in 2017 to 4 million tonnes of CO2e in 2021. The indirect GHG emissions (scope 2) are mainly from imported energy,including electricity,heat,steam,and cooling,and the company has reduced its scope 2 emissions from 3 million tonnes of CO2e in 2017-2018 to 4 million tonnes of CO2e in 2020-2021. The scope 3 emissions are mainly from the use of sold products,and the emissions have increased from 377 million tonnes of CO2e in 2017 to 408 million tonnes of CO2e in 2021." 
val data = Seq(sample_text) .toDF("text") 

val result = model.transform(data) result.select("text","ner.result","ner_chunk.result") 

+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|                                              text|                                            result|                                            result|
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|The company has reduced its direct GHG emission...|[O, O, O, O, O, B-ENVIRONMENTAL_KPI, I-ENVIRONM...|[direct GHG emissions, 12,135 million, tonnes o...|
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+

import spark.implicits._

// Annotator that transforms a text column from dataframe into an Annotation ready for NLP 
val document_assembler = new DocumentAssembler()
 .setInputCol("text") 
 .setOutputCol("document") 

val sentence_detector = new SentenceDetector()
 .setInputCols(Array("document")) 
 .setOutputCol("sentence") 

// Tokenizer splits words in a relevant format for NLP 
val tokenizer = new Tokenizer()
 .setInputCols(Array("sentence")) 
 .setOutputCol("token") 

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
 .setInputCols(Array("sentence","token")) 
 .setOutputCol("embeddings") 
 
// NER model 
val nerModel = LegalNerModel.pretrained("legner_org_per_role_date","en","legal/models")
 .setInputCols(Array("sentence","token","embeddings")) 
 .setOutputCol("ner") 
 
// NER Converter 
val nerConverter = new NerConverterInternal()
 .setInputCols(Array("sentence","token","ner")) 
 .setOutputCol("ner_chunk") 

val nlpPipeline = new Pipeline().setStages(Array(
    document_assembler, 
    sentence_detector, 
    tokenizer, 
    embeddings, 
    nerModel, 
    nerConverter )) 

val empty_data = Seq("") .toDF("text") 
val model = nlpPipeline.fit(empty_data) 

val sample_text = "Jeffrey Preston Bezos is an American entrepreneur,founder and CEO of Amazon" 

val data = Seq(sample_text) .toDF("text") 
val result = model.transform(data) result.select("text","ner.result","ner_chunk.result") 

+--------------------------------------------------+--------------------------------------------------+---------------------------------------------+
|                                              text|                                            result|                                       result|
+--------------------------------------------------+--------------------------------------------------+---------------------------------------------+
|Jeffrey Preston Bezos is an American entreprene...|[B-PERSON, I-PERSON, I-PERSON, O, O, O, O, O, B...|[Jeffrey Preston Bezos, founder, CEO, Amazon]|
+--------------------------------------------------+--------------------------------------------------+---------------------------------------------+

NerDisambiguator

Links words of interest, such as names of persons, locations and companies, from an input text document to a corresponding unique entity in a target Knowledge Base (KB). Words of interest are called Named Entities (NEs), mentions, or surface forms. Instantiated / pretrained model of the NerDisambiguator. Links words of interest, such as names of persons, locations and companies, from an input text document to a corresponding unique entity in a target Knowledge Base (KB). Words of interest are called Named Entities (NEs), mentions, or surface forms.

Parameters:

embeddingTypeParam: (String) ‘bow’ for word embeddings or ‘sentence’ for sentences.
numFirstChars: (Int) number of characters to be considered for initial prefix search in the knowledge base.
tokenSearch: (BooleanParam) mechanism of search - by token or by - chunk in knowledge base (token is recommended ==> Default value: True).
narrowWithApproximateMatching: (BooleanParam) narrow down the prefix search results with Levenshtein distance based matching (True is recommended).
levenshteinDistanceThresholdParam: (Float) value of the Levenshtein distance threshold to narrow results from prefix search (default value: 0.1).
nearMatchingGapParam: (Int) allows to define a limit on the string length (by trimming the candidate chunks) during Levenshtein distance-based narrowing, {len(candidate) - len(entity chunk) > nearMatchingGap} (default value: 4).
predictionsLimit: (BooleanParam) allows to limit the number of predictions N for top N predictions.
s3KnowledgeBaseName: (String) the name of the Knowledge Base name in S3.

Input Annotator Types: CHUNK, SENTENCE_EMBEDDINGS

Output Annotator Type: DISAMBIGUATION

Python API: NerDisambiguatorModel

Scala API: NerDisambiguatorModel

Notebook: NerDisambiguatorModelNotebook

Show Example

from johnsnowlabs import nlp,  medical

documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

sentence_embeddings = nlp.SentenceEmbeddings() \
    .setInputCols(["sentence","embeddings"]) \
    .setOutputCol("sentence_embeddings")

ner_model = medical.NerModel.pretrained() \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk") \
    .setWhiteList(["PER"])

disambiguator = medical.NerDisambiguator() \
    .setS3KnowledgeBaseName("i-per") \
    .setInputCols(["ner_chunk", "sentence_embeddings"]) \
    .setOutputCol("disambiguation") \
    .setTokenSearch(False)

pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    sentence_embeddings,
    ner_model,
    ner_converter,
    disambiguator])

text = """The show also had a contestant named Donald Trump who later defeated Christina Aguilera ..."""

df = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(df).transform(df)


# Result
result.selectExpr("explode(disambiguation)") \
    .selectExpr("col.metadata.chunk as chunk", "col.result as result").show(5, truncate=False)

+------------------+------------------------------------------------------------------------------------------------------------------------+
|chunk             |result                                                                                                                  |
+------------------+------------------------------------------------------------------------------------------------------------------------+
|Donald Trump      |http://en.wikipedia.org/?curid=55907961, http://en.wikipedia.org/?curid=31698421, http://en.wikipedia.org/?curid=4848272|
|Christina Aguilera|http://en.wikipedia.org/?curid=6636454, http://en.wikipedia.org/?curid=144171                                           |
+------------------+------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text") 
  .setOutputCol("document") 

val sentenceDetector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence") 
  .setOutputCol("token") 

val word_embeddings = WordEmbeddingsModel.pretrained()
  .setInputCols(Array("sentence","token")) 
  .setOutputCol("embeddings") 

val sentence_embeddings = new SentenceEmbeddings()
  .setInputCols(Array("sentence","embeddings")) 
  .setOutputCol("sentence_embeddings") 

val ner_model = MedicalNerModel.pretrained()
  .setInputCols(Array("sentence","token","embeddings")) 
  .setOutputCol("ner") 

val ner_converter = new NerConverter()
  .setInputCols(Array("sentence","token","ner")) 
  .setOutputCol("ner_chunk") 
  .setWhiteList(Array("PER")) 

val disambiguator = new NerDisambiguator()
  .setS3KnowledgeBaseName("i-per") 
  .setInputCols(Array("ner_chunk","sentence_embeddings")) 
  .setOutputCol("disambiguation") 
  .setTokenSearch(false)

val pipeline = new Pipeline().setStages(Array( 
                                              documentAssembler, 
                                              sentenceDetector, 
                                              tokenizer, 
                                              word_embeddings, 
                                              sentence_embeddings, 
                                              ner_model, 
                                              ner_converter, 
                                              disambiguator))
 
val text = "The show also had a contestant named Donald Trump who later defeated Christina Aguilera ..." 

val df = Seq(text) .toDF("text") 
val result = pipeline.fit(df) .transform(df) 

// Result 

+------------------+------------------------------------------------------------------------------------------------------------------------+
|chunk             |result                                                                                                                  |
+------------------+------------------------------------------------------------------------------------------------------------------------+
|Donald Trump      |http://en.wikipedia.org/?curid=55907961, http://en.wikipedia.org/?curid=31698421, http://en.wikipedia.org/?curid=4848272|
|Christina Aguilera|http://en.wikipedia.org/?curid=6636454, http://en.wikipedia.org/?curid=144171                                           |
+------------------+------------------------------------------------------------------------------------------------------------------------+

Links words of interest, such as names of persons, locations and companies, from an input text document to a corresponding unique entity in a target Knowledge Base (KB). Words of interest are called Named Entities (NEs), mentions, or surface forms. The model needs extracted CHUNKS and SENTENCE_EMBEDDINGS type input from e.g. SentenceEmbeddings and NerConverter.

Input Annotator Types: CHUNK, SENTENCE_EMBEDDINGS

Output Annotator Type: DISAMBIGUATION

Python API: NerDisambiguator

Scala API: NerDisambiguator

NerModel

NerModel is the Named Entity Recognition (NER) annotator that allows to train generic NER model based on Neural Networks. The architecture of the neural network is a Char CNNs - BiLSTM - CRF that achieves state-of-the-art in most datasets.

Note that some pre-trained models require specific types of embeddings, depending on which they were trained.

Parameters:

setBatchSize: (int) number of samples used in one iteration of training (Default: 32).
setIncludeConfidence: (Boolean) whether to include confidence scores in annotation metadata (Default: False).
setConfigProtoBytes: (int) ConfigProto from tensorflow, serialized into byte array.
setIncludeAllConfidenceScores: (Boolean) whether to include confidence scores for all tags rather than just for the predicted one.
setMinProbability (Float) define the minimum probability value.

For available pretrained models please see the Models Hub. Additionally, pretrained pipelines are available for this module, see the Pipelines. For extended examples of usage, see the Spark NLP Workshop

Input Annotator Types: DOCUMENT, TOKEN, WORD_EMBEDDINGS

Output Annotator Type: NAMED_ENTITY

Python API: MedicalNerModel

Scala API: MedicalNerModel

Notebook: MedicalNerModelNotebook

Show Example

from johnsnowlabs import nlp, medical

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

jsl_ner = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("jsl_ner")

jsl_ner_converter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "jsl_ner"]) \
    .setOutputCol("ner_chunk")

jsl_ner_pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    jsl_ner,
    jsl_ner_converter])

text = '''
A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, and associated with an acute hepatitis, presented with a one-week history of polyuria, poor appetite, and vomiting.
She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation.
Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness, guarding, or rigidity. Pertinent laboratory findings on admission were: serum glucose 111 mg/dl,  creatinine 0.4 mg/dL, triglycerides 508 mg/dL, total cholesterol 122 mg/dL, and venous pH 7.27.
'''
data = spark.createDataFrame([[text]]).toDF("text")

result = jsl_ner_pipeline.fit(data).transform(data)

result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(100, truncate=False)

+-----------------------------+----------------------------+
|chunk                        |ner_label                   |
+-----------------------------+----------------------------+
|28-year-old                  |Age                         |
|female                       |Gender                      |
|gestational diabetes mellitus|Diabetes                    |
|eight years prior            |RelativeDate                |
|type two diabetes mellitus   |Diabetes                    |
|T2DM                         |Diabetes                    |
|HTG-induced pancreatitis     |Disease_Syndrome_Disorder   |
|three years prior            |RelativeDate                |
|acute                        |Modifier                    |
|hepatitis                    |Disease_Syndrome_Disorder   |
|one-week                     |Duration                    |
|polyuria                     |Symptom                     |
|poor appetite                |Symptom                     |
|vomiting                     |Symptom                     |
|She                          |Gender                      |
|metformin                    |Drug_Ingredient             |
|glipizide                    |Drug_Ingredient             |
|dapagliflozin                |Drug_Ingredient             |
|T2DM                         |Diabetes                    |
|atorvastatin                 |Drug_Ingredient             |
|gemfibrozil                  |Drug_Ingredient             |
|HTG                          |Hyperlipidemia              |
|She                          |Gender                      |
|dapagliflozin                |Drug_Ingredient             |
|for six months               |Duration                    |
|dry oral mucosa              |Symptom                     |
|her                          |Gender                      |
|abdominal                    |External_body_part_or_region|
|tenderness                   |Symptom                     |
|guarding                     |Symptom                     |
|rigidity                     |Symptom                     |
|admission                    |Admission_Discharge         |
|serum glucose                |Test                        |
|111 mg/dl                    |Test_Result                 |
|creatinine                   |Test                        |
|0.4 mg/dL                    |Test_Result                 |
|triglycerides                |Triglycerides               |
|508 mg/dL                    |Test_Result                 |
|total cholesterol 122 mg/dL  |Total_Cholesterol           |
|venous pH                    |Test                        |
|7.27                         |Test_Result                 |
+-----------------------------+----------------------------+

from johnsnowlabs import nlp, finance

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
    
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_legal_bert_base_uncased","en")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_sec_conll", "en", "finance/models") \
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = finance.NerConverterInternal()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter])

text = '''December 2007 SUBORDINATED LOAN AGREEMENT. THIS LOAN AGREEMENT is made on 7th December, 2007 BETWEEN: (1) SILICIUM DE PROVENCE S.A.S., a private company with limited liability, incorporated under the laws of France, whose registered office is situated at Usine de Saint Auban, France, represented by Mr.Frank Wouters, hereinafter referred to as the "Borrower", and ( 2 ) EVERGREEN SOLAR INC., a company incorporated in Delaware, U.S.A., with registered number 2426798, whose registered office is situated at Bartlett Street, Marlboro, Massachusetts, U.S.A. represented by Richard Chleboski, hereinafter referred to as "Lender" '''

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(100, truncate=False)
+--------------------------+---------+
|chunk                     |ner_label|
+--------------------------+---------+
|SILICIUM DE PROVENCE S.A.S|ORG      |
|France                    |LOC      |
|Usine de Saint Auban      |LOC      |
|France                    |LOC      |
|Mr.Frank Wouters          |PER      |
|Borrower                  |PER      |
|EVERGREEN SOLAR INC       |ORG      |
|Delaware                  |LOC      |
|U.S.A                     |LOC      |
|Bartlett Street           |LOC      |
|Marlboro                  |LOC      |
|Massachusetts             |LOC      |
|U.S.A                     |LOC      |
|Richard Chleboski         |PER      |
|Lender                    |PER      |
+--------------------------+---------+

from johnsnowlabs import nlp, legal

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
    .setInputCols("sentence", "token") \
    .setOutputCol("embeddings")\

ner_model = legal.NerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter])
  
text = """EXCLUSIVE DISTRIBUTOR AGREEMENT (" Agreement ") dated as April 15, 1994 by and between IMRS OPERATIONS INC., a Delaware corporation with its principal place of business at 777 Long Ridge Road, Stamford, Connecticut 06902, U.S.A. (hereinafter referred to as " Developer ") and Delteq Pte Ltd, a Singapore company (and a subsidiary of Wuthelam Industries (S) Pte LTD ) with its principal place of business at 215 Henderson Road , #101-03 Henderson Industrial Park , Singapore 0315 ( hereinafter referred to as " Distributor ")."""

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(100, truncate=False)

+-------------------------------+---------+
|chunk                          |ner_label|
+-------------------------------+---------+
|EXCLUSIVE DISTRIBUTOR AGREEMENT|DOC      |
|April 15, 1994                 |EFFDATE  |
|IMRS OPERATIONS INC            |PARTY    |
|Developer                      |ALIAS    |
|Delteq Pte Ltd                 |PARTY    |
|Distributor                    |ALIAS    |
+-------------------------------+---------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = SentenceDetector.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val wordEmbeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val jslNer = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("jsl_ner")

val jslNerConverter = new NerConverter()
  .setInputCols(Array("sentence", "token", "jsl_ner"))
  .setOutputCol("ner_chunk")

val jslNerPipeline = new Pipeline()
  .setStages(Array(documentAssembler, 
                   sentenceDetector, 
                   tokenizer, 
                   wordEmbeddings, 
                   jslNer, 
                   jslNerConverter))

val text = "A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, and associated with an acute hepatitis, presented with a one-week history of polyuria, poor appetite, and vomiting.
She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation.
Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness, guarding, or rigidity. Pertinent laboratory findings on admission were: serum glucose 111 mg/dl,  creatinine 0.4 mg/dL, triglycerides 508 mg/dL, total cholesterol 122 mg/dL, and venous pH 7.27."

val data = Seq(text).toDF("text")

val result = jslNerPipeline.fit(data).transform(data)

+-----------------------------+----------------------------+
|chunk                        |ner_label                   |
+-----------------------------+----------------------------+
|28-year-old                  |Age                         |
|female                       |Gender                      |
|gestational diabetes mellitus|Diabetes                    |
|eight years prior            |RelativeDate                |
|type two diabetes mellitus   |Diabetes                    |
|T2DM                         |Diabetes                    |
|HTG-induced pancreatitis     |Disease_Syndrome_Disorder   |
|three years prior            |RelativeDate                |
|acute                        |Modifier                    |
|hepatitis                    |Disease_Syndrome_Disorder   |
|one-week                     |Duration                    |
|polyuria                     |Symptom                     |
|poor appetite                |Symptom                     |
|vomiting                     |Symptom                     |
|She                          |Gender                      |
|metformin                    |Drug_Ingredient             |
|glipizide                    |Drug_Ingredient             |
|dapagliflozin                |Drug_Ingredient             |
|T2DM                         |Diabetes                    |
|atorvastatin                 |Drug_Ingredient             |
|gemfibrozil                  |Drug_Ingredient             |
|HTG                          |Hyperlipidemia              |
|She                          |Gender                      |
|dapagliflozin                |Drug_Ingredient             |
|for six months               |Duration                    |
|dry oral mucosa              |Symptom                     |
|her                          |Gender                      |
|abdominal                    |External_body_part_or_region|
|tenderness                   |Symptom                     |
|guarding                     |Symptom                     |
|rigidity                     |Symptom                     |
|admission                    |Admission_Discharge         |
|serum glucose                |Test                        |
|111 mg/dl                    |Test_Result                 |
|creatinine                   |Test                        |
|0.4 mg/dL                    |Test_Result                 |
|triglycerides                |Triglycerides               |
|508 mg/dL                    |Test_Result                 |
|total cholesterol 122 mg/dL  |Total_Cholesterol           |
|venous pH                    |Test                        |
|7.27                         |Test_Result                 |
+-----------------------------+----------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val embeddings = BertEmbeddings.pretrained("bert_embeddings_legal_bert_base_uncased", "en")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val nerModel = FinanceNerModel.pretrained("finner_sec_conll", "en", "finance/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")

val nerConverter = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter))

val text = '''December 2007 SUBORDINATED LOAN AGREEMENT. THIS LOAN AGREEMENT is made on 7th December, 2007 BETWEEN: (1) SILICIUM DE PROVENCE S.A.S., a private company with limited liability, incorporated under the laws of France, whose registered office is situated at Usine de Saint Auban, France, represented by Mr.Frank Wouters, hereinafter referred to as the "Borrower", and ( 2 ) EVERGREEN SOLAR INC., a company incorporated in Delaware, U.S.A., with registered number 2426798, whose registered office is situated at Bartlett Street, Marlboro, Massachusetts, U.S.A. represented by Richard Chleboski, hereinafter referred to as "Lender" '''

val data = Seq((text)).toDF("text")

val result = pipeline.fit(data).transform(data)

+--------------------------+---------+
|chunk                     |ner_label|
+--------------------------+---------+
|SILICIUM DE PROVENCE S.A.S|ORG      |
|France                    |LOC      |
|Usine de Saint Auban      |LOC      |
|France                    |LOC      |
|Mr.Frank Wouters          |PER      |
|Borrower                  |PER      |
|EVERGREEN SOLAR INC       |ORG      |
|Delaware                  |LOC      |
|U.S.A                     |LOC      |
|Bartlett Street           |LOC      |
|Marlboro                  |LOC      |
|Massachusetts             |LOC      |
|U.S.A                     |LOC      |
|Richard Chleboski         |PER      |
|Lender                    |PER      |
+--------------------------+---------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val nerModel = LegalNerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")

val nerConverter = new NerConverter()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter))

val text = """EXCLUSIVE DISTRIBUTOR AGREEMENT ("Agreement") dated as April 15, 1994 by and between IMRS OPERATIONS INC., a Delaware corporation with its principal place of business at 777 Long Ridge Road, Stamford, Connecticut 06902, U.S.A. (hereinafter referred to as "Developer") and Delteq Pte Ltd, a Singapore company (and a subsidiary of Wuthelam Industries (S) Pte LTD) with its principal place of business at 215 Henderson Road, #101-03 Henderson Industrial Park, Singapore 0315 (hereinafter referred to as "Distributor")."""

val data = Seq(text).toDF("text")

val result = pipeline.fit(data).transform(data)

+-------------------------------+---------+
|chunk                          |ner_label|
+-------------------------------+---------+
|EXCLUSIVE DISTRIBUTOR AGREEMENT|DOC      |
|April 15, 1994                 |EFFDATE  |
|IMRS OPERATIONS INC            |PARTY    |
|Developer                      |ALIAS    |
|Delteq Pte Ltd                 |PARTY    |
|Distributor                    |ALIAS    |
+-------------------------------+---------+

This Named Entity recognition annotator allows to train generic NER model based on Neural Networks.

The architecture of the neural network is a Char CNNs - BiLSTM - CRF that achieves state-of-the-art in most datasets.

For instantiated/pretrained models, see NerDLModel.

The training data should be a labeled Spark Dataset, in the format of CoNLL 2003 IOB with Annotation type columns. The data should have columns of type DOCUMENT, TOKEN, WORD_EMBEDDINGS and an additional label column of annotator type NAMED_ENTITY. Excluding the label, this can be done with for example

a SentenceDetector,
a Tokenizer and
a WordEmbeddingsModel with clinical embeddings (any clinical word embeddings can be chosen).

For extended examples of usage, see the Spark NLP Workshop (sections starting with Training a Clinical NER)

Input Annotator Types: DOCUMENT, TOKEN, WORD_EMBEDDINGS

Output Annotator Type: NAMED_ENTITY

Python API: MedicalNerApproach

Scala API: MedicalNerApproach

Notebook: MedicalNerApproachNotebook

Show Example

from johnsnowlabs import nlp, medical 

# First extract the prerequisites for the NerDLApproach
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")

sentence = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")

clinical_embeddings = nlp.WordEmbeddingsModel.pretrained('embeddings_clinical', "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")

# Then the training can start
nerTagger = medical.NerApproach()\
.setInputCols(["sentence", "token", "embeddings"])\
.setLabelColumn("label")\
.setOutputCol("ner")\
.setMaxEpochs(2)\
.setBatchSize(64)\
.setRandomSeed(0)\
.setVerbose(1)\
.setValidationSplit(0.2)\
.setEvaluationLogExtended(True) \
.setEnableOutputLogs(True)\
.setIncludeConfidence(True)\
.setOutputLogsPath('ner_logs')\
.setGraphFolder('medical_ner_graphs')\
.setEnableMemoryOptimizer(True) #>> if you have a limited memory and a large conll file, you can set this True to train batch by batch

pipeline = nlp.Pipeline().setStages([
documentAssembler,
sentence,
tokenizer,
clinical_embeddings,
nerTagger
])

# We use the text and labels from the CoNLL dataset
conll = CoNLL()
trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")

pipelineModel = pipeline.fit(trainingData)

from johnsnowlabs import nlp, finance

# First extract the prerequisites for the NerDLApproach
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")

sentence = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")

clinical_embeddings = nlp.WordEmbeddingsModel.pretrained('embeddings_clinical', "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")

# Then the training can start
nerTagger = finance.NerApproach()\
.setInputCols(["sentence", "token", "embeddings"])\
.setLabelColumn("label")\
.setOutputCol("ner")\
.setMaxEpochs(2)\
.setBatchSize(64)\
.setRandomSeed(0)\
.setVerbose(1)\
.setValidationSplit(0.2)\
.setEvaluationLogExtended(True) \
.setEnableOutputLogs(True)\
.setIncludeConfidence(True)\
.setOutputLogsPath('ner_logs')\
.setGraphFolder('medical_ner_graphs')\
.setEnableMemoryOptimizer(True) #>> if you have a limited memory and a large conll file, you can set this True to train batch by batch

pipeline = nlp.Pipeline().setStages([
documentAssembler,
sentence,
tokenizer,
clinical_embeddings,
nerTagger
])

from johnsnowlabs import nlp, legal

# First extract the prerequisites for the NerDLApproach
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")

sentence = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")

clinical_embeddings = nlp.WordEmbeddingsModel.pretrained('embeddings_clinical', "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")

# Then the training can start
nerTagger = legal.NerApproach()\
.setInputCols(["sentence", "token", "embeddings"])\
.setLabelColumn("label")\
.setOutputCol("ner")\
.setMaxEpochs(2)\
.setBatchSize(64)\
.setRandomSeed(0)\
.setVerbose(1)\
.setValidationSplit(0.2)\
.setEvaluationLogExtended(True) \
.setEnableOutputLogs(True)\
.setIncludeConfidence(True)\
.setOutputLogsPath('ner_logs')\
.setGraphFolder('medical_ner_graphs')\
.setEnableMemoryOptimizer(True) #>> if you have a limited memory and a large conll file, you can set this True to train batch by batch

pipeline = nlp.Pipeline().setStages([
documentAssembler,
sentence,
tokenizer,
clinical_embeddings,
nerTagger
])

import spark.implicits._

// First extract the prerequisites for the NerDLApproach
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = WordEmbeddingsModel
  .pretrained('embeddings_clinical', "en", "clinical/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

// Then the training can start
val nerTagger =new MedicalNerApproach()
.setInputCols(Array("sentence", "token", "embeddings"))
.setLabelColumn("label")
.setOutputCol("ner")
.setMaxEpochs(5)
.setLr(0.003f)
.setBatchSize(8)
.setRandomSeed(0)
.setVerbose(1)
.setEvaluationLogExtended(false)
.setEnableOutputLogs(false)
.setIncludeConfidence(true)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence,
  tokenizer,
  embeddings,
  nerTagger
))

// We use the text and labels from the CoNLL dataset
val conll = CoNLL()
val trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")

val pipelineModel = pipeline.fit(trainingData)

import spark.implicits._

// First extract the prerequisites for the NerDLApproach
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = WordEmbeddingsModel
  .pretrained('embeddings_clinical', "en", "clinical/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

// Then the training can start
val nerTagger =new FinanceNerApproach()
.setInputCols(Array("sentence", "token", "embeddings"))
.setLabelColumn("label")
.setOutputCol("ner")
.setMaxEpochs(5)
.setLr(0.003f)
.setBatchSize(8)
.setRandomSeed(0)
.setVerbose(1)
.setEvaluationLogExtended(false)
.setEnableOutputLogs(false)
.setIncludeConfidence(true)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence,
  tokenizer,
  embeddings,
  nerTagger
))

import spark.implicits._

// First extract the prerequisites for the NerDLApproach
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = WordEmbeddingsModel
  .pretrained('embeddings_clinical', "en", "clinical/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

// Then the training can start
val nerTagger =new LegalNerApproach()
.setInputCols(Array("sentence", "token", "embeddings"))
.setLabelColumn("label")
.setOutputCol("ner")
.setMaxEpochs(5)
.setLr(0.003f)
.setBatchSize(8)
.setRandomSeed(0)
.setVerbose(1)
.setEvaluationLogExtended(false)
.setEnableOutputLogs(false)
.setIncludeConfidence(true)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence,
  tokenizer,
  embeddings,
  nerTagger
))

NerQuestionGenerator

Model

NerQuestionGenerator takes an NER chunk (obtained by, e.g., NerConverterInternal) and generates a questions based on two entity types, a pronoun and a strategy.

The question is generated in the form of [QUESTIONPRONOUN] [ENTITY1] [ENTITY2] [QUESTIONMARK]. The generated question can be used by QuestionAnswerer or ZeroShotNer annotators to answer the question or find NER entities.

Parametres:

questionPronoun: Pronoun to be used in the question. E.g., ‘When’, ‘Where’, ‘Why’, ‘How’, ‘Who’, ‘What’.
strategyType: Strategy for the proccess, either Paired (default) or Combined.
questionMark: Whether to add a question mark at the end of the question.
entities1: List with the entity types of entities that appear first in the question.
entities2: List with the entity types of entities that appear second in the question.

All the parameters can be set using the corresponding set method in camel case. For example, .setQuestionPronoun(True).

Input Annotator Types: CHUNK

Output Annotator Type: DOCUMENT

Python API: NerQuestionGenerator

Scala API: NerQuestionGenerator

Notebook: NerQuestionGeneratorNotebook

Show Example

from johnsnowlabs import nlp, medical
import json

entities = [
    {
    "label": "Person",
    "patterns": ["Jon", "John", "John's"]
    },
    {
    "label": "Organization",
    "patterns": ["St. Mary's Hospital", "St. Mary's"]
    },
    {
        "label": "Condition",
        "patterns": ["vital signs", "heartbeat", "oxygen saturation levels"]
    }
]

with open('./entities.json', 'w') as jsonfile:
    json.dump(entities, jsonfile)


document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

entity_ruler = nlp.EntityRulerApproach() \
    .setInputCols(["document"]) \
    .setOutputCol("entity") \
    .setPatternsResource("./entities.json")\
    .setCaseSensitive(False)

qagenerator = medical.NerQuestionGenerator()\
    .setInputCols(["entity"])\
    .setOutputCol("question")\
    .setQuestionPronoun("How is")\
    .setEntities1(["Person"])\
    .setEntities2(["Condition"])\
    .setStrategyType("Paired")\
    .setQuestionMark(True)

prep_pipeline = nlp.Pipeline(stages=[
    document_assembler,
    entity_ruler,
    qagenerator
])

example_text = """At St. Mary's Hospital, the healthcare team closely monitored John's vital signs with unwavering attention. They recorded his heartbeat and oxygen saturation levels, promptly addressing any deviations from normal. Their dedication and expertise at St. Mary's played a vital role in ensuring John's stability and fostering a swift recovery."""

df = spark.createDataFrame([[example_text]]).toDF("text")

result = prep_pipeline.fit(df).transform(df)

result.select("question").show(truncate=False)

## Result

+--------------------------------------------------------------------------------------------------------------------------------------------+
|question                                                                                                                                    |
+--------------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 62, 79, How is John's vital signs ?, {sentence -> 0}, []}, {document, 291, 134, How is John's heartbeat ?, {sentence -> 0}, []}]|
+--------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

/* entities.json file
entities = [
    {
    "label": "Person",
    "patterns": ["Jon", "John", "John's"]
    },
    {
    "label": "Organization",
    "patterns": ["St. Mary's Hospital", "St. Mary's"]
    },
    {
        "label": "Condition",
        "patterns": ["vital signs", "heartbeat", "oxygen saturation levels"]
    }
]
*/

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val entity_ruler = new EntityRulerApproach()
    .setInputCols("document")
    .setOutputCol("entity")
    .setPatternsResource("./entities.json")
    .setCaseSensitive(false)

val qagenerator = new NerQuestionGenerator()
    .setInputCols("entity")
    .setOutputCol("question")
    .setQuestionPronoun("How is")
    .setEntities1("Person")
    .setEntities2("Condition")
    .setStrategyType("Paired")
    .setQuestionMark(true)

val prep_pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    entity_ruler, 
    qagenerator )) 

val test_data = Seq("""At St. Mary's Hospital, the healthcare team closely monitored John's vital signs with unwavering attention. They recorded his heartbeat and oxygen saturation levels, promptly addressing any deviations from normal. Their dedication and expertise at St. Mary's played a vital role in ensuring John's stability and fostering a swift recovery.""").toDF("text")

val res = mapperPipeline.fit(test_data).transform(test_data)

// Show results

+--------------------------------------------------------------------------------------------------------------------------------------------+
|question                                                                                                                                    |
+--------------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 62, 79, How is John's vital signs ?, {sentence -> 0}, []}, {document, 291, 134, How is John's heartbeat ?, {sentence -> 0}, []}]|
+--------------------------------------------------------------------------------------------------------------------------------------------+

PosologyREModel

Model

Instantiated RelationExtractionModel for extracting relationships between different recognized drug entitites. This class is not intended to be directly used, please use the RelationExtractionModel instead. Possible values are “DRUG-DOSAGE”, “DRUG-ADE”, “DRUG-FORM”, “DRUG-FREQUENCY”, “DRUG-ROUTE”, “DRUG-REASON”, “DRUG-STRENGTH”, “DRUG-DURATION”.

Input Annotator Types: WORD_EMBEDDINGS, POS, CHUNK, DEPENDENCY

Output Annotator Type: CATEGORY

Python API: RelationExtractionModel

Scala API: RelationExtractionModel

Notebook: RelationExtractionModelNotebook

Show Example

from johnsnowlabs import nlp, medical

documenter = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencer = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentences"])\
    .setOutputCol("tokens")

words_embedder = nlp.WordEmbeddingsModel()\
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("embeddings")

pos_tagger = nlp.PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("pos_tags")

ner_tagger = medical.NerModel()\
    .pretrained("ner_posology", "en", "clinical/models")\
    .setInputCols("sentences", "tokens", "embeddings")\
    .setOutputCol("ner_tags")

ner_chunker = medical.NerConverterInternal()\
    .setInputCols(["sentences", "tokens", "ner_tags"])\
    .setOutputCol("ner_chunks")

dependency_parser = nlp.DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentences", "pos_tags", "tokens"])\
    .setOutputCol("dependencies")

reModel = medical.RelationExtractionModel()\
    .pretrained("posology_re")\
    .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
    .setOutputCol("relations")\
    .setMaxSyntacticDistance(4)

pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer,
    words_embedder,
    pos_tagger,
    ner_tagger,
    ner_chunker,
    dependency_parser,
    reModel
])

text = """
The patient was prescribed 1 unit of Advil for 5 days after meals. The patient was also
given 1 unit of Metformin daily.
He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night ,
12 units of insulin lispro with meals , and metformin 1000 mg two times a day.
"""
df = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(df).transform(df)

# Show results
result.select(F.explode(F.arrays_zip(
                              result.relations.result,
                              result.relations.metadata)).alias("cols"))\
.select(
    F.expr("cols['1']['chunk1']").alias("chunk1"),
    F.expr("cols['1']['chunk2']").alias("chunk2"),
    F.expr("cols['1']['entity1']").alias("entity1"),
    F.expr("cols['1']['entity2']").alias("entity2"),
    F.expr("cols['0']").alias("relations"),
    F.expr("cols['1']['confidence']").alias("confidence")).show(5, truncate=False)

+---------+----------------+-------+---------+--------------+----------+
|chunk1   |chunk2          |entity1|entity2  |relations     |confidence|
+---------+----------------+-------+---------+--------------+----------+
|1 unit   |Advil           |DOSAGE |DRUG     |DOSAGE-DRUG   |1.0       |
|Advil    |for 5 days      |DRUG   |DURATION |DRUG-DURATION |1.0       |
|1 unit   |Metformin       |DOSAGE |DRUG     |DOSAGE-DRUG   |1.0       |
|Metformin|daily           |DRUG   |FREQUENCY|DRUG-FREQUENCY|1.0       |
|40 units |insulin glargine|DOSAGE |DRUG     |DOSAGE-DRUG   |1.0       |
+---------+----------------+-------+---------+--------------+----------+

import spark.implicits._

val documenter = new DocumentAssembler()
    .setInputCol("text") 
    .setOutputCol("document") 

val sentencer = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentences") 

val tokenizer = new Tokenizer()
    .setInputCols("sentences") 
    .setOutputCol("tokens") 

val words_embedder = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models") 
    .setInputCols(Array("sentences","tokens")) 
    .setOutputCol("embeddings") 

val pos_tagger = PerceptronModel.pretrained("pos_clinical","en","clinical/models") 
    .setInputCols(Array("sentences","tokens")) 
    .setOutputCol("pos_tags") 

val ner_tagger = MedicalNerModel.pretrained("ner_posology","en","clinical/models") 
    .setInputCols("sentences","tokens","embeddings") 
    .setOutputCol("ner_tags") 

val ner_chunker = new NerConverterInternal()
    .setInputCols(Array("sentences","tokens","ner_tags")) 
    .setOutputCol("ner_chunks") 

val dependency_parser = DependencyParserModel.pretrained("dependency_conllu","en") 
    .setInputCols(Array("sentences","pos_tags","tokens")) 
    .setOutputCol("dependencies") 

val reModel = RelationExtractionModel.pretrained("posology_re") 
    .setInputCols(Array("embeddings","pos_tags","ner_chunks","dependencies")) 
    .setOutputCol("relations") 
    .setMaxSyntacticDistance(4) 

val pipeline = new Pipeline().setStages(Array(
                                             documenter, 
                                             sentencer, 
                                             tokenizer,
                                             words_embedder, 
                                             pos_tagger, 
                                             ner_tagger, 
                                             ner_chunker, 
                                             dependency_parser, 
                                             reModel )) 

val text = " The patient was prescribed 1 unit of Advil for 5 days after meals. The patient was also given 1 unit of Metformin daily. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals ,and metformin 1000 mg two times a day. " 

val df = Seq(text) .toDF("text") 
val result = pipeline.fit(df) .transform(df) 

// Show results

+---------+----------------+-------+---------+--------------+----------+
|chunk1   |chunk2          |entity1|entity2  |relations     |confidence|
+---------+----------------+-------+---------+--------------+----------+
|1 unit   |Advil           |DOSAGE |DRUG     |DOSAGE-DRUG   |1.0       |
|Advil    |for 5 days      |DRUG   |DURATION |DRUG-DURATION |1.0       |
|1 unit   |Metformin       |DOSAGE |DRUG     |DOSAGE-DRUG   |1.0       |
|Metformin|daily           |DRUG   |FREQUENCY|DRUG-FREQUENCY|1.0       |
|40 units |insulin glargine|DOSAGE |DRUG     |DOSAGE-DRUG   |1.0       |
+---------+----------------+-------+---------+--------------+----------+

PretrainedZeroShotNER

Model

Pretrained Zero-shot Named Entity Recognition (NER) makes it easy to identify specific entities in text without needing pre-labeled datasets. It uses advanced pre-trained language models to recognize entities in different fields and languages, saving time and effort. This method is flexible, letting you define your own entity labels instead of relying on a fixed set of examples. For the best results, it’s helpful to choose labels similar to the provided examples, as they guide the model’s understanding.

Parameters:

labels: A list of labels descriving the entities. For example: [“person”, “location”]
predictionThreshold: Minimal confidence score to encode an entity (Default: 0.01f)
setBatchSize: Sets the number of inputs processed together in a single batch during inference. A higher batch size can improve throughput and reduce overall inference time on supported hardware, but may increase memory usage. (Default: 8).

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: NAMED_ENTITY

Python API: PretrainedZeroShotNER

Scala API: PretrainedZeroShotNER

Notebook: PretrainedZeroShotNER

Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

labels = ['DOCTOR', 'PATIENT', 'AGE', 'DATE', 'HOSPITAL', 'CITY', 'STREET', 'STATE', 'COUNTRY', 'PHONE', 'IDNUM', 'EMAIL','ZIP', 'ORGANIZATION', 'PROFESSION', 'USERNAME']

pretrained_zero_shot_ner = medical.PretrainedZeroShotNER().pretrained("zeroshot_ner_deid_subentity_merged_medium", "en", "clinical/models")\
    .setInputCols("sentence", "token")\
    .setOutputCol("ner")\
    .setPredictionThreshold(0.5)\
    .setLabels(labels)

ner_converter = medical.NerConverterInternal()\
    .setInputCols("sentence", "token", "ner")\
    .setOutputCol("ner_chunk")


pipeline = Pipeline().setStages([
    document_assembler,
    sentence_detector,
    tokenizer,
    pretrained_zero_shot_ner,
    ner_converter
    ])

text = """Dr. John Lee, from Royal Medical Clinic in Chicago,  attended to the patient on 11/05/2024.
The patient’s medical record number is 56467890. The patient, Emma Wilson, is 50 years old,  her Contact number: 444-456-7890 .
Dr. John Taylor, ID: 982345, a cardiologist at St. Mary's Hospital in Boston, was contacted on 05/10/2023 regarding a 45-year-old.
"""

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)


# result

+--------------------+-----+---+----------+
|chunk               |begin|end|ner_label |
+--------------------+-----+---+----------+
|John Lee            |4    |11 |DOCTOR    |
|Royal Medical Clinic|19   |38 |HOSPITAL  |
|Chicago             |43   |49 |CITY      |
|11/05/2024          |80   |89 |DATE      |
|56467890            |131  |138|IDNUM     |
|Emma Wilson         |154  |164|PATIENT   |
|50                  |170  |171|AGE       |
|444-456-7890        |205  |216|PHONE     |
|John Taylor         |224  |234|DOCTOR    |
|982345              |241  |246|IDNUM     |
|cardiologist        |251  |262|PROFESSION|
|St. Mary's Hospital |267  |285|HOSPITAL  |
|Boston              |290  |295|CITY      |
|05/10/2023          |315  |324|DATE      |
|45-year-old         |338  |348|AGE       |
+--------------------+-----+---+----------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCols(Array("document"))
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols(Array("sentence"))
    .setOutputCol("token")

val labels = Array(
    "DOCTOR", "PATIENT", "AGE", "DATE", "HOSPITAL", "CITY", "STREET",
    "STATE", "COUNTRY", "PHONE", "IDNUM", "EMAIL", "ZIP",
    "ORGANIZATION", "PROFESSION", "USERNAME"
    )

val pretrainedZeroShotNer = PretrainedZeroShotNER
    .pretrained("zeroshot_ner_deid_subentity_merged_medium", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("ner")
    .setPredictionThreshold(0.5.toFloat)
    .setLabels(labels)

val nerConverter = new NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

val pipeline = new Pipeline()
    .setStages(Array(
        documentAssembler,
        sentenceDetector,
        tokenizer,
        pretrainedZeroShotNer,
        nerConverter
    ))

val text = """Dr. John Lee, from Royal Medical Clinic in Chicago, attended to the patient on 11/05/2024.
The patient’s medical record number is 56467890. The patient, Emma Wilson, is 50 years old, her Contact number: 444-456-7890.
Dr. John Taylor, ID: 982345, a cardiologist at St. Mary's Hospital in Boston, was contacted on 05/10/2023 regarding a 45-year-old.
"""


val data = spark.createDataFrame(Seq((text))).toDF("text")
val pipelineModel = pipeline.fit(data)
val result = pipelineModel.transform(data)



# result

+--------------------+-----+---+----------+
|chunk               |begin|end|ner_label |
+--------------------+-----+---+----------+
|John Lee            |4    |11 |DOCTOR    |
|Royal Medical Clinic|19   |38 |HOSPITAL  |
|Chicago             |43   |49 |CITY      |
|11/05/2024          |80   |89 |DATE      |
|56467890            |131  |138|IDNUM     |
|Emma Wilson         |154  |164|PATIENT   |
|50                  |170  |171|AGE       |
|444-456-7890        |205  |216|PHONE     |
|John Taylor         |224  |234|DOCTOR    |
|982345              |241  |246|IDNUM     |
|cardiologist        |251  |262|PROFESSION|
|St. Mary's Hospital |267  |285|HOSPITAL  |
|Boston              |290  |295|CITY      |
|05/10/2023          |315  |324|DATE      |
|45-year-old         |338  |348|AGE       |
+--------------------+-----+---+----------+

QuestionAnswering

Model

QuestionAnswering is a GPT-based model for answering questions given a context. Unlike span-based models, it generates the answers to the questions, rather than selecting phrases from the given context. The model is capable of answering various types of questions, including yes-no or full-text ones. Types of questions are supported: "short" (producing yes/no/maybe) answers and "long" (full answers).

Parameters:

questionType: Question type, e.g. “short” or “long”. The question types depend on the model.
maxNewTokens: Maximum number of of new tokens to generate, by default 30
maxContextLength: Maximum length of context text
configProtoBytes: ConfigProto from tensorflow, serialized into byte array.
doSample: Whether or not to use sampling; use greedy decoding otherwise, by default False
topK: The number of highest probability vocabulary tokens to consider, by default 1
noRepeatNgramSize: The number of tokens that can’t be repeated in the same order. Useful for preventing loops. The default is 0.
ignoreTokenIds: A list of token ids which are ignored in the decoder’s output, by default []
randomSeed: Set to positive integer to get reproducible results, by default None.
customPrompt: Custom prompt template. Available variables {QUESTION} and {CONTEXT}

Available models can be found at the Models Hub

For more extended examples on the document, pre-processing see the Spark NLP Workshop

Input Annotator Types: DOCUMENT, DOCUMENT

Output Annotator Type: CHUNK

Python API: MedicalQuestionAnswering

Scala API: MedicalQuestionAnswering

Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.MultiDocumentAssembler()\
    .setInputCols("question", "context")\
    .setOutputCols("document_question", "document_context")

med_qa = medical.MedicalQuestionAnswering.pretrained("medical_qa_biogpt","en","clinical/models")\
    .setInputCols(["document_question", "document_context"])\
    .setOutputCol("answer")\
    .setMaxNewTokens(30)\
    .setTopK(1)\
    .setQuestionType("long") # "short"

pipeline = nlp.Pipeline(stages=[document_assembler, med_qa])

paper_abstract = [
    "In patients with Los Angeles (LA) grade C or D oesophagitis, a positive relationship has been established between the duration of intragastric acid suppression and healing.AIM: To determine whether there is an apparent optimal time of intragastric acid suppression for maximal healing of reflux oesophagitis. Post hoc analysis of data from a proof-of-concept, double-blind, randomized study of 134 adult patients treated with esomeprazole (10 or 40 mg od for 4 weeks) for LA grade C or D oesophagitis. A curve was fitted to pooled 24-h intragastric pH (day 5) and endoscopically assessed healing (4 weeks) data using piecewise quadratic logistic regression. Maximal reflux oesophagitis healing rates were achieved when intragastric pH>4 was achieved for approximately 50-70% (12-17 h) of the 24-h period. Acid suppression above this threshold did not yield further increases in healing rates."
]

question = ["Is there an optimal time of acid suppression for maximal healing?"]

data = spark.createDataFrame([ [paper_abstract[0],  question[0]] ]).toDF("context","question")

data.show(truncate = 60)

+------------------------------------------------------------+------------------------------------------------------------+
|                                                     context|                                                    question|
+------------------------------------------------------------+------------------------------------------------------------+
|In patients with Los Angeles (LA) grade C or D oesophagit...|Is there an optimal time of acid suppression for maximal ...|
+------------------------------------------------------------+------------------------------------------------------------+

result = pipeline.fit(data).transform(data)

result.selectExpr("document_question.result as Question", "answer.result as Long_Answer").show(truncate=False)

+-------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
|Question                                                           |Long_Answer                                                                                                                                          |
+-------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
|[Is there an optimal time of acid suppression for maximal healing?]|[in patients with reflux oesophagitis, maximal healing rates are obtained when intragastric pH is achieved for approximately 50 - 70 % ( 12 - 17 h )]|
+-------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+

from johnsnowlabs import nlp, finance

document_assembler = nlp.MultiDocumentAssembler()\
    .setInputCols("question", "context")\
    .setOutputCols("document_question", "document_context")

fin_qa = finance.QuestionAnswering.pretrained("finqa_flant5_finetuned","en","finance/models")\
    .setInputCols(["document_question", "document_context"])\
    .setCustomPrompt("question: {QUESTION} context: {CONTEXT}")\
    .setMaxNewTokens(100)\
    .setOutputCol("answer")

pipeline = nlp.Pipeline(stages=[document_assembler, fin_qa])

context = """EXHIBIT 99.2 Page 1 of 3 DISTRIBUTOR AGREEMENT Agreement made this 19th day of March, 2020 Between: Co-Diagnostics, Inc. (herein referred to as "Principal") And PreCheck Health Services, Inc. (herein referred to as "Distributor"). In consideration of the mutual terms, conditions and covenants hereinafter set forth, Principal and Distributor acknowledge and agree to the following descriptions and conditions: DESCRIPTION OF PRINCIPAL The Principal is a company located in Utah, United States and is in the business of research and development of reagents. The Principal markets and sells it products globally through direct sales and distributors. DESCRIPTION OF DISTRIBUTOR The Distributor is a company operating or planning to operate in the United States of America, Latin America, Europe and Russia. The Distributor represents that the Distributor or a subsidiary of the Distributor is or will be fully licensed and registered in the Territory and will provide professional distribution services for the products of the Principal. CONDITIONS: 1. The Principal appoints the Distributor as a non-exclusive distributor, to sell Principal's qPCR infectious disease kits, Logix Smart COVID-19 PCR diagnostic test and Co-Dx Box™ instrument (the "Products"). The Products are described on Exhibit A to this Agreement. 2. The Principal grants Distributor non- exclusive rights to sell these products within the countries of Romania (the "Territory"), which may be amended by mutual written agreement."""

questions = ["""Which company is referred to as 'Principal' in the Distributor Agreement?""",
             """What is the date of the distributor agreement between Co-Diagnostics, Inc. and PreCheck Health Services, Inc.?""",
             """What is the Territory in which the Distributor has non-exclusive rights to sell Principal's products according to the Agreement?"""]

data = spark.createDataFrame(
    [
        [context[0],  questions[0]],
        [context[0],  questions[1]],
        [context[0],  questions[2]],
    ]
).toDF("context","question")

data.show(truncate = 80)

+------------------------------------------------------------------+--------------------------------------------------------------------------------+
|                                                          question|                                                                         context|
+------------------------------------------------------------------+--------------------------------------------------------------------------------+
|   What are the key components of the business strategy described?|Our business strategy has been to develop data processing and product technol...|
|What is the immediate strategy for scaling the IntentKey platform?|Our business strategy has been to develop data processing and product technol...|
|How does the company aim to provide differentiation in the market?|Our business strategy has been to develop data processing and product technol...|
+------------------------------------------------------------------+--------------------------------------------------------------------------------+

result = pipeline.fit(data).transform(data)

result.select('question', 'answer.result').show(truncate=False)

+------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|question                                                          |result                                                                                                                                                                                                                                                                                                            |
+------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|What are the key components of the business strategy described?   |[The key components of the business strategy described are proprietary demand (media spend) and supply side (media inventory) technologies, targeting technologies, on-page or in-app ad-unit technologies, proprietary data and data management technologies, and advertising fraud detection technologies. . . ]|
|What is the immediate strategy for scaling the IntentKey platform?|[The immediate strategy for scaling the IntentKey platform is to scale through the hiring of additional sales professionals, growing existing accounts and expanding the market size by concurrently selling the SaaS version of the IntentKey beginning in 2021. ]                                               |
|How does the company aim to provide differentiation in the market?|[The company aims to provide differentiation through the AI analytics and data products they own and protect through patents. ]                                                                                                                                                                                   |
+------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

from johnsnowlabs import nlp, legal

context = ["""EXHIBIT 99.2 Page 1 of 3 DISTRIBUTOR AGREEMENT Agreement made this 19th day of March, 2020 Between: Co-Diagnostics, Inc. (herein referred to as "Principal") And PreCheck Health Services, Inc. (herein referred to as "Distributor"). In consideration of the mutual terms, conditions and covenants hereinafter set forth, Principal and Distributor acknowledge and agree to the following descriptions and conditions: DESCRIPTION OF PRINCIPAL The Principal is a company located in Utah, United States and is in the business of research and development of reagents. The Principal markets and sells it products globally through direct sales and distributors. DESCRIPTION OF DISTRIBUTOR The Distributor is a company operating or planning to operate in the United States of America, Latin America, Europe and Russia. The Distributor represents that the Distributor or a subsidiary of the Distributor is or will be fully licensed and registered in the Territory and will provide professional distribution services for the products of the Principal. CONDITIONS: 1. The Principal appoints the Distributor as a non-exclusive distributor, to sell Principal's qPCR infectious disease kits, Logix Smart COVID-19 PCR diagnostic test and Co-Dx Box™ instrument (the "Products"). The Products are described on Exhibit A to this Agreement. 2. The Principal grants Distributor non- exclusive rights to sell these products within the countries of Romania (the "Territory"), which may be amended by mutual written agreement."""]

questions = ["""Which company is referred to as 'Principal' in the Distributor Agreement?""",
             """What is the date of the distributor agreement between Co-Diagnostics, Inc. and PreCheck Health Services, Inc.?""",
             """What is the Territory in which the Distributor has non-exclusive rights to sell Principal's products according to the Agreement?"""]

data = spark.createDataFrame(
    [
        [context[0],  questions[0]],
        [context[0],  questions[1]],
        [context[0],  questions[2]],
    ]
).toDF("context","question")

data.show(truncate = 80)

+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+
|                                                                         context|                                                                        question|
+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+
|EXHIBIT 99.2 Page 1 of 3 DISTRIBUTOR AGREEMENT Agreement made this 19th day o...|       Which company is referred to as 'Principal' in the Distributor Agreement?|
|EXHIBIT 99.2 Page 1 of 3 DISTRIBUTOR AGREEMENT Agreement made this 19th day o...|What is the date of the distributor agreement between Co-Diagnostics, Inc. an...|
|EXHIBIT 99.2 Page 1 of 3 DISTRIBUTOR AGREEMENT Agreement made this 19th day o...|What is the Territory in which the Distributor has non-exclusive rights to se...|
+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+

document_assembler = nlp.MultiDocumentAssembler()\
    .setInputCols("question", "context")\
    .setOutputCols("document_question", "document_context")

leg_qa = legal.QuestionAnswering.pretrained("legqa_flant5_finetuned","en","legal/models")\
    .setInputCols(["document_question", "document_context"])\
    .setCustomPrompt("question: {QUESTION} context: {CONTEXT}")\
    .setMaxNewTokens(40)\
    .setTopK(3)\
    .setOutputCol("answer")

pipeline = nlp.Pipeline(stages=[document_assembler, leg_qa])

result = pipeline.fit(data).transform(data)

result.selectExpr("document_question.result as Question", "answer.result as Answer").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|Question                                                                                                                          |Answer                                                                                                                                  |
+----------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|[Which company is referred to as 'Principal' in the Distributor Agreement?]                                                       |[Co-Diagnostics, Inc. is referred to as 'Principal' in the Distributor Agreement. ]                                                     |
|[What is the date of the distributor agreement between Co-Diagnostics, Inc. and PreCheck Health Services, Inc.?]                  |[The date of the distributor agreement between Co-Diagnostics, Inc. and PreCheck Health Services, Inc. is the 19th day of March, 2020. ]|
|[What is the Territory in which the Distributor has non-exclusive rights to sell Principal's products according to the Agreement?]|[The Territory in which the Distributor has non-exclusive rights to sell Principal's products according to the Agreement is Romania. ]  |
+----------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCols(Array("question", "context"))
  .setOutputCols(Array("document_question", "document_context"))

val medQA = MedicalQuestionAnswering.pretrained("medical_qa_biogpt", "en", "clinical/models")
  .setInputCols(Array("document_question", "document_context"))
  .setOutputCol("answer")
  .setMaxNewTokens(30)
  .setTopK(1)
  .setQuestionType("long") // "short"

val pipeline = new Pipeline().setStages(Array(
                                              documentAssembler, 
                                              medQA))

val paperAbstract = "In patients with Los Angeles (LA) grade C or D oesophagitis, a positive relationship has been established between the duration of intragastric acid suppression and healing.AIM: To determine whether there is an apparent optimal time of intragastric acid suppression for maximal healing of reflux oesophagitis. Post hoc analysis of data from a proof-of-concept, double-blind, randomized study of 134 adult patients treated with esomeprazole (10 or 40 mg od for 4 weeks) for LA grade C or D oesophagitis. A curve was fitted to pooled 24-h intragastric pH (day 5) and endoscopically assessed healing (4 weeks) data using piecewise quadratic logistic regression. Maximal reflux oesophagitis healing rates were achieved when intragastric pH>4 was achieved for approximately 50-70% (12-17 h) of the 24-h period. Acid suppression above this threshold did not yield further increases in healing rates."

val question = "Is there an optimal time of acid suppression for maximal healing?"

val data = Seq(paperAbstract, question).toDF("context", "question")

+------------------------------------------------------------+------------------------------------------------------------+
|                                                     context|                                                    question|
+------------------------------------------------------------+------------------------------------------------------------+
|In patients with Los Angeles (LA) grade C or D oesophagit...|Is there an optimal time of acid suppression for maximal ...|
+------------------------------------------------------------+------------------------------------------------------------+

val result = pipeline.fit(data).transform(data)


+-------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
|Question                                                           |Long_Answer                                                                                                                                          |
+-------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
|[Is there an optimal time of acid suppression for maximal healing?]|[in patients with reflux oesophagitis, maximal healing rates are obtained when intragastric pH is achieved for approximately 50 - 70 % ( 12 - 17 h )]|
+-------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCols(Array("question", "context"))
  .setOutputCols(Array("document_question", "document_context"))

val finQa = new FinanceQuestionAnswering()
  .pretrained("finqa_flant5_finetuned", "en", "finance/models")
  .setInputCols(Array("document_question", "document_context"))
  .setCustomPrompt("question: {QUESTION} context: {CONTEXT}")
  .setMaxNewTokens(100)
  .setOutputCol("answer")

val pipeline = new Pipeline().setStages(Array(documentAssembler, finQa))

val context = "EXHIBIT 99.2 Page 1 of 3 DISTRIBUTOR AGREEMENT Agreement made this 19th day of March, 2020 Between: Co-Diagnostics, Inc. (herein referred to as "Principal") And PreCheck Health Services, Inc. (herein referred to as "Distributor"). In consideration of the mutual terms, conditions and covenants hereinafter set forth, Principal and Distributor acknowledge and agree to the following descriptions and conditions: DESCRIPTION OF PRINCIPAL The Principal is a company located in Utah, United States and is in the business of research and development of reagents. The Principal markets and sells it products globally through direct sales and distributors. DESCRIPTION OF DISTRIBUTOR The Distributor is a company operating or planning to operate in the United States of America, Latin America, Europe and Russia. The Distributor represents that the Distributor or a subsidiary of the Distributor is or will be fully licensed and registered in the Territory and will provide professional distribution services for the products of the Principal. CONDITIONS: 1. The Principal appoints the Distributor as a non-exclusive distributor, to sell Principal's qPCR infectious disease kits, Logix Smart COVID-19 PCR diagnostic test and Co-Dx Box™ instrument (the "Products"). The Products are described on Exhibit A to this Agreement. 2. The Principal grants Distributor non- exclusive rights to sell these products within the countries of Romania (the "Territory"), which may be amended by mutual written agreement."

val questions = Seq(
  "Which company is referred to as 'Principal' in the Distributor Agreement?",
  "What is the date of the distributor agreement between Co-Diagnostics, Inc. and PreCheck Health Services, Inc.?",
  "What is the Territory in which the Distributor has non-exclusive rights to sell Principal's products according to the Agreement?"
)

val data = questions.map(q => (context, q)).toDF("context", "question")

+------------------------------------------------------------------+--------------------------------------------------------------------------------+
|                                                          question|                                                                         context|
+------------------------------------------------------------------+--------------------------------------------------------------------------------+
|   What are the key components of the business strategy described?|Our business strategy has been to develop data processing and product technol...|
|What is the immediate strategy for scaling the IntentKey platform?|Our business strategy has been to develop data processing and product technol...|
|How does the company aim to provide differentiation in the market?|Our business strategy has been to develop data processing and product technol...|
+------------------------------------------------------------------+--------------------------------------------------------------------------------+

val result = pipeline.fit(data).transform(data)

+------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|question                                                          |result                                                                                                                                                                                                                                                                                                            |
+------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|What are the key components of the business strategy described?   |[The key components of the business strategy described are proprietary demand (media spend) and supply side (media inventory) technologies, targeting technologies, on-page or in-app ad-unit technologies, proprietary data and data management technologies, and advertising fraud detection technologies. . . ]|
|What is the immediate strategy for scaling the IntentKey platform?|[The immediate strategy for scaling the IntentKey platform is to scale through the hiring of additional sales professionals, growing existing accounts and expanding the market size by concurrently selling the SaaS version of the IntentKey beginning in 2021. ]                                               |
|How does the company aim to provide differentiation in the market?|[The company aims to provide differentiation through the AI analytics and data products they own and protect through patents. ]                                                                                                                                                                                   |
+------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

val context = Seq("""EXHIBIT 99.2 Page 1 of 3 DISTRIBUTOR AGREEMENT Agreement made this 19th day of March, 2020 Between: Co-Diagnostics, Inc. (herein referred to as "Principal") And PreCheck Health Services, Inc. (herein referred to as "Distributor"). In consideration of the mutual terms, conditions and covenants hereinafter set forth, Principal and Distributor acknowledge and agree to the following descriptions and conditions: DESCRIPTION OF PRINCIPAL The Principal is a company located in Utah, United States and is in the business of research and development of reagents. The Principal markets and sells it products globally through direct sales and distributors. DESCRIPTION OF DISTRIBUTOR The Distributor is a company operating or planning to operate in the United States of America, Latin America, Europe and Russia. The Distributor represents that the Distributor or a subsidiary of the Distributor is or will be fully licensed and registered in the Territory and will provide professional distribution services for the products of the Principal. CONDITIONS: 1. The Principal appoints the Distributor as a non-exclusive distributor, to sell Principal's qPCR infectious disease kits, Logix Smart COVID-19 PCR diagnostic test and Co-Dx Box™ instrument (the "Products"). The Products are described on Exhibit A to this Agreement. 2. The Principal grants Distributor non- exclusive rights to sell these products within the countries of Romania (the "Territory"), which may be amended by mutual written agreement."""
)

val questions = Seq(
  """Which company is referred to as 'Principal' in the Distributor Agreement?""",
  """What is the date of the distributor agreement between Co-Diagnostics, Inc. and PreCheck Health Services, Inc.?""",
  """What is the Territory in which the Distributor has non-exclusive rights to sell Principal's products according to the Agreement?"""
)

// val data = context.flatMap(c => questions.map(q => (c, q))).toDF("context", "question")


+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+
|                                                                         context|                                                                        question|
+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+
|EXHIBIT 99.2 Page 1 of 3 DISTRIBUTOR AGREEMENT Agreement made this 19th day o...|       Which company is referred to as 'Principal' in the Distributor Agreement?|
|EXHIBIT 99.2 Page 1 of 3 DISTRIBUTOR AGREEMENT Agreement made this 19th day o...|What is the date of the distributor agreement between Co-Diagnostics, Inc. an...|
|EXHIBIT 99.2 Page 1 of 3 DISTRIBUTOR AGREEMENT Agreement made this 19th day o...|What is the Territory in which the Distributor has non-exclusive rights to se...|
+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCols(Array("question", "context"))
  .setOutputCols(Array("document_question", "document_context"))

val legQA = LegalQuestionAnswering.pretrained("legqa_flant5_finetuned", "en", "clinical/models")
  .setInputCols(Array("document_question", "document_context"))
  .setCustomPrompt("question: {QUESTION} context: {CONTEXT}")
  .setMaxNewTokens(40)
  .setTopK(3)
  .setOutputCol("answer")

val pipeline = new Pipeline().setStages(Array(documentAssembler, legQA))

val result = pipeline.fit(data).transform(data)

+----------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|Question                                                                                                                          |Answer                                                                                                                                  |
+----------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|[Which company is referred to as 'Principal' in the Distributor Agreement?]                                                       |[Co-Diagnostics, Inc. is referred to as 'Principal' in the Distributor Agreement. ]                                                     |
|[What is the date of the distributor agreement between Co-Diagnostics, Inc. and PreCheck Health Services, Inc.?]                  |[The date of the distributor agreement between Co-Diagnostics, Inc. and PreCheck Health Services, Inc. is the 19th day of March, 2020. ]|
|[What is the Territory in which the Distributor has non-exclusive rights to sell Principal's products according to the Agreement?]|[The Territory in which the Distributor has non-exclusive rights to sell Principal's products according to the Agreement is Romania. ]  |
+----------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+

REChunkMerger

Model

REChunkMerger annotator merges relation chunks to create a new chunk.

Parameters:

separator: Separator to add between the chunks. Default: “ “.

Input Annotator Types: CATEGORY

Output Annotator Type: CHUNK

Python API: REChunkMerger

Scala API: REChunkMerger

Notebook: REChunkMergerNotebook

Show Example

from johnsnowlabs import nlp, medical

documenter = nlp.DocumentAssembler() \
    .setInputCol("sentence") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("tokens") \

words_embedder = nlp.WordEmbeddingsModel() \
    .pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "tokens"]) \
    .setOutputCol("embeddings")

pos_tagger = nlp.PerceptronModel() \
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["document", "tokens"]) \
    .setOutputCol("pos_tags")

ner_tagger = medical.NerModel() \
    .pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["document", "tokens", "embeddings"]) \
    .setOutputCol("ner_tags")

ner_converter = medical.NerConverter() \
    .setInputCols(["document", "tokens", "ner_tags"]) \
    .setOutputCol("ner_chunks")

depency_parser = nlp.DependencyParserModel() \
    .pretrained("dependency_conllu", "en") \
    .setInputCols(["document", "pos_tags", "tokens"]) \
    .setOutputCol("dependencies")

re_model = medical.RelationExtractionModel \
    .pretrained("re_clinical", "en", "clinical/models") \
    .setCustomLabels({"TeRP": "CustomLabel_TeRP", "TrWP": "CustomLabel_TeWP"}) \
    .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"]) \
    .setOutputCol("re_chunk")

re_chunk_merger = medical.REChunkMerger() \
    .setInputCols(["re_chunk"]) \
    .setOutputCol("relation_chunks") \
    .setSeparator(" && ")

nlpPipeline = nlp.Pipeline(
    stages=[
      documenter,
      tokenizer,
      words_embedder,
      pos_tagger,
      ner_tagger,
      ner_converter,
      depency_parser,
      re_model,
      re_chunk_merger
    ])

empty_data = spark.createDataFrame([[""]]).toDF("sentence")

model = nlpPipeline.fit(empty_data)

text =''' 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to " +
"presentation and subsequent type two diabetes mellitus ( T2DM ). '''

result = model.transform(spark.createDataFrame([[text]]).toDF("sentence"))

# result
+----------------------------------------------------------------------+
|result                                                                |
+----------------------------------------------------------------------+
|gestational diabetes mellitus && subsequent type two diabetes mellitus|
|gestational diabetes mellitus && T2DM                                 |
|subsequent type two diabetes mellitus && T2DM                         |
+----------------------------------------------------------------------+

import spark.implicits._

val documenter = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new Tokenizer()
    .setInputCols("document")
    .setOutputCol("tokens")

val words_embedder = WordEmbeddingsModel
    .pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols("document", "tokens")
    .setOutputCol("embeddings")

val pos_tagger = PerceptronModel
    .pretrained("pos_clinical", "en", "clinical/models")
    .setInputCols("document", "tokens")
    .setOutputCol("pos_tags")

val ner_tagger = MedicalNerModel
    .pretrained("ner_clinical", "en", "clinical/models")
    .setInputCols("document", "tokens", "embeddings")
    .setOutputCol("ner_tags")

val ner_converter = new NerConverter()
    .setInputCols("document", "tokens", "ner_tags")
    .setOutputCol("ner_chunks")

val depency_parser = DependencyParserModel
    .pretrained("dependency_conllu", "en")
    .setInputCols("document", "pos_tags", "tokens")
    .setOutputCol("dependencies")

val re_model = RelationExtractionModel
    .pretrained("re_clinical", "en", "clinical/models")
    .setInputCols("embeddings", "pos_tags", "ner_chunks", "dependencies")
    .setOutputCol("re_chunk")

val re_chunk_merger = new REChunkMerger()
    .setInputCols("re_chunk")
    .setOutputCol("relation_chunks")
    .setSeparator(" && ")

val pipeline = new Pipeline()
  .setStages(Array(
        documenter,
        tokenizer,
        words_embedder,
        pos_tagger,
        ner_tagger,
        ner_converter,
        depency_parser,
        re_model,
        re_chunk_merger
  ))
val text = "28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to " +
  "presentation and subsequent type two diabetes mellitus ( T2DM ). "

val empty_data = Seq("").toDF("text")

val model = pipeline.fit(empty_data).transform(Seq(text).toDF("text"))

# result
+----------------------------------------------------------------------+
|result                                                                |
+----------------------------------------------------------------------+
|gestational diabetes mellitus && subsequent type two diabetes mellitus|
|gestational diabetes mellitus && T2DM                                 |
|subsequent type two diabetes mellitus && T2DM                         |
+----------------------------------------------------------------------+

RENerChunksFilter

Model

The RENerChunksFilter annotator filters desired relation pairs (defined by the parameter realtionPairs), and store those on the output column. Filtering the possible relations can be useful to perform additional analysis for a specific use case (e.g., checking adverse drug reactions and drug realations), which can be the input for further analysis using a pretrained RelationExtractionDLModel.

Parameters:

maxSyntacticDistance (Int): Maximum syntactic distance between a pair of named entities to consider them as a relation. Increasing this value will increase recall, but also increase the number of false positives.
relationPairs (List[Str]): List of dash-separated pairs of named entities. For example, [“Biomarker-RelativeDay”] will process all relations between entities of type “Biomarker” and “RelativeDay”.
relationPairsCaseSensitive (Boolean): Determines whether relation pairs are case sensitive.
directionSensitive (Boolean): Specify direction sensitivity in processing relation pairs Default is ‘False’.
filterByTokenDistance (Int): Filtering criterion based on number of token between entities (Default: 0)

For example, the ner_clinical NER model can identify PROBLEM, TEST, and TREATMENT entities. By using the RENerChunksFilter, one can filter only the relations between PROBLEM and TREATMENT entities only, removing any relation between the other entities, to further analyze the associations between clinical problems and treatments.

Input Annotator Types: CHUNK, DEPENDENCY

Output Annotator Type: CHUNK

Python API: RENerChunksFilter

Scala API: RENerChunksFilter

Notebook: RENerChunksFilter

Show Example

from johnsnowlabs import nlp, medical

documenter = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencer = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

words_embedder = nlp.WordEmbeddingsModel()\
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

pos_tagger = nlp.PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"])\
    .setOutputCol("pos_tags")

ner_tagger = medical.NerModel.pretrained("ner_ade_clinical", "en", "clinical/models")\
    .setInputCols("sentence", "token", "embeddings")\
    .setOutputCol("ner_tags")

ner_chunker = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner_tags"])\
    .setOutputCol("ner_chunks")

dependency_parser = nlp.DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos_tags", "token"])\
    .setOutputCol("dependencies")

ade_re_ner_chunk_filter = medical.RENerChunksFilter() \
    .setInputCols(["ner_chunks", "dependencies"])\
    .setOutputCol("re_ner_chunks")\
    .setMaxSyntacticDistance(10)\
    .setRelationPairs(["drug-ade, ade-drug"])

ade_re_model = medical.RelationExtractionDLModel()\
    .pretrained('redl_ade_biobert', 'en', "clinical/models") \
    .setInputCols(["re_ner_chunks", "sentences"]) \
    .setPredictionThreshold(0.5)\
    .setOutputCol("relations")

pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer,
    words_embedder,
    pos_tagger,
    ner_tagger,
    ner_chunker,
    dependency_parser,
    ade_re_ner_chunk_filter,
    ade_re_model
])

text = """A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands."""

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

from pyspark.sql import functions as F

results.select(
    F.explode(F.arrays_zip(results.relations.metadata, results.relations.result)).alias("cols")).select(
    F.expr("cols['0']['sentence']").alias("sentence"),
    F.expr("cols['0']['entity1_begin']").alias("entity1_begin"),
    F.expr("cols['0']['entity1_end']").alias("entity1_end"),
    F.expr("cols['0']['chunk1']").alias("chunk1"),
    F.expr("cols['0']['entity1']").alias("entity1"),
    F.expr("cols['0']['entity2_begin']").alias("entity2_begin"),
    F.expr("cols['0']['entity2_end']").alias("entity2_end"),
    F.expr("cols['0']['chunk2']").alias("chunk2"),
    F.expr("cols['0']['entity2']").alias("entity2"),
    F.expr("cols['1']").alias("relation"),
    F.expr("cols['0']['confidence']").alias("confidence"),
).show(truncate=70)

+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
|sentence|entity1_begin|entity1_end|   chunk1|entity1|entity2_begin|entity2_end|                                                   chunk2|entity2|relation|confidence|
+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
|       0|           25|         32| naproxen|   DRUG|          137|        148|                                             tense bullae|    ADE|       1| 0.9989047|
|       0|           25|         32| naproxen|   DRUG|          154|        210|cutaneous fragility on the face and the back of the hands|    ADE|       1| 0.9989704|
|       0|           87|         95|oxaprozin|   DRUG|          137|        148|                                             tense bullae|    ADE|       1|0.99895453|
|       0|           87|         95|oxaprozin|   DRUG|          154|        210|cutaneous fragility on the face and the back of the hands|    ADE|       1|0.99900633|
+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+

from johnsnowlabs import nlp, finance

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

text_splitter = finance.TextSplitter()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model_date = finance.NerModel.pretrained("finner_sec_dates", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_dates")

ner_converter_date = finance.NerConverterInternal()\
    .setInputCols(["sentence","token","ner_dates"])\
    .setOutputCol("ner_chunk_date")

ner_model_org= finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_orgs")

ner_converter_org = finance.NerConverterInternal()\
    .setInputCols(["sentence","token","ner_orgs"])\
    .setOutputCol("ner_chunk_org")\

chunk_merger = finance.ChunkMergeApproach()\
    .setInputCols('ner_chunk_org', "ner_chunk_date")\
    .setOutputCol('ner_chunk')

pos = nlp.PerceptronModel.pretrained()\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("pos")

dependency_parser = nlp.DependencyParserModel().pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos", "token"])\
    .setOutputCol("dependencies")

re_filter = finance.RENerChunksFilter()\
    .setInputCols(["ner_chunk", "dependencies"])\
    .setOutputCol("re_ner_chunk")\
    .setRelationPairs(["ORG-ORG", "ORG-DATE"])\
    .setMaxSyntacticDistance(10)

reDL = finance.RelationExtractionDLModel().pretrained('finre_acquisitions_subsidiaries_md', 'en', 'finance/models')\
    .setInputCols(["re_ner_chunk", "sentence"])\
    .setOutputCol("relation")\
    .setPredictionThreshold(0.1)

pipeline = nlp.Pipeline(stages=[
        document_assembler,
        text_splitter,
        tokenizer,
        embeddings,
        ner_model_date,
        ner_converter_date,
        ner_model_org,
        ner_converter_org,
        chunk_merger,
        pos,
        dependency_parser,
        re_filter,
        reDL])

text = """In fiscal 2020, Cadence acquired all of the outstanding equity of AWR Corporation (“AWR”) and Integrand Software, Inc. (“Integrand”)."""

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

from pyspark.sql import functions as F

result.select(
    F.explode(F.arrays_zip(result.relation.metadata, result.relation.result)).alias("cols")).select(
    F.expr("cols['0']['sentence']").alias("sentence"),
    F.expr("cols['0']['entity1_begin']").alias("entity1_begin"),
    F.expr("cols['0']['entity1_end']").alias("entity1_end"),
    F.expr("cols['0']['chunk1']").alias("chunk1"),
    F.expr("cols['0']['entity1']").alias("entity1"),
    F.expr("cols['0']['entity2_begin']").alias("entity2_begin"),
    F.expr("cols['0']['entity2_end']").alias("entity2_end"),
    F.expr("cols['0']['chunk2']").alias("chunk2"),
    F.expr("cols['0']['entity2']").alias("entity2"),
    F.expr("cols['1']").alias("relation"),
    F.expr("cols['0']['confidence']").alias("confidence"),
).filter("relation != 'no_rel'").show(truncate=70)

+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
|sentence|entity1_begin|entity1_end|                 chunk1|entity1|entity2_begin|entity2_end|         chunk2|entity2|            relation|confidence|
+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
|       0|           16|         22|                Cadence|    ORG|            3|         13|    fiscal 2020|   DATE|has_acquisition_date|0.99687237|
|       0|           66|         80|        AWR Corporation|    ORG|            3|         13|    fiscal 2020|   DATE|has_acquisition_date|  0.993112|
|       0|           94|        116|Integrand Software, Inc|    ORG|            3|         13|    fiscal 2020|   DATE|has_acquisition_date| 0.9741451|
|       0|           66|         80|        AWR Corporation|    ORG|           16|         22|        Cadence|    ORG|     was_acquired_by|  0.997124|
|       0|           94|        116|Integrand Software, Inc|    ORG|           16|         22|        Cadence|    ORG|     was_acquired_by|0.99910504|
|       0|           94|        116|Integrand Software, Inc|    ORG|           66|         80|AWR Corporation|    ORG|     was_acquired_by|0.93245244|
+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+

from johnsnowlabs import nlp, legal

document_assembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

text_splitter = legal.TextSplitter()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
    .setInputCols("sentence", "token") \
    .setOutputCol("embeddings")\
    .setMaxSentenceLength(512)

ner_model = legal.NerModel.pretrained(ner_model, "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

pos_tagger = nlp.PerceptronModel().pretrained() \
    .setInputCols(["sentence", "token"])\
    .setOutputCol("pos_tags")

dependency_parser = nlp.DependencyParserModel() \
    .pretrained("dependency_conllu", "en") \
    .setInputCols(["sentence", "pos_tags", "token"]) \
    .setOutputCol("dependencies")

re_filter = legal.RENerChunksFilter()\
    .setInputCols(["ner_chunk", "dependencies"])\
    .setOutputCol("re_ner_chunks")\
    .setMaxSyntacticDistance(10)\
    .setRelationPairs(['PARTY-ALIAS', 'DOC-PARTY', 'DOC-EFFDATE'])

re_model = legal.RelationExtractionDLModel.pretrained(re_model, "en", "legal/models")\
    .setPredictionThreshold(0.1)\
    .setInputCols(["re_ner_chunks", "sentence"])\
    .setOutputCol("relations")

pipeline = nlp.Pipeline(stages=[
        document_assembler,
        text_splitter,
        tokenizer,
        embeddings,
        ner_model,
        ner_converter,
        pos_tagger,
        dependency_parser,
        re_filter,
        re_model
        ])

text = """This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties")."""

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

from pyspark.sql import functions as F

result.select(
    F.explode(F.arrays_zip(result.relations.metadata, result.relations.result)).alias("cols")).select(
    F.expr("cols['0']['sentence']").alias("sentence"),
    F.expr("cols['0']['entity1_begin']").alias("entity1_begin"),
    F.expr("cols['0']['entity1_end']").alias("entity1_end"),
    F.expr("cols['0']['chunk1']").alias("chunk1"),
    F.expr("cols['0']['entity1']").alias("entity1"),
    F.expr("cols['0']['entity2_begin']").alias("entity2_begin"),
    F.expr("cols['0']['entity2_end']").alias("entity2_end"),
    F.expr("cols['0']['chunk2']").alias("chunk2"),
    F.expr("cols['0']['entity2']").alias("entity2"),
    F.expr("cols['1']").alias("relation"),
    F.expr("cols['0']['confidence']").alias("confidence"),
).filter("relation != 'no_rel'").show(truncate=70)

+--------+-------------+-----------+-------------------------------+-------+-------------+-----------+-----------------+-------+---------+----------+
|sentence|entity1_begin|entity1_end|                         chunk1|entity1|entity2_begin|entity2_end|           chunk2|entity2| relation|confidence|
+--------+-------------+-----------+-------------------------------+-------+-------------+-----------+-----------------+-------+---------+----------+
|       0|            5|         35|INTELLECTUAL PROPERTY AGREEMENT|    DOC|           69|         85|December 31, 2018|EFFDATE| dated_as| 0.9856822|
|       0|          141|        163|        Armstrong Flooring, Inc|  PARTY|          192|        197|           Seller|  ALIAS|has_alias|0.89620054|
+--------+-------------+-----------+-------------------------------+-------+-------------+-----------+-----------------+-------+---------+----------+

import spark.implicits._

val documenter = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentencer = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val wordsEmbedder = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val posTagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("pos_tags")

val nerTagger = MedicalNerModel.pretrained("ner_ade_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner_tags")

val nerChunker = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "ner_tags"))
  .setOutputCol("ner_chunks")

val dependencyParser = DependencyParserModel.pretrained("dependency_conllu", "en")
  .setInputCols(Array("sentence", "pos_tags", "token"))
  .setOutputCol("dependencies")

val adeReNerChunkFilter = new RENerChunksFilter()
  .setInputCols(Array("ner_chunks", "dependencies"))
  .setOutputCol("re_ner_chunks")
  .setMaxSyntacticDistance(10)
  .setRelationPairs(Array("drug-ade", "ade-drug"))

val adeReModel = RelationExtractionDLModel.pretrained("redl_ade_biobert", "en", "clinical/models")
  .setInputCols(Array("re_ner_chunks", "sentences"))
  .setPredictionThreshold(0.5)
  .setOutputCol("relations")

val pipeline = new Pipeline()
  .setStages(Array(
    documenter,
    sentencer,
    tokenizer,
    wordsEmbedder,
    posTagger,
    nerTagger,
    nerChunker,
    dependencyParser,
    adeReNerChunkFilter,
    adeReModel
  ))

val text = """A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands."""

val data = Seq(text).toDF("text")

val result = pipeline.fit(data).transform(data)

+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
|sentence|entity1_begin|entity1_end|   chunk1|entity1|entity2_begin|entity2_end|                                                   chunk2|entity2|relation|confidence|
+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
|       0|           25|         32| naproxen|   DRUG|          137|        148|                                             tense bullae|    ADE|       1| 0.9989047|
|       0|           25|         32| naproxen|   DRUG|          154|        210|cutaneous fragility on the face and the back of the hands|    ADE|       1| 0.9989704|
|       0|           87|         95|oxaprozin|   DRUG|          137|        148|                                             tense bullae|    ADE|       1|0.99895453|
|       0|           87|         95|oxaprozin|   DRUG|          154|        210|cutaneous fragility on the face and the back of the hands|    ADE|       1|0.99900633|
+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val text_splitter = new TextSplitter() 
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en", "finance/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val ner_model_date = FinanceNerModel.pretrained("finner_sec_dates", "en", "finance/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner_dates")

val ner_converter_date = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "ner_dates"))
  .setOutputCol("ner_chunk_date")

val ner_model_org = FinanceNerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner_orgs")

val ner_converter_org = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "ner_orgs"))
  .setOutputCol("ner_chunk_org")

val chunk_merger = new ChunkMergeApproach()
  .setInputCols(Array("ner_chunk_org", "ner_chunk_date"))
  .setOutputCol("ner_chunk")

val pos = PerceptronModel.pretrained()
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("pos")

val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en")
  .setInputCols(Array("sentence", "pos", "token"))
  .setOutputCol("dependencies")

val re_filter = new RENerChunksFilter()
  .setInputCols(Array("ner_chunk", "dependencies"))
  .setOutputCol("re_ner_chunk")
  .setRelationPairs(Array("ORG-ORG", "ORG-DATE"))
  .setMaxSyntacticDistance(10)

val reDL = RelationExtractionDLModel.pretrained("finre_acquisitions_subsidiaries_md", "en", "finance/models")
  .setInputCols(Array("re_ner_chunk", "sentence"))
  .setOutputCol("relation")
  .setPredictionThreshold(0.1)

val pipeline = new Pipeline().setStages(Array(
    document_assembler,
    text_splitter,
    tokenizer,
    embeddings,
    ner_model_date,
    ner_converter_date,
    ner_model_org,
    ner_converter_org,
    chunk_merger,
    pos,
    dependency_parser,
    re_filter,
    reDL
  ))

text = """In fiscal 2020, Cadence acquired all of the outstanding equity of AWR Corporation (“AWR”) and Integrand Software, Inc. (“Integrand”)."""

val data = Seq(text).toDS.toDF("text")

val result = pipeline.fit(data).transform(data)

+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
|sentence|entity1_begin|entity1_end|                 chunk1|entity1|entity2_begin|entity2_end|         chunk2|entity2|            relation|confidence|
+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
|       0|           16|         22|                Cadence|    ORG|            3|         13|    fiscal 2020|   DATE|has_acquisition_date|0.99687237|
|       0|           66|         80|        AWR Corporation|    ORG|            3|         13|    fiscal 2020|   DATE|has_acquisition_date|  0.993112|
|       0|           94|        116|Integrand Software, Inc|    ORG|            3|         13|    fiscal 2020|   DATE|has_acquisition_date| 0.9741451|
|       0|           66|         80|        AWR Corporation|    ORG|           16|         22|        Cadence|    ORG|     was_acquired_by|  0.997124|
|       0|           94|        116|Integrand Software, Inc|    ORG|           16|         22|        Cadence|    ORG|     was_acquired_by|0.99910504|
|       0|           94|        116|Integrand Software, Inc|    ORG|           66|         80|AWR Corporation|    ORG|     was_acquired_by|0.93245244|
+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val text_splitter = new TextSplitter()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en", "clinical/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")
  .setMaxSentenceLength(512)

val ner_model = LegalNerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")

val ner_converter = new NerConverter()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")

val pos_tagger = PerceptronModel().pretrained() \
  .setInputCols(["sentence", "token"])\
  .setOutputCol("pos_tags")

val dependency_parser = DependencyParserModel() \
  .pretrained("dependency_conllu", "en") \
  .setInputCols(["sentence", "pos_tags", "token"]) \
  .setOutputCol("dependencies")

val re_filter = new RENerChunksFilter()\
  .setInputCols(["ner_chunk", "dependencies"])\
  .setOutputCol("re_ner_chunks")\
  .setMaxSyntacticDistance(10)\
  .setRelationPairs(['PARTY-ALIAS', 'DOC-PARTY', 'DOC-EFFDATE'])

val re_model = RelationExtractionDLModel.pretrained("legre_contract_doc_parties", "en", "legal/models")
  .setPredictionThreshold(0.1)
  .setInputCols(Array("re_ner_chunks", "sentence"))
  .setOutputCol("relations")

val pipeline = new Pipeline()
  .setStages(Array(
    document_assembler,
    text_splitter,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    pos_tagger,
    dependency_parser,
    re_filter,
    re_model
  ))

text = """This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties")."""

val data = Seq(text).toDF("text")

val result = pipeline.fit(data).transform(data)

+--------+-------------+-----------+-------------------------------+-------+-------------+-----------+-----------------+-------+---------+----------+
|sentence|entity1_begin|entity1_end|                         chunk1|entity1|entity2_begin|entity2_end|           chunk2|entity2| relation|confidence|
+--------+-------------+-----------+-------------------------------+-------+-------------+-----------+-----------------+-------+---------+----------+
|       0|            5|         35|INTELLECTUAL PROPERTY AGREEMENT|    DOC|           69|         85|December 31, 2018|EFFDATE| dated_as| 0.9856822|
|       0|          141|        163|        Armstrong Flooring, Inc|  PARTY|          192|        197|           Seller|  ALIAS|has_alias|0.89620054|
+--------+-------------+-----------+-------------------------------+-------+-------------+-----------+-----------------+-------+---------+----------+

ReIdentification

Model

This annotator can reidentifies obfuscated entities by DeIdentification. It requires the outputs from the deidentification as input. Input columns need to be the deidentified document and the deidentification mappings set with DeIdentification.setMappingsColumn.

Input Annotator Types: DOCUMENT,CHUNK

Output Annotator Type: DOCUMENT

Python API: ReIdentification

Scala API: ReIdentification

Notebook: ReIdentificationNotebook

Show Example

from johnsnowlabs import nlp, medical

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

deidentification = medical.DeIdentification() \
    .setInputCols(["sentence", "token", "ner_chunk"]) \
    .setOutputCol("deidentified") \
    .setMode("mask")\
    .setReturnEntityMappings(True) #  return a new column to save the mappings between the mask/obfuscated entities and original entities.
    #.setMappingsColumn("MappingCol") # change the name of the column, 'aux' is default

pipeline = nlp.Pipeline(stages=[
      documentAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,
      clinical_ner,
      ner_converter,
      deidentification])

text = """
Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora ,
MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 .
Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .
"""
data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

result.select(F.explode(F.arrays_zip(result.sentence.result, 
                                     result.deidentified.result)).alias("cols"))\
      .select(F.expr("cols['0']").alias("sentence"), 
              F.expr("cols['1']").alias("deidentified")).show(truncate = False)

+-----------------------------------------------------------------------+-------------------------------------------------------+
|sentence                                                               |deidentified                                           |
+-----------------------------------------------------------------------+-------------------------------------------------------+
|Record date : 2093-01-13 , David Hale , M.D .                          |Record date : <DATE> , <NAME> , M.D .                  |
|, Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 .              |, Name : <NAME> , MR # <ID> Date : <DATE> .            |
|PCP : Oliveira , 25 years-old , Record date : 2079-11-09 .             |PCP : <NAME> , <AGE> years-old , Record date : <DATE> .|
|Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .|<LOCATION> , <LOCATION> , Phone <CONTACT> .            |
+-----------------------------------------------------------------------+-------------------------------------------------------+

reIdentification = medical.ReIdentification()\
    .setInputCols(["aux","deidentified"])\
    .setOutputCol("original")

reid_result = reIdentification.transform(result)

reid_result.select('original.result').show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                         |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[Record date : 2093-01-13 , David Hale , M.D ., , Name : Hendrickson Ora ,MR # 7194334 Date : 01/13/93 ., PCP : Oliveira , 25 years-old , Record date : 2079-11-09 ., Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .] |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

from johnsnowlabs import nlp, finance

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained('finner_sec_10k_summary', 'en', 'finance/models')\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = finance.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

deidentification = finance.DeIdentification() \
    .setInputCols(["sentence", "token", "ner_chunk"]) \
    .setOutputCol("deidentified") \
    .setMode("mask")\
    .setReturnEntityMappings(True) #  return a new column to save the mappings between the mask/obfuscated entities and original entities. REquired for "ReIdentification"
    #.setMappingsColumn("MappingCol") # change the name of the column, 'aux' is default

pipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      ner_model,
      ner_converter,
      deidentification])

text= """
Commission file number 000-15867 
_____________________________________
 
CADENCE DESIGN SYSTEMS, INC. 
(Exact name of registrant as specified in its charter)
____________________________________ 
Delaware
 
00-0000000
(State or Other Jurisdiction ofIncorporation or Organization)
 
(I.R.S. EmployerIdentification No.)
2655 Seely Avenue, Building 5,
San Jose,
California
 
95134
(Address of Principal Executive Offices)
 
(Zip Code)
(408)
-943-1234 
(Registrant’s Telephone Number, including Area Code) 
Securities registered pursuant to Section 12(b) of the Act:
Title of Each Class
Trading Symbol(s)
Names of Each Exchange on which Registered
Common Stock, $0.01 par value per share
CDNS
Nasdaq Global Select Market
Securities registered pursuant to Section 12(g) of the Act:"""

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

result.select("deidentified.result").show(truncate = False)

+-------------------------------------------------------------------------------------------------------------+
|result                                                                                                       |
+-------------------------------------------------------------------------------------------------------------+
|[Commission file number <CFN> 
_____________________________________
 
<ORG>., (Exact name of registrant as specified in its charter)
____________________________________ 
<STATE>
 
<IRS>
(State or Other Jurisdiction ofIncorporation or Organization)
 
(I.R.S., EmployerIdentification No., )
<ADDRESS>
 
95134
(Address of Principal Executive Offices)
 
(Zip Code)
<PHONE> 
(Registrant’s Telephone Number, including Area Code) 
Securities registered pursuant to Section 12, (b) of the Act:
Title of Each Class
Trading Symbol, (s)
Names of Each Exchange on which Registered
<TITLE_CLASS>, <TITLE_CLASS_VALUE> par value per share
<TICKER>
<STOCK_EXCHANGE>
Securities registered pursuant to Section 12, (g) of the Act:]|
+-------------------------------------------------------------------------------------------------------------+

reIdentification = finance.ReIdentification()\
    .setInputCols(["aux","deidentified"])\
    .setOutputCol("original")

reid_result = reIdentification.transform(result)

reid_result.select('original.result').show(truncate=False)

+---------------------------------------------------------------------------------------------------+
|result                                                                                             |
+---------------------------------------------------------------------------------------------------+
|[Commission file number 000-15867 
_____________________________________
 
CADENCE DESIGN SYSTEMS, INC., (Exact name of registrant as specified in its charter)
____________________________________ 
Delaware
 
00-0000000
(State or Other Jurisdiction ofIncorporation or Organization)
 
(I.R.S., EmployerIdentification No., )
2655 Seely Avenue, Building 5,
San Jose,
California
 
95134
(Address of Principal Executive Offices)
 
(Zip Code)<(408)
-943-1234
(Registrant’s Telephone Number, including Area Code) 
Securities registered pursuant to Section 12, (b) of the Act:
Title of Each Class
Trading Symbol, (s)
Names of Each Exchange on which Registered
Common Stock, $0.01 par value per share
CDNS
Nasdaq Global Select Market
Securities registered pursuant to Section 12, (g) of the Act:]|
+---------------------------------------------------------------------------------------------------+

from johnsnowlabs import nlp, legal

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

legal_ner = legal.NerModel.pretrained("legner_contract_doc_parties_lg", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner") 

ner_converter = legal.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setReplaceLabels({"ALIAS": "PARTY"}) # "ALIAS" are secondary names of companies, so let's extract them also as PARTY

deidentification = legal.DeIdentification() \
    .setInputCols(["sentence", "token", "ner_chunk"]) \
    .setOutputCol("deidentified") \
    .setMode("mask")\
    .setReturnEntityMappings(True) #  return a new column to save the mappings between the mask/obfuscated entities and original entities. REquired for "ReIdentification"
    #.setMappingsColumn("MappingCol") # change the name of the column, 'aux' is default

pipeline = nlp.Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      legal_ner,
      ner_converter,
      deidentification])

text = """THIS STRATEGIC ALLIANCE AGREEMENT ("Agreement") is made and entered into as of December 14, 2016 , by and between Hyatt Franchising Latin America, L.L.C. a limited liability company organized and existing under the laws of the State of Delaware"""

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

result.select("deidentified.result").show(truncate = False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[THIS <DOC> ("Agreement") is made and entered into as of <EFFDATE> , by and between <PARTY>. a limited liability company organized and existing under the laws of the State of Delaware]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

reIdentification = legal.ReIdentification()\
    .setInputCols(["aux","deidentified"])\
    .setOutputCol("original")

reid_result = reIdentification.transform(result)

reid_result.select('original.result').show(truncate=False)

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[THIS STRATEGIC ALLIANCE AGREEMENT ("Agreement") is made and entered into as of December 14, 2016 , by and between Hyatt Franchising Latin America, L.L.C. a limited liability company organized and existing under the laws of the State of Delaware]|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val wordEmbeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val clinicalNer = MedicalNerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")

val nerConverter = new NerConverter()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")

val deidentification = new DeIdentification()
  .setInputCols(Array("sentence", "token", "ner_chunk"))
  .setOutputCol("deidentified")
  .setMode("mask")
  .setReturnEntityMappings(true)

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    wordEmbeddings,
    clinicalNer,
    nerConverter,
    deidentification
  ))

val text = """
Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora ,
MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 .
Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .
"""

val data = Seq((text)).toDF("text")

val result = pipeline.fit(data).transform(data)

+-----------------------------------------------------------------------+-------------------------------------------------------+
|sentence                                                               |deidentified                                           |
+-----------------------------------------------------------------------+-------------------------------------------------------+
|Record date : 2093-01-13 , David Hale , M.D .                          |Record date : <DATE> , <NAME> , M.D .                  |
|, Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 .              |, Name : <NAME> , MR # <ID> Date : <DATE> .            |
|PCP : Oliveira , 25 years-old , Record date : 2079-11-09 .             |PCP : <NAME> , <AGE> years-old , Record date : <DATE> .|
|Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .|<LOCATION> , <LOCATION> , Phone <CONTACT> .            |
+-----------------------------------------------------------------------+-------------------------------------------------------+

val reIdentification = new ReIdentification()
  .setInputCols(Array("aux", "deidentified"))
  .setOutputCol("original")

val reidResult = reIdentification.transform(result)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                         |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[Record date : 2093-01-13 , David Hale , M.D ., , Name : Hendrickson Ora ,MR # 7194334 Date : 01/13/93 ., PCP : Oliveira , 25 years-old , Record date : 2079-11-09 ., Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .] |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val nerModel = FinanceNerModel.pretrained("finner_sec_10k_summary", "en", "finance/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")

val nerConverter = new NerConverter()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")

val deidentification = new DeIdentification()
  .setInputCols(Array("sentence", "token", "ner_chunk"))
  .setOutputCol("deidentified")
  .setMode("mask")
  .setReturnEntityMappings(true)

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    nerModel,
    nerConverter,
    deidentification
  ))

val text = "Commission file number 000-15867 
_____________________________________
 
CADENCE DESIGN SYSTEMS, INC. 
(Exact name of registrant as specified in its charter)
____________________________________ 
Delaware
 
00-0000000
(State or Other Jurisdiction ofIncorporation or Organization)
 
(I.R.S. EmployerIdentification No.)
2655 Seely Avenue, Building 5,
San Jose,
California
 
95134
(Address of Principal Executive Offices)
 
(Zip Code)
(408)
-943-1234 
(Registrant’s Telephone Number, including Area Code) 
Securities registered pursuant to Section 12(b) of the Act:
Title of Each Class
Trading Symbol(s)
Names of Each Exchange on which Registered
Common Stock, $0.01 par value per share
CDNS
Nasdaq Global Select Market
Securities registered pursuant to Section 12(g) of the Act:"

val data = Seq(text).toDF("text")

val result = pipeline.fit(data).transform(data)

+-------------------------------------------------------------------------------------------------------------+
|result                                                                                                       |
+-------------------------------------------------------------------------------------------------------------+
|[Commission file number <CFN> 
_____________________________________
 
<ORG>., (Exact name of registrant as specified in its charter)
____________________________________ 
<STATE>
 
<IRS>
(State or Other Jurisdiction ofIncorporation or Organization)
 
(I.R.S., EmployerIdentification No., )
<ADDRESS>
 
95134
(Address of Principal Executive Offices)
 
(Zip Code)
<PHONE> 
(Registrant’s Telephone Number, including Area Code) 
Securities registered pursuant to Section 12, (b) of the Act:
Title of Each Class
Trading Symbol, (s)
Names of Each Exchange on which Registered
<TITLE_CLASS>, <TITLE_CLASS_VALUE> par value per share
<TICKER>
<STOCK_EXCHANGE>
Securities registered pursuant to Section 12, (g) of the Act:]|
+-------------------------------------------------------------------------------------------------------------+

val reIdentification = new ReIdentification()
  .setInputCols(Array("aux", "deidentified"))
  .setOutputCol("original")

val reidResult = reIdentification.transform(result)

+---------------------------------------------------------------------------------------------------+
|result                                                                                             |
+---------------------------------------------------------------------------------------------------+
|[Commission file number 000-15867 
_____________________________________
 
CADENCE DESIGN SYSTEMS, INC., (Exact name of registrant as specified in its charter)
____________________________________ 
Delaware
 
00-0000000
(State or Other Jurisdiction ofIncorporation or Organization)
 
(I.R.S., EmployerIdentification No., )
2655 Seely Avenue, Building 5,
San Jose,
California
 
95134
(Address of Principal Executive Offices)
 
(Zip Code)<(408)
-943-1234
(Registrant’s Telephone Number, including Area Code) 
Securities registered pursuant to Section 12, (b) of the Act:
Title of Each Class
Trading Symbol, (s)
Names of Each Exchange on which Registered
Common Stock, $0.01 par value per share
CDNS
Nasdaq Global Select Market
Securities registered pursuant to Section 12, (g) of the Act:]|
+---------------------------------------------------------------------------------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val legalNer = LegalNerModel.pretrained("legner_contract_doc_parties_lg", "en", "legal/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")

val nerConverter = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")
  .setReplaceLabels(Map("ALIAS" -> "PARTY"))

val deidentification = new DeIdentification()
  .setInputCols(Array("sentence", "token", "ner_chunk"))
  .setOutputCol("deidentified")
  .setMode("mask")
  .setReturnEntityMappings(true)

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    legalNer,
    nerConverter,
    deidentification
  ))

val text = "THIS STRATEGIC ALLIANCE AGREEMENT (\"Agreement\") is made and entered into as of December 14, 2016, by and between Hyatt Franchising Latin America, L.L.C. a limited liability company organized and existing under the laws of the State of Delaware"

val data = Seq(text).toDF("text")

val result = pipeline.fit(data).transform(data)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[THIS <DOC> ("Agreement") is made and entered into as of <EFFDATE> , by and between <PARTY>. a limited liability company organized and existing under the laws of the State of Delaware]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

val reIdentification = new ReIdentification()
  .setInputCols(Array("aux", "deidentified"))
  .setOutputCol("original")



+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[THIS STRATEGIC ALLIANCE AGREEMENT ("Agreement") is made and entered into as of December 14, 2016 , by and between Hyatt Franchising Latin America, L.L.C. a limited liability company organized and existing under the laws of the State of Delaware]|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

RegexMatcherInternal

Model

The RegexMatcherInternal class implements an internal annotator approach to match a set of regular expressions with a provided entity. This approach is utilized for associating specific patterns within text data with predetermined entities, such as dates, mentioned within the text.

The class allows users to define rules using regular expressions paired with entities, offering flexibility in customization. These rules can either be directly set using the setRules method, with a specified delimiter, or loaded from an external file using the setExternalRules method.

Additionally, users can specify parameters such as the matching strategy (MATCH_FIRST, MATCH_ALL, or MATCH_COMPLETE) to control how matches are handled. The output annotation type is CHUNK, with input annotation types supporting DOCUMENT. This class provides a versatile tool for implementing entity recognition based on user-defined patterns within text data.

A rule consists of a regex pattern and an identifier, delimited by a character of choice. An example could be "\\d{4}\\/\\d\\d\\/\\d\\d,date" which will match strings like "1970/01/01" to the identifier "date".

Parametres:

strategy: Can be either MATCH_FIRST, MATCH_ALL, MATCH_COMPLETE, by default MATCH_ALL.
rules: Regex rules to match the identifier with.
delimiter: Delimiter for rules provided with setRules.
externalRules: external resource to rules, needs delimiter in options.

See Spark NLP Workshop for more examples of usage.

Input Annotator Types: DOCUMENT

Output Annotator Type: CHUNK

Python API: RegexMatcherInternal

Scala API: RegexMatcherInternal

Show Example

from johnsnowlabs import nlp, medical

text = """Name : Hendrickson, Ora, Record date: 2093-01-13, MR #719435.
Dr. John Green, ID: 1231511863, IP 203.120.223.13
He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93
Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no: A334455B.
Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com."""

data = spark.createDataFrame([[text]]).toDF("text")

rules = '''
(\d{1,3}\.){3}\d{1,3}~IPADDR
\d{4}-\d{2}-\d{2}|\d{2}/\d{2}/\d{2}|\d{2}/\d{2}/\d{2}~DATE
'''

with open('./rules/regex_rules.txt', 'w') as f:
    f.write(rules)

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

regex_matcher_internal = medical.RegexMatcherInternal()\
    .setInputCols('document')\
    .setStrategy("MATCH_ALL")\
    .setOutputCol("regex_matches")\
    .setExternalRules(path='./rules/regex_rules.txt', delimiter='~')

nlpPipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        regex_matcher_internal
])

result = nlpPipeline.fit(data).transform(data)

# result
+--------------+-----+---+---------+
|  regex_result|begin|end|ner_label|
+--------------+-----+---+---------+
|    2093-01-13|   38| 47|     DATE|
|203.120.223.13|   97|110|   IPADDR|
|      01/13/93|  188|195|     DATE|
+--------------+-----+---+---------+

//rules = '''
//(\d{1,3}\.){3}\d{1,3}~IPADDR
//\d{4}-\d{2}-\d{2}|\d{2}/\d{2}/\d{2}|\d{2}/\d{2}/\d{2}~DATE
//'''
//
//with open('./rules/regex_rules.txt', 'w') as f:
//    f.write(rules)

val text = """Name : Hendrickson, Ora, Record date: 2093-01-13, MR #719435.
  |Dr. John Green, ID: 1231511863, IP 203.120.223.13
  |He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93
  |Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no: A334455B.
  |Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""".stripMargin

import spark.implicits._

val data = Seq(text).toDF("text")

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val regexMatcher = new RegexMatcher()
  .setInputCols("document")
  .setStrategy("MATCH_ALL")
  .setOutputCol("regex_matches")
  .setExternalRulesPath("./rules/regex_rules.txt")
  .setDelimiter("~")

val nlpPipeline = new Pipeline()
  .setStages(Array(documentAssembler, regexMatcher))

val result = nlpPipeline.fit(data).transform(data)

# result
+--------------+-----+---+---------+
|  regex_result|begin|end|ner_label|
+--------------+-----+---+---------+
|    2093-01-13|   38| 47|     DATE|
|203.120.223.13|   97|110|   IPADDR|
|      01/13/93|  188|195|     DATE|
+--------------+-----+---+---------+

RelationExtraction

Extracts and classifies instances of relations between named entities.

Parameters:

predictionThreshold (Float): Sets minimal activation of the target unit to encode a new relation instance.
relationPairs (List[Str]): List of dash-separated pairs of named entities. For example, [“Biomarker-RelativeDay”] will process all relations between entities of type “Biomarker” and “RelativeDay”.
relationPairsCaseSensitive (Bool): Determines whether relation pairs are case sensitive.
relationTypePerPair dict[str, list[str]]: List of entity pairs per relations which limit the entities can form a relation. For example, {“CAUSE”: [“PROBLEM”, “SYMPTOM”]} which only let a “CAUSE” relation to hold between a problem (“PROBLEM) and a symptom (“SYMTOM”).
maxSyntacticDistance (Int): Maximal syntactic distance, as threshold (Default: 0). Determine how far the “from entity” can be from the “to entity” in the text. Increasing this value will increase recall, but also increase the number of false positives.
customLabels (dict[str, str]): Custom relation labels.
multiClass (Bool): If multiClass is set, the model will return all the labels with corresponding scores (Default: False)
DoExceptionHandling: If it is set as True, the annotator tries to process as usual and ff exception-causing data (e.g. corrupted record/ document) is passed to the annotator, an exception warning is emitted which has the exception message.
scopeWindow (IntArrayParam) The scope window of feature generation for relations. Given scope window list must contain two integers.

For pretrained models please see the Models Hub for available models.

Input Annotator Types: WORD_EMBEDDINGS, POS, CHUNK, DEPENDENCY

Output Annotator Type: CATEGORY

Python API: RelationExtractionModel

Scala API: RelationExtractionModel

Notebook: RelationExtractionModelNotebook

Show Example

from johnsnowlabs import nlp, medical

documenter = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencer = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentences"])\
    .setOutputCol("tokens")

words_embedder = nlp.WordEmbeddingsModel()\
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("embeddings")

pos_tagger = nlp.PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("pos_tags")

ner_tagger = medical.NerModel()\
    .pretrained("ner_posology", "en", "clinical/models")\
    .setInputCols("sentences", "tokens", "embeddings")\
    .setOutputCol("ner_tags")

ner_chunker = medical.NerConverterInternal()\
    .setInputCols(["sentences", "tokens", "ner_tags"])\
    .setOutputCol("ner_chunks")

dependency_parser = nlp.DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentences", "pos_tags", "tokens"])\
    .setOutputCol("dependencies")

reModel = medical.RelationExtractionModel()\
    .pretrained("posology_re")\
    .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
    .setOutputCol("relations")\
    .setMaxSyntacticDistance(4)

pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer,
    words_embedder,
    pos_tagger,
    ner_tagger,
    ner_chunker,
    dependency_parser,
    reModel
])

text = """
The patient was prescribed 1 unit of Advil for 5 days after meals. The patient was also
given 1 unit of Metformin daily.
He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night ,
12 units of insulin lispro with meals , and metformin 1000 mg two times a day.
"""
df = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(df).transform(df)

# Show results
result.select(F.explode(F.arrays_zip(
                              result.relations.result,
                              result.relations.metadata)).alias("cols"))\
.select(
    F.expr("cols['1']['chunk1']").alias("chunk1"),
    F.expr("cols['1']['chunk2']").alias("chunk2"),
    F.expr("cols['1']['entity1']").alias("entity1"),
    F.expr("cols['1']['entity2']").alias("entity2"),
    F.expr("cols['0']").alias("relations"),
    F.expr("cols['1']['confidence']").alias("confidence")).show(5, truncate=False)

+---------+----------------+-------+---------+--------------+----------+
|chunk1   |chunk2          |entity1|entity2  |relations     |confidence|
+---------+----------------+-------+---------+--------------+----------+
|1 unit   |Advil           |DOSAGE |DRUG     |DOSAGE-DRUG   |1.0       |
|Advil    |for 5 days      |DRUG   |DURATION |DRUG-DURATION |1.0       |
|1 unit   |Metformin       |DOSAGE |DRUG     |DOSAGE-DRUG   |1.0       |
|Metformin|daily           |DRUG   |FREQUENCY|DRUG-FREQUENCY|1.0       |
|40 units |insulin glargine|DOSAGE |DRUG     |DOSAGE-DRUG   |1.0       |
+---------+----------------+-------+---------+--------------+----------+

import spark.implicits._

val documenter = new DocumentAssembler()
    .setInputCol("text") 
    .setOutputCol("document") 

val sentencer = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentences") 

val tokenizer = new Tokenizer()
    .setInputCols("sentences") 
    .setOutputCol("tokens") 

val words_embedder = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models") 
    .setInputCols(Array("sentences","tokens")) 
    .setOutputCol("embeddings") 

val pos_tagger = PerceptronModel.pretrained("pos_clinical","en","clinical/models") 
    .setInputCols(Array("sentences","tokens")) 
    .setOutputCol("pos_tags") 

val ner_tagger = MedicalNerModel.pretrained("ner_posology","en","clinical/models") 
    .setInputCols("sentences","tokens","embeddings") 
    .setOutputCol("ner_tags") 

val ner_chunker = new NerConverterInternal()
    .setInputCols(Array("sentences","tokens","ner_tags")) 
    .setOutputCol("ner_chunks") 

val dependency_parser = DependencyParserModel.pretrained("dependency_conllu","en") 
    .setInputCols(Array("sentences","pos_tags","tokens")) 
    .setOutputCol("dependencies") 

val reModel = RelationExtractionModel.pretrained("posology_re") 
    .setInputCols(Array("embeddings","pos_tags","ner_chunks","dependencies")) 
    .setOutputCol("relations") 
    .setMaxSyntacticDistance(4) 

val pipeline = new Pipeline().setStages(Array(
                                             documenter, 
                                             sentencer, 
                                             tokenizer,
                                             words_embedder, 
                                             pos_tagger, 
                                             ner_tagger, 
                                             ner_chunker, 
                                             dependency_parser, 
                                             reModel )) 

val text = " The patient was prescribed 1 unit of Advil for 5 days after meals. The patient was also given 1 unit of Metformin daily. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals ,and metformin 1000 mg two times a day. " 

val df = Seq(text) .toDF("text") 
val result = pipeline.fit(df) .transform(df) 

// Show results

+---------+----------------+-------+---------+--------------+----------+
|chunk1   |chunk2          |entity1|entity2  |relations     |confidence|
+---------+----------------+-------+---------+--------------+----------+
|1 unit   |Advil           |DOSAGE |DRUG     |DOSAGE-DRUG   |1.0       |
|Advil    |for 5 days      |DRUG   |DURATION |DRUG-DURATION |1.0       |
|1 unit   |Metformin       |DOSAGE |DRUG     |DOSAGE-DRUG   |1.0       |
|Metformin|daily           |DRUG   |FREQUENCY|DRUG-FREQUENCY|1.0       |
|40 units |insulin glargine|DOSAGE |DRUG     |DOSAGE-DRUG   |1.0       |
+---------+----------------+-------+---------+--------------+----------+

Trains a TensorFlow model for relation extraction.

To train a custom relation extraction model, you need to first creat a Tensorflow graph using either the TfGraphBuilder annotator or the tf_graph module. Then, set the path to the Tensorflow graph using the method .setModelFile("path/to/tensorflow_graph.pb").

If the parameter relationDirectionCol is set, the model will be trained using the direction information (see the parameter decription for details). Otherwise, the model won’t have direction between the relation of the entities.

After training a model (using the .fit() method), the resulting object is of class RelationExtractionModel.

Parameters:

FromEntity: (begin_col: str, end_col: str, label_col: str) Sets from entity
begin_col Column that has a reference of where the chunk begins
end_col: Column that has a reference of where the chunk ends
label_col: Column that has a reference what are the type of chunk
ToEntity: (begin_col: str, end_col: str, label_col: str) Sets to entity
begin_col Column that has a reference of where the chunk begins
end_col: Column that has a reference of where the chunk ends
label_col: Column that has a reference what are the type of chunk
CustomLabels: (labels: dict[str, str]) Sets custom relation labels
labels: Dictionary which maps old to new labels
RelationDirectionCol: (col: str) Relation direction column (possible values are: “none”, “left” or “right”). If this parameter is not set, the model will not have direction between the relation of the entities
col Column contains the relation direction values
PretrainedModelPath (value: str) Path to an already trained model saved to disk, which is used as a starting point for training the new model
ОverrideExistingLabels (bool) Whether to override already learned labels when using a pretrained model to initialize the new model. Default is ‘true’
batchSize: (Int) Size for each batch in the optimization process
EpochsNumber (Int) Maximum number of epochs to train
Dropout: (Float) Dropout at the output of each layer
LearningRate: (Float) Learning rate for the optimization process
OutputLogsPath: (Str) Folder path to save training logs. If no path is specified, the logs won’t be stored in disk. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3).
ModelFile: (Str) The path to the Tensorflow graph
FixImbalance (Float) Fix the imbalance in the training set by replicating examples of under represented categories
ValidationSplit (Float) The proportion of training dataset to be used as validation set
OverrideExistingLabels (Boolean) Controls whether to override already learned lebels when using a pretrained model to initialize the new model. A value of true will override existing labels
MultiClass (Boolean) If multiClass is set, the model will return all the labels with corresponding scores. By default, multiClass is false.
ModelFile (Str) Location of file of the model used for classification
MaxSyntacticDistance (Int) Maximal syntactic distance, as threshold (Default: 0)
directionSensitive (Boolean) Specify direction sensitivity in processing relation pairs
filterByTokenDistance (Int) Filtering criterion based on number of token between entities
scopeWindow (IntArrayParam) The scope window of feature generation for relations. Given scope window list must contain two integers.

Input Annotator Types: WORD_EMBEDDINGS, POS, CHUNK, DEPENDENCY

Output Annotator Type: NONE

Python API: RelationExtractionApproach

Scala API: RelationExtractionApproach

Notebook: RelationExtractionApproachNotebook

Show Example

from johnsnowlabs import nlp, medical

# Defining pipeline stages to extract entities first
documentAssembler = nlp.DocumentAssembler() \
  .setInputCol("text") \
  .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
  .setInputCols(["document"]) \
  .setOutputCol("tokens")

embedder = nlp.WordEmbeddingsModel \
  .pretrained("embeddings_clinical", "en", "clinical/models") \
  .setInputCols(["document", "tokens"]) \
  .setOutputCol("embeddings")

posTagger = nlp.PerceptronModel \
  .pretrained("pos_clinical", "en", "clinical/models") \
  .setInputCols(["document", "tokens"]) \
  .setOutputCol("posTags")

nerTagger = nlp.MedicalNerModel \
  .pretrained("ner_events_clinical", "en", "clinical/models") \
  .setInputCols(["document", "tokens", "embeddings"]) \
  .setOutputCol("ner_tags")

nerConverter = nlp.NerConverter() \
  .setInputCols(["document", "tokens", "ner_tags"]) \
  .setOutputCol("nerChunks")

depencyParser = nlp.DependencyParserModel \
  .pretrained("dependency_conllu", "en") \
  .setInputCols(["document", "posTags", "tokens"]) \
  .setOutputCol("dependencies")

# Then define `RelationExtractionApproach` and training parameters
re = medical.RelationExtractionApproach() \
  .setInputCols(["embeddings", "posTags", "train_ner_chunks", "dependencies"]) \
  .setOutputCol("relations_t") \
  .setLabelColumn("target_rel") \
  .setEpochsNumber(300) \
  .setBatchSize(200) \
  .setLearningRate(0.001) \
  .setModelFile("path/to/graph_file.pb") \
  .setFixImbalance(True) \
  .setValidationSplit(0.05) \
  .setFromEntity("from_begin", "from_end", "from_label") \
  .setToEntity("to_begin", "to_end", "to_label")

finisher = nlp.Finisher() \
  .setInputCols(["relations_t"]) \
  .setOutputCols(["relations"]) \
  .setCleanAnnotations(False) \
  .setValueSplitSymbol(",") \
  .setAnnotationSplitSymbol(",") \
  .setOutputAsArray(False)

# Define complete pipeline and start training
pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    tokenizer,
    embedder,
    posTagger,
    nerTagger,
    nerConverter,
    depencyParser,
    re,
    finisher])

model = pipeline.fit(trainData)

import spark.implicits._

// Defining pipeline stages to extract entities first
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("tokens")

val embedder = WordEmbeddingsModel
  .pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("document", "tokens"))
  .setOutputCol("embeddings")

val posTagger = PerceptronModel
  .pretrained("pos_clinical", "en", "clinical/models")
  .setInputCols(Array("document", "tokens"))
  .setOutputCol("posTags")

val nerTagger = MedicalNerModel
  .pretrained("ner_events_clinical", "en", "clinical/models")
  .setInputCols(Array("document", "tokens", "embeddings"))
  .setOutputCol("ner_tags")

val nerConverter = new NerConverter()
  .setInputCols(Array("document", "tokens", "ner_tags"))
  .setOutputCol("nerChunks")

val depencyParser = DependencyParserModel
  .pretrained("dependency_conllu", "en")
  .setInputCols(Array("document", "posTags", "tokens"))
  .setOutputCol("dependencies")

// Then define `RelationExtractionApproach` and training parameters
val re = new RelationExtractionApproach()
  .setInputCols(Array("embeddings", "posTags", "train_ner_chunks", "dependencies"))
  .setOutputCol("relations_t")
  .setLabelColumn("target_rel")
  .setEpochsNumber(300)
  .setBatchSize(200)
  .setlearningRate(0.001f)
  .setModelFile("path/to/graph_file.pb")
  .setFixImbalance(true)
  .setValidationSplit(0.05f)
  .setFromEntity("from_begin", "from_end", "from_label")
  .setToEntity("to_begin", "to_end", "to_label")

val finisher = new Finisher()
  .setInputCols(Array("relations_t"))
  .setOutputCols(Array("relations"))
  .setCleanAnnotations(false)
  .setValueSplitSymbol(",")
  .setAnnotationSplitSymbol(",")
  .setOutputAsArray(false)

// Define complete pipeline and start training
val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    tokenizer,
    embedder,
    posTagger,
    nerTagger,
    nerConverter,
    depencyParser,
    re,
    finisher))

val model = pipeline.fit(trainData)

RelationExtractionDL

Model

This Relation Extraction annotator extracts and classifies instances of relations between named entities. In contrast with RelationExtractionModel, RelationExtractionDLModel is based on BERT.

Parametres:

predictionThreshold (Float): Sets minimal activation of the target unit to encode a new relation instance.
customLabels (dict[str, str]): Custom relation labels.
DoExceptionHandling: If it is set as True, the annotator tries to process as usual and ff exception-causing data (e.g. corrupted record/ document) is passed to the annotator, an exception warning is emitted which has the exception message.

Available models can be found at the Models Hub.

For more extended examples on document pre-processing see the Spark NLP Workshop

Input Annotator Types: CHUNK, DOCUMENT

Output Annotator Type: CATEGORY

Python API: RelationExtractionDLModel

Scala API: RelationExtractionDLModel

Notebook: RelationExtractionDLModelNotebook

Show Example

from johnsnowlabs import nlp, medical

documenter = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencer = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

words_embedder = nlp.WordEmbeddingsModel()\
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

pos_tagger = nlp.PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"])\
    .setOutputCol("pos_tags")

ner_tagger = medical.NerModel.pretrained("ner_ade_clinical", "en", "clinical/models")\
    .setInputCols("sentence", "token", "embeddings")\
    .setOutputCol("ner_tags")

ner_chunker = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner_tags"])\
    .setOutputCol("ner_chunks")

dependency_parser = nlp.DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos_tags", "token"])\
    .setOutputCol("dependencies")

ade_re_ner_chunk_filter = medical.RENerChunksFilter() \
    .setInputCols(["ner_chunks", "dependencies"])\
    .setOutputCol("re_ner_chunks")\
    .setMaxSyntacticDistance(10)\
    .setRelationPairs(["drug-ade, ade-drug"])

ade_re_model = medical.RelationExtractionDLModel()\
    .pretrained('redl_ade_biobert', 'en', "clinical/models") \
    .setInputCols(["re_ner_chunks", "sentences"]) \
    .setPredictionThreshold(0.5)\
    .setOutputCol("relations")

pipeline = nlp.Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer,
    words_embedder,
    pos_tagger,
    ner_tagger,
    ner_chunker,
    dependency_parser,
    ade_re_ner_chunk_filter,
    ade_re_model
])

text = """A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands."""

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

from pyspark.sql import functions as F

results.select(
    F.explode(F.arrays_zip(results.relations.metadata, results.relations.result)).alias("cols")).select(
    F.expr("cols['0']['sentence']").alias("sentence"),
    F.expr("cols['0']['entity1_begin']").alias("entity1_begin"),
    F.expr("cols['0']['entity1_end']").alias("entity1_end"),
    F.expr("cols['0']['chunk1']").alias("chunk1"),
    F.expr("cols['0']['entity1']").alias("entity1"),
    F.expr("cols['0']['entity2_begin']").alias("entity2_begin"),
    F.expr("cols['0']['entity2_end']").alias("entity2_end"),
    F.expr("cols['0']['chunk2']").alias("chunk2"),
    F.expr("cols['0']['entity2']").alias("entity2"),
    F.expr("cols['1']").alias("relation"),
    F.expr("cols['0']['confidence']").alias("confidence"),
).show(truncate=70)

+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
|sentence|entity1_begin|entity1_end|   chunk1|entity1|entity2_begin|entity2_end|                                                   chunk2|entity2|relation|confidence|
+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
|       0|           25|         32| naproxen|   DRUG|          137|        148|                                             tense bullae|    ADE|       1| 0.9989047|
|       0|           25|         32| naproxen|   DRUG|          154|        210|cutaneous fragility on the face and the back of the hands|    ADE|       1| 0.9989704|
|       0|           87|         95|oxaprozin|   DRUG|          137|        148|                                             tense bullae|    ADE|       1|0.99895453|
|       0|           87|         95|oxaprozin|   DRUG|          154|        210|cutaneous fragility on the face and the back of the hands|    ADE|       1|0.99900633|
+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+

from johnsnowlabs import nlp, finance
document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

text_splitter = finance.TextSplitter()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model_date = finance.NerModel.pretrained("finner_sec_dates", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_dates")

ner_converter_date = finance.NerConverterInternal()\
    .setInputCols(["sentence","token","ner_dates"])\
    .setOutputCol("ner_chunk_date")

ner_model_org= finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner_orgs")

ner_converter_org = finance.NerConverterInternal()\
    .setInputCols(["sentence","token","ner_orgs"])\
    .setOutputCol("ner_chunk_org")\

chunk_merger = finance.ChunkMergeApproach()\
    .setInputCols('ner_chunk_org', "ner_chunk_date")\
    .setOutputCol('ner_chunk')

pos = nlp.PerceptronModel.pretrained()\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("pos")

dependency_parser = nlp.DependencyParserModel().pretrained("dependency_conllu", "en")\
    .setInputCols(["sentence", "pos", "token"])\
    .setOutputCol("dependencies")

re_filter = finance.RENerChunksFilter()\
    .setInputCols(["ner_chunk", "dependencies"])\
    .setOutputCol("re_ner_chunk")\
    .setRelationPairs(["ORG-ORG", "ORG-DATE"])\
    .setMaxSyntacticDistance(10)

reDL = finance.RelationExtractionDLModel().pretrained('finre_acquisitions_subsidiaries_md', 'en', 'finance/models')\
    .setInputCols(["re_ner_chunk", "sentence"])\
    .setOutputCol("relation")\
    .setPredictionThreshold(0.1)

pipeline = nlp.Pipeline(stages=[
    document_assembler,
    text_splitter,
    tokenizer,
    embeddings,
    ner_model_date,
    ner_converter_date,
    ner_model_org,
    ner_converter_org,
    chunk_merger,
    pos,
    dependency_parser,
    re_filter,
    reDL])

text = """In fiscal 2020, Cadence acquired all of the outstanding equity of AWR Corporation (“AWR”) and Integrand Software, Inc. (“Integrand”)."""

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

from pyspark.sql import functions as F

result.select(
    F.explode(F.arrays_zip(result.relation.metadata, result.relation.result)).alias("cols")).select(
    F.expr("cols['0']['sentence']").alias("sentence"),
    F.expr("cols['0']['entity1_begin']").alias("entity1_begin"),
    F.expr("cols['0']['entity1_end']").alias("entity1_end"),
    F.expr("cols['0']['chunk1']").alias("chunk1"),
    F.expr("cols['0']['entity1']").alias("entity1"),
    F.expr("cols['0']['entity2_begin']").alias("entity2_begin"),
    F.expr("cols['0']['entity2_end']").alias("entity2_end"),
    F.expr("cols['0']['chunk2']").alias("chunk2"),
    F.expr("cols['0']['entity2']").alias("entity2"),
    F.expr("cols['1']").alias("relation"),
    F.expr("cols['0']['confidence']").alias("confidence"),
).filter("relation != 'no_rel'").show(truncate=70)

+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
|sentence|entity1_begin|entity1_end|                 chunk1|entity1|entity2_begin|entity2_end|         chunk2|entity2|            relation|confidence|
+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
|       0|           16|         22|                Cadence|    ORG|            3|         13|    fiscal 2020|   DATE|has_acquisition_date|0.99687237|
|       0|           66|         80|        AWR Corporation|    ORG|            3|         13|    fiscal 2020|   DATE|has_acquisition_date|  0.993112|
|       0|           94|        116|Integrand Software, Inc|    ORG|            3|         13|    fiscal 2020|   DATE|has_acquisition_date| 0.9741451|
|       0|           66|         80|        AWR Corporation|    ORG|           16|         22|        Cadence|    ORG|     was_acquired_by|  0.997124|
|       0|           94|        116|Integrand Software, Inc|    ORG|           16|         22|        Cadence|    ORG|     was_acquired_by|0.99910504|
|       0|           94|        116|Integrand Software, Inc|    ORG|           66|         80|AWR Corporation|    ORG|     was_acquired_by|0.93245244|
+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+

from johnsnowlabs import nlp, legal

document_assembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

text_splitter = legal.TextSplitter()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
    .setInputCols("sentence", "token") \
    .setOutputCol("embeddings")\
    .setMaxSentenceLength(512)

ner_model = legal.NerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = legal.NerConverterInternal()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

re_model = legal.RelationExtractionDLModel.pretrained("legre_contract_doc_parties", "en", "legal/models")\
    .setPredictionThreshold(0.1)\
    .setInputCols(["ner_chunk", "sentence"])\
    .setOutputCol("relation")

pipeline = nlp.Pipeline(stages=[
        document_assembler,
        text_splitter,
        tokenizer,
        embeddings,
        ner_model,
        ner_converter,
        re_model
        ])

text = """This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties")."""

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

from pyspark.sql import functions as F

result.select(
    F.explode(F.arrays_zip(result.relation.metadata, result.relation.result)).alias("cols")).select(
    F.expr("cols['0']['sentence']").alias("sentence"),
    F.expr("cols['0']['entity1_begin']").alias("entity1_begin"),
    F.expr("cols['0']['entity1_end']").alias("entity1_end"),
    F.expr("cols['0']['chunk1']").alias("chunk1"),
    F.expr("cols['0']['entity1']").alias("entity1"),
    F.expr("cols['0']['entity2_begin']").alias("entity2_begin"),
    F.expr("cols['0']['entity2_end']").alias("entity2_end"),
    F.expr("cols['0']['chunk2']").alias("chunk2"),
    F.expr("cols['0']['entity2']").alias("entity2"),
    F.expr("cols['1']").alias("relation"),
    F.expr("cols['0']['confidence']").alias("confidence"),
).filter("relation != 'no_rel'").show(truncate=70)

+--------+-------------+-----------+-----------------------------------+-------+-------------+-----------+-----------------------+-------+--------------------+----------+
|sentence|entity1_begin|entity1_end|                             chunk1|entity1|entity2_begin|entity2_end|                 chunk2|entity2|            relation|confidence|
+--------+-------------+-----------+-----------------------------------+-------+-------------+-----------+-----------------------+-------+--------------------+----------+
|       0|            5|         35|    INTELLECTUAL PROPERTY AGREEMENT|    DOC|           69|         85|      December 31, 2018|EFFDATE|            dated_as| 0.9856822|
|       0|            5|         35|    INTELLECTUAL PROPERTY AGREEMENT|    DOC|          141|        163|Armstrong Flooring, Inc|  PARTY|           signed_by| 0.7816506|
|       0|            5|         35|    INTELLECTUAL PROPERTY AGREEMENT|    DOC|          205|        221|      AFI Licensing LLC|  PARTY|           signed_by|0.53521496|
|       0|          141|        163|            Armstrong Flooring, Inc|  PARTY|          192|        197|                 Seller|  ALIAS|           has_alias| 0.8962001|
|       0|          205|        221|                  AFI Licensing LLC|  PARTY|          263|        271|              Licensing|  ALIAS|           has_alias|0.95189077|
|       0|          292|        297|                             Seller|  ALIAS|          301|        307|                Arizona|  ALIAS|has_collective_alias| 0.8934925|
|       1|          411|        445|Armstrong Hardwood Flooring Company|  PARTY|          478|        484|                Company|  ALIAS|           has_alias|0.98353034|
|       1|          505|        509|                              Buyer|  ALIAS|          516|        529|         Buyer Entities|  ALIAS|has_collective_alias| 0.7217146|
|       1|          611|        615|                              Party|  ALIAS|          641|        647|                Parties|  ALIAS|has_collective_alias| 0.5040909|
+--------+-------------+-----------+-----------------------------------+-------+-------------+-----------+-----------------------+-------+--------------------+----------+

import spark.implicits._

val documenter = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentencer = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val wordsEmbedder = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val posTagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("pos_tags")

val nerTagger = MedicalNerModel.pretrained("ner_ade_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner_tags")

val nerChunker = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "ner_tags"))
  .setOutputCol("ner_chunks")

val dependencyParser = DependencyParserModel.pretrained("dependency_conllu", "en")
  .setInputCols(Array("sentence", "pos_tags", "token"))
  .setOutputCol("dependencies")

val adeReNerChunkFilter = new RENerChunksFilter()
  .setInputCols(Array("ner_chunks", "dependencies"))
  .setOutputCol("re_ner_chunks")
  .setMaxSyntacticDistance(10)
  .setRelationPairs(Array("drug-ade", "ade-drug"))

val adeReModel = RelationExtractionDLModel.pretrained("redl_ade_biobert", "en", "clinical/models")
  .setInputCols(Array("re_ner_chunks", "sentences"))
  .setPredictionThreshold(0.5)
  .setOutputCol("relations")

val pipeline = new Pipeline()
  .setStages(Array(
    documenter,
    sentencer,
    tokenizer,
    wordsEmbedder,
    posTagger,
    nerTagger,
    nerChunker,
    dependencyParser,
    adeReNerChunkFilter,
    adeReModel
  ))

val text = """A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands."""

val data = Seq(text).toDF("text")

val result = pipeline.fit(data).transform(data)

+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
|sentence|entity1_begin|entity1_end|   chunk1|entity1|entity2_begin|entity2_end|                                                   chunk2|entity2|relation|confidence|
+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
|       0|           25|         32| naproxen|   DRUG|          137|        148|                                             tense bullae|    ADE|       1| 0.9989047|
|       0|           25|         32| naproxen|   DRUG|          154|        210|cutaneous fragility on the face and the back of the hands|    ADE|       1| 0.9989704|
|       0|           87|         95|oxaprozin|   DRUG|          137|        148|                                             tense bullae|    ADE|       1|0.99895453|
|       0|           87|         95|oxaprozin|   DRUG|          154|        210|cutaneous fragility on the face and the back of the hands|    ADE|       1|0.99900633|
+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val text_splitter = new TextSplitter() 
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en", "finance/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val ner_model_date = NerModel.pretrained("finner_sec_dates", "en", "finance/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner_dates")

val ner_converter_date = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "ner_dates"))
  .setOutputCol("ner_chunk_date")

val ner_model_org = FinanceNerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner_orgs")

val ner_converter_org = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "ner_orgs"))
  .setOutputCol("ner_chunk_org")

val chunk_merger = new ChunkMergeApproach()
  .setInputCols(Array("ner_chunk_org", "ner_chunk_date"))
  .setOutputCol("ner_chunk")

val pos = PerceptronModel.pretrained()
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("pos")

val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en")
  .setInputCols(Array("sentence", "pos", "token"))
  .setOutputCol("dependencies")

val re_filter = new RENerChunksFilter()
  .setInputCols(Array("ner_chunk", "dependencies"))
  .setOutputCol("re_ner_chunk")
  .setRelationPairs(Array("ORG-ORG", "ORG-DATE"))
  .setMaxSyntacticDistance(10)

val reDL = RelationExtractionDLModel.pretrained("finre_acquisitions_subsidiaries_md", "en", "finance/models")
  .setInputCols(Array("re_ner_chunk", "sentence"))
  .setOutputCol("relation")
  .setPredictionThreshold(0.1)

val pipeline = new Pipeline().setStages(Array(
    document_assembler,
    text_splitter,
    tokenizer,
    embeddings,
    ner_model_date,
    ner_converter_date,
    ner_model_org,
    ner_converter_org,
    chunk_merger,
    pos,
    dependency_parser,
    re_filter,
    reDL
  ))

text = """In fiscal 2020, Cadence acquired all of the outstanding equity of AWR Corporation (“AWR”) and Integrand Software, Inc. (“Integrand”)."""

val data = Seq(text).toDF("text")

val result = pipeline.fit(data).transform(data)

+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
|sentence|entity1_begin|entity1_end|                 chunk1|entity1|entity2_begin|entity2_end|         chunk2|entity2|            relation|confidence|
+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
|       0|           16|         22|                Cadence|    ORG|            3|         13|    fiscal 2020|   DATE|has_acquisition_date|0.99687237|
|       0|           66|         80|        AWR Corporation|    ORG|            3|         13|    fiscal 2020|   DATE|has_acquisition_date|  0.993112|
|       0|           94|        116|Integrand Software, Inc|    ORG|            3|         13|    fiscal 2020|   DATE|has_acquisition_date| 0.9741451|
|       0|           66|         80|        AWR Corporation|    ORG|           16|         22|        Cadence|    ORG|     was_acquired_by|  0.997124|
|       0|           94|        116|Integrand Software, Inc|    ORG|           16|         22|        Cadence|    ORG|     was_acquired_by|0.99910504|
|       0|           94|        116|Integrand Software, Inc|    ORG|           66|         80|AWR Corporation|    ORG|     was_acquired_by|0.93245244|
+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val text_splitter = new TextSplitter()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en", "clinical/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")
  .setMaxSentenceLength(512)

val ner_model = LegalNerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")

val ner_converter = new NerConverter()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")

val re_model = RelationExtractionDLModel.pretrained("legre_contract_doc_parties", "en", "legal/models")
  .setPredictionThreshold(0.1)
  .setInputCols(Array("ner_chunk", "sentence"))
  .setOutputCol("relation")

val pipeline = new Pipeline()
  .setStages(Array(
    document_assembler,
    text_splitter,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    re_model
  ))

text = """This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties")."""

val data = Seq(text).toDS.toDF("text")

val result = pipeline.fit(data).transform(data)

+--------+-------------+-----------+-----------------------------------+-------+-------------+-----------+-----------------------+-------+--------------------+----------+
|sentence|entity1_begin|entity1_end|                             chunk1|entity1|entity2_begin|entity2_end|                 chunk2|entity2|            relation|confidence|
+--------+-------------+-----------+-----------------------------------+-------+-------------+-----------+-----------------------+-------+--------------------+----------+
|       0|            5|         35|    INTELLECTUAL PROPERTY AGREEMENT|    DOC|           69|         85|      December 31, 2018|EFFDATE|            dated_as| 0.9856822|
|       0|            5|         35|    INTELLECTUAL PROPERTY AGREEMENT|    DOC|          141|        163|Armstrong Flooring, Inc|  PARTY|           signed_by| 0.7816506|
|       0|            5|         35|    INTELLECTUAL PROPERTY AGREEMENT|    DOC|          205|        221|      AFI Licensing LLC|  PARTY|           signed_by|0.53521496|
|       0|          141|        163|            Armstrong Flooring, Inc|  PARTY|          192|        197|                 Seller|  ALIAS|           has_alias| 0.8962001|
|       0|          205|        221|                  AFI Licensing LLC|  PARTY|          263|        271|              Licensing|  ALIAS|           has_alias|0.95189077|
|       0|          292|        297|                             Seller|  ALIAS|          301|        307|                Arizona|  ALIAS|has_collective_alias| 0.8934925|
|       1|          411|        445|Armstrong Hardwood Flooring Company|  PARTY|          478|        484|                Company|  ALIAS|           has_alias|0.98353034|
|       1|          505|        509|                              Buyer|  ALIAS|          516|        529|         Buyer Entities|  ALIAS|has_collective_alias| 0.7217146|
|       1|          611|        615|                              Party|  ALIAS|          641|        647|                Parties|  ALIAS|has_collective_alias| 0.5040909|
+--------+-------------+-----------+-----------------------------------+-------+-------------+-----------+-----------------------+-------+--------------------+----------+

Replacer

Model

Replacer allows to replace entities in the original text with the ones extracted by the annotators NameChunkObfuscatorApproach or DateNormalizer.

Replacer is most often used in conjunction with the DateNormalizer annotator or in deidentification pipelines.

With the dates, the Replacer annotator is used to replace specific tokens in a text with another token or string. The DateNormalizer annotator, on the other hand, is used to normalize dates and times to a standardized format.

Obfuscation in healthcare is the act of making healthcare data difficult to understand or use without authorization. This can be done by replacing or removing identifying information, such as names, dates of birth, and Social Security numbers. Obfuscation can also be used to hide the contents of healthcare records, such as diagnoses, medications, and treatment plans.

In the deidentification process, the Replacer annotator is used to replace certain tokens or patterns in the text with specified values. For example, it can be used to replace all instances of a person’s name with a placeholder like “PERSON”.

The NameChunkObfuscatorApproach annotator is used to identify and obfuscate sensitive named entities in the text, such as people’s names, addresses, dates of birth, SSNs etc.

Parameter:

setUseReplacement: (Boolean) Select what output format should be used. By default it will use the current day.
setPlaceHolderDelimiters: (String) Sets the placeholder string to use when noneValuesTo is set to “place_holder”. This placeholder string will be wrapped by the delimiters defined in placeHolderDelimiters.
setPlaceHolder: (String) Determines the action to take when encountering a value of ‘NONE’ in the annotation. This parameter can take one of the following three string values:
- “entity”: Replaces ‘NONE’ values with the entity field extracted from the annotation, if available. If the entity field is not available, it uses the string “NONE” wrapped by the specified delimiters.
- “place_holder”: Replaces ‘NONE’ values with a placeholder string wrapped by the specified delimiters.
- “skip”: Retains the original target_text from the annotation’s metadata if available. If not available, it retains the original annotation result.
mappingsColumn: (String) Column name for mapping. This column maps the annotations to their corresponding chunks before the entities are replaced.
returnEntityMappings: (Boolean) With this property you select if you want to return mapping column.
staticEntityMappingsFallback: (String) Fallback option for static entity mappings. Allowed values: ‘entity’, ‘place_holder’, ‘skip’, ‘error’.
staticEntityMappings: (dict) Static entity mappings. A dictionary with entity types as keys and replacement values as values.

Input Annotator Types: DOCUMENT, CHUNK

Output Annotator Type: DOCUMENT

Python API: Replacer

Scala API: Replacer

Notebook: ReplacerNotebook

Show Example

from johnsnowlabs import nlp, medical

names = """Mitchell#NAME
Clifford#NAME
Jeremiah#NAME
Lawrence#NAME
Brittany#NAME
Patricia#NAME
Samantha#NAME
Jennifer#NAME
Jackson#NAME
Leonard#NAME
Randall#NAME
Camacho#NAME
Ferrell#NAME
Mueller#NAME
Bowman#NAME
Hansen#NAME
Acosta#NAME
Gillespie#NAME
Zimmerman#NAME
Gillespie#NAME
Chandler#NAME
Bradshaw#NAME
Ferguson#NAME
Jacobson#NAME
Figueroa#NAME
Chandler#NAME
Schaefer#NAME
Matthews#NAME
Ferguson#NAME
Bradshaw#NAME
Figueroa#NAME
Delacruz#NAME
Gallegos#NAME
Villarreal#NAME
Williamson#NAME
Montgomery#NAME
Mclaughlin#NAME
Blankenship#NAME
Fitzpatrick#NAME
"""

with open('names_test.txt', 'w') as file:
    file.write(names)


# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("sentence")\

# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
  .setInputCols("sentence")\
  .setOutputCol("token")\

# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

# NER model trained on n2c2 (de-identification and Heart Disease Risk Factors Challenge) datasets)
clinical_ner = medical.NerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter_name = medical.NerConverterInternal()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

nameChunkObfuscator = medical.NameChunkObfuscatorApproach()\
  .setInputCols("ner_chunk")\
  .setOutputCol("replacement")\
  .setRefFileFormat("csv")\
  .setObfuscateRefFile("names_test.txt")\
  .setRefSep("#")\

replacer_name = medical.Replacer()\
  .setInputCols("replacement","sentence")\
  .setOutputCol("obfuscated_document_name")\
  .setUseReplacement(False) \
  .setNoneValuesTo("entity") \
  .setPlaceHolder("******") \
  .setPlaceHolderDelimiters(["<", ">"])

nlpPipeline = nlp.Pipeline(stages=[
    documentAssembler, 
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter_name,
    nameChunkObfuscator,
    replacer_name
    ])

sample_text = "John Davies is a 62 y.o. patient admitted. Mr. Davies was seen by attending physician Dr. Lorand and was scheduled for emergency assessment."

data = spark.createDataFrame([[sample_text]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)

## Result

Original text.  :  John Davies is a 62 y.o. patient admitted. Mr. Davies was seen by attending physician Dr. Lorand and was scheduled for emergency assessment.

Obfuscated text :  Joseeduardo is a 62 y.o. patient admitted. Mr. Teigan was seen by attending physician Dr. Mayson and was scheduled for emergency assessment.

import spark.implicits._

/* names.txt file

names = """Mitchell#NAME
Clifford#NAME
Jeremiah#NAME
Lawrence#NAME
Brittany#NAME
Patricia#NAME
Samantha#NAME
Jennifer#NAME
Jackson#NAME
Leonard#NAME
Randall#NAME
Camacho#NAME
Ferrell#NAME
Mueller#NAME
Bowman#NAME
Hansen#NAME
Acosta#NAME
Gillespie#NAME
Zimmerman#NAME
Gillespie#NAME
Chandler#NAME
Bradshaw#NAME
Ferguson#NAME
Jacobson#NAME
Figueroa#NAME
Chandler#NAME
Schaefer#NAME
Matthews#NAME
Ferguson#NAME
Bradshaw#NAME
Figueroa#NAME
Delacruz#NAME
Gallegos#NAME
Villarreal#NAME
Williamson#NAME
Montgomery#NAME
Mclaughlin#NAME
Blankenship#NAME
Fitzpatrick#NAME
"""
*/

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
    .setInputCols(Array("sentence","token"))
    .setOutputCol("embeddings")

val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented","en","clinical/models")
    .setInputCols(Array("sentence","token","embeddings"))
    .setOutputCol("ner")

val ner_converter_name = new NerConverterInternal()
    .setInputCols(Array("sentence","token","ner"))
    .setOutputCol("ner_chunk")

val nameChunkObfuscator = new NameChunkObfuscatorApproach()
    .setInputCols("ner_chunk")
    .setOutputCol("replacement")
    .setRefFileFormat("csv")
    .setObfuscateRefFile("names_test.txt")
    .setRefSep("//")

val replacer_name = new Replacer()
    .setInputCols("replacement","sentence")
    .setOutputCol("obfuscated_document_name")
    .setUseReplacement(true)
    .setNoneValuesTo("entity")
    .setPlaceHolder("******")
    .setPlaceHolderDelimiters(["<", ">"])

val nlpPipeline = new Pipeline().setStages(Array(
    documentAssembler, 
    tokenizer, 
    word_embeddings, 
    clinical_ner, 
    ner_converter_name, 
    nameChunkObfuscator, 
    replacer_name))


val test_data = Seq("""John Davies is a 62 y.o. patient admitted. Mr. Davies was seen by attending physician Dr. Lorand and was scheduled for emergency assessment.""").toDF("text")

val res = mapperPipeline.fit(test_data).transform(test_data)

# Result

Original text.  :  John Davies is a 62 y.o. patient admitted. Mr. Davies was seen by attending physician Dr. Lorand and was scheduled for emergency assessment.

Obfuscated text :  Joseeduardo is a 62 y.o. patient admitted. Mr. Teigan was seen by attending physician Dr. Mayson and was scheduled for emergency assessment.

Resolution2Chunk

Model

This annotator is responsible for converting the annotations generated by entity resolver models (typically labeled as ENTITY) into a format compatible with subsequent stages of the pipeline, such as the ChunkMapperModel. It transforms these annotations into CHUNK annotations, allowing for seamless integration and processing of clinical terminologies and entities in the pipeline.

Input Annotator Types: Resolution

Output Annotator Type: CHUNK

Python API: Resolution2Chunk

Scala API: Resolution2Chunk

Notebook: Resolution2ChunkNotebook

Show Example

from johnsnowlabs import medical, nlp

document_assembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("ner_chunk")

sbert_embedder = nlp.BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en","clinical/models")\
      .setInputCols(["ner_chunk"])\
      .setOutputCol("sentence_embeddings")\
      .setCaseSensitive(False)

rxnorm_resolver = medical.SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented","en", "clinical/models") \
      .setInputCols(["sentence_embeddings"]) \
      .setOutputCol("rxnorm_code")\
      .setDistanceFunction("EUCLIDEAN")

resolver2chunk = medical.Resolution2Chunk()\
      .setInputCols(["rxnorm_code"]) \
      .setOutputCol("resolver2chunk")

chunkerMapper_action = medical.ChunkMapperModel.pretrained("rxnorm_action_treatment_mapper", "en", "clinical/models")\
      .setInputCols(["resolver2chunk"])\
      .setOutputCol("action_mapping")\
      .setRels(["action"]) #for treatment

pipeline = nlp.Pipeline().setStages([document_assembler,
                                 sbert_embedder,
                                 rxnorm_resolver,
                                 resolver2chunk,
                                 chunkerMapper_action
                                 ])

data= spark.createDataFrame([['Zonalon 50 mg']]).toDF('text')

res= pipeline.fit(data).transform(data)

# Example results

res.select(F.explode(F.arrays_zip(res.ner_chunk.result,
                                  res.rxnorm_code.result,
                                  res.action_mapping.result)).alias("col"))\
    .select(F.expr("col['0']").alias("document"),
            F.expr("col['1']").alias("rxnorm_code"),
            F.expr("col['2']").alias("Action Mapping")).show(truncate=False)

+-------------+-----------+--------------+
|document     |rxnorm_code|Action Mapping|
+-------------+-----------+--------------+
|Zonalon 50 mg|103971     |Analgesic     |
+-------------+-----------+--------------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("ner_chunk")

val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
  .setInputCols("ner_chunk")
  .setOutputCol("sentence_embeddings")
  .setCaseSensitive(False)

val rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented","en","clinical/models")
  .setInputCols("sentence_embeddings")
  .setOutputCol("rxnorm_code")
  .setDistanceFunction("EUCLIDEAN")

val resolver2chunk = new Resolution2Chunk()
  .setInputCols("rxnorm_code")
  .setOutputCol("resolver2chunk")

val chunkerMapper_action = ChunkMapperModel.pretrained("rxnorm_action_treatment_mapper","en","clinical/models")
  .setInputCols("resolver2chunk")
  .setOutputCol("action_mapping")
  .setRels("action")

val pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    sbert_embedder, 
    rxnorm_resolver, 
    resolver2chunk, 
    chunkerMapper_action )) 

val data = Seq("Zonalon 50 mg").toDF("text") 

val res = pipeline.fit(data).transform(data)

// Example results

+-------------+-----------+--------------+
|document     |rxnorm_code|Action Mapping|
+-------------+-----------+--------------+
|Zonalon 50 mg|103971     |Analgesic     |
+-------------+-----------+--------------+

ResolverMerger

Model

ResolverMerger provides the ability to merge sentence enitity resolver and chunk mapper model output columns.

To convert a sentence or document into a vector for tasks like semantic search or recommendation systems, a common approach is to utilize transformer models like BERT. These models provide embeddings for each token in the text. One option is to extract the embedding vector of the CLS token, which represents the overall meaning of the text. Another option is to average the embeddings of all tokens.

Alternatively, we can use fine-tuned Siamese network variants like SBERT, which are specifically designed to generate embeddings that bring similar sentences or documents closer together in the embedding space while separating dissimilar ones. These embeddings can be applied in “Sentence Entity Resolver Models” to perform entity mapping.

However, for a more straightforward approach, we can use a chunk mapper method to extract entities from the text. In addition, by combining resolver models and mapper models using the ResolverMerger annotator, we can further enhance the performance and accuracy of the resolver system.

Parameters:

inputCols: The name of the columns containing the input annotations. It can read an Array of strings.
outputCol: The name of the column in Document type that is generated. We can specify only one column here.

All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols().

Input Annotator Types: ENTITY, LABEL_DEPENDENCY

Output Annotator Type: ENTITY

Python API: ResolverMerger

Scala API: ResolverMerger

Notebook: ResolverMergerNotebook

Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
    .setInputCol('text')\
    .setOutputCol('document')

sentence_detector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols("sentence")\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model = medical.NerModel.pretrained("ner_posology_greedy", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = medical.NerConverterInternal()\
    .setInputCols("sentence", "token", "ner")\
    .setOutputCol("chunk")

chunkerMapper = medical.ChunkMapperModel.pretrained("rxnorm_mapper", "en", "clinical/models")\
    .setInputCols(["chunk"])\
    .setOutputCol("RxNorm_Mapper")\
    .setRel("rxnorm_code")

cfModel = medical.ChunkMapperFilterer() \
    .setInputCols(["chunk", "RxNorm_Mapper"]) \
    .setOutputCol("chunks_fail") \
    .setReturnCriteria("fail")

chunk2doc = nlp.Chunk2Doc() \
    .setInputCols("chunks_fail") \
    .setOutputCol("doc_chunk")

sbert_embedder = nlp.BertSentenceEmbeddings.pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\
    .setInputCols(["doc_chunk"])\
    .setOutputCol("sentence_embeddings")\
    .setCaseSensitive(False)

resolver = medical.SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented", "en", "clinical/models") \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("resolver_code") \
    .setDistanceFunction("EUCLIDEAN")

resolverMerger = medical.ResolverMerger()\
    .setInputCols(["resolver_code","RxNorm_Mapper"])\
    .setOutputCol("RxNorm")

mapper_pipeline = nlp.Pipeline(
    stages = [
        document_assembler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        ner_model,
        ner_converter,
        chunkerMapper,
        chunkerMapper,
        cfModel,
        chunk2doc,
        sbert_embedder,
        resolver,
        resolverMerger
    ])

sample_text = [
    ["The patient was given Adapin 10 MG, coumadn 5 mg"],
    ["The patient was given Avandia 4 mg, Tegretol, zitiga"],
]

data = spark.createDataFrame(sample_text).toDF("text")

result = mapper_pipeline.fit(data).transform(data)

result.selectExpr(
    "chunk.result as chunk",
    "RxNorm_Mapper.result as RxNorm_Mapper",
    "chunks_fail.result as chunks_fail",
    "resolver_code.result as resolver_code",
    "RxNorm.result as RxNorm",
).show(truncate=False)


## Result

+--------------------------------+----------------------+--------------+-------------+------------------------+
|chunk                           |RxNorm_Mapper         |chunks_fail   |resolver_code|RxNorm                  |
+--------------------------------+----------------------+--------------+-------------+------------------------+
|[Adapin 10 MG, coumadn 5 mg]    |[1000049, NONE]       |[coumadn 5 mg]|[200883]     |[1000049, 200883]       |
|[Avandia 4 mg, Tegretol, zitiga]|[261242, 203029, NONE]|[zitiga]      |[220989]     |[261242, 203029, 220989]|
+--------------------------------+----------------------+--------------+-------------+------------------------+

import spark.implicits._

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentence_detector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
    .setInputCols(Array("sentence","token"))
    .setOutputCol("embeddings")

val ner_model = MedicalNerModel.pretrained("ner_posology_greedy","en","clinical/models")
    .setInputCols(Array("sentence","token","embeddings"))
    .setOutputCol("ner")

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence","token","ner"))
    .setOutputCol("chunk")

val chunkerMapper = ChunkMapperModel.pretrained("rxnorm_mapper","en","clinical/models")
    .setInputCols("chunk")
    .setOutputCol("RxNorm_Mapper")
    .setRel("rxnorm_code")

val cfModel = new ChunkMapperFilterer()
    .setInputCols(Array("chunk","RxNorm_Mapper"))
    .setOutputCol("chunks_fail")
    .setReturnCriteria("fail")

val chunk2doc = new Chunk2Doc()
    .setInputCols("chunks_fail")
    .setOutputCol("doc_chunk")

val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
    .setInputCols("doc_chunk")
    .setOutputCol("sentence_embeddings")
    .setCaseSensitive(false)

val resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented","en","clinical/models")
    .setInputCols("sentence_embeddings")
    .setOutputCol("resolver_code")
    .setDistanceFunction("EUCLIDEAN")

val resolverMerger = new ResolverMerger()
    .setInputCols(Array("resolver_code","RxNorm_Mapper"))
    .setOutputCol("RxNorm")

val mapper_pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    sentence_detector, 
    tokenizer, 
    word_embeddings, 
    ner_model, 
    ner_converter, 
    chunkerMapper, 
    chunkerMapper, 
    cfModel, 
    chunk2doc, 
    sbert_embedder, 
    resolver, 
    resolverMerger))


val data = Seq(("""The patient was given Adapin 10 MG, coumadn 5 mg"""),("""The patient was given Avandia 4 mg, Tegretol, zitiga""")).toDF("text")

val res = mapperPipeline.fit(data).transform(data)

// Show results

+--------------------------------+----------------------+--------------+-------------+------------------------+
|chunk                           |RxNorm_Mapper         |chunks_fail   |resolver_code|RxNorm                  |
+--------------------------------+----------------------+--------------+-------------+------------------------+
|[Adapin 10 MG, coumadn 5 mg]    |[1000049, NONE]       |[coumadn 5 mg]|[200883]     |[1000049, 200883]       |
|[Avandia 4 mg, Tegretol, zitiga]|[261242, 203029, NONE]|[zitiga]      |[220989]     |[261242, 203029, 220989]|
+--------------------------------+----------------------+--------------+-------------+------------------------+

Router

Model

Router provides the ability to split an output of an annotator for a selected metadata field and the value for that field.

When we need to use multiple sentence entity resolver models in the same pipeline, we typically had to run the BertSentenceEmbeddings annotator multiple times based on the number of resolver models. This meant that the heavy process of generating sentence embeddings using BERT was repeated multiple times.

To address this issue, Spark NLP Healthcare Library has introduced a solution using the Router annotator. With this new approach, we can provide all the named entity recognition (NER) chunks to the BertSentenceEmbeddings annotator at once. The annotator generates the sentence embeddings for all the chunks together. Then, the output of the sentence embeddings is routed to the specific resolver models that are required for further processing.

This solution eliminates the need to run BertSentenceEmbeddings multiple times, reducing the computational overhead and improving the efficiency of the pipeline.

Parameters:

inputCols: The name of the columns containing the input annotations. It can read an Array of strings.
outputCol: The name of the column in the Document type that is generated. We can specify only one column here.
inputType: The type of entity that you want to filter (by default sentence_embeddings). Possible values; document|token|wordpiece|word_embeddings|sentence_embeddings|category|date|sentiment|pos|chunk|named_entity|regex|dependency|labeled_dependency|language|keyword
metadataField: The key in the metadata dictionary that you want to filter (by default entity)
filterFieldsElements: The filterfieldsElements are the allowed values for the metadata field that is being used.

All the parameters can be set using the corresponding set method in the camel case. For example, .setInputcols().

Input Annotator Types: ENTITY, LABEL_DEPENDENCY

Output Annotator Type: ENTITY

Python API: Router

Scala API: Router

Notebook: RouterNotebook

Show Example

from johnsnowlabs import nlp, medical

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols("document")\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols("sentence")\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols("sentence", "token")\
    .setOutputCol("word_embeddings")

# to get PROBLEM entitis
clinical_ner = medical.NerModel().pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "word_embeddings"]) \
    .setOutputCol("clinical_ner")

clinical_ner_chunk = medical.NerConverterInternal()\
    .setInputCols("sentence","token","clinical_ner")\
    .setOutputCol("clinical_ner_chunk")\
    .setWhiteList(["PROBLEM"])

# to get DRUG entities
posology_ner = medical.NerModel().pretrained("ner_posology", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "word_embeddings"]) \
    .setOutputCol("posology_ner")

posology_ner_chunk = medical.NerConverterInternal()\
    .setInputCols("sentence","token","posology_ner")\
    .setOutputCol("posology_ner_chunk")\
    .setWhiteList(["DRUG"])

# merge the chunks into a single ner_chunk
chunk_merger = medical.ChunkMergeApproach()\
    .setInputCols("clinical_ner_chunk","posology_ner_chunk")\
    .setOutputCol("final_ner_chunk")\
    .setMergeOverlapping(False)

# convert chunks to doc to get sentence embeddings of them
chunk2doc = nlp.Chunk2Doc().setInputCols("final_ner_chunk").setOutputCol("doc_final_chunk")

sbiobert_embeddings = nlp.BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
    .setInputCols(["doc_final_chunk"])\
    .setOutputCol("sbert_embeddings")\
    .setCaseSensitive(False)

# filter PROBLEM entity embeddings
router_sentence_icd10 = medical.Router() \
    .setInputCols("sbert_embeddings") \
    .setFilterFieldsElements(["PROBLEM"]) \
    .setOutputCol("problem_embeddings")

# filter DRUG entity embeddings
router_sentence_rxnorm = medical.Router() \
    .setInputCols("sbert_embeddings") \
    .setFilterFieldsElements(["DRUG"]) \
    .setOutputCol("drug_embeddings")

# use problem_embeddings only
icd_resolver = medical.SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_slim_billable_hcc","en", "clinical/models") \
    .setInputCols(["problem_embeddings"]) \
    .setOutputCol("icd10cm_code")\
    .setDistanceFunction("EUCLIDEAN")

# use drug_embeddings only
rxnorm_resolver = medical.SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented","en", "clinical/models") \
    .setInputCols(["drug_embeddings"]) \
    .setOutputCol("rxnorm_code")\
    .setDistanceFunction("EUCLIDEAN")


pipeline = nlp.Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    clinical_ner_chunk,
    posology_ner,
    posology_ner_chunk,
    chunk_merger,
    chunk2doc,
    sbiobert_embeddings,
    router_sentence_icd10,
    router_sentence_rxnorm,
    icd_resolver,
    rxnorm_resolver
])

clinical_note = """The patient is a 41-year-old Vietnamese female with a cough that started last week.
She has had right-sided chest pain radiating to her back with fever starting yesterday.
She has a history of pericarditis in May 2006 and developed cough with right-sided chest pain.
MEDICATIONS
1. Coumadin 1 mg daily. Last INR was on Tuesday, August 14, 2007, and her INR was 2.3.
2. Amiodarone 100 mg p.o. daily.
"""

data = spark.createDataFrame([[clinical_note]]).toDF("text")

result = pipeline.fit(data).transform(data)

## Result

result.selectExpr(
    "final_ner_chunk.result as chunk",
    "posology_ner_chunk.result as posology_chunk",
    "rxnorm_code.result as rxnorm_code",
    "clinical_ner_chunk.result as clinical_chunk",
    "icd10cm_code.result as icd10cm_code",
).show(truncate=False)

+-----------------------------------------------------------------------------------------------------------+----------------------+-------------+-------------------------------------------------------------------------------------+--------------------------------------+
|chunk                                                                                                      |posology_chunk        |rxnorm_code  |clinical_chunk                                                                       |icd10cm_code                          |
+-----------------------------------------------------------------------------------------------------------+----------------------+-------------+-------------------------------------------------------------------------------------+--------------------------------------+
|[a cough, right-sided chest pain, fever, pericarditis, cough, right-sided chest pain, Coumadin, Amiodarone]|[Coumadin, Amiodarone]|[202421, 703]|[a cough, right-sided chest pain, fever, pericarditis, cough, right-sided chest pain]|[R05, R10.11, A68, I30.1, R05, R10.11]|
+-----------------------------------------------------------------------------------------------------------+----------------------+-------------+-------------------------------------------------------------------------------------+--------------------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence","token"))
    .setOutputCol("word_embeddings")
    
// to get PROBLEM entitis 
val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence","token","word_embeddings"))
    .setOutputCol("clinical_ner")

val clinical_ner_chunk = new NerConverterInternal()
    .setInputCols("sentence","token","clinical_ner")
    .setOutputCol("clinical_ner_chunk")
    .setWhiteList("PROBLEM")

// to get DRUG entities 
val posology_ner = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models")
    .setInputCols(Array("sentence","token","word_embeddings"))
    .setOutputCol("posology_ner")

val posology_ner_chunk = new NerConverterInternal()
    .setInputCols("sentence","token","posology_ner")
    .setOutputCol("posology_ner_chunk")
    .setWhiteList("DRUG")

// merge the chunks into a single ner_chunk 
val chunk_merger = new ChunkMergeApproach()
    .setInputCols(Array("clinical_ner_chunk","posology_ner_chunk"))
    .setOutputCol("final_ner_chunk")
    .setMergeOverlapping(false)

// convert chunks to doc to get sentence embeddings of them 
val chunk2doc = new Chunk2Doc()
    .setInputCols("final_ner_chunk")
    .setOutputCol("doc_final_chunk")

val sbiobert_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
    .setInputCols("doc_final_chunk")
    .setOutputCol("sbert_embeddings")
    .setCaseSensitive(false)

// filter PROBLEM entity embeddings 
val router_sentence_icd10 = new Router()
    .setInputCols("sbert_embeddings")
    .setFilterFieldsElements("PROBLEM")
    .setOutputCol("problem_embeddings")
    
// filter DRUG entity embeddings 
val router_sentence_rxnorm = new Router()
    .setInputCols("sbert_embeddings")
    .setFilterFieldsElements("DRUG")
    .setOutputCol("drug_embeddings")
    
// use problem_embeddings only 
val icd_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_slim_billable_hcc", "en", "clinical/models")
    .setInputCols("problem_embeddings")
    .setOutputCol("icd10cm_code")
    .setDistanceFunction("EUCLIDEAN")
    
// use drug_embeddings only 
val rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented", "en", "clinical/models")
    .setInputCols("drug_embeddings")
    .setOutputCol("rxnorm_code")
    .setDistanceFunction("EUCLIDEAN")

val pipeline = new Pipeline().setStages(Array(
    documentAssembler, 
    sentenceDetector, 
    tokenizer, 
    word_embeddings, 
    clinical_ner, 
    clinical_ner_chunk, 
    posology_ner, 
    posology_ner_chunk, 
    chunk_merger, 
    chunk2doc, 
    sbiobert_embeddings, 
    router_sentence_icd10, 
    router_sentence_rxnorm, 
    icd_resolver, 
    rxnorm_resolver))


val data = Seq("""The patient is a 41-year-old Vietnamese female with a cough that started last week.
She has had right-sided chest pain radiating to her back with fever starting yesterday.
She has a history of pericarditis in May 2006 and developed cough with right-sided chest pain.
MEDICATIONS
1. Coumadin 1 mg daily. Last INR was on Tuesday, August 14, 2007, and her INR was 2.3.
2. Amiodarone 100 mg p.o. daily.""").toDF("text")

val res = mapperPipeline.fit(data).transform(data)

// Show results

+-----------------------------------------------------------------------------------------------------------+----------------------+-------------+-------------------------------------------------------------------------------------+--------------------------------------+
|chunk                                                                                                      |posology_chunk        |rxnorm_code  |clinical_chunk                                                                       |icd10cm_code                          |
+-----------------------------------------------------------------------------------------------------------+----------------------+-------------+-------------------------------------------------------------------------------------+--------------------------------------+
|[a cough, right-sided chest pain, fever, pericarditis, cough, right-sided chest pain, Coumadin, Amiodarone]|[Coumadin, Amiodarone]|[202421, 703]|[a cough, right-sided chest pain, fever, pericarditis, cough, right-sided chest pain]|[R05, R10.11, A68, I30.1, R05, R10.11]|
+-----------------------------------------------------------------------------------------------------------+----------------------+-------------+-------------------------------------------------------------------------------------+--------------------------------------+

SentenceEntityResolver

The model transforms a dataset with Input Annotation type SENTENCE_EMBEDDINGS, coming from e.g. BertSentenceEmbeddings and returns the normalized entity for a particular trained ontology / curated dataset (e.g. ICD-10, RxNorm, SNOMED etc.).

Parameters:

distanceFunction: Determines how the distance between different entities will be calculated. Either COSINE or EUCLIDEAN.
neighbours: The number of neighbours to consider when computing the distances.
caseSensitive: WWhether to consider text casing or not.
threshold: Threshold of the distance between nodes to consider.
DoExceptionHandling: If it is set as True, the annotator tries to process as usual and ff exception-causing data (e.g. corrupted record/ document) is passed to the annotator, an exception warning is emitted which has the exception message.

All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols().

For a list of pretrained models, please see the Models Hub.

Input Annotator Types: SENTENCE_EMBEDDINGS

Output Annotator Type: ENTITY

Python API: SentenceEntityResolverModel

Scala API: SentenceEntityResolverModel

Notebook: SentenceEntityResolverModelNotebook

Show Example

from johnsnowlabs import nlp, medical 

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("jsl_ner_wip_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setWhiteList(["Test","Procedure"])

c2doc = nlp.Chunk2Doc()\
    .setInputCols(["ner_chunk"])\
    .setOutputCol("ner_chunk_doc")

sbert_embedder = nlp.BertSentenceEmbeddings\
    .pretrained("sbiobert_base_cased_mli","en","clinical/models")\
    .setInputCols(["ner_chunk_doc"])\
    .setOutputCol("sbert_embeddings")

# Then the resolver is defined on the extracted entities and sentence embeddings
cpt_resolver = medical.SentenceEntityResolverModel.pretrained("sbiobertresolve_cpt_procedures_augmented","en", "clinical/models")\
    .setInputCols(["sbert_embeddings"])\
    .setOutputCol("cpt_code")\
    .setDistanceFunction("EUCLIDEAN")

pipeline = nlp.Pipeline().setStages([
    documentAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,
    clinical_ner,
    ner_converter,
    c2doc,
    sbert_embedder,
    cpt_resolver])

text = """She was admitted to the hospital with chest pain and found to have bilateral pleural effusion, the right greater than the left. CT scan of the chest also revealed a large mediastinal lymph node.
We reviewed the pathology obtained from the pericardectomy in March 2006, which was diagnostic of mesothelioma.
At this time, chest tube placement for drainage of the fluid occurred and thoracoscopy, which were performed, which revealed epithelioid malignant mesothelioma."""

df = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(df).transform(df)

# Show Results
+--------------------+---------+-----+----------+--------------------+--------------------+
|               chunk|   entity| code|confidence|       all_k_results|   all_k_resolutions|
+--------------------+---------+-----+----------+--------------------+--------------------+
|CT scan of the chest|     Test|62284|    0.2028|62284:::76497:::7...|Computed tomograp...|
|      pericardectomy|Procedure|33031|    0.3329|33031:::33025:::3...|Pericardectomy [P...|
|chest tube placement|Procedure|39503|    0.9343|39503:::32036:::3...|Insertion of ches...|
|drainage of the f...|Procedure|49405|    0.2476|49405:::49407:::4...|Drainage procedur...|
|        thoracoscopy|Procedure|32660|    0.1422|32660:::32667:::1...|Thoracoscopy [Tho...|
+--------------------+---------+-----+----------+--------------------+--------------------+

from johnsnowlabs import nlp, finance 

documentAssembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("ner_chunk")

embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \
      .setInputCols("ner_chunk") \
      .setOutputCol("sentence_embeddings")

resolver = finance.SentenceEntityResolverModel.pretrained("finel_edgar_company_name", "en", "finance/models")\
      .setInputCols(["ner_chunk", "sentence_embeddings"]) \
      .setOutputCol("normalized")\
      .setDistanceFunction("EUCLIDEAN")

pipeline = nlp.Pipeline(
      stages = [
          documentAssembler,
          embeddings,
          resolver])

text = """CONTACT GOLD"""

df = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(df).transform(df)

# Show Results
+------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|chunk       |result            |all_k_results                                                                                                                                                                                                                                  |all_k_resolutions                                                                                                                                                                                                                              |
+------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|CONTACT GOLD|Contact Gold Corp.|Contact Gold Corp.:::ISHARES GOLD TRUST:::Minatura Gold:::Mexus Gold US:::BESRA GOLD INC.:::ALAMOS GOLD INC:::JOSHUA GOLD RESOURCES INC:::MIDEX GOLD CORP.:::Gold Mark Stephen:::Guskin Gold Corp.:::CMX GOLD & SILVER CORP.:::Permal Gold Ltd.|Contact Gold Corp.:::ISHARES GOLD TRUST:::Minatura Gold:::Mexus Gold US:::BESRA GOLD INC.:::ALAMOS GOLD INC:::JOSHUA GOLD RESOURCES INC:::MIDEX GOLD CORP.:::Gold Mark Stephen:::Guskin Gold Corp.:::CMX GOLD & SILVER CORP.:::Permal Gold Ltd.|
+------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

from johnsnowlabs import nlp, legal 

documentAssembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("ner_chunk")

embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en")\
      .setInputCols("ner_chunk")\
      .setOutputCol("sentence_embeddings")
    
resolver = legal.SentenceEntityResolverModel.pretrained("legel_edgar_company_name", "en", "legal/models")\
      .setInputCols(["ner_chunk", "sentence_embeddings"])\
      .setOutputCol("irs_code")\
      .setDistanceFunction("EUCLIDEAN")

pipeline = nlp.Pipeline(
      stages = [
          documentAssembler,
          embeddings,
          resolver])

text = """CONTACT GOLD"""

df = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(df).transform(df)

# Show Results
+------------+------------------+---------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|chunk       |result            |code                                                                                         |all_k_results                                                                                                                                                                                                                                  |all_k_resolutions                                                                                                                                                                                                                              |
+------------+------------------+---------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|CONTACT GOLD|Contact Gold Corp.|981369960:::0:::208273426:::204092640:::0:::0:::270531073:::261918920:::0:::271989147:::0:::0|Contact Gold Corp.:::ISHARES GOLD TRUST:::Minatura Gold:::Mexus Gold US:::BESRA GOLD INC.:::ALAMOS GOLD INC:::JOSHUA GOLD RESOURCES INC:::MIDEX GOLD CORP.:::Gold Mark Stephen:::Guskin Gold Corp.:::CMX GOLD & SILVER CORP.:::Permal Gold Ltd.|Contact Gold Corp.:::ISHARES GOLD TRUST:::Minatura Gold:::Mexus Gold US:::BESRA GOLD INC.:::ALAMOS GOLD INC:::JOSHUA GOLD RESOURCES INC:::MIDEX GOLD CORP.:::Gold Mark Stephen:::Guskin Gold Corp.:::CMX GOLD & SILVER CORP.:::Permal Gold Ltd.|
+------------+------------------+---------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text") 
  .setOutputCol("document") 

val sentenceDetector = SentenceDetectorDLModel.pretrained()
  .setInputCols(Array("document")) 
  .setOutputCol("sentence") 

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence")) 
  .setOutputCol("token") 

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
  .setInputCols(Array("sentence","token")) 
  .setOutputCol("embeddings") 

val clinical_ner = MedicalNerModel.pretrained("jsl_ner_wip_clinical","en","clinical/models")
  .setInputCols(Array("sentence","token","embeddings")) 
  .setOutputCol("ner") 

val ner_converter = new NerConverter()
  .setInputCols(Array("sentence","token","ner")) 
  .setOutputCol("ner_chunk") 
  .setWhiteList(Array("Test","Procedure")) 

val c2doc = new Chunk2Doc()
  .setInputCols(Array("ner_chunk")) 
  .setOutputCol("ner_chunk_doc") 

val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models") 
  .setInputCols(Array("ner_chunk_doc"))
  .setOutputCol("sbert_embeddings") 

// Then the resolver is defined on the extracted entities and sentence embeddings 

val cpt_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_cpt_procedures_augmented","en","clinical/models")
  .setInputCols(Array("sbert_embeddings")) 
  .setOutputCol("cpt_code") 
  .setDistanceFunction("EUCLIDEAN") 

val pipeline = new Pipeline().setStages(Array( 
                                              documentAssembler, 
                                              sentenceDetector, 
                                              tokenizer,
                                              word_embeddings, 
                                              clinical_ner, 
                                              ner_converter, 
                                              c2doc, 
                                              sbert_embedder, 
                                              cpt_resolver)) 


val text = "She was admitted to the hospital with chest pain and found to have bilateral pleural effusion,the right greater than the left. CT scan of the chest also revealed a large mediastinal lymph node. We reviewed the pathology obtained from the pericardectomy in March 2006,which was diagnostic of mesothelioma. At this time,chest tube placement for drainage of the fluid occurred and thoracoscopy,which were performed,which revealed epithelioid malignant mesothelioma." 

val df = Seq(text) .toDF("text") 
val result = pipeline.fit(df).transform(df) 

// Show Results
+--------------------+---------+-----+----------+--------------------+--------------------+
|               chunk|   entity| code|confidence|       all_k_results|   all_k_resolutions|
+--------------------+---------+-----+----------+--------------------+--------------------+
|CT scan of the chest|     Test|62284|    0.2028|62284:::76497:::7...|Computed tomograp...|
|      pericardectomy|Procedure|33031|    0.3329|33031:::33025:::3...|Pericardectomy [P...|
|chest tube placement|Procedure|39503|    0.9343|39503:::32036:::3...|Insertion of ches...|
|drainage of the f...|Procedure|49405|    0.2476|49405:::49407:::4...|Drainage procedur...|
|        thoracoscopy|Procedure|32660|    0.1422|32660:::32667:::1...|Thoracoscopy [Tho...|
+--------------------+---------+-----+----------+--------------------+--------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text") 
  .setOutputCol("ner_chunk") 

val embeddings = UniversalSentenceEncoder.pretrained("tfhub_use","en")
  .setInputCols("ner_chunk") 
  .setOutputCol("sentence_embeddings") 

val resolver = SentenceEntityResolverModel.pretrained("finel_edgar_company_name","en","finance/models")
  .setInputCols(Array("ner_chunk","sentence_embeddings")) 
  .setOutputCol("normalized") 
  .setDistanceFunction("EUCLIDEAN") 

val pipeline = new Pipeline().setStages(Array(
                                            documentAssembler, 
                                            embeddings, 
                                            resolver)) 

val text = "CONTACT GOLD" 
val df = Seq(text) .toDF("text") 
val result = pipeline.fit(df).transform(df) 

// Show Results
+------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|chunk       |result            |all_k_results                                                                                                                                                                                                                                  |all_k_resolutions                                                                                                                                                                                                                              |
+------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|CONTACT GOLD|Contact Gold Corp.|Contact Gold Corp.:::ISHARES GOLD TRUST:::Minatura Gold:::Mexus Gold US:::BESRA GOLD INC.:::ALAMOS GOLD INC:::JOSHUA GOLD RESOURCES INC:::MIDEX GOLD CORP.:::Gold Mark Stephen:::Guskin Gold Corp.:::CMX GOLD & SILVER CORP.:::Permal Gold Ltd.|Contact Gold Corp.:::ISHARES GOLD TRUST:::Minatura Gold:::Mexus Gold US:::BESRA GOLD INC.:::ALAMOS GOLD INC:::JOSHUA GOLD RESOURCES INC:::MIDEX GOLD CORP.:::Gold Mark Stephen:::Guskin Gold Corp.:::CMX GOLD & SILVER CORP.:::Permal Gold Ltd.|
+------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text") 
  .setOutputCol("ner_chunk") 

val embeddings = UniversalSentenceEncoder.pretrained("tfhub_use","en")
  .setInputCols("ner_chunk") 
  .setOutputCol("sentence_embeddings") 

val resolver = SentenceEntityResolverModel.pretrained("legel_edgar_company_name","en","legal/models")
  .setInputCols(Array("ner_chunk","sentence_embeddings")) 
  .setOutputCol("irs_code") .setDistanceFunction("EUCLIDEAN") 

val pipeline = new Pipeline().setStages(Array(
                                            documentAssembler, 
                                            embeddings, 
                                            resolver)) 

val text = "CONTACT GOLD" 

val df = Seq(text) .toDF("text") 
val result = pipeline.fit(df).transform(df)

// Show Results
+------------+------------------+---------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|chunk       |result            |code                                                                                         |all_k_results                                                                                                                                                                                                                                  |all_k_resolutions                                                                                                                                                                                                                              |
+------------+------------------+---------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|CONTACT GOLD|Contact Gold Corp.|981369960:::0:::208273426:::204092640:::0:::0:::270531073:::261918920:::0:::271989147:::0:::0|Contact Gold Corp.:::ISHARES GOLD TRUST:::Minatura Gold:::Mexus Gold US:::BESRA GOLD INC.:::ALAMOS GOLD INC:::JOSHUA GOLD RESOURCES INC:::MIDEX GOLD CORP.:::Gold Mark Stephen:::Guskin Gold Corp.:::CMX GOLD & SILVER CORP.:::Permal Gold Ltd.|Contact Gold Corp.:::ISHARES GOLD TRUST:::Minatura Gold:::Mexus Gold US:::BESRA GOLD INC.:::ALAMOS GOLD INC:::JOSHUA GOLD RESOURCES INC:::MIDEX GOLD CORP.:::Gold Mark Stephen:::Guskin Gold Corp.:::CMX GOLD & SILVER CORP.:::Permal Gold Ltd.|
+------------+------------------+---------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Trains a SentenceEntityResolverModel that maps sentence embeddings to entities in a knowledge base.

Parameters:

labelCol : Column name for the value we are trying to resolve. Usually this contains the entity ID in the knowledge base (e.g., the ICD-10 code).
normalizedCol: Column name for the original, normalized description
aux_label_col: Auxiliary label which maps resolved entities to additional labels
useAuxLabel: Whether to use the auxiliary column or not. Default value is False.
distanceFunction: Determines how the distance between different entities will be calculated.
confidenceFunction: What function to use to calculate confidence: Either ` INVERSE or SOFTMAX.
caseSensitive: whether to ignore case in tokens for embeddings matching (Default: False)
threshold: Threshold value for the last distance calculated (default: 5.0)
missAsEmpty: whether or not to return an empty annotation on unmatched chunks (default: True)

When finetuning an existing model, there are additional parameters:

pretrainedModelPath: Path to an already trained SentenceEntityResolverModel.This pretrained model will be used as a starting point for training the new one. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3).
overrideExistingCodes: Whether to override the existing codes with new data while continue the training from a pretrained model. Default value is False (keep all the codes).
dropCodesList: A list of codes in a pretrained model that will be omitted when the training process begins with a pretrained model.

You can find pretrained Sentence Embeddings (using BERT or other architecgture) in the NLP Models Hub <https://nlp.johnsnowlabs.com/models?task=Embeddings>_.

Input Annotator Types: SENTENCE_EMBEDDINGS

Output Annotator Type: ENTITY

Python API: SentenceEntityResolverApproach

Scala API: SentenceEntityResolverApproach

Notebook: SentenceEntityResolverApproachNotebook

Show Example

from johnsnowlabs import nlp, medical 

# Training a SNOMED resolution model using BERT sentence embeddings
# Define pre-processing pipeline for training data. It needs consists of columns for the normalized training data and their labels.
documentAssembler = nlp.DocumentAssembler() \
  .setInputCol("normalized_text") \
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

bertEmbeddings = nlp.BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_base_cased") \
  .setInputCols(["sentence"]) \
  .setOutputCol("bert_embeddings")

snomedTrainingPipeline = nlp.Pipeline(stages=[
  documentAssembler,
  sentenceDetector,
  bertEmbeddings
])
snomedTrainingModel = snomedTrainingPipeline.fit(data)
snomedData = snomedTrainingModel.transform(data).cache()

# Then the Resolver can be trained with
bertExtractor = medical.SentenceEntityResolverApproach() \
  .setNeighbours(25) \
  .setThreshold(1000) \
  .setInputCols(["bert_embeddings"]) \
  .setNormalizedCol("normalized_text") \
  .setLabelCol("label") \
  .setOutputCol("snomed_code") \
  .setDistanceFunction("EUCLIDIAN") \
  .setCaseSensitive(False)

snomedModel = bertExtractor.fit(snomedData)

from johnsnowlabs import nlp, finance

# Define pre-processing pipeline for training data. It needs consists of columns for the normalized training data and their labels.
documentAssembler = nlp.DocumentAssembler() \
  .setInputCol("normalized_text") \
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

bertEmbeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_large_cased") \
  .setInputCols(["sentence"]) \
  .setOutputCol("bert_embeddings")

preprocessing_pipeline = nlp.Pipeline(stages=[
  documentAssembler,
  sentenceDetector,
  bertEmbeddings
])
preprocessing_model = preprocessing_pipeline.fit(data)
processed_data = preprocessing_model.transform(data).cache()

# Then the Resolver can be trained with
bertExtractor = finance.SentenceEntityResolverApproach() \
  .setNeighbours(25) \
  .setThreshold(1000) \
  .setInputCols(["bert_embeddings"]) \
  .setNormalizedCol("normalized_text") \
  .setLabelCol("label") \
  .setOutputCol("snomed_code") \
  .setDistanceFunction("EUCLIDIAN") \
  .setCaseSensitive(False)

model = bertExtractor.fit(processed_data)

from johnsnowlabs import nlp, legal

# Define pre-processing pipeline for training data. It needs consists of columns for the normalized training data and their labels.
documentAssembler = nlp.DocumentAssembler() \
  .setInputCol("normalized_text") \
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

bertEmbeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_uncased_legal") \
  .setInputCols(["sentence"]) \
  .setOutputCol("bert_embeddings")

preprocessing_pipeline = nlp.Pipeline(stages=[
  documentAssembler,
  sentenceDetector,
  bertEmbeddings
])
data_preprocessing_model = preprocessing_pipeline.fit(data)
processed_data = data_preprocessing_model.transform(data).cache()

# Then the Resolver can be trained with
bertExtractor = legal.SentenceEntityResolverApproach() \
  .setNeighbours(25) \
  .setThreshold(1000) \
  .setInputCols(["bert_embeddings"]) \
  .setNormalizedCol("normalized_text") \
  .setLabelCol("label") \
  .setOutputCol("snomed_code") \
  .setDistanceFunction("EUCLIDIAN") \
  .setCaseSensitive(False)

model = bertExtractor.fit(processed_data)

import spark.implicits._

// Training a SNOMED resolution model using BERT sentence embeddings
// Define pre-processing pipeline for training data. It needs consists of columns for the normalized training data and their labels.
val documentAssembler = new DocumentAssembler()
   .setInputCol("normalized_text")
   .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

 val bertEmbeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_base_cased")
   .setInputCols("sentence")
   .setOutputCol("bert_embeddings")

 val snomedTrainingPipeline = new Pipeline().setStages(Array(
   documentAssembler,
   sentenceDetector,
   bertEmbeddings
 ))

 val snomedTrainingModel = snomedTrainingPipeline.fit(data)
 val snomedData = snomedTrainingModel.transform(data).cache()

// Then the Resolver can be trained with
val bertExtractor = new SentenceEntityResolverApproach()
  .setNeighbours(25)
  .setThreshold(1000)
  .setInputCols("bert_embeddings")
  .setNormalizedCol("normalized_text")
  .setLabelCol("label")
  .setOutputCol("snomed_code")
  .setDistanceFunction("EUCLIDIAN")
  .setCaseSensitive(false)

val snomedModel = bertExtractor.fit(snomedData)

import spark.implicits._

// Training a SNOMED resolution model using BERT sentence embeddings
// Define pre-processing pipeline for training data. It needs consists of columns for the normalized training data and their labels.
val documentAssembler = new DocumentAssembler()
   .setInputCol("normalized_text")
   .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

 val bertEmbeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_base_cased")
   .setInputCols("sentence")
   .setOutputCol("bert_embeddings")
   
 val snomedTrainingPipeline = new Pipeline().setStages(Array(
   documentAssembler,
   sentenceDetector,
   bertEmbeddings
 ))
 val snomedTrainingModel = snomedTrainingPipeline.fit(data)
 val snomedData = snomedTrainingModel.transform(data).cache()

// Then the Resolver can be trained with
val bertExtractor = new SentenceEntityResolverApproach()
  .setNeighbours(25)
  .setThreshold(1000)
  .setInputCols("bert_embeddings")
  .setNormalizedCol("normalized_text")
  .setLabelCol("label")
  .setOutputCol("snomed_code")
  .setDistanceFunction("EUCLIDIAN")
  .setCaseSensitive(false)

val snomedModel = bertExtractor.fit(snomedData)

import spark.implicits._

// Training a SNOMED resolution model using BERT sentence embeddings
// Define pre-processing pipeline for training data. It needs consists of columns for the normalized training data and their labels.
val documentAssembler = new DocumentAssembler()
   .setInputCol("normalized_text")
   .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

 val bertEmbeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_base_cased")
   .setInputCols("sentence")
   .setOutputCol("bert_embeddings")

 val snomedTrainingPipeline = new Pipeline().setStages(Array(
   documentAssembler,
   sentenceDetector,
   bertEmbeddings
 ))
 val snomedTrainingModel = snomedTrainingPipeline.fit(data)
 val snomedData = snomedTrainingModel.transform(data).cache()

// Then the Resolver can be trained with
val bertExtractor = new SentenceEntityResolverApproach()
  .setNeighbours(25)
  .setThreshold(1000)
  .setInputCols("bert_embeddings")
  .setNormalizedCol("normalized_text")
  .setLabelCol("label")
  .setOutputCol("snomed_code")
  .setDistanceFunction("EUCLIDIAN")
  .setCaseSensitive(false)

val snomedModel = bertExtractor.fit(snomedData)

StructuredJsonConverter

Model

This annotator integrates seamlessly with existing systems to process outputs from pretrained pipelines, delivering structured, easy-to-read results in a dictionary format. Optimized for API integration and user-friendly outputs, it supports streamlined data analysis workflows by converting raw annotations into a prettified, structured JSON format. With configurable schema mappings, it accommodates diverse outputs, including entities, assertions, resolutions, relations, summaries, deidentifications, and classifications. It uses column_maps to define output columns and align them with pipeline requirements. It handles diverse annotation types, including entities, assertions, resolutions, relations, summaries, deidentifications, and classifications. It produces well-structured, easy-to-read results ideal for API consumption and streamlined workflows.

Parameters:

cleanAnnotations: Whether to remove annotation columns, by default False.
returnRelationEntities: Whether to return the entities in the relations or not, by default False.
parentSource: Parent source of the output.

Available options: `chunk` and ``.
When set to `chunk`, the output will be designed as a chunk-based struct with the following schema::
|-- json_output: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- chunk_id: string (nullable = true)
|    |    |-- chunk: string (nullable = true)
|    |    |-- begin: integer (nullable = true)
|    |    |-- end: integer (nullable = true)
|    |    |-- sentence_id: integer (nullable = true)
|    |    |-- sentence: string (nullable = true)
|    |    |-- ner_label: string (nullable = true)
|    |    |-- ner_source: string (nullable = true)
|    |    |-- ner_confidence: string (nullable = true)
|    |    |-- assertion: string (nullable = true)
|    |    |-- assertion_confidence: string (nullable = true)
|    |    |-- relations: array (nullable = true)
|    |    |    |-- element: map (containsNull = true)
|    |    |    |    |-- key: string
|    |    |    |    |-- value: string (valueContainsNull = true)
  

sentenceColumn: Sentence column name. The sentence column is used to extract sentence of the chunk.
outputAsStr: Whether to output the result as a string or as a structured json, by default True.

When set to `True`, the output column will be a string:

|-- column_name: string (nullable = true)

When set to False, the output column will be a struct with the following schema:

|-- column_name: struct (nullable = true)
      |-- document_identifier: string (nullable = true)
      |-- document_text: array (nullable = true)
      |    |-- element: string (containsNull = true)
      |-- entities: array (nullable = true)
      |    |-- element: map (containsNull = true)
      |        |-- key: string
      |        |-- value: string (valueContainsNull = true)
      |-- assertions: array (nullable = true)
      |    |-- element: map (containsNull = true)
      |        |-- key: string
      |        |-- value: string (valueContainsNull = true)
      |-- resolutions: array (nullable = true)
      |    |-- element: map (containsNull = true)
      |        |-- key: string
      |        |-- value: string (valueContainsNull = true)
      |-- relations: array (nullable = true)
      |    |-- element: map (containsNull = true)
      |        |-- key: string
      |        |-- value: string (valueContainsNull = true)
      |-- summaries: array (nullable = true)
      |    |-- element: string (containsNull = true)
      |-- deidentifications: array (nullable = true)
      |    |-- element: map (containsNull = true)
      |        |-- key: string
      |        |-- value: string (valueContainsNull = true)
      |-- classifications: array (nullable = true)
      |    |-- element: map (containsNull = true)
      |        |-- key: string
      |        |-- value: string (valueContainsNull = true)

converterSchema and converterSchemaAsStr: The schema for converting the output of the pipeline into a structured JSON format. Fields in the schema:
- document_identifier: The identifier of the document. This column must be of type StringType.
- document_text: The text of the document, typically created by the DocumentAssembler annotator.
- entities: Chunk columns generated by various annotators, such as the ChunkMergeModel annotator.
- assertions: Assertion columns produced by annotators like the AssertionDLModel annotator.
- resolutions: The schema for resolutions. See ResolutionSchema for details.
- relations: Relation columns created by annotators such as the RelationExtractionModel annotator.
- summaries: Summary columns generated by annotators like the MedicalSummarizer annotator.
- deidentifications: The schema for deidentifications.
- classifications: The schema for classifications.

Input Annotator Types: ANY

Output Annotator Type: JSON

Python API: StructuredJsonConverter

Scala API: StructuredJsonConverter

Notebook: StructuredJsonConverter

Show Example

from johnsnowlabs import nlp, medical
from sparknlp_jsl.pipeline_tracer import PipelineTracer

oncology_pipeline = nlp.PretrainedPipeline("explain_clinical_doc_oncology", "en", "clinical/models")

text = """The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ovarian mass. A Pap smear performed one month later was positive for atypical glandular cells suspicious for adenocarcinoma. The pathologic specimen showed extension of the tumor throughout the fallopian tubes, appendix, omentum, and 5 out of 5 enlarged lymph nodes. The final pathologic diagnosis of the tumor was stage IIIC papillary serous ovarian adenocarcinoma. Two months later, the patient was diagnosed with lung metastases.Neoadjuvant chemotherapy with the regimens of Cyclophosphamide (500 mg/m2) is being given for 6 cycles with poor response"""

data = spark.createDataFrame([[text]]).toDF("text")

result_df = oncology_pipeline.transform(data)

pipeline_tracer = PipelineTracer(oncology_pipeline)

column_maps = pipeline_tracer.createParserDictionary()

output_converter = StructuredJsonConverter()\
    .setOutputCol("result")\
    .setConverterSchema(column_maps)\
    .setCleanAnnotations(False)\
    .setReturnRelationEntities(True)\
    .setOutputAsStr(True)

json_output = output_converter.transform(result_df).select("result")

json_output.show(truncate=200)

# result

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                  |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"result":{"document_identifier":"fc43295f-cbbe-4ca9-b842-abb1c2fc017e","document_text":["The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ova...|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.util.tracer.PipelineTracer

val oncology_pipeline = PretrainedPipeline("explain_clinical_doc_oncology", "en", "clinical/models")

val text = """The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ovarian mass. A Pap smear performed one month later was positive for atypical glandular cells suspicious for adenocarcinoma. The pathologic specimen showed extension of the tumor throughout the fallopian tubes, appendix, omentum, and 5 out of 5 enlarged lymph nodes. The final pathologic diagnosis of the tumor was stage IIIC papillary serous ovarian adenocarcinoma. Two months later, the patient was diagnosed with lung metastases.Neoadjuvant chemotherapy with the regimens of Cyclophosphamide (500 mg/m2) is being given for 6 cycles with poor response"""

val data = Seq(text).toDF("text")

val result_df  = oncology_pipeline.transform(data)

val pipeline_tracer = new PipelineTracer(oncology_pipeline)

val column_maps = pipeline_tracer.createParserDictionary()

val output_converter = new StructuredJsonConverter()
  .setOutputCol("result")
  .setConverterSchemaAsStr(column_maps)
  .setCleanAnnotations(false)
  .setReturnRelationEntities(true)
  .setOutputAsStr(true)

val json_output = output_converter.transform(result_df).select("result")

json_output.show()

# result
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                  |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"result":{"document_identifier":"fc43295f-cbbe-4ca9-b842-abb1c2fc017e","document_text":["The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ova...|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Summarizer

Model

Summarizer annotator that uses a generative deep learning model to create summaries of medical, finance, and legal texts. This annotator helps to quickly summarize complex medical, finance, and legal information from related documents.

Parameters:

doSample: Whether or not to use sampling, use greedy decoding otherwise (Default: false)
ignoreTokenIds: A list of token ids which are ignored in the decoder’s output (Default: Array())
maxNewTokens: Maximum number of new tokens to be generated (Default: 30)
maxTextLength: Maximum length of context text.
noRepeatNgramSize: If set to int > 0, all ngrams of that size can only occur once (Default: 0)
randomSeed: Optional Random seed for the model.
refineChunkSize: How large should refined chunks Be.
refineMaxAttempts: How many times should chunks be re-summarized while they are above SummaryTargetLength before stopping.
refineSummary: Set true to perform refined summarization at increased computation cost.
refineSummaryTargetLength: Target length for refined summary.
topK: The number of highest probability vocabulary tokens to keep for top-k-filtering (Default: 50)
useCache: Cache internal state of the model to improve performance

Available models can be found at the Models Hub.

For more extended examples on document pre-processing see the Spark NLP Workshop

Input Annotator Types: DOCUMENT

Output Annotator Type: CHUNK

Python API: MedicalSummarizer

Scala API: MedicalSummarizer

Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
    .setInputCol('text')\
    .setOutputCol('document')

summarizer = medical.Summarizer.pretrained("summarizer_clinical_jsl", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("summary")\
    .setMaxTextLength(512)\
    .setMaxNewTokens(512)

pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        summarizer
])

text = """The patient is a pleasant 17-year-old gentleman who was playing basketball today in gym. Two hours prior to presentation, he started to fall and someone stepped on his ankle and kind of twisted his right ankle and he cannot bear weight on it now. It hurts to move or bear weight. No other injuries noted. He does not think he has had injuries to his ankle in the past.
SOCIAL HISTORY: He does not drink or smoke.
MEDICAL DECISION MAKING:
He had an x-ray of his ankle that showed a small ossicle versus avulsion fracture of the talonavicular joint on the lateral view. He has had no pain over the metatarsals themselves. This may be a fracture based upon his exam. He does want to have me to put him in a splint. He was given Motrin here. He will be discharged home to follow up with Dr. X from Orthopedics.
DISPOSITION: Crutches and splint were administered here. I gave him a prescription for Motrin and some Darvocet if he needs to length his sleep and if he has continued pain to follow up with Dr. X. Return if any worsening problems."""

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

result.select("summary.result").show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[A 17-year-old man fell and twisted his right ankle, causing pain to move or bear weight. An x-ray showed a small ossicle or avulsion fracture of the talonavicular joint on the lateral view, which may be a fracture based upon his exam. He was given Motrin and discharged home with crutches and a prescription for Motrin and Darvocet. He was advised to follow up with his doctor if pain worsens and return if any worsening problems worsen.]|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

from johnsnowlabs import nlp, finance

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

summarizer = finance.Summarizer().pretrained('finsum_flant5_base','en','finance/models')\
    .setInputCols(["document"])\
    .setOutputCol("summary")\
    .setMaxNewTokens(1000)

pipeline = nlp.Pipeline(stages=[document_assembler, summarizer])

data = spark.createDataFrame([["""Lost Time Incident Rate: 
The lost time incident rate per 200,000 hours worked in 2021 was 0.14, which decreased by 17.6% compared to 2020 (0.17) and decreased by 70.8% compared to 2019 (0.48). The decrease in the lost time incident rate can be attributed to the company's efforts to improve workplace safety and implement effective risk management strategies. 
The total Scope 2 GHG emissions in 2021 were 688,228 tonnes, which remained relatively stable compared to 2020. The company's efforts to transition to renewable energy sources have helped to minimize Scope 2 GHG emissions."""]]).toDF('text')

result = pipeline.fit(data).transform(data)

result.select("summary.result").show(truncate=False)

from johnsnowlabs import nlp, legal

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

summarizer = legal.Summarizer().pretrained('legsum_flant5_legal_augmented','en','legal/models')\
    .setInputCols(["document"])\
    .setOutputCol("summary")\
    .setMaxNewTokens(1000)

pipeline = nlp.Pipeline(stages=[document_assembler, summarizer])

data = spark.createDataFrame([
  ["""NOW, THEREFORE, in consideration of the Company’s disclosure of information to the Recipient
and the promises set forth below, the parties agree as follows:

     1. Confidential Information. “Confidential Information” as used in this
Agreement means all information relating to the Company disclosed to the Recipient by the Company,
including without limitation any business, technical, marketing, financial or other information,
whether in written, electronic or oral form. Any and all reproductions, copies, notes, summaries,
reports, analyses or other material derived by the Recipient or its Representatives (as defined
below) in whole or in part from the Confidential Information in whatever form maintained shall be
considered part of the Confidential Information itself and shall be treated as such. Confidential
Information does not include information that (a) is or becomes part of the public domain other
than as a result of disclosure by the Recipient or its Representatives; (b) becomes available to
the Recipient on a nonconfidential basis from a source other than the Company, provided that source
is not bound with respect to that information by a confidentiality agreement with the Company or is
otherwise prohibited from transmitting that information by a contractual, legal or other
obligation; (c) can be proven by the Recipient to have been in the Recipient’s possession prior to
disclosure of the same by the Company; or (d) is independently developed by the Recipient without
reference to or reliance on any of the Company’s Confidential Information."""]
]).toDF('text')

result = pipeline.fit(data).transform(data)

result.select("summary.result").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[This legal agreement states that the company has disclosed all information relating to the company to the recipient, including any business, technical, marketing, financial or other information. It also states that any reproductions, copies, notes, summaries, reports, analyses or other material derived from the confidential information must be treated as part of the confidential information. The confidential information does not include information that is or becomes part of the public domain other than as a result of disclosure by the recipient or its representatives, becomes available to the recipient on a nonconfidential basis from a source other than the company, can be proven by the recipient to have been in the recipient’s possession prior to disclosure, or is independently developed by the recipient without reference to or reliance on any of the company’s confidential information.]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val summarizer = Summarizer.pretrained("summarizer_clinical_jsl", "en", "clinical/models")
  .setInputCols(Array("document"))
  .setOutputCol("summary")
  .setMaxTextLength(512)
  .setMaxNewTokens(512)


val pipeline = new Pipeline().setStages(Array(documentAssembler, summarizer))

val text = """The patient is a pleasant 17-year-old gentleman who was playing basketball today in gym. Two hours prior to presentation, he started to fall and someone stepped on his ankle and kind of twisted his right ankle and he cannot bear weight on it now. It hurts to move or bear weight. No other injuries noted. He does not think he has had injuries to his ankle in the past.
SOCIAL HISTORY: He does not drink or smoke.
MEDICAL DECISION MAKING:
He had an x-ray of his ankle that showed a small ossicle versus avulsion fracture of the talonavicular joint on the lateral view. He has had no pain over the metatarsals themselves. This may be a fracture based upon his exam. He does want to have me to put him in a splint. He was given Motrin here. He will be discharged home to follow up with Dr. X from Orthopedics.
DISPOSITION: Crutches and splint were administered here. I gave him a prescription for Motrin and some Darvocet if he needs to length his sleep and if he has continued pain to follow up with Dr. X. Return if any worsening problems."""

val data = Seq(text).toDS.toDF("text")

val result = pipeline.fit(data).transform(data)

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val summarizer = Summarizer.pretrained("finsum_flant5_base", "en", "finance/models")
  .setInputCols(Array("document"))
  .setOutputCol("summary")
  .setMaxNewTokens(1000)

val pipeline = new Pipeline().setStages(Array(documentAssembler, summarizer))

val text = """Lost Time Incident Rate: 
The lost time incident rate per 200,000 hours worked in 2021 was 0.14, which decreased by 17.6% compared to 2020 (0.17) and decreased by 70.8% compared to 2019 (0.48). The decrease in the lost time incident rate can be attributed to the company's efforts to improve workplace safety and implement effective risk management strategies. 
The total Scope 2 GHG emissions in 2021 were 688,228 tonnes, which remained relatively stable compared to 2020. The company's efforts to transition to renewable energy sources have helped to minimize Scope 2 GHG emissions."""

val data = Seq(text).toDS.toDF("text")

val result = pipeline.fit(data).transform(data)

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val summarizer = Summarizer.pretrained("legsum_flant5_legal_augmented", "en", "legal/models")
  .setInputCols(Array("document"))
  .setOutputCol("summary")
  .setMaxNewTokens(1000)

val pipeline = new Pipeline()
  .setStages(Array(documentAssembler, summarizer))

val text = """NOW, THEREFORE, in consideration of the Company’s disclosure of information to the Recipient
and the promises set forth below, the parties agree as follows:

     1. Confidential Information. “Confidential Information” as used in this
Agreement means all information relating to the Company disclosed to the Recipient by the Company,
including without limitation any business, technical, marketing, financial or other information,
whether in written, electronic or oral form. Any and all reproductions, copies, notes, summaries,
reports, analyses or other material derived by the Recipient or its Representatives (as defined
below) in whole or in part from the Confidential Information in whatever form maintained shall be
considered part of the Confidential Information itself and shall be treated as such. Confidential
Information does not include information that (a) is or becomes part of the public domain other
than as a result of disclosure by the Recipient or its Representatives; (b) becomes available to
the Recipient on a nonconfidential basis from a source other than the Company, provided that source
is not bound with respect to that information by a confidentiality agreement with the Company or is
otherwise prohibited from transmitting that information by a contractual, legal or other
obligation; (c) can be proven by the Recipient to have been in the Recipient’s possession prior to
disclosure of the same by the Company; or (d) is independently developed by the Recipient without
reference to or reliance on any of the Company’s Confidential Information."""

val data = Seq(text).toDS.toDF("text")

val result = pipeline.fit(data).transform(data)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[This legal agreement states that the company has disclosed all information relating to the company to the recipient, including any business, technical, marketing, financial or other information. It also states that any reproductions, copies, notes, summaries, reports, analyses or other material derived from the confidential information must be treated as part of the confidential information. The confidential information does not include information that is or becomes part of the public domain other than as a result of disclosure by the recipient or its representatives, becomes available to the recipient on a nonconfidential basis from a source other than the company, can be proven by the recipient to have been in the recipient’s possession prior to disclosure, or is independently developed by the recipient without reference to or reliance on any of the company’s confidential information.]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

TFGraphBuilder

Model

TFGraphBuilder annotator can be used to create graphs in the model training pipeline. TFGraphBuilder inspects the data and creates the proper graph if a suitable version of TensorFlow (>= 2.7 ) is available. The graph is stored in the defined folder and loaded by the approach.

You can use this builder with MedicalNerApproach, FinanceNerApproach, LegalNerApproach, RelationExtractionApproach, AssertionDLApproach, and GenericClassifierApproach.

ATTENTION: Playing with the parameters of TFGraphBuilder may affect the model performance that you want to train.

Input Annotator Types: The setInputCols parameter is changing based on the setModelName parameter.

Output Annotator Type: There is no output file. The setGraphFile function creates a file with a .pb extension and saves it there.

Python API: TFGraphBuilder

Show Example

graph_folder = "./medical_graphs"
ner_graph_builder = medical.TFGraphBuilder()\
    .setModelName("ner_dl")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("label")\
    .setGraphFile("auto")\
    .setHiddenUnitsNumber(20)\
    .setGraphFolder(graph_folder)\
    .setIsLicensed(True)  # False -> for NerDLApproach

graph_folder = "./finance_graphs"
ner_graph_builder = finance.TFGraphBuilder()\
    .setModelName("ner_dl")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("label")\
    .setGraphFile("auto")\
    .setHiddenUnitsNumber(20)\
    .setGraphFolder(graph_folder)\
    .setIsLicensed(True)  # False -> for NerDLApproach

graph_folder = "./legal_graphs"
ner_graph_builder = legal.TFGraphBuilder()\
    .setModelName("ner_dl")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("label")\
    .setGraphFile("auto")\
    .setHiddenUnitsNumber(20)\
    .setGraphFolder(graph_folder)\
    .setIsLicensed(True)  # False -> for NerDLApproach

TextGenerator

Model

The Medical, Financial, and Legal Text Generators are specialized tools designed for text abstraction in their respective fields. The MedicalTextGenerator, based on the BioGPT model, excels in medical text abstraction, allowing users to provide prompts and contexts for tasks like disease explanation, paraphrasing medical context, or creating clinical notes for cancer patients. This model is adept at extracting relevant information due to its training on extensive medical data.

Similarly, the Financial and Legal Text Generators utilize the Flan-T5 model, an advanced version of the T5 model, for tasks in financial and legal text abstraction. Users can input prompts and contexts to receive high-quality summaries, document abstractions, and other text-based outputs. The Flan-T5 model’s training on a diverse range of texts ensures the generation of coherent and accurate content in these domains.

Parameters:

maxNewTokens: Maximum number of of new tokens to generate, by default 30
maxContextLength: Maximum length of context text
configProtoBytes: ConfigProto from tensorflow, serialized into byte array.
doSample: Whether or not to use sampling; use greedy decoding otherwise, by default False
topK: The number of highest probability vocabulary tokens to consider, by default 1
noRepeatNgramSize: The number of tokens that can’t be repeated in the same order. Useful for preventing loops. The default is 0.
ignoreTokenIds: A list of token ids which are ignored in the decoder’s output, by default []
randomSeed: Set to positive integer to get reproducible results, by default None.
customPrompt: The only available variable is {DOCUMENT} and it is populated with the contents of the input document

Available models can be found at the Models Hub.

For more extended examples on document pre-processing see the Spark NLP Workshop.

Input Annotator Types: DOCUMENT

Output Annotator Type: CHUNK

Python API: MedicalTextGenerator

Scala API: MedicalTextGenerator

Show Example

from johnsnowlabs import nlp, medical

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("prompt")\
    .setOutputCol("document_prompt")

med_text_generator  = medical.TextGenerator.pretrained("text_generator_biomedical_biogpt_base", "en", "clinical/models")\
    .setInputCols("document_prompt")\
    .setOutputCol("answer")\
    .setMaxNewTokens(256)\
    .setDoSample(True)\
    .setTopK(3)\
    .setRandomSeed(42)\
    .setStopAtEos(True)

pipeline = nlp.Pipeline(stages=[document_assembler, med_text_generator])

data = spark.createDataFrame([['Covid 19 is']]).toDF("prompt")

result = pipeline.fit(data).transform(data)

result.select("answer.result").show(truncate=False)

+--------------------------------------------------------------------------+
|result                                                                    |
+--------------------------------------------------------------------------+
|[Covid 19 is a pandemic that has affected the world's economy and health.]|
+--------------------------------------------------------------------------+

from johnsnowlabs import nlp, finance 

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("prompt")

flant5 = finance.TextGenerator.pretrained("fingen_flant5_base","en","finance/models")\
    .setInputCols(["prompt"])\
    .setOutputCol("answer")\
    .setMaxNewTokens(150)\
    .setStopAtEos(True)\
  
pipeline = nlp.Pipeline(stages=[document_assembler, flant5])

data = spark.createDataFrame([["Explain what is Sec 10-k filing"]]).toDF('text')

result = pipeline.fit(data).transform(data)

result.select("answer.result").show(truncate=False)

result = pipeline.fit(data).transform(data)

+--------------------------------------------------------------------------------------------------------------------+
|result                                                                                                              |
+--------------------------------------------------------------------------------------------------------------------+
|[Sec 10k filing is a form of tax filing that requires a party to file jointly or several entities for tax purposes.]|
+--------------------------------------------------------------------------------------------------------------------+

from johnsnowlabs import nlp, legal

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("prompt")

flant5 = legal.TextGenerator.pretrained("leggen_flant5_finetuned","en","legal/models")\
    .setInputCols(["prompt"])\
    .setOutputCol("answer")\
    .setMaxNewTokens(200)\
    .setTopK(3)\
    .setRandomSeed(42)\
    .setNoRepeatNgramSize(3)\
    .setStopAtEos(True)
 
pipeline = nlp.Pipeline(stages=[document_assembler, flant5])

data = spark.createDataFrame([["This exhibit has been redacted and is the subject of a confidential treatment request. Redacted material is marked with [* * *] and has been filed separately with the securities and exchange commission."]]).toDF("text")

pipeline.fit(data).transform(data)

result = pipeline.fit(data).transform(data)

result.select("answer.result").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| result                                                                                                                                                                                                                                           |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| This exhibit has been redacted and is the subject of a confidential treatment request. Redacted material is marked with [* * *] and has been filed separately with the securities and exchange commission. The redacted material is confidential |
| and will not be disclosed to any third party without the prior written consent of the parties.                                                                                                                                                   |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("prompt")
  .setOutputCol("document_prompt")

val medTextGenerator = TextGenerator.pretrained("text_generator_biomedical_biogpt_base", "en", "clinical/models")
  .setInputCols(Array("document_prompt"))
  .setOutputCol("answer")
  .setMaxNewTokens(256)
  .setDoSample(true)
  .setTopK(3)
  .setRandomSeed(42)
  .setStopAtEos(true)

val pipeline = new Pipeline().setStages(Array(documentAssembler, medTextGenerator))

val data = Seq("Covid 19 is").toDS.toDF("prompt")

val result = pipeline.fit(data).transform(data)

+--------------------------------------------------------------------------+
|result                                                                    |
+--------------------------------------------------------------------------+
|[Covid 19 is a pandemic that has affected the world's economy and health.]|
+--------------------------------------------------------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("prompt")

val flanT5 = TextGenerator.pretrained("fingen_flant5_base", "en", "finance/models")
  .setInputCols(Array("prompt"))
  .setOutputCol("answer")
  .setMaxNewTokens(150)
  .setStopAtEos(true)

val pipeline = new Pipeline().setStages(Array(documentAssembler, flanT5))

val data = Seq("Explain what is Sec 10-k filing").toDS.toDF("text")

val result = pipeline.fit(data).transform(data)

+--------------------------------------------------------------------------------------------------------------------+
|result                                                                                                              |
+--------------------------------------------------------------------------------------------------------------------+
|[Sec 10k filing is a form of tax filing that requires a party to file jointly or several entities for tax purposes.]|
+--------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("prompt")

val flanT5 = TextGenerator.pretrained("leggen_flant5_finetuned", "en", "legal/models")
  .setInputCols(Array("prompt"))
  .setOutputCol("answer")
  .setMaxNewTokens(200)
  .setTopK(3)
  .setRandomSeed(42)
  .setNoRepeatNgramSize(3)
  .setStopAtEos(true)

val pipeline = new Pipeline().setStages(Array(documentAssembler, flanT5))

val data = Seq("This exhibit has been redacted and is the subject of a confidential treatment request. Redacted material is marked with [* * *] and has been filed separately with the securities and exchange commission.").toDS.toDF("text")

val result = pipeline.fit(data).transform(data)


+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| result                                                                                                                                                                                                                                           |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| This exhibit has been redacted and is the subject of a confidential treatment request. Redacted material is marked with [* * *] and has been filed separately with the securities and exchange commission. The redacted material is confidential |
| and will not be disclosed to any third party without the prior written consent of the parties.                                                                                                                                                   |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

TextMatcherInternal

Model

This annotator match exact phrases provided in a file against a Document.

Parametres:

entities (str): external resource for the entities. path : str Path to the external resource read_as : str, optional How to read the resource, by default ReadAs.TEXT options : dict, optional Options for reading the resource, by default {“format”: “text”}
caseSensitive (Boolean): whether to match regardless of case. (Default: True)
mergeOverlapping (Boolean): whether to merge overlapping matched chunks. (Default: False)
entityValue (str): value for the entity metadata field. If any entity value isn’t set in the file, we need to set it for the entity value.
buildFromTokens (Boolean): whether the TextMatcherInternal should take the CHUNK from TOKEN.
delimiter (str): value for the delimiter between Phrase, Entity.
enableLemmatizer (Boolean): whether to enable lemmatization. (Default: False)
enableStemmer (Boolean): whether to enable stemming. (Default: False)
cleanStopWords (Boolean): whether to clean stop words. (Default: False)
shuffleEntitySubTokens (Boolean): whether to use permutations of entity phrase tokens to improve matching. (Default: False)
safeKeywords (List[str]): list of critical terms that should never be removed, even if they are stop words or noise words.
excludePunctuation (Boolean): whether to remove all punctuation characters during matching. (Default: False)
cleanKeywords (List[str]): list of domain-specific noise words to be removed from entities.
excludeRegexPatterns (List[str]): list of regex patterns; if a matched chunk fits one of these patterns, it will be discarded.
returnChunks (str): whether to return the matched phrase in its original form or normalized version. Accepted values: "original", "matched".
skipMatcherAugmentation (Boolean): whether to disable augmentation on the matcher side (e.g., token permutations). (Default: False)
skipSourceTextAugmentation (Boolean): whether to disable augmentation on the source text side. (Default: False)

See Spark NLP Workshop for more examples of usage.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CHUNK

Python API: TextMatcherInternal

Scala API: TextMatcherInternal

Notebook: TextMatcherInternalNotebook

Show Example

from johnsnowlabs import nlp, medical

matcher_drug = """
Aspirin 100mg#Drug
aspirin#Drug
paracetamol#Drug
amoxicillin#Drug
ibuprofen#Drug
lansoprazole#Drug
"""

with open ('matcher_drug.csv', 'w') as f:
  f.write(matcher_drug)

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

entityExtractor = medical.TextMatcherInternal()\
    .setInputCols(["document", "token"])\
    .setEntities("matcher_drug.csv")\
    .setOutputCol("matched_text")\
    .setCaseSensitive(False)\
    .setDelimiter("#")\
    .setMergeOverlapping(False)

mathcer_pipeline = nlp.Pipeline().setStages([
                  documentAssembler,
                  tokenizer,
                  entityExtractor])

data = spark.createDataFrame([["John's doctor prescribed aspirin 100mg for his heart condition, along with paracetamol for his fever, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD."]]).toDF("text")

matcher_model = mathcer_pipeline.fit(data)
result = matcher_model.transform(data)

# result
+-------------+-----+---+-----+
|        chunk|begin|end|label|
+-------------+-----+---+-----+
|      aspirin|   25| 31| Drug|
|aspirin 100mg|   25| 37| Drug|
|  paracetamol|   75| 85| Drug|
|  amoxicillin|  102|112| Drug|
|    ibuprofen|  134|142| Drug|
| lansoprazole|  170|181| Drug|
+-------------+-----+---+-----+

import spark.implicits._

//matcher_drug = """
//Aspirin 100mg#Drug
//aspirin#Drug
//paracetamol#Drug
//amoxicillin#Drug
//ibuprofen#Drug
//lansoprazole#Drug
//"""
//
//with open ('matcher_drug.csv', 'w') as f:
//  f.write(matcher_drug)

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("token")

val entityExtractor = new EntityExtractor()
  .setInputCols(Array("document", "token"))
  .setOutputCol("matched_text")
  .setEntities("matcher_drug.csv")
  .setCaseSensitive(false)
  .setDelimiter("#")
  .setMergeOverlapping(false)

val matcherPipeline = new Pipeline()
  .setStages(Array(documentAssembler, 
                   tokenizer, 
                   entityExtractor))

val data = Seq("John's doctor prescribed aspirin 100mg for his heart condition, along with paracetamol for his fever, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD.")
  .toDF("text")

val matcherModel = matcherPipeline.fit(data)
val result = matcherModel.transform(data)


# result
+-------------+-----+---+-----+
|        chunk|begin|end|label|
+-------------+-----+---+-----+
|      aspirin|   25| 31| Drug|
|aspirin 100mg|   25| 37| Drug|
|  paracetamol|   75| 85| Drug|
|  amoxicillin|  102|112| Drug|
|    ibuprofen|  134|142| Drug|
| lansoprazole|  170|181| Drug|
+-------------+-----+---+-----+

VectorDBPostProcessor

Model

VectorDBPostProcessor is used to filter and sort the annotations from the :class:sparknlp_jsl.annotator.resolution.VectorDBModel.

Parametres:

filterBy:The filterBy parameter is used to select and prioritize filter options.
sortBy:The sortBy parameter is used to select sorting option. Options: ascending, descending, lost_in_the_middle, diversity.ascending: Sort by ascending order of distance.descending: Sort by descending order of distance.lost_in_the_middle: Sort by lost in the middle ranker. Let’s say we have 5 annotations with distances [1, 2, 3, 4, 5]. The lost in the middle ranker will sort them as [1, 3, 5, 4, 2].diversity: Sort by diversity ranker. The annotations are sorted by distance and the first annotation select, and then the next annotation is selected by the maximum average distance from the selected annotations. Default: ascending
caseSensitive: Whether the criteria of the string operators are case sensitive or not. For example, if set to False, the operator “equals” will match “John” with “john”. Default: False
diversityThreshold: The diversityThreshold parameter is used to set the threshold for the diversityByThreshold filter. The diversityByThreshold filter selects the annotations by the distance between the sorted annotations. diversityThreshold must be greater than 0. Default: 0.01
maxTopKAfterFiltering: Whether to allow zero annotation after filtering. If set to True, the output may contain zero annotation if all annotations are filtered out. If set to False, The output is tried to contain at least one annotation. Default: False
metadataCriteria: The metadataCriteria parameter is used to filter the annotations by metadata fields.

See Spark NLP Workshop for more examples of usage.

Input Annotator Types: VECTOR_SIMILARITY_RANKINGS

Output Annotator Type: VECTOR_SIMILARITY_RANKINGS

Python API: VectorDBPostProcessor

Scala API: VectorDBPostProcessor

Show Example

# result

# result

WindowedSentenceModel

Model

This annotator that helps you to merge the previous and following sentences of a given piece of text, so that you add the context surrounding them. This is super useful for especially context-rich analyses that require a deeper understanding of the language being used.

Inferring the class from sentence X may be a much harder task sometime, due to the lack of context, than to infer the class of sentence X-1 + sentence X + sentence X+1. In this example, the window is 1, that’s why we augment sentence with 1 neighbour from behind and another from ahead. Window size can be configured so that each piece of text/sentence get a number of previous and posterior sentences as context, equal to the windows size.

Parameters:

setWindowSize: Sets size of the sliding window.
setGlueString: Sets string to use to join the neighboring elements together.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

Python API: WindowedSentenceModel

Scala API: WindowedSentenceModel

Show Example

from johnsnowlabs import medical, nlp

documentAssembler =  nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector =  nlp.SentenceDetector()\
    .setInputCols("document")\
    .setOutputCol("sentence")

windowedSentence1 =  medical.WindowedSentenceModel()\
    .setWindowSize(1)\
    .setInputCols("sentence")\
    .setOutputCol("window_1")

windowedSentence2 =  medical.WindowedSentenceModel()\
    .setWindowSize(2)\
    .setInputCols("sentence")\
    .setOutputCol("window_2")

pipeline = nlp.Pipeline(stages=[
    documentAssembler, 
    sentenceDetector, 
    windowedSentence1, 
    windowedSentence2
    ])


sample_text = """The patient was admitted on Monday. 
She has a right-sided pleural effusion for thoracentesis. 
Her Coumadin was placed on hold.
A repeat echocardiogram was checked. 
She was started on prophylaxis for DVT. 
Her CT scan from March 2006 prior to her pericardectomy. 
It already shows bilateral plural effusions."""

data = spark.createDataFrame([[sample_text]]).toDF("text")

result = pipeline.fit(data).transform(data)

# Example results

result.select(F.explode('window_1')).select('col.result').show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------------------+
|The patient was admitted on Monday. She has a right-sided pleural effusion for thoracentesis.                                                |
|The patient was admitted on Monday. She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold.               |
|She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold. A repeat echocardiogram was checked.              |
|Her Coumadin was placed on hold. A repeat echocardiogram was checked. She was started on prophylaxis for DVT.                                |
|A repeat echocardiogram was checked. She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy.        |
|She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy. It already shows bilateral plural effusions.|
|Her CT scan from March 2006 prior to her pericardectomy. It already shows bilateral plural effusions.                                        |
+---------------------------------------------------------------------------------------------------------------------------------------------+

result.select(F.explode('window_2')).select('col.result').show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                          |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|The patient was admitted on Monday. She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold.                                                                                                  |
|The patient was admitted on Monday. She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold. A repeat echocardiogram was checked.                                                             |
|The patient was admitted on Monday. She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold. A repeat echocardiogram was checked. She was started on prophylaxis for DVT.                     |
|She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold. A repeat echocardiogram was checked. She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy.|
|Her Coumadin was placed on hold. A repeat echocardiogram was checked. She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy. It already shows bilateral plural effusions.             |
|A repeat echocardiogram was checked. She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy. It already shows bilateral plural effusions.                                              |
|She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy. It already shows bilateral plural effusions.                                                                                   |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

from johnsnowlabs import nlp, legal
from pyspark.sql import functions as F

doc_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("isolated_sentence")

context_window = legal.WindowedSentenceModel()\
    .setInputCols(["isolated_sentence"])\
    .setOutputCol("window")\
    .setWindowSize(1)

window_splitting_pipeline = nlp.Pipeline(stages=[doc_assembler, sentence_detector, context_window])

window_splitting_model = window_splitting_pipeline.fit(df)

window_splitting_lp = nlp.LightPipeline(window_splitting_model)

## Result

['1  \nMUTUAL NONDISCLOSURE AGREEMENT  \nThis Mutual Nondisclosure Agreement (the “Agreement”) is made on _________ (“Effective  \nDate”) by and between:  \n(1) John Snow Labs, a Delaware corporation, registered at 16192 Coastal Highway,  \nLewes, Delaware 19958 (“John Snow Labs”), and   \n(2) Achiles, S.L, a Spanish corporation, registered at Gran Via, 2º floor, Offices 9\nand 10.(“Company”),  \n(each a “party” and together the “parties”). Recitals:  \nJohn Snow Labs and Company intend to explore the possibility of a business relationship  \nbetween each other, whereby each party (“Discloser”) may disclose sensitive information to the  \nother party (“Recipient”).',
 '1  \nMUTUAL NONDISCLOSURE AGREEMENT  \nThis Mutual Nondisclosure Agreement (the “Agreement”) is made on _________ (“Effective  \nDate”) by and between:  \n(1) John Snow Labs, a Delaware corporation, registered at 16192 Coastal Highway,  \nLewes, Delaware 19958 (“John Snow Labs”), and   \n(2) Achiles, S.L, a Spanish corporation, registered at Gran Via, 2º floor, Offices 9\nand 10.(“Company”),  \n(each a “party” and together the “parties”). Recitals:  \nJohn Snow Labs and Company intend to explore the possibility of a business relationship  \nbetween each other, whereby each party (“Discloser”) may disclose sensitive information to the  \nother party (“Recipient”). The parties agree as follows:',
 'Recitals:  \nJohn Snow Labs and Company intend to explore the possibility of a business relationship  \nbetween each other, whereby each party (“Discloser”) may disclose sensitive information to the  \nother party (“Recipient”). The parties agree as follows: 1. Definition.',]

import spark.implicits._

val documentAssembler =  new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector =  new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val windowedSentence1 =  new WindowedSentenceModel()
    .setWindowSize(1)
    .setInputCols("sentence")
    .setOutputCol("window_1")

val windowedSentence2 =  new WindowedSentenceModel()
    .setWindowSize(2)
    .setInputCols("sentence")
    .setOutputCol("window_2")

val pipeline = new Pipeline().setStages(Array(
    documentAssembler, 
    sentenceDetector, 
    windowedSentence1, 
    windowedSentence2
))


val testDataset = Seq("The patient was admitted on Monday. 
She has a right-sided pleural effusion for thoracentesis. 
Her Coumadin was placed on hold.
A repeat echocardiogram was checked. 
She was started on prophylaxis for DVT. 
Her CT scan from March 2006 prior to her pericardectomy. 
It already shows bilateral plural effusions.").toDF("text")

val result = pipeline.fit(testDataset).transform(testDataset)

// Result

// window 1

+---------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                       |
+---------------------------------------------------------------------------------------------------------------------------------------------+
|The patient was admitted on Monday. She has a right-sided pleural effusion for thoracentesis.                                                |
|The patient was admitted on Monday. She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold.               |
|She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold. A repeat echocardiogram was checked.              |
|Her Coumadin was placed on hold. A repeat echocardiogram was checked. She was started on prophylaxis for DVT.                                |
|A repeat echocardiogram was checked. She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy.        |
|She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy. It already shows bilateral plural effusions.|
|Her CT scan from March 2006 prior to her pericardectomy. It already shows bilateral plural effusions.                                        |
+---------------------------------------------------------------------------------------------------------------------------------------------+

// window 2

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                          |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|The patient was admitted on Monday. She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold.                                                                                                  |
|The patient was admitted on Monday. She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold. A repeat echocardiogram was checked.                                                             |
|The patient was admitted on Monday. She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold. A repeat echocardiogram was checked. She was started on prophylaxis for DVT.                     |
|She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold. A repeat echocardiogram was checked. She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy.|
|Her Coumadin was placed on hold. A repeat echocardiogram was checked. She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy. It already shows bilateral plural effusions.             |
|A repeat echocardiogram was checked. She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy. It already shows bilateral plural effusions.                                              |
|She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy. It already shows bilateral plural effusions.                                                                                   |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._

val doc_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentence_detector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("isolated_sentence")

val context_window = new WindowedSentenceModel()
    .setInputCols("isolated_sentence")
    .setOutputCol("window")
    .setWindowSize(1)

val pipeline = new Pipeline().setStages(Array(
    doc_assembler, 
    sentence_detector, 
    context_window))

val window_splitting_model = window_splitting_pipeline.fit(df)

val window_splitting_lp = LightPipeline(window_splitting_model)

ZeroShotNerModel

Model

This is a zero shot named entity recognition based on RoBertaForQuestionAnswering. Zero shot models excel at generalization, meaning that the model can accurately predict entities in very different data sets without the need to fine tune the model or train from scratch for each different domain.

Even though a model trained to solve a specific problem can achieve better accuracy than a zero-shot model in this specific task, it probably won’t be be useful in a different task. That is where zero-shot models shows its usefulness by being able to achieve good results in many different scenarions.

Parameters:

entityDefinitions: A dictionary with definitions of the named entities. The keys of dictionary are the entity types and the values are lists of hypothesis templates.
predictionThreshold: Minimal confidence score to consider the entity(Default: 0.01)
ignoreEntitites: A list of entities to be discarted from the output..

All the parameters can be set using the corresponding set method in camel case. For example, .setMultiLabel().

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: NAMED_ENTITY

Python API: ZeroShotNerModel

Scala API: ZeroShotNerModel

Notebook: ZeroShotNerModelNotebook

Show Example

from johnsnowlabs import nlp, medical

documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

zero_shot_ner = medical.ZeroShotNerModel.pretrained("zero_shot_ner_roberta", "en", "clinical/models")\
    .setEntityDefinitions(
        {
            "PROBLEM": ["What is the disease?", "What is his symptom?", "What is her disease?", "What is his disease?",
                        "What is the problem?" ,"What does a patient suffer", 'What was the reason that the patient is admitted to the clinic?'],
            "DRUG": ["Which drug?", "Which is the drug?", "What is the drug?", "Which drug does he use?", "Which drug does she use?", "Which drug do I use?", "Which drug is prescribed for a symptom?"],
            "ADMISSION_DATE": ["When did patient admitted to a clinic?"],
            "PATIENT_AGE": ["How old is the patient?","What is the gae of the patient?"]
        })\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("zero_shot_ner")\
    .setPredictionThreshold(0.1) # default 0.01

ner_converter = medical.NerConverterInternal()\
    .setInputCols(["sentence", "token", "zero_shot_ner"])\
    .setOutputCol("ner_chunk")\

pipeline = nlp.Pipeline(stages = [
    documentAssembler,
    sentenceDetector,
    tokenizer,
    zero_shot_ner,
    ner_converter])

text_list = ["The doctor pescribed Majezik for my severe headache.",
             "The patient was admitted to the hospital for his colon cancer.",
             "27 years old patient was admitted to clinic on Sep 1st by Dr. X for a right-sided pleural effusion for thoracentesis."
            ]

data = spark.createDataFrame(text_list, nlp.StringType()).toDF("text")

result = pipeline.fit(data).transform(data)

result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label"),
              F.expr("cols['1']['confidence']").alias("confidence")).show(50, truncate=100)

+------------------------------------------------+--------------+----------+
|                                           chunk|     ner_label|confidence|
+------------------------------------------------+--------------+----------+
|                                         Majezik|          DRUG|0.64671576|
|                                 severe headache|       PROBLEM| 0.5526346|
|                                    colon cancer|       PROBLEM| 0.8898498|
|                                    27 years old|   PATIENT_AGE| 0.6943085|
|                                         Sep 1st|ADMISSION_DATE|0.95646095|
|a right-sided pleural effusion for thoracentesis|       PROBLEM|0.50026613|
+------------------------------------------------+--------------+----------+

from johnsnowlabs import nlp, finance

documentAssembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

textsplitter = finance.TextSplitter()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols("sentence")\
  .setOutputCol("token")

zero_shot_ner = finance.ZeroShotNerModel.pretrained("finner_roberta_zeroshot", "en", "finance/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("zero_shot_ner")\
    .setEntityDefinitions(
        {
            "DATE": ['When was the company acquisition?', 'When was the company purchase agreement?'],
            "ORG": ["Which company was acquired?"],
            "PRODUCT": ["Which product?"],
            "PROFIT_INCREASE": ["How much has the gross profit increased?"],
            "REVENUES_DECLINED": ["How much has the revenues declined?"],
            "OPERATING_LOSS_2020": ["Which was the operating loss in 2020"],
            "OPERATING_LOSS_2019": ["Which was the operating loss in 2019"]
        })

ner_converter = finance.NerConverterInternal()\
  .setInputCols(["sentence", "token", "zero_shot_ner"])\
  .setOutputCol("ner_chunk")

pipeline =  nlp.Pipeline(stages=[
  documentAssembler,
  textsplitter,
  tokenizer,
  zero_shot_ner,
  ner_converter
  ]
)

from pyspark.sql.types import StringType
text_list = ["In March 2012, as part of a longer-term strategy, the Company acquired Vertro, Inc., which owned and operated the ALOT product portfolio.",
              "In February 2017, the Company entered into an asset purchase agreement with NetSeer, Inc.",
              "While our gross profit margin increased to 81.4% in 2020 from 63.1% in 2019, our revenues declined approximately 27% in 2020 as compared to 2019.",
              "We reported an operating loss of approximately $8,048,581 million in 2020 as compared to an operating loss of $7,738,193 in 2019."]

data = spark.createDataFrame(text_list, nlp.StringType()).toDF("text")

result = pipeline.fit(data).transform(data)

result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(50, truncate=100)


+------------------+-------------------+
|chunk             |ner_label          |
+------------------+-------------------+
|March 2012        |DATE               |
|Vertro            |ORG                |
|ALOT              |PRODUCT            |
|February 2017     |DATE               |
|NetSeer           |ORG                |
|81.4%             |PROFIT_INCREASE    |
|27%               |REVENUES_DECLINED  |
|$8,048,581 million|OPERATING_LOSS_2020|
|$7,738,193        |OPERATING_LOSS_2019|
|2019              |DATE               |
+------------------+-------------------+

from johnsnowlabs import nlp, legal

documentAssembler = nlp.DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

textSplitter = legal.TextSplitter()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
  .setInputCols("sentence")\
  .setOutputCol("token")

zero_shot_ner = legal.ZeroShotNerModel.pretrained("legner_roberta_zeroshot", "en", "legal/models")\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("zero_shot_ner")\
  .setEntityDefinitions(
        {
            "DATE": ['When was the company acquisition?', 'When was the company purchase agreement?', "When was the agreement?"],
            "ORG": ["Which company?"],
            "STATE": ["Which state?"],
            "AGREEMENT": ["What kind of agreement?"],
            "LICENSE": ["What kind of license?"],
            "LICENSE_RECIPIENT": ["To whom the license is granted?"]
        })
    
ner_converter = legal.NerConverterInternal()\
  .setInputCols(["sentence", "token", "zero_shot_ner"])\
  .setOutputCol("ner_chunk")

pipeline =  nlp.Pipeline(stages=[
  documentAssembler,
  textSplitter,
  tokenizer,
  zero_shot_ner,
  nerconverter
  ]
)

from pyspark.sql.types import StringType

text_list = [
    "In March 2012, as part of a longer-term strategy, the Company acquired Vertro, Inc., which owned and operated the ALOT product portfolio.",
    "In February 2017, the Company entered into an asset purchase agreement with NetSeer, Inc.",
    "This INTELLECTUAL PROPERTY AGREEMENT, dated as of December 31, 2018 (the 'Effective Date') is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ('Seller') and AFI Licensing LLC, a Delaware company (the 'Licensee')",
    "The Company hereby grants to Seller a perpetual, non- exclusive, royalty-free license",
]

data = spark.createDataFrame(text_list, nlp.StringType()).toDF("text")

result = pipeline.fit(data).transform(data)

result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(50, truncate=100)

+-------------------------------------+-----------------+
|chunk                                |ner_label        |
+-------------------------------------+-----------------+
|March 2012                           |DATE             |
|Vertro, Inc                          |ORG              |
|February 2017                        |DATE             |
|asset purchase agreement             |AGREEMENT        |
|NetSeer                              |ORG              |
|INTELLECTUAL PROPERTY                |AGREEMENT        |
|December 31, 2018                    |DATE             |
|Armstrong Flooring                   |LICENSE_RECIPIENT|
|Delaware                             |STATE            |
|AFI Licensing LLC, a Delaware company|LICENSE_RECIPIENT|
|Seller                               |LICENSE_RECIPIENT|
|perpetual                            |LICENSE          |
|non- exclusive                       |LICENSE          |
|royalty-free                         |LICENSE          |
+-------------------------------------+-----------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val zeroShotNer = ZeroShotNerModel.pretrained("zero_shot_ner_roberta", "en", "clinical/models")
  .setEntityDefinitions(Map(
    "PROBLEM" -> Seq("What is the disease?", "What is his symptom?", "What is her disease?", "What is his disease?",
                     "What is the problem?" ,"What does a patient suffer", "What was the reason that the patient is admitted to the clinic?"),
    "DRUG" -> Seq("Which drug?", "Which is the drug?", "What is the drug?", "Which drug does he use?", "Which drug does she use?", "Which drug do I use?", "Which drug is prescribed for a symptom?"),
    "ADMISSION_DATE" -> Seq("When did patient admitted to a clinic?"),
    "PATIENT_AGE" -> Seq("How old is the patient?", "What is the gae of the patient?")
  ))
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("zero_shot_ner")
  .setPredictionThreshold(0.1)

val nerConverter = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "zero_shot_ner"))
  .setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(
    documentAssembler, 
    sentenceDetector, 
    tokenizer, 
    zeroShotNer, 
    nerConverter))

val textList = Seq(
  "The doctor pescribed Majezik for my severe headache.",
  "The patient was admitted to the hospital for his colon cancer.",
  "27 years old patient was admitted to clinic on Sep 1st by Dr. X for a right-sided pleural effusion for thoracentesis."
).toDS.toDF("text")

val result = pipeline.fit(textList).transform(textList)

  
+------------------------------------------------+--------------+----------+
|                                           chunk|     ner_label|confidence|
+------------------------------------------------+--------------+----------+
|                                         Majezik|          DRUG|0.64671576|
|                                 severe headache|       PROBLEM| 0.5526346|
|                                    colon cancer|       PROBLEM| 0.8898498|
|                                    27 years old|   PATIENT_AGE| 0.6943085|
|                                         Sep 1st|ADMISSION_DATE|0.95646095|
|a right-sided pleural effusion for thoracentesis|       PROBLEM|0.50026613|
+------------------------------------------------+--------------+----------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val textsplitter = new TextSplitter()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val zero_shot_ner = ZeroShotNerModel.pretrained("finner_roberta_zeroshot", "en", "finance/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("zero_shot_ner")
  .setEntityDefinitions(
    Map(
      "DATE" -> Seq('When was the company acquisition?', 'When was the company purchase agreement?'),
      "ORG" -> Seq("Which company was acquired?"),
      "PRODUCT" -> Seq("Which product?"),
      "PROFIT_INCREASE" -> Seq("How much has the gross profit increased?"),
      "REVENUES_DECLINED" -> Seq("How much has the revenues declined?"),
      "OPERATING_LOSS_2020" -> Seq("Which was the operating loss in 2020"),
      "OPERATING_LOSS_2019" -> Seq("Which was the operating loss in 2019")
    )
  )

val ner_converter = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "zero_shot_ner"))
  .setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  textsplitter,
  tokenizer,
  zero_shot_ner,
  ner_converter
))

val text_list = Seq(
  "In March 2012, as part of a longer-term strategy, the Company acquired Vertro, Inc., which owned and operated the ALOT product portfolio.",
  "In February 2017, the Company entered into an asset purchase agreement with NetSeer, Inc.",
  "While our gross profit margin increased to 81.4% in 2020 from 63.1% in 2019, our revenues declined approximately 27% in 2020 as compared to 2019.",
  "We reported an operating loss of approximately $8,048,581 million in 2020 as compared to an operating loss of $7,738,193 in 2019."
).toDS.toDF("text")

val result = pipeline.fit(text_list).transform(text_list)

+------------------+-------------------+
|chunk             |ner_label          |
+------------------+-------------------+
|March 2012        |DATE               |
|Vertro            |ORG                |
|ALOT              |PRODUCT            |
|February 2017     |DATE               |
|NetSeer           |ORG                |
|81.4%             |PROFIT_INCREASE    |
|27%               |REVENUES_DECLINED  |
|$8,048,581 million|OPERATING_LOSS_2020|
|$7,738,193        |OPERATING_LOSS_2019|
|2019              |DATE               |
+------------------+-------------------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val textSplitter = new TextSplitter()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val zeroShotNer = ZeroShotNerModel.pretrained("legner_roberta_zeroshot", "en", "legal/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("zero_shot_ner")
  .setEntityDefinitions(Map(
    "DATE" -> Seq("When was the company acquisition?", "When was the company purchase agreement?", "When was the agreement?"),
    "ORG" -> Seq("Which company?"),
    "STATE" -> Seq("Which state?"),
    "AGREEMENT" -> Seq("What kind of agreement?"),
    "LICENSE" -> Seq("What kind of license?"),
    "LICENSE_RECIPIENT" -> Seq("To whom the license is granted?")
  ))

val nerConverter = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "zero_shot_ner"))
  .setOutputCol("ner_chunk")


val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    textSplitter,
    tokenizer,
    zeroShotNer,
    nerConverter
  ))

val textList = Seq(
  "In March 2012, as part of a longer-term strategy, the Company acquired Vertro, Inc., which owned and operated the ALOT product portfolio.",
  "In February 2017, the Company entered into an asset purchase agreement with NetSeer, Inc.",
  "This INTELLECTUAL PROPERTY AGREEMENT, dated as of December 31, 2018 (the 'Effective Date') is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ('Seller') and AFI Licensing LLC, a Delaware company (the 'Licensee')",
  "The Company hereby grants to Seller a perpetual, non-exclusive, royalty-free license"
).toDS.toDF("text")

val result = pipeline.fit(textList).transform(textList)

+-------------------------------------+-----------------+
|chunk                                |ner_label        |
+-------------------------------------+-----------------+
|March 2012                           |DATE             |
|Vertro, Inc                          |ORG              |
|February 2017                        |DATE             |
|asset purchase agreement             |AGREEMENT        |
|NetSeer                              |ORG              |
|INTELLECTUAL PROPERTY                |AGREEMENT        |
|December 31, 2018                    |DATE             |
|Armstrong Flooring                   |LICENSE_RECIPIENT|
|Delaware                             |STATE            |
|AFI Licensing LLC, a Delaware company|LICENSE_RECIPIENT|
|Seller                               |LICENSE_RECIPIENT|
|perpetual                            |LICENSE          |
|non-exclusive                        |LICENSE          |
|royalty-free                         |LICENSE          |
+-------------------------------------+-----------------+

ZeroShotRelationExtractionModel

Model

ZeroShotRelationExtractionModel implements zero-shot binary relations extraction by utilizing BERT transformer models trained on the NLI (Natural Language Inference) task.

The model inputs consists of documents/sentences and paired NER chunks, usually obtained by RENerChunksFilter. The definitions of relations which are extracted is given by a dictionary structures, specifying a set of statements regarding the relationship of named entities.

These statements are automatically appended to each document in the dataset and the NLI model is used to determine whether a particular relationship between entities.

Parameters:

relationalCategories: A dictionary with definitions of relational categories. The keys of dictionary are the relation labels and the values are lists of hypothesis templates.
predictionThreshold: Minimal confidence score to encode a relation (Default: 0.5)
multiLabel: Whether or not a pair of entities can be categorized by multiple relations (Default: False).

All the parameters can be set using the corresponding set method in camel case. For example, .setMultiLabel().

For available pretrained models please see the Models Hub.

Input Annotator Types: CHUNK, DOCUMENT

Output Annotator Type: CATEGORY

Python API: ZeroShotRelationExtractionModel

Scala API: ZeroShotRelationExtractionModel

Notebook: ZeroShotRelationExtractionModelNotebook

Show Example

from johnsnowlabs import nlp, medical

documenter = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencer = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentences"])\
    .setOutputCol("tokens")

words_embedder = nlp.WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("embeddings")

ner_clinical = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens", "embeddings"])\
    .setOutputCol("ner_clinical")

ner_clinical_converter = medical.NerConverterInternal()\
    .setInputCols(["sentences", "tokens", "ner_clinical"])\
    .setOutputCol("ner_clinical_chunks")\
    .setWhiteList(["PROBLEM", "TEST"])  # PROBLEM-TEST-TREATMENT

ner_posology = medical.NerModel.pretrained("ner_posology", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens", "embeddings"])\
    .setOutputCol("ner_posology")

ner_posology_converter = medical.NerConverterInternal()\
    .setInputCols(["sentences", "tokens", "ner_posology"])\
    .setOutputCol("ner_posology_chunks")\
    .setWhiteList(["DRUG"]) # DRUG-FREQUENCY-DOSAGE-DURATION-FORM-ROUTE-STRENGTH

chunk_merger = medical.ChunkMergeApproach()\
    .setInputCols("ner_clinical_chunks", "ner_posology_chunks")\
    .setOutputCol("merged_ner_chunks")

## ZERO-SHOT RE Starting...

pos_tagger = nlp.PerceptronModel().pretrained("pos_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("pos_tags")

dependency_parser = nlp.DependencyParserModel().pretrained("dependency_conllu", "en")\
    .setInputCols(["document", "pos_tags", "tokens"])\
    .setOutputCol("dependencies")

re_ner_chunk_filter = medical.RENerChunksFilter().setRelationPairs(["problem-test", "problem-drug"])\
    .setMaxSyntacticDistance(4)\
    .setDocLevelRelations(False)\
    .setInputCols(["merged_ner_chunks", "dependencies"])\
    .setOutputCol("re_ner_chunks")

re_model = medical.ZeroShotRelationExtractionModel.pretrained("re_zeroshot_biobert", "en", "clinical/models")\
    .setInputCols(["re_ner_chunks", "sentences"])\
    .setOutputCol("relations")\
    .setMultiLabel(True)\
    .setRelationalCategories(
        {
            "ADE": ["{DRUG} causes {PROBLEM}."],
            "IMPROVE": ["{DRUG} improves {PROBLEM}.", "{DRUG} cures {PROBLEM}."],
            "REVEAL": ["{TEST} reveals {PROBLEM}."],
        }
    )

pipeline = nlp.Pipeline(
    stages = [
        documenter,
        sentencer,
        tokenizer,
        words_embedder,
        ner_clinical,
        ner_clinical_converter,
        ner_posology,
        ner_posology_converter,
        chunk_merger,
        pos_tagger,
        dependency_parser,
        re_ner_chunk_filter,
        re_model
    ]
)

text = "Paracetamol can alleviate headache or sickness. An MRI test can be used to find cancer."

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

from pyspark.sql import functions as F

results.select(
    F.explode(F.arrays_zip(results.relations.metadata, results.relations.result)).alias("cols")).select(
    F.expr("cols['0']['sentence']").alias("sentence"),
    F.expr("cols['0']['entity1_begin']").alias("entity1_begin"),
    F.expr("cols['0']['entity1_end']").alias("entity1_end"),
    F.expr("cols['0']['chunk1']").alias("chunk1"),
    F.expr("cols['0']['entity1']").alias("entity1"),
    F.expr("cols['0']['entity2_begin']").alias("entity2_begin"),
    F.expr("cols['0']['entity2_end']").alias("entity2_end"),
    F.expr("cols['0']['chunk2']").alias("chunk2"),
    F.expr("cols['0']['entity2']").alias("entity2"),
    F.expr("cols['0']['hypothesis']").alias("hypothesis"),
    F.expr("cols['0']['nli_prediction']").alias("nli_prediction"),
    F.expr("cols['1']").alias("relation"),
    F.expr("cols['0']['confidence']").alias("confidence"),
).show(truncate=70)

+--------+-------------+-----------+-----------+-------+-------------+-----------+--------+-------+------------------------------+--------------+--------+----------+
sentence|entity1_begin|entity1_end|     chunk1|entity1|entity2_begin|entity2_end|  chunk2|entity2|                    hypothesis|nli_prediction|relation|confidence|
+--------+-------------+-----------+-----------+-------+-------------+-----------+--------+-------+------------------------------+--------------+--------+----------+
       0|            0|         10|Paracetamol|   DRUG|           38|         45|sickness|PROBLEM|Paracetamol improves sickness.|        entail| IMPROVE|0.98819494|
       0|            0|         10|Paracetamol|   DRUG|           26|         33|headache|PROBLEM|Paracetamol improves headache.|        entail| IMPROVE| 0.9929625|
       1|           48|         58|An MRI test|   TEST|           80|         85|  cancer|PROBLEM|   An MRI test reveals cancer.|        entail|  REVEAL| 0.9760039|
+--------+-------------+-----------+-----------+-------+-------------+-----------+--------+-------+------------------------------+--------------+--------+----------+

from johnsnowlabs import nlp, finance

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_financial_small", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = finance.NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

re_model = finance.ZeroShotRelationExtractionModel.pretrained("finre_zero_shot", "en", "finance/models")\
    .setInputCols(["ner_chunk", "sentence"])\
    .setOutputCol("relations")\
    .setMultiLabel(False)\
    .setRelationalCategories(
        {
            "profit_decline_by": [
                "{PROFIT_DECLINE} decreased by {AMOUNT} from",
                "{PROFIT_DECLINE} decreased by {AMOUNT} to",
            ],
            "profit_decline_by_per": [
                "{PROFIT_DECLINE} decreased by a {PERCENTAGE} from",
                "{PROFIT_DECLINE} decreased by a {PERCENTAGE} to",
            ],
            "profit_decline_from": [
                "{PROFIT_DECLINE} decreased from {AMOUNT}",
                "{PROFIT_DECLINE} decreased from {AMOUNT} for the year",
            ],
            "profit_decline_from_per": [
                "{PROFIT_DECLINE} decreased from {PERCENTAGE} to",
                "{PROFIT_DECLINE} decreased from {PERCENTAGE} to a total of",
            ],
            "profit_decline_to": ["{PROFIT_DECLINE} to {AMOUNT}"],
            "profit_increase_from": ["{PROFIT_INCREASE} from {AMOUNT}"],
            "profit_increase_to": ["{PROFIT_INCREASE} to {AMOUNT}"],
            "expense_decrease_by": ["{EXPENSE_DECREASE} decreased by {AMOUNT}"],
            "expense_decrease_by_per": ["{EXPENSE_DECREASE} decreased by a {PERCENTAGE}"],
            "expense_decrease_from": ["{EXPENSE_DECREASE} decreased from {AMOUNT}"],
            "expense_decrease_to": [
                "{EXPENSE_DECREASE} for a total of {AMOUNT} for the fiscal year"
            ],
            "has_date": [
                "{AMOUNT} for the fiscal year ended {FISCAL_YEAR}",
                "{PERCENTAGE} for the fiscal year ended {FISCAL_YEAR}",
            ],
        }
    )

pipeline = nlp.Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        embeddings,
        ner_model,
        ner_converter,
        re_model,
    ]
)

text = """License fees revenue decreased 40 %, or $ 0.5 million to $ 0.7 million for the year ended December 31, 2020 compared to $ 1.2 million for the year ended December 31, 2019. Services revenue increased 4 %, or $ 1.1 million, to $ 25.6 million for the year ended December 31, 2020 from $ 24.5 million for the year ended December 31, 2019. Costs of revenue, excluding depreciation and amortization increased by $ 0.1 million, or 2 %, to $ 8.8 million for the year ended December 31, 2020 from $ 8.7 million for the year ended December 31, 2019.  Also, a decrease in travel costs of $ 0.4 million due to travel restrictions caused by the global pandemic. As a percentage of revenue, cost of revenue, excluding depreciation and amortization was 34 % for each of the years ended December 31, 2020 and 2019. Sales and marketing expenses decreased 20 %, or $ 1.5 million, to $ 6.0 million for the year ended December 31, 2020 from $ 7.5 million for the year ended December 31, 2019."""

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

from pyspark.sql import functions as F

result.select(
    F.explode(F.arrays_zip(result.relations.metadata, result.relations.result)).alias("cols")).select(
    F.expr("cols['0']['sentence']").alias("sentence"),
    F.expr("cols['0']['entity1_begin']").alias("entity1_begin"),
    F.expr("cols['0']['entity1_end']").alias("entity1_end"),
    F.expr("cols['0']['chunk1']").alias("chunk1"),
    F.expr("cols['0']['entity1']").alias("entity1"),
    F.expr("cols['0']['entity2_begin']").alias("entity2_begin"),
    F.expr("cols['0']['entity2_end']").alias("entity2_end"),
    F.expr("cols['0']['chunk2']").alias("chunk2"),
    F.expr("cols['0']['entity2']").alias("entity2"),
    F.expr("cols['0']['hypothesis']").alias("hypothesis"),
    F.expr("cols['0']['nli_prediction']").alias("nli_prediction"),
    F.expr("cols['1']").alias("relation"),
    F.expr("cols['0']['confidence']").alias("confidence"),
).show(truncate=70)

+--------+-------------+-----------+----------------------------+----------------+-------------+-----------+-----------------+-----------+--------------------------------------------------------+--------------+---------------------+----------+
|sentence|entity1_begin|entity1_end|                      chunk1|         entity1|entity2_begin|entity2_end|           chunk2|    entity2|                                              hypothesis|nli_prediction|             relation|confidence|
+--------+-------------+-----------+----------------------------+----------------+-------------+-----------+-----------------+-----------+--------------------------------------------------------+--------------+---------------------+----------+
|       1|          227|        238|                25.6 million|          AMOUNT|          316|        332|December 31, 2019|FISCAL_YEAR|25.6 million for the fiscal year ended December 31, 2019|        entail|             has_date| 0.8744757|
|       0|           31|         32|                          40|      PERCENTAGE|          153|        169|December 31, 2019|FISCAL_YEAR|          40 for the fiscal year ended December 31, 2019|        entail|             has_date| 0.7889032|
|       5|          799|        826|Sales and marketing expenses|EXPENSE_DECREASE|          923|        933|      7.5 million|     AMOUNT| Sales and marketing expenses decreased from 7.5 million|        entail|expense_decrease_from| 0.9770538|
|       0|           59|         69|                 0.7 million|          AMOUNT|           90|        106|December 31, 2020|FISCAL_YEAR| 0.7 million for the fiscal year ended December 31, 2020|        entail|             has_date|0.67187774|
|       1|          172|        187|            Services revenue| PROFIT_INCREASE|          227|        238|     25.6 million|     AMOUNT|                        Services revenue to 25.6 million|        entail|   profit_increase_to| 0.9674029|
|       0|           31|         32|                          40|      PERCENTAGE|           90|        106|December 31, 2020|FISCAL_YEAR|          40 for the fiscal year ended December 31, 2020|        entail|             has_date|0.77800345|
|       5|          838|        839|                          20|      PERCENTAGE|          898|        914|December 31, 2020|FISCAL_YEAR|          20 for the fiscal year ended December 31, 2020|        entail|             has_date|0.85455483|
|       3|          561|        572|                travel costs|EXPENSE_DECREASE|          579|        589|      0.4 million|     AMOUNT|                   travel costs decreased by 0.4 million|        entail|  expense_decrease_by| 0.9946776|
|       0|           42|         52|                 0.5 million|          AMOUNT|          153|        169|December 31, 2019|FISCAL_YEAR| 0.5 million for the fiscal year ended December 31, 2019|        entail|             has_date| 0.7756689|
|       1|          172|        187|            Services revenue| PROFIT_INCREASE|          209|        219|      1.1 million|     AMOUNT|                       Services revenue from 1.1 million|        entail| profit_increase_from|0.96610945|
|       2|          408|        418|                 0.1 million|          AMOUNT|          521|        537|December 31, 2019|FISCAL_YEAR| 0.1 million for the fiscal year ended December 31, 2019|        entail|             has_date| 0.9083247|
|       5|          849|        859|                 1.5 million|          AMOUNT|          898|        914|December 31, 2020|FISCAL_YEAR| 1.5 million for the fiscal year ended December 31, 2020|        entail|             has_date| 0.7528142|
|       5|          849|        859|                 1.5 million|          AMOUNT|          954|        970|December 31, 2019|FISCAL_YEAR| 1.5 million for the fiscal year ended December 31, 2019|        entail|             has_date|0.80734617|
|       0|           42|         52|                 0.5 million|          AMOUNT|           90|        106|December 31, 2020|FISCAL_YEAR| 0.5 million for the fiscal year ended December 31, 2020|        entail|             has_date| 0.7157578|
|       1|          172|        187|            Services revenue| PROFIT_INCREASE|          284|        295|     24.5 million|     AMOUNT|                        Services revenue to 24.5 million|        entail|   profit_increase_to| 0.8597209|
|       0|           59|         69|                 0.7 million|          AMOUNT|          153|        169|December 31, 2019|FISCAL_YEAR| 0.7 million for the fiscal year ended December 31, 2019|        entail|             has_date|0.74845695|
|       1|          199|        199|                           4|      PERCENTAGE|          259|        275|December 31, 2020|FISCAL_YEAR|           4 for the fiscal year ended December 31, 2020|        entail|             has_date|0.84127575|
|       2|          424|        424|                           2|      PERCENTAGE|          465|        481|December 31, 2020|FISCAL_YEAR|           2 for the fiscal year ended December 31, 2020|        entail|             has_date| 0.8046481|
|       2|          424|        424|                           2|      PERCENTAGE|          521|        537|December 31, 2019|FISCAL_YEAR|           2 for the fiscal year ended December 31, 2019|        entail|             has_date| 0.8485104|
|       0|            0|         19|        License fees revenue|  PROFIT_DECLINE|           31|         32|               40| PERCENTAGE|               License fees revenue decreased by a 40 to|        entail|profit_decline_by_per| 0.9948003|
+--------+-------------+-----------+----------------------------+----------------+-------------+-----------+-----------------+-----------+--------------------------------------------------------+--------------+---------------------+----------+
only showing top 20 rows

from johnsnowlabs import nlp, legal

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

tokenClassifier = legal.BertForTokenClassification.pretrained('legner_obligations','en', 'legal/models')\
    .setInputCols("token", "document")\
    .setOutputCol("ner")\
    .setMaxSentenceLength(512)\
    .setCaseSensitive(True)

ner_converter = legal.NerConverterInternal()\
    .setInputCols(["document", "token", "ner"])\
    .setOutputCol("ner_chunk")

re_model = legal.ZeroShotRelationExtractionModel.pretrained("legre_zero_shot", "en", "legal/models")\
    .setInputCols(["ner_chunk", "document"]) \
    .setOutputCol("relations")

re_model.setRelationalCategories({
    "GRANTS_TO": ["{OBLIGATION_SUBJECT} grants {OBLIGATION_INDIRECT_OBJECT}"],
    "GRANTS": ["{OBLIGATION_SUBJECT} grants {OBLIGATION_ACTION}"]
})

pipeline = nlp.Pipeline(stages = [
                document_assembler,  
                tokenizer,
                tokenClassifier, 
                ner_converter,
                re_model
               ])

text = """Arizona Copyright Grant. Subject to the terms and conditions of this Agreement, Arizona hereby grants to the Company a perpetual, non-exclusive, royalty-free license in, to and under the Arizona Licensed Copyrights for use in the Company Field throughout the world."""

data = spark.createDataFrame([[text]]).toDF("text")

result = pipeline.fit(data).transform(data)

from pyspark.sql import functions as F

result.select(
    F.explode(F.arrays_zip(result.relations.metadata, result.relations.result)).alias("cols")).select(
    F.expr("cols['0']['sentence']").alias("sentence"),
    F.expr("cols['0']['entity1_begin']").alias("entity1_begin"),
    F.expr("cols['0']['entity1_end']").alias("entity1_end"),
    F.expr("cols['0']['chunk1']").alias("chunk1"),
    F.expr("cols['0']['entity1']").alias("entity1"),
    F.expr("cols['0']['entity2_begin']").alias("entity2_begin"),
    F.expr("cols['0']['entity2_end']").alias("entity2_end"),
    F.expr("cols['0']['chunk2']").alias("chunk2"),
    F.expr("cols['0']['entity2']").alias("entity2"),
    F.expr("cols['0']['hypothesis']").alias("hypothesis"),
    F.expr("cols['0']['nli_prediction']").alias("nli_prediction"),
    F.expr("cols['1']").alias("relation"),
    F.expr("cols['0']['confidence']").alias("confidence"),
).show(truncate=70)

+--------+-------------+-----------+-------+------------------+-------------+-----------+-------------+--------------------------+----------------------------+--------------+---------+----------+
|sentence|entity1_begin|entity1_end| chunk1|           entity1|entity2_begin|entity2_end|       chunk2|                   entity2|                  hypothesis|nli_prediction| relation|confidence|
+--------+-------------+-----------+-------+------------------+-------------+-----------+-------------+--------------------------+----------------------------+--------------+---------+----------+
|       0|           80|         86|Arizona|OBLIGATION_SUBJECT|          109|        115|      Company|OBLIGATION_INDIRECT_OBJECT|      Arizona grants Company|        entail|GRANTS_TO| 0.9535338|
|       0|           80|         86|Arizona|OBLIGATION_SUBJECT|           88|        100|hereby grants|         OBLIGATION_ACTION|Arizona grants hereby grants|        entail|   GRANTS| 0.9873099|
+--------+-------------+-----------+-------+------------------+-------------+-----------+-------------+--------------------------+----------------------------+--------------+---------+----------+

import spark.implicits._

val documenter = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentencer = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
    .setInputCols(Array("document"))
    .setOutputCol("sentences")

val tokenizer = new Tokenizer()
    .setInputCols(Array("sentences"))
    .setOutputCol("tokens")

val wordsEmbedder = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentences", "tokens"))
    .setOutputCol("embeddings")

val nerClinical = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
    .setInputCols(Array("sentences", "tokens", "embeddings"))
    .setOutputCol("ner_clinical")

val nerClinicalConverter = new NerConverterInternal()
    .setInputCols(Array("sentences", "tokens", "ner_clinical"))
    .setOutputCol("ner_clinical_chunks")
    .setWhiteList(Array("PROBLEM", "TEST"))

val nerPosology = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models")
    .setInputCols(Array("sentences", "tokens", "embeddings"))
    .setOutputCol("ner_posology")

val nerPosologyConverter = new NerConverterInternal()
    .setInputCols(Array("sentences", "tokens", "ner_posology"))
    .setOutputCol("ner_posology_chunks")
    .setWhiteList(Array("DRUG"))

val chunkMerger = new ChunkMergeApproach()
    .setInputCols(Array("ner_clinical_chunks", "ner_posology_chunks"))
    .setOutputCol("merged_ner_chunks")

val posTagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models")
    .setInputCols(Array("sentences", "tokens"))
    .setOutputCol("pos_tags")

val dependencyParser = DependencyParserModel.pretrained("dependency_conllu", "en")
    .setInputCols(Array("document", "pos_tags", "tokens"))
    .setOutputCol("dependencies")

val reNerChunkFilter = new RENerChunksFilter()
    .setRelationPairs(Array("problem-test", "problem-drug"))
    .setMaxSyntacticDistance(4)
    .setDocLevelRelations(false)
    .setInputCols(Array("merged_ner_chunks", "dependencies"))
    .setOutputCol("re_ner_chunks")

val reModel = ZeroShotRelationExtractionModel.pretrained("re_zeroshot_biobert", "en", "clinical/models")
    .setInputCols(Array("re_ner_chunks", "sentences"))
    .setOutputCol("relations")
    .setMultiLabel(true)
    .setRelationalCategories(Map(
        "ADE" -> Array("{DRUG} causes {PROBLEM}."),
        "IMPROVE" -> Array("{DRUG} improves {PROBLEM}.", "{DRUG} cures {PROBLEM}."),
        "REVEAL" -> Array("{TEST} reveals {PROBLEM}.")
    ))

val pipeline = new Pipeline().setStages(Array(
    documenter,
    sentencer,
    tokenizer,
    wordsEmbedder,
    nerClinical,
    nerClinicalConverter,
    nerPosology,
    nerPosologyConverter,
    chunkMerger,
    posTagger,
    dependencyParser,
    reNerChunkFilter,
    reModel
))

val text = "Paracetamol can alleviate headache or sickness. An MRI test can be used to find cancer."

val data = Seq(text).toDF("text")

val result = pipeline.fit(data).transform(data)
 
+--------+-------------+-----------+-----------+-------+-------------+-----------+--------+-------+--------------------+--------------+--------+----------+
|sentence|entity1_begin|entity1_end|     chunk1|entity1|entity2_begin|entity2_end|  chunk2|entity2|          hypothesis|nli_prediction|relation|confidence|
+--------+-------------+-----------+-----------+-------+-------------+-----------+--------+-------+--------------------+--------------+--------+----------+
|       0|            0|         10|Paracetamol|   DRUG|           38|         45|sickness|PROBLEM|Paracetamol impro...|        entail| IMPROVE|0.98819494|
|       0|            0|         10|Paracetamol|   DRUG|           26|         33|headache|PROBLEM|Paracetamol impro...|        entail| IMPROVE| 0.9929625|
|       1|           48|         58|An MRI test|   TEST|           80|         85|  cancer|PROBLEM|An MRI test revea...|        entail|  REVEAL| 0.9760039|
+--------+-------------+-----------+-----------+-------+-------------+-----------+--------+-------+--------------------+--------------+--------+----------+

import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val nerModel = FinanceNerModel.pretrained("finner_financial_small", "en", "finance/models")
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")

val nerConverter = new NerConverterInternal()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")

val reModel = ZeroShotRelationExtractionModel.pretrained("finre_zero_shot", "en", "finance/models")
  .setInputCols(Array("ner_chunk", "sentence"))
  .setOutputCol("relations")
  .setMultiLabel(false)
  .setRelationalCategories(Map(
  "profit_decline_by" -> Array(
    "{PROFIT_DECLINE} decreased by {AMOUNT} from",
    "{PROFIT_DECLINE} decreased by {AMOUNT} to"
  ),
  "profit_decline_by_per" -> Array(
    "{PROFIT_DECLINE} decreased by a {PERCENTAGE} from",
    "{PROFIT_DECLINE} decreased by a {PERCENTAGE} to"
  ),
  "profit_decline_from" -> Array(
    "{PROFIT_DECLINE} decreased from {AMOUNT}",
    "{PROFIT_DECLINE} decreased from {AMOUNT} for the year"
  ),
  "profit_decline_from_per" -> Array(
    "{PROFIT_DECLINE} decreased from {PERCENTAGE} to",
    "{PROFIT_DECLINE} decreased from {PERCENTAGE} to a total of"
  ),
  "profit_decline_to" -> Array("{PROFIT_DECLINE} to {AMOUNT}"),
  "profit_increase_from" -> Array("{PROFIT_INCREASE} from {AMOUNT}"),
  "profit_increase_to" -> Array("{PROFIT_INCREASE} to {AMOUNT}"),
  "expense_decrease_by" -> Array("{EXPENSE_DECREASE} decreased by {AMOUNT}"),
  "expense_decrease_by_per" -> Array("{EXPENSE_DECREASE} decreased by a {PERCENTAGE}"),
  "expense_decrease_from" -> Array("{EXPENSE_DECREASE} decreased from {AMOUNT}"),
  "expense_decrease_to" -> Array("{EXPENSE_DECREASE} for a total of {AMOUNT} for the fiscal year"),
  "has_date" -> Array(
    "{AMOUNT} for the fiscal year ended {FISCAL_YEAR}",
    "{PERCENTAGE} for the fiscal year ended {FISCAL_YEAR}"
  )
))

val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    nerModel,
    nerConverter,
    reModel
  ))

val text = """License fees revenue decreased 40 %, or $ 0.5 million to $ 0.7 million for the year ended December 31, 2020 compared to $ 1.2 million for the year ended December 31, 2019. Services revenue increased 4 %, or $ 1.1 million, to $ 25.6 million for the year ended December 31, 2020 from $ 24.5 million for the year ended December 31, 2019. Costs of revenue, excluding depreciation and amortization increased by $ 0.1 million, or 2 %, to $ 8.8 million for the year ended December 31, 2020 from $ 8.7 million for the year ended December 31, 2019.  Also, a decrease in travel costs of $ 0.4 million due to travel restrictions caused by the global pandemic. As a percentage of revenue, cost of revenue, excluding depreciation and amortization was 34 % for each of the years ended December 31, 2020 and 2019. Sales and marketing expenses decreased 20 %, or $ 1.5 million, to $ 6.0 million for the year ended December 31, 2020 from $ 7.5 million for the year ended December 31, 2019."""

val data = Seq(text).toDF("text")

val result = pipeline.fit(data).transform(data)

+--------+-------------+-----------+----------------------------+----------------+-------------+-----------+-----------------+-----------+--------------------------------------------------------+--------------+---------------------+----------+
|sentence|entity1_begin|entity1_end|                      chunk1|         entity1|entity2_begin|entity2_end|           chunk2|    entity2|                                              hypothesis|nli_prediction|             relation|confidence|
+--------+-------------+-----------+----------------------------+----------------+-------------+-----------+-----------------+-----------+--------------------------------------------------------+--------------+---------------------+----------+
|       1|          227|        238|                25.6 million|          AMOUNT|          316|        332|December 31, 2019|FISCAL_YEAR|25.6 million for the fiscal year ended December 31, 2019|        entail|             has_date| 0.8744757|
|       0|           31|         32|                          40|      PERCENTAGE|          153|        169|December 31, 2019|FISCAL_YEAR|          40 for the fiscal year ended December 31, 2019|        entail|             has_date| 0.7889032|
|       5|          799|        826|Sales and marketing expenses|EXPENSE_DECREASE|          923|        933|      7.5 million|     AMOUNT| Sales and marketing expenses decreased from 7.5 million|        entail|expense_decrease_from| 0.9770538|
|       0|           59|         69|                 0.7 million|          AMOUNT|           90|        106|December 31, 2020|FISCAL_YEAR| 0.7 million for the fiscal year ended December 31, 2020|        entail|             has_date|0.67187774|
|       1|          172|        187|            Services revenue| PROFIT_INCREASE|          227|        238|     25.6 million|     AMOUNT|                        Services revenue to 25.6 million|        entail|   profit_increase_to| 0.9674029|
|       0|           31|         32|                          40|      PERCENTAGE|           90|        106|December 31, 2020|FISCAL_YEAR|          40 for the fiscal year ended December 31, 2020|        entail|             has_date|0.77800345|
|       5|          838|        839|                          20|      PERCENTAGE|          898|        914|December 31, 2020|FISCAL_YEAR|          20 for the fiscal year ended December 31, 2020|        entail|             has_date|0.85455483|
|       3|          561|        572|                travel costs|EXPENSE_DECREASE|          579|        589|      0.4 million|     AMOUNT|                   travel costs decreased by 0.4 million|        entail|  expense_decrease_by| 0.9946776|
|       0|           42|         52|                 0.5 million|          AMOUNT|          153|        169|December 31, 2019|FISCAL_YEAR| 0.5 million for the fiscal year ended December 31, 2019|        entail|             has_date| 0.7756689|
|       1|          172|        187|            Services revenue| PROFIT_INCREASE|          209|        219|      1.1 million|     AMOUNT|                       Services revenue from 1.1 million|        entail| profit_increase_from|0.96610945|
|       2|          408|        418|                 0.1 million|          AMOUNT|          521|        537|December 31, 2019|FISCAL_YEAR| 0.1 million for the fiscal year ended December 31, 2019|        entail|             has_date| 0.9083247|
|       5|          849|        859|                 1.5 million|          AMOUNT|          898|        914|December 31, 2020|FISCAL_YEAR| 1.5 million for the fiscal year ended December 31, 2020|        entail|             has_date| 0.7528142|
|       5|          849|        859|                 1.5 million|          AMOUNT|          954|        970|December 31, 2019|FISCAL_YEAR| 1.5 million for the fiscal year ended December 31, 2019|        entail|             has_date|0.80734617|
|       0|           42|         52|                 0.5 million|          AMOUNT|           90|        106|December 31, 2020|FISCAL_YEAR| 0.5 million for the fiscal year ended December 31, 2020|        entail|             has_date| 0.7157578|
|       1|          172|        187|            Services revenue| PROFIT_INCREASE|          284|        295|     24.5 million|     AMOUNT|                        Services revenue to 24.5 million|        entail|   profit_increase_to| 0.8597209|
|       0|           59|         69|                 0.7 million|          AMOUNT|          153|        169|December 31, 2019|FISCAL_YEAR| 0.7 million for the fiscal year ended December 31, 2019|        entail|             has_date|0.74845695|
|       1|          199|        199|                           4|      PERCENTAGE|          259|        275|December 31, 2020|FISCAL_YEAR|           4 for the fiscal year ended December 31, 2020|        entail|             has_date|0.84127575|
|       2|          424|        424|                           2|      PERCENTAGE|          465|        481|December 31, 2020|FISCAL_YEAR|           2 for the fiscal year ended December 31, 2020|        entail|             has_date| 0.8046481|
|       2|          424|        424|                           2|      PERCENTAGE|          521|        537|December 31, 2019|FISCAL_YEAR|           2 for the fiscal year ended December 31, 2019|        entail|             has_date| 0.8485104|
|       0|            0|         19|        License fees revenue|  PROFIT_DECLINE|           31|         32|               40| PERCENTAGE|               License fees revenue decreased by a 40 to|        entail|profit_decline_by_per| 0.9948003|
+--------+-------------+-----------+----------------------------+----------------+-------------+-----------+-----------------+-----------+--------------------------------------------------------+--------------+---------------------+----------+
only showing top 20 rows

import spark.implicits._

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new Tokenizer()
    .setInputCols(Array("document"))
    .setOutputCol("token")

val tokenClassifier = LegalBertForTokenClassification.pretrained("legner_obligations", "en", "legal/models")
    .setInputCols(Array("token", "document"))
    .setOutputCol("ner")
    .setMaxSentenceLength(512)
    .setCaseSensitive(true)

val nerConverter = new NerConverterInternal()
    .setInputCols(Array("document", "token", "ner"))
    .setOutputCol("ner_chunk")

val reModel = ZeroShotRelationExtractionModel.pretrained("legre_zero_shot", "en", "legal/models")
    .setInputCols(Array("ner_chunk", "document"))
    .setOutputCol("relations")

reModel.setRelationalCategories(Map(
    "GRANTS_TO" -> Array("{OBLIGATION_SUBJECT} grants {OBLIGATION_INDIRECT_OBJECT}"),
    "GRANTS" -> Array("{OBLIGATION_SUBJECT} grants {OBLIGATION_ACTION}")
))

val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    tokenizer,
    tokenClassifier,
    nerConverter,
    reModel
))

val text = """Arizona Copyright Grant. Subject to the terms and conditions of this Agreement, Arizona hereby grants to the Company a perpetual, non-exclusive, royalty-free license in, to and under the Arizona Licensed Copyrights for use in the Company Field throughout the world."""

val data = Seq(text).toDF("text")

val result = pipeline.fit(data).transform(data)


+--------+-------------+-----------+-------+------------------+-------------+-----------+-------------+--------------------------+----------------------------+--------------+---------+----------+
|sentence|entity1_begin|entity1_end| chunk1|           entity1|entity2_begin|entity2_end|       chunk2|                   entity2|                  hypothesis|nli_prediction| relation|confidence|
+--------+-------------+-----------+-------+------------------+-------------+-----------+-------------+--------------------------+----------------------------+--------------+---------+----------+
|       0|           80|         86|Arizona|OBLIGATION_SUBJECT|          109|        115|      Company|OBLIGATION_INDIRECT_OBJECT|      Arizona grants Company|        entail|GRANTS_TO| 0.9535338|
|       0|           80|         86|Arizona|OBLIGATION_SUBJECT|           88|        100|hereby grants|         OBLIGATION_ACTION|Arizona grants hereby grants|        entail|   GRANTS| 0.9873099|
+--------+-------------+-----------+-------+------------------+-------------+-----------+-------------+--------------------------+----------------------------+--------------+---------+----------+

PREVIOUSInstallation

NEXTTraining