Enterprise NLP Annotators

 

A Spark NLP Enterprise license includes access to unique annotators. At the Spark NLP Workshop you can see different types of annotators in action.

By clicking on any annotator, you will see different sections:

  • The Approach, or class to train models.
  • The Model, to infer using pretrained models.

Also, for most of the annotators, you will find examples for the different enterprise libraries:

  • Healthcare NLP
  • Finance NLP
  • Legal NLP

Check out the Spark NLP Annotators page for more information on how to read this page.

Available Annotators

Annotators Description
AssertionDL AssertionDL is a deep Learning based approach used to extract Assertion Status from extracted entities and text.
AssertionFilterer Filters entities coming from ASSERTION type annotations and returns the CHUNKS.
AssertionLogReg Logistic Regression is used to extract Assertion Status from extracted entities and text.
Chunk2Token A feature transformer that converts the input array of strings (annotatorType CHUNK) into an array of chunk-based tokens (annotatorType TOKEN).
ChunkEntityResolver Returns a normalized entity for a particular trained ontology / curated dataset (e.g. clinical ICD-10, RxNorm, SNOMED; financial SEC’s EDGAR database, etc).
ChunkFilterer Filters entities coming from CHUNK annotations.
ChunkKeyPhraseExtraction Uses Bert Sentence Embeddings to determine the most relevant key phrases describing a text.
ChunkMerge Merges entities coming from different CHUNK annotations.
ContextualParser Extracts entity from a document based on user defined rules.
DeIdentification Deidentifies Input Annotations of types DOCUMENT, TOKEN and CHUNK, by either masking or obfuscating the given CHUNKS.
DocumentLogRegClassifier Classifies documents with a Logarithmic Regression algorithm.
DrugNormalizer Annotator which normalizes raw text from documents, e.g. scraped web pages or xml documents
FeaturesAssembler Collects features from different columns.
GenericClassifier Creates a generic single-label classifier which uses pre-generated Tensorflow graphs.
IOBTagger Merges token tags and NER labels from chunks in the specified format.
NerChunker Extracts phrases that fits into a known pattern using the NER tags.
NerConverterInternal Converts a IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their label.
NerDisambiguator Links words of interest, such as names of persons, locations and companies, from an input text document to a corresponding unique entity in a target Knowledge Base (KB).
MedicalNer This Named Entity recognition annotator is a generic NER model based on Neural Networks..
QuestionAnswering GPT-based model for answering questions given a context.
RENerChunksFilter Filters and outputs combinations of relations between extracted entities, for further processing.
ReIdentification Reidentifies obfuscated entities by DeIdentification.
RelationExtraction Extracts and classifies instances of relations between named entities.
RelationExtractionDL Extracts and classifies instances of relations between named entities.
SentenceEntityResolver Returns the normalized entity for a particular trained ontology / curated dataset (e.g. clinical ICD-10, RxNorm, SNOMED; financial SEC’s EDGAR database, etc) based on sentence embeddings.
Summarizer Helps to quickly summarize complex medical information.
TextGenerator Uses the basic BioGPT model to perform various tasks related to medical text abstraction.
TFGraphBuilder Creates Tensorflow graphs.

AnnotationMerger

Model

Merge annotations from different pipeline steps that have the same annotation type into a unified annotation. Possible annotations that can be merged include:

  • document (e.g., output of DocumentAssembler annotator)
  • token (e.g., output of Tokenizer annotator)
  • word_embeddings (e.g., output of WordEmbeddingsModel annotator)
  • sentence_embeddings (e.g., output of BertSentenceEmbeddings annotator)
  • category (e.g., output of RelationExtractionModel annotator)
  • date (e.g., output of DateMatcher annotator)
  • sentiment (e.g., output of SentimentDLModel annotator)
  • pos (e.g., output of PerceptronModel annotator)
  • chunk (e.g., output of NerConverter annotator)
  • named_entity (e.g., output of NerDLModel annotator)
  • regex (e.g., output of RegexTokenizer annotator)
  • dependency (e.g., output of DependencyParserModel annotator)
  • language (e.g., output of LanguageDetectorDL annotator)
  • keyword (e.g., output of YakeModel annotator)

Input Annotator Types: ANY

Output Annotator Type: ANY

Python API: AnnotationMerger Scala API: AnnotationMerger
Show Example

# Create the pipeline with two RE models
documenter = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentencer = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentences")

tokenizer = Tokenizer()\
    .setInputCols(["sentences"])\
    .setOutputCol("tokens")

words_embedder = WordEmbeddingsModel()\
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("embeddings")

pos_tagger = PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("pos_tags")

pos_ner_tagger = MedicalNerModel()\
    .pretrained("ner_posology", "en", "clinical/models")\
    .setInputCols("sentences", "tokens", "embeddings")\
    .setOutputCol("ner_pos")

pos_ner_chunker = NerConverterInternal()\
    .setInputCols(["sentences", "tokens", "ner_pos"])\
    .setOutputCol("pos_ner_chunks")

dependency_parser = DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentences", "pos_tags", "tokens"])\
    .setOutputCol("dependencies")

pos_reModel = RelationExtractionModel()\
    .pretrained("posology_re")\
    .setInputCols(["embeddings", "pos_tags", "pos_ner_chunks", "dependencies"])\
    .setOutputCol("pos_relations")\
    .setMaxSyntacticDistance(4)

ade_ner_tagger = MedicalNerModel.pretrained("ner_ade_clinical", "en", "clinical/models")\
    .setInputCols("sentences", "tokens", "embeddings")\
    .setOutputCol("ade_ner_tags")  

ade_ner_chunker = NerConverterInternal()\
    .setInputCols(["sentences", "tokens", "ade_ner_tags"])\
    .setOutputCol("ade_ner_chunks")

ade_reModel = RelationExtractionModel()\
    .pretrained("re_ade_clinical", "en", 'clinical/models')\
    .setInputCols(["embeddings", "pos_tags", "ade_ner_chunks", "dependencies"])\
    .setOutputCol("ade_relations")\
    .setMaxSyntacticDistance(10)\
    .setRelationPairs(["drug-ade, ade-drug"])

annotation_merger = AnnotationMerger()\
    .setInputCols("ade_relations", "pos_relations")\
    .setInputType("category")\
    .setOutputCol("all_relations")

merger_pipeline = Pipeline(stages=[
    documenter,
    sentencer,
    tokenizer, 
    words_embedder, 
    pos_tagger, 
    pos_ner_tagger,
    pos_ner_chunker,
    dependency_parser,
    pos_reModel,
    ade_ner_tagger,
    ade_ner_chunker,
    ade_reModel,
    annotation_merger
])

empty_df= spark.createDataFrame([[""]]).toDF("text")
merger_model= merger_pipeline.fit(empty_df)

# Show example result
text = """
The patient was prescribed 1 unit of naproxen for 5 days after meals for chronic low back pain. The patient was also given 1 unit of oxaprozin daily for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands.. 
"""
data = spark.createDataFrame([[text]]).toDF("text")

result = merger_model.transform(data)
result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
                text|            document|           sentences|              tokens|          embeddings|            pos_tags|             ner_pos|      pos_ner_chunks|        dependencies|       pos_relations|        ade_ner_tags|      ade_ner_chunks|       ade_relations|       all_relations|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+

The patient was ...|[{document, 0, 26...|[{document, 1, 95...|[{token, 1, 3, Th...|[{word_embeddings...|[{pos, 1, 3, DD, ...|[{named_entity, 1...|[{chunk, 28, 33, ...|[{dependency, 1, ...|[{category, 28, 4...|[{named_entity, 1...|[{chunk, 38, 45, ...|[{category, 134, ...|[{category, 134, ...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+

AssertionChunkConverter

Model

This annotator creates a CHUNK column with metadata useful for training an Assertion Status Detection model (see AssertionDL).

In some cases, there may be issues while creating the chunk column when using token indices that can lead to loss of data to train assertion status models.

The AssertionChunkConverter annotator uses both begin and end indices of the tokens as input to add a more robust metadata to the chunk column in a way that improves the reliability of the indices and avoid loss of data.

NOTE: Chunk begin and end indices in the assertion status model training dataframe can be populated using the new version of ALAB module.

Input Annotator Types: TOKEN

Output Annotator Type: CHUNK

Python API: AssertionChunkConverter Scala API: AssertionChunkConverter
Show Example

data = spark.createDataFrame(
    [
        [
            "An angiography showed bleeding in two vessels off of the Minnie supplying the sigmoid that were succesfully embolized.",
            "Minnie",
            57,
            64,
        ],
        [
            "After discussing this with his PCP, Leon was clear that the patient had had recurrent DVTs and ultimately a PE and his PCP felt strongly that he required long-term anticoagulation ",
            "PCP",
            31,
            34,
        ],
    ]
).toDF("text", "target", "char_begin", "char_end")

document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

sentenceDetector = (
    SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
)

tokenizer = Tokenizer().setInputCols(["sentence"]).setOutputCol("tokens")

converter = (
    AssertionChunkConverter()
    .setInputCols("tokens")
    .setChunkTextCol("target")
    .setChunkBeginCol("char_begin")
    .setChunkEndCol("char_end")
    .setOutputTokenBeginCol("token_begin")
    .setOutputTokenEndCol("token_end")
    .setOutputCol("chunk")
)

pipeline = Pipeline().setStages(
    [document_assembler, sentenceDetector, tokenizer, converter]
)

results = pipeline.fit(data).transform(data)

results.selectExpr(
    "target",
    "char_begin",
    "char_end",
    "token_begin",
    "token_end",
    "tokens[token_begin].result",
    "tokens[token_end].result",
    "target",
    "chunk",
).show(truncate=False)
+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
|target|char_begin|char_end|token_begin|token_end|tokens[token_begin].result|tokens[token_end].result|target|chunk                                         |
+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
|Minnie|57        |64      |10         |10       |Minnie                    |Minnie                  |Minnie|[{chunk, 57, 62, Minnie, {sentence -> 0}, []}]|
|PCP   |31        |34      |5          |5        |PCP                       |PCP                     |PCP   |[{chunk, 31, 33, PCP, {sentence -> 0}, []}]   |
+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+

AssertionDL

AssertionDL is a deep Learning based approach used to extract Assertion Status from extracted entities and text. AssertionDLModel requires DOCUMENT, CHUNK and WORD_EMBEDDINGS type annotator inputs, which can be obtained by e.g a DocumentAssembler, NerConverter and WordEmbeddingsModel. The result is an assertion status annotation for each recognized entity. Possible values include “present”, “absent”, “hypothetical”, “conditional”, “associated_with_other_person” etc.

For pretrained models please see the Models Hub for available models.

Input Annotator Types: DOCUMENT, CHUNK, WORD_EMBEDDINGS

Output Annotator Type: ASSERTION

Python API: AssertionDLModel Scala API: AssertionDLModel
Show Example

from johnsnowlabs import * 
# Define pipeline stages to extract NER chunks first
data = spark.createDataFrame([
  ["Patient with severe fever and sore throat"],
  ["Patient shows no stomach pain"],
  ["She was maintained on an epidural and PCA for pain control."]]).toDF("text")
documentAssembler = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
sentenceDetector = nlp.SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
tokenizer = nlp.Tokenizer().setInputCols(["sentence"]).setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
  .setOutputCol("embeddings")
nerModel = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"]).setOutputCol("ner")
nerConverter = nlp.NerConverter().setInputCols(["sentence", "token", "ner"]).setOutputCol("ner_chunk")

# Then a pretrained AssertionDLModel is used to extract the assertion status
clinicalAssertion = medical.AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models") \
  .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
  .setOutputCol("assertion")

assertionPipeline = Pipeline(stages=[
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter,
  clinicalAssertion
])

assertionModel = assertionPipeline.fit(data)

# Show results
result = assertionModel.transform(data)
result.selectExpr("ner_chunk.result", "assertion.result").show(3, truncate=False)
+--------------------------------+--------------------------------+
|result                          |result                          |
+--------------------------------+--------------------------------+
|[severe fever, sore throat]     |[present, present]              |
|[stomach pain]                  |[absent]                        |
|[an epidural, PCA, pain control]|[present, present, hypothetical]|
+--------------------------------+--------------------------------+
from johnsnowlabs import * 

data = spark.createDataFrame([["Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc."]]).toDF("text")

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector =  nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer =  nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings =  nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")\

ner_converter = finance.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")\

assertion = finance.AssertionDLModel.pretrained("finassertion_competitors", "en", "finance/models")\
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")
    
pipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    assertion
    ])

assertionModel = pipeline.fit(data)

# Show results
result = assertionModel.transform(data)
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata, result.assertion.result)).alias("cols"))\
      .select(F.expr("cols['1']['sentence']").alias("sent_id"),
              F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label"),
              F.expr("cols['2']").alias("assertion")).show(truncate=False)
+-------+------------+---------+----------+
|sent_id|chunk       |ner_label|assertion |
+-------+------------+---------+----------+
|0      |McAfee LLC  |ORG      |COMPETITOR|
|0      |Broadcom Inc|ORG      |COMPETITOR|
+-------+------------+---------+----------+
from johnsnowlabs import * 

data = spark.createDataFrame([["This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc."]]).toDF("text")

document_assembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings_ner = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
    .setInputCols("sentence", "token") \
    .setOutputCol("embeddings_ner")\

ner_model = legal.NerModel.pretrained('legner_contract_doc_parties', 'en', 'legal/models')\
    .setInputCols(["sentence", "token", "embeddings_ner"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setWhiteList(["DOC", "EFFDATE", "PARTY"])

embeddings_ass = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings_ass")

assertion = legal.AssertionDLModel.pretrained("legassertion_time", "en", "legal/models")\
    .setInputCols(["sentence", "ner_chunk", "embeddings_ass"]) \
    .setOutputCol("assertion")


nlpPipeline = Pipeline(stages=[
            document_assembler, 
            sentence_detector,
            tokenizer,
            embeddings_ner,
            ner_model,
            ner_converter,
            embeddings_ass,
            assertion
            ])

assertionModel = nlpPipeline.fit(data)

# Show results
result = assertionModel.transform(data)
result.select(F.explode(F.arrays_zip(result.ner_chunk.result,  
                                     result.ner_chunk.begin, 
                                     result.ner_chunk.end, 
                                     result.ner_chunk.metadata, 
                                     result.assertion.result)).alias("cols"))\
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']").alias("begin"),
              F.expr("cols['2']").alias("end"),
              F.expr("cols['3']['entity']").alias("ner_label"),
              F.expr("cols['4']").alias("assertion")).show(truncate=False)
+-------------------------------+-----+---+---------+---------+
|chunk                          |begin|end|ner_label|assertion|
+-------------------------------+-----+---+---------+---------+
|Intellectual Property Agreement|11   |41 |DOC      |PRESENT  |
|Amazon Inc                     |51   |60 |PARTY    |PRESENT  |
|Atlantic Inc                   |67   |78 |PARTY    |PRESENT  |
+-------------------------------+-----+---+---------+---------+

from johnsnowlabs import * 

// Define pipeline stages to extract NER chunks first
val data = Seq(
  "Patient with severe fever and sore throat",
  "Patient shows no stomach pain",
  "She was maintained on an epidural and PCA for pain control.").toDF("text")
val documentAssembler = new nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector().setInputCols("document").setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer().setInputCols("sentence").setOutputCol("token")
val embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models").setInputCols(Array("sentence", "token")).setOutputCol("embeddings")
val nerModel = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence", "token", "embeddings")).setOutputCol("ner")
val nerConverter = new nlp.NerConverter().setInputCols(Array("sentence", "token", "ner")).setOutputCol("ner_chunk")

// Then a pretrained AssertionDLModel is used to extract the assertion status
val clinicalAssertion = medical.AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models")
  .setInputCols(Array("sentence", "ner_chunk", "embeddings"))
  .setOutputCol("assertion")

val assertionPipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter,
  clinicalAssertion
))

val assertionModel = assertionPipeline.fit(data)

// Show results
val result = assertionModel.transform(data)
result.selectExpr("ner_chunk.result", "assertion.result").show(3, truncate=false)
+--------------------------------+--------------------------------+
|result                          |result                          |
+--------------------------------+--------------------------------+
|[severe fever, sore throat]     |[present, present]              |
|[stomach pain]                  |[absent]                        |
|[an epidural, PCA, pain control]|[present, present, hypothetical]|
+--------------------------------+--------------------------------+
from johnsnowlabs import * 

val data = Seq("Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc.").toDF("text")

val document_assembler = new nlp.DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentence_detector =  new nlp.SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer =  new nlp.Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val embeddings =  nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = new finance.NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

val assertion = finance.AssertionDLModel.pretrained("finassertion_competitors", "en", "finance/models")
    .setInputCols(Array("sentence", "ner_chunk", "embeddings"))
    .setOutputCol("assertion")
    
val pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    assertion
    )

val assertionModel = pipeline.fit(data)
from johnsnowlabs import * 

val data = Seq("This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc.").toDF("text")

val document_assembler = new nlp.DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new nlp.Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val embeddings_ner = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings_ner")

val ner_model = legal.NerModel.pretrained('legner_contract_doc_parties', 'en', 'legal/models')
    .setInputCols(Array("sentence", "token", "embeddings_ner"))
    .setOutputCol("ner")

val ner_converter = new nlp.NerConverter()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")
    .setWhiteList(Array("DOC", "EFFDATE", "PARTY"))

val embeddings_ass = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings_ass")

val assertion = legal.AssertionDLModel.pretrained("legassertion_time", "en", "legal/models")
    .setInputCols(Array("sentence", "ner_chunk", "embeddings_ass"))
    .setOutputCol("assertion")
    
val pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    sentence_detector,
    tokenizer,
    embeddings_ner,
    ner_model,
    ner_converter,
    embeddings_ass,
    assertion
    )

val assertionModel = pipeline.fit(data)

Trains AssertionDL, a deep Learning based approach used to extract Assertion Status from extracted entities and text. Contains all the methods for training an AssertionDLModel. For pretrained models please use AssertionDLModel and see the Models Hub for available models.

Input Annotator Types: DOCUMENT, CHUNK, WORD_EMBEDDINGS

Output Annotator Type: ASSERTION

Python API: AssertionDLApproach Scala API: AssertionDLApproach
Show Example

from johnsnowlabs import * 
# First, pipeline stages for pre-processing the dataset (containing columns for text and label) are defined.
document = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
chunk = nlp.Doc2Chunk() \
    .setInputCols(["document"]) \
    .setOutputCol("chunk")
token = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

# Define AssertionDLApproach with parameters and start training
assertionStatus = medical.AssertionDLApproach() \
    .setLabelCol("label") \
    .setInputCols(["document", "chunk", "embeddings"]) \
    .setOutputCol("assertion") \
    .setBatchSize(128) \
    .setDropout(0.012) \
    .setLearningRate(0.015) \
    .setEpochs(1) \
    .setStartCol("start") \
    .setEndCol("end") \
    .setMaxSentLen(250)

trainingPipeline = Pipeline().setStages([
    document,
    chunk,
    token,
    embeddings,
    assertionStatus
])

assertionModel = trainingPipeline.fit(data)
assertionResults = assertionModel.transform(data).cache()
from johnsnowlabs import * 
# First, pipeline stages for pre-processing the dataset (containing columns for text and label) are defined.
document = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
chunk = nlp.Doc2Chunk() \
    .setInputCols(["document"]) \
    .setOutputCol("chunk")
token = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

# Define AssertionDLApproach with parameters and start training
assertionStatus = finance.AssertionDLApproach() \
    .setLabelCol("label") \
    .setInputCols(["document", "chunk", "embeddings"]) \
    .setOutputCol("assertion") \
    .setBatchSize(128) \
    .setDropout(0.012) \
    .setLearningRate(0.015) \
    .setEpochs(1) \
    .setStartCol("start") \
    .setEndCol("end") \
    .setMaxSentLen(250)

trainingPipeline = Pipeline().setStages([
    document,
    chunk,
    token,
    embeddings,
    assertionStatus
])

assertionModel = trainingPipeline.fit(data)
assertionResults = assertionModel.transform(data).cache()
from johnsnowlabs import * 
# First, pipeline stages for pre-processing the dataset (containing columns for text and label) are defined.
document = nlp.DocumentAssembler()\
    .setInputCol("sentence")\
    .setOutputCol("document")
chunk = nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("doc_chunk")
token = nlp.Tokenizer()\
    .setInputCols(['document'])\
    .setOutputCol('token')
roberta_embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings") \
    .setMaxSentenceLength(512)

# Define AssertionDLApproach with parameters and start training
assertionStatus = legal.AssertionDLApproach()\
    .setLabelCol("assertion_label")\
    .setInputCols("document", "doc_chunk", "embeddings")\
    .setOutputCol("assertion")\
    .setBatchSize(128)\
    .setLearningRate(0.001)\
    .setEpochs(2)\
    .setStartCol("tkn_start")\
    .setEndCol("tkn_end")\
    .setMaxSentLen(1200)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath('training_logs/')\
    .setGraphFolder(graph_folder)\
    .setGraphFile(f"{graph_folder}/assertion_graph.pb")\
    .setTestDataset(path="test_data.parquet", read_as='SPARK', options={'format': 'parquet'})\
    .setScopeWindow(scope_window)
    #.setValidationSplit(0.2)\    
    #.setDropout(0.1)\    

trainingPipeline = Pipeline().setStages([
    document,
    chunk,
    token,
    roberta_embeddings,
    assertionStatus
])

assertionModel = trainingPipeline.fit(data)
assertionResults = assertionModel.transform(data).cache()

from johnsnowlabs import * 

// First, pipeline stages for pre-processing the dataset (containing columns for text and label) are defined.
val document = new nlp.DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")
val chunk = new nlp.Doc2Chunk()
  .setInputCols("document")
  .setOutputCol("chunk")
val token = new nlp.Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")
val embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols("document", "token")
  .setOutputCol("embeddings")

// Define AssertionDLApproach with parameters and start training
val assertionStatus = new medical.AssertionDLApproach()
  .setLabelCol("label")
  .setInputCols("document", "chunk", "embeddings")
  .setOutputCol("assertion")
  .setBatchSize(128)
  .setDropout(0.012f)
  .setLearningRate(0.015f)
  .setEpochs(1)
  .setStartCol("start")
  .setEndCol("end")
  .setMaxSentLen(250)

val trainingPipeline = new Pipeline().setStages(Array(
  document,
  chunk,
  token,
  embeddings,
  assertionStatus
))

val assertionModel = trainingPipeline.fit(data)
val assertionResults = assertionModel.transform(data).cache()
from johnsnowlabs import * 

// First, pipeline stages for pre-processing the dataset (containing columns for text and label) are defined.
val document = new nlp.DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")
val chunk = new nlp.Doc2Chunk()
  .setInputCols("document")
  .setOutputCol("chunk")
val token = new nlp.Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")
val embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols("document", "token")
  .setOutputCol("embeddings")

// Define AssertionDLApproach with parameters and start training
val assertionStatus = new finance.AssertionDLApproach()
  .setLabelCol("label")
  .setInputCols("document", "chunk", "embeddings")
  .setOutputCol("assertion")
  .setBatchSize(128)
  .setDropout(0.012f)
  .setLearningRate(0.015f)
  .setEpochs(1)
  .setStartCol("start")
  .setEndCol("end")
  .setMaxSentLen(250)

val trainingPipeline = new Pipeline().setStages(Array(
  document,
  chunk,
  token,
  embeddings,
  assertionStatus
))

val assertionModel = trainingPipeline.fit(data)
val assertionResults = assertionModel.transform(data).cache()
from johnsnowlabs import * 

val document = new nlp.DocumentAssembler()\
    .setInputCol("sentence")\
    .setOutputCol("document")
val chunk = new nlp.Doc2Chunk()\
    .setInputCols("document")\
    .setOutputCol("doc_chunk")\
    .setChunkCol("chunk")\
    .setStartCol("tkn_start")\
    .setStartColByTokenIndex(True)\
    .setFailOnMissing(False)\
    .setLowerCase(False)
val token = new nlp.Tokenizer()\
    .setInputCols(['document'])\
    .setOutputCol('token')
val roberta_embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings") \
    .setMaxSentenceLength(512)

# Define AssertionDLApproach with parameters and start training
val assertionStatus = new legal.AssertionDLApproach()\
    .setLabelCol("assertion_label")\
    .setInputCols("document", "doc_chunk", "embeddings")\
    .setOutputCol("assertion")\
    .setBatchSize(128)\
    .setLearningRate(0.001)\
    .setEpochs(2)\
    .setStartCol("tkn_start")\
    .setEndCol("tkn_end")\
    .setMaxSentLen(1200)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath('training_logs/')\
    .setGraphFolder(graph_folder)\
    .setGraphFile(f"{graph_folder}/assertion_graph.pb")\
    .setTestDataset(path="test_data.parquet", read_as='SPARK', options={'format': 'parquet'})\
    .setScopeWindow(scope_window)
    #.setValidationSplit(0.2)\    
    #.setDropout(0.1)\    

val trainingPipeline = new Pipeline().setStages(Array(
  document,
  chunk,
  token,
  roberta_embeddings,
  assertionStatus
))

val assertionModel = trainingPipeline.fit(data)
val assertionResults = assertionModel.transform(data).cache()

AssertionFilterer

Model

Filters entities coming from ASSERTION type annotations and returns the CHUNKS. Filters can be set via a white list on the extracted chunk, the assertion or a regular expression. White list for assertion is enabled by default. To use chunk white list, criteria has to be set to "isin". For regex, criteria has to be set to "regex".

Input Annotator Types: DOCUMENT, CHUNK, ASSERTION

Output Annotator Type: CHUNK

Python API: AssertionFilterer Scala API: AssertionFilterer
Show Example

from johnsnowlabs import * 
# To see how the assertions are extracted, see the example for AssertionDLModel.
# Define an extra step where the assertions are filtered
assertionFilterer = medical.AssertionFilterer() \
  .setInputCols(["sentence","ner_chunk","assertion"]) \
  .setOutputCol("filtered") \
  .setCriteria("assertion") \
  .setWhiteList(["present"])

assertionPipeline = Pipeline(stages=[
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter,
  clinicalAssertion,
  assertionFilterer
])

assertionModel = assertionPipeline.fit(data)
result = assertionModel.transform(data)

# Show results:

result.selectExpr("ner_chunk.result", "assertion.result").show(3, truncate=False)
+--------------------------------+--------------------------------+
|result                          |result                          |
+--------------------------------+--------------------------------+
|[severe fever, sore throat]     |[present, present]              |
|[stomach pain]                  |[absent]                        |
|[an epidural, PCA, pain control]|[present, present, hypothetical]|
+--------------------------------+--------------------------------+
result.select("filtered.result").show(3, truncate=False)
+---------------------------+
|result                     |
+---------------------------+
|[severe fever, sore throat]|
|[]                         |
|[an epidural, PCA]         |
+---------------------------+
from johnsnowlabs import * 
# To see how the assertions are extracted, see the example for AssertionDLModel.
# Define an extra step where the assertions are filtered
assertionFilterer = finance.AssertionFilterer() \
  .setInputCols(["sentence","ner_chunk","assertion"]) \
  .setOutputCol("filtered") \
  .setCriteria("assertion") \
  .setWhiteList(["present"])

assertionPipeline = Pipeline(stages=[
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter,
  clinicalAssertion,
  assertionFilterer
])

assertionModel = assertionPipeline.fit(data)
result = assertionModel.transform(data)
from johnsnowlabs import * 
# To see how the assertions are extracted, see the example for AssertionDLModel.
# Define an extra step where the assertions are filtered
assertionFilterer = legal.AssertionFilterer() \
  .setInputCols(["sentence","ner_chunk","assertion"]) \
  .setOutputCol("filtered") \
  .setCriteria("assertion") \
  .setWhiteList(["present"])

assertionPipeline = Pipeline(stages=[
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter,
  clinicalAssertion,
  assertionFilterer
])

assertionModel = assertionPipeline.fit(data)
result = assertionModel.transform(data)

from johnsnowlabs import * 

// To see how the assertions are extracted, see the example for
// [[com.johnsnowlabs.nlp.annotators.assertion.dl.AssertionDLModel AssertionDLModel]].
// Define an extra step where the assertions are filtered
val assertionFilterer = new medical.AssertionFilterer()
  .setInputCols("sentence","ner_chunk","assertion")
  .setOutputCol("filtered")
  .setCriteria("assertion")
  .setWhiteList("present")

val assertionPipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter,
  clinicalAssertion,
  assertionFilterer
))

val assertionModel = assertionPipeline.fit(data)
val result = assertionModel.transform(data)

// Show results:
//
// result.selectExpr("ner_chunk.result", "assertion.result").show(3, truncate=false)
// +--------------------------------+--------------------------------+
// |result                          |result                          |
// +--------------------------------+--------------------------------+
// |[severe fever, sore throat]     |[present, present]              |
// |[stomach pain]                  |[absent]                        |
// |[an epidural, PCA, pain control]|[present, present, hypothetical]|
// +--------------------------------+--------------------------------+
// result.select("filtered.result").show(3, truncate=false)
// +---------------------------+
// |result                     |
// +---------------------------+
// |[severe fever, sore throat]|
// |[]                         |
// |[an epidural, PCA]         |
// +---------------------------+
//
from johnsnowlabs import * 

// To see how the assertions are extracted, see the example for
// [[com.johnsnowlabs.nlp.annotators.assertion.dl.AssertionDLModel AssertionDLModel]].
// Define an extra step where the assertions are filtered
val assertionFilterer = new legal.AssertionFilterer()
  .setInputCols("sentence","ner_chunk","assertion")
  .setOutputCol("filtered")
  .setCriteria("assertion")
  .setWhiteList("present")

val assertionPipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter,
  clinicalAssertion,
  assertionFilterer
))

val assertionModel = assertionPipeline.fit(data)
val result = assertionModel.transform(data)
from johnsnowlabs import * 

// To see how the assertions are extracted, see the example for
// [[com.johnsnowlabs.nlp.annotators.assertion.dl.AssertionDLModel AssertionDLModel]].
// Define an extra step where the assertions are filtered
val assertionFilterer = new legal.AssertionFilterer()
  .setInputCols("sentence","ner_chunk","assertion")
  .setOutputCol("filtered")
  .setCriteria("assertion")
  .setWhiteList("present")

val assertionPipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter,
  clinicalAssertion,
  assertionFilterer
))

val assertionModel = assertionPipeline.fit(data)
val result = assertionModel.transform(data)

AssertionLogReg

This is a main class in AssertionLogReg family. Logarithmic Regression is used to extract Assertion Status from extracted entities and text. AssertionLogRegModel requires DOCUMENT, CHUNK and WORD_EMBEDDINGS type annotator inputs, which can be obtained by e.g a DocumentAssembler, NerConverter and WordEmbeddingsModel. The result is an assertion status annotation for each recognized entity. Possible values are "Negated", "Affirmed" and "Historical".

Unlike the DL Model, this class does not extend AnnotatorModel. Instead it extends the RawAnnotator, that’s why the main point of interest is method transform().

At the moment there are no pretrained models available for this class. Please refer to AssertionLogRegApproach to train your own model.

Input Annotator Types: DOCUMENT, CHUNK, WORD_EMBEDDINGS

Output Annotator Type: ASSERTION

Python API: AssertionLogRegModel Scala API: AssertionLogRegModel

Trains a classification method, which uses the Logarithmic Regression Algorithm. It is used to extract Assertion Status from extracted entities and text. Contains all the methods for training a AssertionLogRegModel, together with trainWithChunk, trainWithStartEnd.

Input Annotator Types: DOCUMENT, CHUNK, WORD_EMBEDDINGS

Output Annotator Type: ASSERTION

Python API: AssertionLogRegApproach Scala API: AssertionLogRegApproach
Show Example

from johnsnowlabs import *

# Training with Glove Embeddings

# First define pipeline stages to extract embeddings and text chunks

documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

glove = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("word_embeddings") \
    .setCaseSensitive(False)

chunkAssembler = nlp.Doc2Chunk() \
    .setInputCols(["document"]) \
    .setChunkCol("target") \
    .setOutputCol("chunk")

# Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.

assertion = medical.AssertionLogRegApproach() \
    .setLabelCol("label") \
    .setInputCols(["document", "chunk", "word_embeddings"]) \
    .setOutputCol("assertion") \
    .setReg(0.01) \
    .setBefore(11) \
    .setAfter(13) \
    .setStartCol("start") \
    .setEndCol("end")

assertionPipeline = Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    nerModel,
    nerConverter,
    assertion
])

assertionModel = assertionPipeline.fit(dataset)
from johnsnowlabs import *

# Training with Glove Embeddings

# First define pipeline stages to extract embeddings and text chunks

documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

glove = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("word_embeddings") \
    .setCaseSensitive(False)

chunkAssembler = nlp.Doc2Chunk() \
    .setInputCols(["document"]) \
    .setChunkCol("target") \
    .setOutputCol("chunk")

# Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.

assertion = finance.AssertionLogRegApproach() \
    .setLabelCol("label") \
    .setInputCols(["document", "chunk", "word_embeddings"]) \
    .setOutputCol("assertion") \
    .setReg(0.01) \
    .setBefore(11) \
    .setAfter(13) \
    .setStartCol("start") \
    .setEndCol("end")

assertionPipeline = Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    nerModel,
    nerConverter,
    assertion
])

assertionModel = assertionPipeline.fit(dataset)
from johnsnowlabs import *

# Training with Glove Embeddings

# First define pipeline stages to extract embeddings and text chunks

documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

glove = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("word_embeddings") \
    .setCaseSensitive(False)

chunkAssembler = nlp.Doc2Chunk() \
    .setInputCols(["document"]) \
    .setChunkCol("target") \
    .setOutputCol("chunk")

# Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.

assertion = legal.AssertionLogRegApproach() \
    .setLabelCol("label") \
    .setInputCols(["document", "chunk", "word_embeddings"]) \
    .setOutputCol("assertion") \
    .setReg(0.01) \
    .setBefore(11) \
    .setAfter(13) \
    .setStartCol("start") \
    .setEndCol("end")

assertionPipeline = Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    nerModel,
    nerConverter,
    assertion
])

assertionModel = assertionPipeline.fit(dataset)

from johnsnowlabs import * 

// Training with Glove Embeddings
// First define pipeline stages to extract embeddings and text chunks
val documentAssembler = new nlp.DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new nlp.Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val glove = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("document", "token"))
  .setOutputCol("word_embeddings")
  .setCaseSensitive(false)

val chunkAssembler = new nlp.Doc2Chunk()
  .setInputCols("document")
  .setChunkCol("target")
  .setOutputCol("chunk")

// Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
val assertion = new medical.AssertionLogRegApproach()
  .setLabelCol("label")
  .setInputCols(Array("document", "chunk", "word_embeddings"))
  .setOutputCol("assertion")
  .setReg(0.01)
  .setBefore(11)
  .setAfter(13)
  .setStartCol("start")
  .setEndCol("end")

val assertionPipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter,
  assertion
))

val assertionModel = assertionPipeline.fit(dataset)
from johnsnowlabs import * 

// Training with Glove Embeddings
// First define pipeline stages to extract embeddings and text chunks
val documentAssembler = new nlp.DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new nlp.Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val glove = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("document", "token"))
  .setOutputCol("word_embeddings")
  .setCaseSensitive(false)

val chunkAssembler = new nlp.Doc2Chunk()
  .setInputCols("document")
  .setChunkCol("target")
  .setOutputCol("chunk")

// Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
val assertion = new finance.AssertionLogRegApproach()
  .setLabelCol("label")
  .setInputCols(Array("document", "chunk", "word_embeddings"))
  .setOutputCol("assertion")
  .setReg(0.01)
  .setBefore(11)
  .setAfter(13)
  .setStartCol("start")
  .setEndCol("end")

val assertionPipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter,
  assertion
))

val assertionModel = assertionPipeline.fit(dataset)
from johnsnowlabs import * 

// Training with Glove Embeddings
// First define pipeline stages to extract embeddings and text chunks
val documentAssembler = new nlp.DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new nlp.Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val glove = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("document", "token"))
  .setOutputCol("word_embeddings")
  .setCaseSensitive(false)

val chunkAssembler = new nlp.Doc2Chunk()
  .setInputCols("document")
  .setChunkCol("target")
  .setOutputCol("chunk")

// Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
val assertion = new legal.AssertionLogRegApproach()
  .setLabelCol("label")
  .setInputCols(Array("document", "chunk", "word_embeddings"))
  .setOutputCol("assertion")
  .setReg(0.01)
  .setBefore(11)
  .setAfter(13)
  .setStartCol("start")
  .setEndCol("end")

val assertionPipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  nerModel,
  nerConverter,
  assertion
))

val assertionModel = assertionPipeline.fit(dataset)

BertSentenceChunkEmbeddings

Model

This annotator allows aggregating sentence embeddings with ner chunk embeddings to get specific and more accurate resolution codes. It works by averaging sentence and chunk embeddings add contextual information in the embedding value. Input to this annotator is the context (sentence) and ner chunks, while the output is embedding for each chunk that can be fed to the resolver model.

The setChunkWeight parameter can be used to control the influence of surrounding context.

For more information and examples of BertSentenceChunkEmbeddings annotator, you can check the Spark NLP Workshop, and in special, the notebook 24.1.Improved_Entity_Resolution_with_SentenceChunkEmbeddings.ipynb.

Input Annotator Types: DOCUMENT, CHUNK

Output Annotator Type: SENTENCE_EMBEDDINGS

Python API: BertSentenceChunkEmbeddings Scala API: BertSentenceChunkEmbeddings
Show Example

# Define the pipeline

document_assembler = nlp.DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

tokenizer = nlp.Tokenizer()\
      .setInputCols(["document"])\
      .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["document", "token"])\
      .setOutputCol("word_embeddings")

clinical_ner = medical.NerModel.pretrained("ner_abbreviation_clinical", "en", "clinical/models") \
      .setInputCols(["document", "token", "word_embeddings"]) \
      .setOutputCol("ner")

ner_converter = medical.NerConverterInternal() \
      .setInputCols(["document", "token", "ner"]) \
      .setOutputCol("ner_chunk")\
      .setWhiteList(['ABBR'])

sentence_chunk_embeddings = medical.BertSentenceChunkEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
      .setInputCols(["document", "ner_chunk"])\
      .setOutputCol("sentence_embeddings")\
      .setChunkWeight(0.5)\
      .setCaseSensitive(True)

abbr_resolver = medical.SentenceEntityResolverModel.pretrained("sbiobertresolve_clinical_abbreviation_acronym", "en", "clinical/models") \
      .setInputCols(["ner_chunk", "sentence_embeddings"]) \
      .setOutputCol("abbr_meaning")\
      .setDistanceFunction("EUCLIDEAN")
    

resolver_pipeline = Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        sentence_chunk_embeddings,
        abbr_resolver
  ])


# Example results

sample_text = [
"""The patient admitted from the IR for aggressive irrigation of the Miami pouch. DISCHARGE DIAGNOSES: 1. A 58-year-old female with a history of stage 2 squamous cell carcinoma of the cervix status post total pelvic exenteration in 1991.""",
"""Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal. 
Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet."""]

from pyspark.sql.types import StringType, IntegerType

df = spark.createDataFrame(sample_text, StringType()).toDF('text')
df.show(truncate = 100)

+----------------------------------------------------------------------------------------------------+
|                                                                                                text|
+----------------------------------------------------------------------------------------------------+
|The patient admitted from the IR for aggressive irrigation of the Miami pouch. DISCHARGE DIAGNOSE...|
|Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA...|
+----------------------------------------------------------------------------------------------------+

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

 val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

 val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("tokens")

 val wordEmbeddings = BertEmbeddings
    .pretrained("biobert_pubmed_base_cased")
    .setInputCols(Array("sentence", "tokens"))
    .setOutputCol("word_embeddings")

 val nerModel = MedicalNerModel
    .pretrained("ner_clinical_biobert", "en", "clinical/models")
    .setInputCols(Array("sentence", "tokens", "word_embeddings"))
    .setOutputCol("ner")

  val nerConverter = new NerConverter()
    .setInputCols("sentence", "tokens", "ner")
    .setOutputCol("ner_chunk")

 val sentenceChunkEmbeddings = BertSentenceChunkEmbeddings
    .pretrained("sbluebert_base_uncased_mli", "en", "clinical/models")
     .setInputCols(Array("sentence", "ner_chunk"))
     .setOutputCol("sentence_chunk_embeddings")

 val pipeline = new Pipeline()
      .setStages(Array(
          documentAssembler,
          sentenceDetector,
          tokenizer,
          wordEmbeddings,
          nerModel,
          nerConverter,
          sentenceChunkEmbeddings))

 val sampleText = "Her Diabetes has become type 2 in the last year with her Diabetes." +
    " He complains of swelling in his right forearm."

 val testDataset = Seq("").toDS.toDF("text")
 val result = pipeline.fit(emptyDataset).transform(testDataset)

 result
    .selectExpr("explode(sentence_chunk_embeddings) AS s")
    .selectExpr("s.result", "slice(s.embeddings, 1, 5) AS averageEmbedding")
    .show(truncate=false)

 +-----------------------------+-----------------------------------------------------------------+
 |                       result|                                                 averageEmbedding|
 +-----------------------------+-----------------------------------------------------------------+
 |Her Diabetes                 |[-0.31995273, -0.04710883, -0.28973156, -0.1294758, 0.12481072]  |
 |type 2                       |[-0.027161136, -0.24613449, -0.0949309, 0.1825444, -0.2252143]   |
 |her Diabetes                 |[-0.31995273, -0.04710883, -0.28973156, -0.1294758, 0.12481072]  |
 |swelling in his right forearm|[-0.45139068, 0.12400375, -0.0075617577, -0.90806055, 0.12871636]|
 +-----------------------------+-----------------------------------------------------------------+

Chunk2Token

Model

A feature transformer that converts the input array of strings (annotatorType CHUNK) into an array of chunk-based tokens (annotatorType TOKEN).

When the input is empty, an empty array is returned.

This Annotator is specially convenient when using NGramGenerator annotations as inputs to WordEmbeddingsModels

Input Annotator Types: CHUNK

Output Annotator Type: TOKEN

Scala API: Chunk2Token
Show Example

from johnsnowlabs import * 
# Define a pipeline for generating n-grams
data = spark.createDataFrame([["A 63-year-old man presents to the hospital ..."]]).toDF("text")
document = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
sentenceDetector = nlp.SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
token = nlp.Tokenizer().setInputCols(["sentence"]).setOutputCol("token")
ngrammer = nlp.NGramGenerator() \
 .setN(2) \
 .setEnableCumulative(False) \
 .setInputCols(["token"]) \
 .setOutputCol("ngrams") \
 .setDelimiter("_")

# Stage to convert n-gram CHUNKS to TOKEN type
chunk2Token = medical.Chunk2Token().setInputCols(["ngrams"]).setOutputCol("ngram_tokens")
trainingPipeline = Pipeline(stages=[document, sentenceDetector, token, ngrammer, chunk2Token]).fit(data)

result = trainingPipeline.transform(data).cache()
result.selectExpr("explode(ngram_tokens)").show(5, False)
    +----------------------------------------------------------------+
    |col                                                             |
    +----------------------------------------------------------------+
    |{token, 3, 15, A_63-year-old, {sentence -> 0, chunk -> 0}, []}  |
    |{token, 5, 19, 63-year-old_man, {sentence -> 0, chunk -> 1}, []}|
    |{token, 17, 28, man_presents, {sentence -> 0, chunk -> 2}, []}  |
    |{token, 21, 31, presents_to, {sentence -> 0, chunk -> 3}, []}   |
    |{token, 30, 35, to_the, {sentence -> 0, chunk -> 4}, []}        |
    +----------------------------------------------------------------+
from johnsnowlabs import * 
# Define a pipeline for generating n-grams
document = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
sentenceDetector = nlp.SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
token = nlp.Tokenizer().setInputCols(["sentence"]).setOutputCol("token")
ngrammer = nlp.NGramGenerator() \
 .setN(2) \
 .setEnableCumulative(False) \
 .setInputCols(["token"]) \
 .setOutputCol("ngrams") \
 .setDelimiter("_")

# Stage to convert n-gram CHUNKS to TOKEN type
chunk2Token = finance.Chunk2Token().setInputCols(["ngrams"]).setOutputCol("ngram_tokens")
trainingPipeline = Pipeline(stages=[document, sentenceDetector, token, ngrammer, chunk2Token])
from johnsnowlabs import * 
# Define a pipeline for generating n-grams
document = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
sentenceDetector = nlp.SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
token = nlp.Tokenizer().setInputCols(["sentence"]).setOutputCol("token")
ngrammer = nlp.NGramGenerator() \
 .setN(2) \
 .setEnableCumulative(False) \
 .setInputCols(["token"]) \
 .setOutputCol("ngrams") \
 .setDelimiter("_")

# Stage to convert n-gram CHUNKS to TOKEN type
chunk2Token = legal.Chunk2Token().setInputCols(["ngrams"]).setOutputCol("ngram_tokens")
trainingPipeline = Pipeline(stages=[document, sentenceDetector, token, ngrammer, chunk2Token])

from johnsnowlabs import * 
// Define a pipeline for generating n-grams
val data = Seq(("A 63-year-old man presents to the hospital ...")).toDF("text")
val document = new nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector().setInputCols("document").setOutputCol("sentence")
val token = new nlp.Tokenizer().setInputCols("sentence").setOutputCol("token")
val ngrammer = new nlp.NGramGenerator()
 .setN(2)
 .setEnableCumulative(false)
 .setInputCols("token")
 .setOutputCol("ngrams")
 .setDelimiter("_")

// Stage to convert n-gram CHUNKS to TOKEN type
val chunk2Token = new medical.Chunk2Token().setInputCols("ngrams").setOutputCol("ngram_tokens")
val trainingPipeline = new Pipeline().setStages(Array(document, sentenceDetector, token, ngrammer, chunk2Token)).fit(data)

val result = trainingPipeline.transform(data).cache()
result.selectExpr("explode(ngram_tokens)").show(5, false)
+----------------------------------------------------------------+
|col                                                             |
+----------------------------------------------------------------+
|{token, 3, 15, A_63-year-old, {sentence -> 0, chunk -> 0}, []}  |
|{token, 5, 19, 63-year-old_man, {sentence -> 0, chunk -> 1}, []}|
|{token, 17, 28, man_presents, {sentence -> 0, chunk -> 2}, []}  |
|{token, 21, 31, presents_to, {sentence -> 0, chunk -> 3}, []}   |
|{token, 30, 35, to_the, {sentence -> 0, chunk -> 4}, []}        |
+----------------------------------------------------------------+
from johnsnowlabs import * 
// Define a pipeline for generating n-grams

val document = new nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector().setInputCols("document").setOutputCol("sentence")
val token = new nlp.Tokenizer().setInputCols("sentence").setOutputCol("token")
val ngrammer = new nlp.NGramGenerator()
 .setN(2)
 .setEnableCumulative(false)
 .setInputCols("token")
 .setOutputCol("ngrams")
 .setDelimiter("_")

// Stage to convert n-gram CHUNKS to TOKEN type
val chunk2Token = new finance.Chunk2Token().setInputCols("ngrams").setOutputCol("ngram_tokens")
val trainingPipeline = new Pipeline().setStages(Array(document, sentenceDetector, token, ngrammer, chunk2Token))
from johnsnowlabs import * 
// Define a pipeline for generating n-grams

val document = new nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector().setInputCols("document").setOutputCol("sentence")
val token = new nlp.Tokenizer().setInputCols("sentence").setOutputCol("token")
val ngrammer = new nlp.NGramGenerator()
 .setN(2)
 .setEnableCumulative(false)
 .setInputCols("token")
 .setOutputCol("ngrams")
 .setDelimiter("_")

// Stage to convert n-gram CHUNKS to TOKEN type
val chunk2Token = new legal.Chunk2Token().setInputCols("ngrams").setOutputCol("ngram_tokens")
val trainingPipeline = new Pipeline().setStages(Array(document, sentenceDetector, token, ngrammer, chunk2Token))

ChunkConverter

Model

Convert chunks from RegexMatcher to chunks with a entity in the metadata.

This annotator is important when the user wants to merge entities identified by NER models together with rules-based matching used by the RegexMathcer annotator. In the following steps of the pipeline, all the identified entities can be treated in a unified field.

Input Annotator Types: DOCUMENT, CHUNK

Output Annotator Type: CHUNK

Python API: ChunkConverter Scala API: ChunkConverter
Show Example

# Creating the pipeline

documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model = MedicalNerModel.pretrained("ner_clinical_large","en","clinical/models") \
    .setInputCols("sentence","token","embeddings") \
    .setOutputCol("ner")

ner_converter= NerConverter()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\

regex_matcher = RegexMatcher()\
    .setInputCols('document')\
    .setStrategy("MATCH_ALL")\
    .setOutputCol("regex_matches")\
    .setExternalRules(path='file:/dbfs/regex_rules.txt', delimiter=',')

chunkConverter = ChunkConverter()\
    .setInputCols("regex_matches")\
    .setOutputCol("regex_chunk")

merger= ChunkMergeApproach()\
    .setInputCols(["regex_chunk", "ner_chunk"])\
    .setOutputCol("merged_chunks")\
    .setMergeOverlapping(True)\
    .setChunkPrecedence("field")

pipeline= Pipeline(stages=[
                           documentAssembler,
                           sentenceDetector,
                           tokenizer,
                           word_embeddings,
                           ner_model,
                           ner_converter,
                           regex_matcher,
                           chunkConverter,
                           merger
])

empty_df= spark.createDataFrame([[""]]).toDF("text")
model= pipeline.fit(empty_df)

lp_model = LightPipeline(model)
results = lp_model.fullAnnotate(sample_text)[0]

# Displaying the results

chunk= []
merge= []
for result in list(results["merged_chunks"]):
  merge.append(result.metadata["entity"])
  chunk.append(result.result)
df_merge = pd.DataFrame({"chunk": chunk,  "merged_entity": merge})
df_merge

|                                          chunk |  merged_entity |
|-----------------------------------------------:|---------------:|
|                       POSTOPERATIVE DIAGNOSIS: | SECTION_HEADER |
|                       Cervical lymphadenopathy |        PROBLEM |
|                                     PROCEDURE: | SECTION_HEADER |
| Excisional biopsy of right cervical lymph node |           TEST |
|                                    ANESTHESIA: | SECTION_HEADER |
|                General endotracheal anesthesia |      TREATMENT |
|                      Right cervical lymph node |        PROBLEM |
|                                           EBL: | SECTION_HEADER |
|                                 COMPLICATIONS: | SECTION_HEADER |
|                                      FINDINGS: | SECTION_HEADER |
|                    Enlarged level 2 lymph node |        PROBLEM |
| ...                                            |                |

val sampleDataset = ResourceHelper.spark.createDataFrame(Seq(
 (1, "My first sentence with the first rule. This is my second sentence with ceremonies rule.")
)).toDF("id", "text")

val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")

val sentence = new SentenceDetector().setInputCols("document").setOutputCol("sentence")

val regexMatcher = new RegexMatcher()
 .setExternalRules(ExternalResource("src/test/resources/regex-matcher/rules.txt", ReadAs.TEXT, Map("delimiter" -> ",")))
 .setInputCols(Array("sentence"))
 .setOutputCol("regex")
 .setStrategy(strategy)

val chunkConverter = new ChunkConverter().setInputCols("regex").setOutputCol("chunk")

val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, regexMatcher,chunkConverter))

val results = pipeline.fit(sampleDataset).transform(sampleDataset)
results.select("chunk").show(truncate = false)
+------------------------------------------------------------------------------------------------+
|col                                                                                             |
+------------------------------------------------------------------------------------------------+
|[chunk, 23, 31, the first, [identifier -> NAME, sentence -> 0, chunk -> 0, entity -> NAME], []] |
|[chunk, 71, 80, ceremonies, [identifier -> NAME, sentence -> 1, chunk -> 0, entity -> NAME], []]|
+------------------------------------------------------------------------------------------------+

ChunkEntityResolver

Returns a normalized entity for a particular trained ontology / curated dataset (e.g. ICD-10, RxNorm, SNOMED etc).

For available pretrained models please see the Models Hub.

Input Annotator Types: TOKEN, WORD_EMBEDDINGS

Output Annotator Type: ENTITY

Scala API: ChunkEntityResolverModel
Show Example

from johnsnowlabs import *
# Using pretrained models for SNOMED
# First the prior steps of the pipeline are defined.
# Output of types TOKEN and WORD_EMBEDDINGS are needed.
data = spark.createDataFrame([["A 63-year-old man presents to the hospital ..."]]).toDF("text")
docAssembler = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
sentenceDetector = nlp.SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
tokenizer = nlp.Tokenizer().setInputCols(["sentence"]).setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("word_embeddings")
icdo_ner = medical.NerModel.pretrained("ner_bionlp", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "word_embeddings"]) \
    .setOutputCol("icdo_ner")
icdo_chunk = nlp.NerConverter().setInputCols(["sentence","token","icdo_ner"]).setOutputCol("icdo_chunk").setWhiteList(["Cancer"])
icdo_chunk_embeddings = nlp.ChunkEmbeddings() \
    .setInputCols(["icdo_chunk", "word_embeddings"]) \
    .setOutputCol("icdo_chunk_embeddings")
icdo_chunk_resolver = medical.ChunkEntityResolverModel.pretrained("chunkresolve_icdo_clinical", "en", "clinical/models") \
    .setInputCols(["token","icdo_chunk_embeddings"]) \
    .setOutputCol("tm_icdo_code")
clinical_ner = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "word_embeddings"]) \
  .setOutputCol("ner")
ner_converter = nlp.NerConverter() \
  .setInputCols(["sentence", "token", "ner"]) \
  .setOutputCol("ner_chunk")
ner_chunk_tokenizer = nlp.ChunkTokenizer() \
    .setInputCols(["ner_chunk"]) \
    .setOutputCol("ner_token")
ner_chunk_embeddings = nlp.ChunkEmbeddings() \
    .setInputCols(["ner_chunk", "word_embeddings"]) \
    .setOutputCol("ner_chunk_embeddings")

# Definition of the SNOMED Resolution
ner_snomed_resolver = medical.ChunkEntityResolverModel.pretrained("chunkresolve_snomed_findings_clinical","en","clinical/models") \
      .setInputCols(["ner_token","ner_chunk_embeddings"]).setOutputCol("snomed_result")
pipelineFull = Pipeline().setStages([
      docAssembler,
      sentenceDetector,
      tokenizer,
      word_embeddings,

      clinical_ner,
      ner_converter,
      ner_chunk_embeddings,
      ner_chunk_tokenizer,
      ner_snomed_resolver,

      icdo_ner,
      icdo_chunk,
      icdo_chunk_embeddings,
      icdo_chunk_resolver
])
pipelineModelFull = pipelineFull.fit(data)
result = pipelineModelFull.transform(data).cache()

# Show results
result.selectExpr("explode(snomed_result)")
  .selectExpr(
    "col.metadata.target_text",
    "col.metadata.resolved_text",
    "col.metadata.confidence",
    "col.metadata.all_k_results",
    "col.metadata.all_k_resolutions")
  .filter($"confidence" > 0.2).show(5)
+--------------------+--------------------+----------+--------------------+--------------------+
|         target_text|       resolved_text|confidence|       all_k_results|   all_k_resolutions|
+--------------------+--------------------+----------+--------------------+--------------------+
|hypercholesterolemia|Hypercholesterolemia|    0.2524|13644009:::267432...|Hypercholesterole...|
|                 CBC|             Neocyte|    0.4980|259680000:::11573...|Neocyte:::Blood g...|
|                CD38|       Hypoviscosity|    0.2560|47872005:::370970...|Hypoviscosity:::E...|
|           platelets| Increased platelets|    0.5267|6631009:::2596800...|Increased platele...|
|                CD38|       Hypoviscosity|    0.2560|47872005:::370970...|Hypoviscosity:::E...|
+--------------------+--------------------+----------+--------------------+--------------------+

from johnsnowlabs import * 
// Using pretrained models for SNOMED
// First the prior steps of the pipeline are defined.
// Output of types TOKEN and WORD_EMBEDDINGS are needed.
val data = Seq(("A 63-year-old man presents to the hospital ...")).toDF("text")
val docAssembler = new nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector().setInputCols("document").setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer().setInputCols("sentence").setOutputCol("token")
val word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("word_embeddings")
val icdo_ner = medical.NerModel.pretrained("ner_bionlp", "en", "clinical/models")
  .setInputCols(Array("sentence", "token", "word_embeddings"))
  .setOutputCol("icdo_ner")
val icdo_chunk = new nlp.NerConverter().setInputCols(Array("sentence","token","icdo_ner")).setOutputCol("icdo_chunk").setWhiteList("Cancer")
val icdo_chunk_embeddings = new nlp.ChunkEmbeddings()
  .setInputCols(Array("icdo_chunk", "word_embeddings"))
  .setOutputCol("icdo_chunk_embeddings")
val icdo_chunk_resolver = medical.ChunkEntityResolverModel.pretrained("chunkresolve_icdo_clinical", "en", "clinical/models")
  .setInputCols(Array("token","icdo_chunk_embeddings"))
  .setOutputCol("tm_icdo_code")
val clinical_ner = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "word_embeddings"))
.setOutputCol("ner")
val ner_converter = new nlp.NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val ner_chunk_tokenizer = new nlp.ChunkTokenizer()
  .setInputCols("ner_chunk")
  .setOutputCol("ner_token")
val ner_chunk_embeddings = new nlp.ChunkEmbeddings()
  .setInputCols(Array("ner_chunk", "word_embeddings"))
  .setOutputCol("ner_chunk_embeddings")

// Definition of the SNOMED Resolution
val ner_snomed_resolver = medical.ChunkEntityResolverModel.pretrained("chunkresolve_snomed_findings_clinical","en","clinical/models")
    .setInputCols(Array("ner_token","ner_chunk_embeddings")).setOutputCol("snomed_result")
val pipelineFull = new Pipeline().setStages(Array(
    docAssembler,
    sentenceDetector,
    tokenizer,
    word_embeddings,

    clinical_ner,
    ner_converter,
    ner_chunk_embeddings,
    ner_chunk_tokenizer,
    ner_snomed_resolver,

    icdo_ner,
    icdo_chunk,
    icdo_chunk_embeddings,
    icdo_chunk_resolver
))
val pipelineModelFull = pipelineFull.fit(data)
val result = pipelineModelFull.transform(data).cache()

// Show results
//
// result.selectExpr("explode(snomed_result)")
//   .selectExpr(
//     "col.metadata.target_text",
//     "col.metadata.resolved_text",
//     "col.metadata.confidence",
//     "col.metadata.all_k_results",
//     "col.metadata.all_k_resolutions")
//   .filter($"confidence" > 0.2).show(5)
// +--------------------+--------------------+----------+--------------------+--------------------+
// |         target_text|       resolved_text|confidence|       all_k_results|   all_k_resolutions|
// +--------------------+--------------------+----------+--------------------+--------------------+
// |hypercholesterolemia|Hypercholesterolemia|    0.2524|13644009:::267432...|Hypercholesterole...|
// |                 CBC|             Neocyte|    0.4980|259680000:::11573...|Neocyte:::Blood g...|
// |                CD38|       Hypoviscosity|    0.2560|47872005:::370970...|Hypoviscosity:::E...|
// |           platelets| Increased platelets|    0.5267|6631009:::2596800...|Increased platele...|
// |                CD38|       Hypoviscosity|    0.2560|47872005:::370970...|Hypoviscosity:::E...|
// +--------------------+--------------------+----------+--------------------+--------------------+
//

Contains all the parameters and methods to train a ChunkEntityResolverModel. It transform a dataset with two Input Annotations of types TOKEN and WORD_EMBEDDINGS, coming from e.g. ChunkTokenizer and ChunkEmbeddings Annotators and returns the normalized entity for a particular trained ontology / curated dataset. (e.g. ICD-10, RxNorm, SNOMED etc.)

To use pretrained models please use ChunkEntityResolverModel and see the Models Hub for available models.

Input Annotator Types: TOKEN, WORD_EMBEDDINGS

Output Annotator Type: ENTITY

Scala API: ChunkEntityResolverApproach
Show Example

from johnsnowlabs import *
# Training a SNOMED model
# Define pre-processing pipeline for training data. It needs consists of columns for the normalized training data
# and their labels.
document = nlp.DocumentAssembler() \
    .setInputCol("normalized_text") \
    .setOutputCol("document")

chunk = nlp.Doc2Chunk() \
    .setInputCols(["document"]) \
    .setOutputCol("chunk")

token = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_healthcare_100d", "en", "clinical/models") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

chunkEmb = nlp.ChunkEmbeddings() \
        .setInputCols(["chunk", "embeddings"]) \
        .setOutputCol("chunk_embeddings")

snomedTrainingPipeline = Pipeline().setStages([
    document,
    chunk,
    token,
    embeddings,
    chunkEmb
])

snomedTrainingModel = snomedTrainingPipeline.fit(data)

snomedData = snomedTrainingModel.transform(data).cache()

# Then the Resolver can be trained with
snomedExtractor = medical.ChunkEntityResolverApproach() \
    .setInputCols(["token", "chunk_embeddings"]) \
    .setOutputCol("recognized") \
    .setNeighbours(1000) \
    .setAlternatives(25) \
    .setNormalizedCol("normalized_text") \
    .setLabelCol("label") \
    .setEnableWmd(True).setEnableTfidf(True).setEnableJaccard(True) \
    .setEnableSorensenDice(True).setEnableJaroWinkler(True).setEnableLevenshtein(True) \
    .setDistanceWeights([1, 2, 2, 1, 1, 1]) \
    .setAllDistancesMetadata(True) \
    .setPoolingStrategy("MAX") \
    .setThreshold(1e32)
model = snomedExtractor.fit(snomedData)

from johnsnowlabs import * 
// Training a SNOMED model
// Define pre-processing pipeline for training data. It needs consists of columns for the normalized training data
// and their labels.
val document = new nlp.DocumentAssembler()
  .setInputCol("normalized_text")
  .setOutputCol("document")

val chunk = new nlp.Doc2Chunk()
  .setInputCols("document")
  .setOutputCol("chunk")

val token = new nlp.Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_healthcare_100d", "en", "clinical/models")
  .setInputCols(Array("document", "token"))
  .setOutputCol("embeddings")

val chunkEmb = new nlp.ChunkEmbeddings()
      .setInputCols(Array("chunk", "embeddings"))
      .setOutputCol("chunk_embeddings")

val snomedTrainingPipeline = new Pipeline().setStages(Array(
  document,
  chunk,
  token,
  embeddings,
  chunkEmb
))

val snomedTrainingModel = snomedTrainingPipeline.fit(data)

val snomedData = snomedTrainingModel.transform(data).cache()

// Then the Resolver can be trained with
val snomedExtractor = new medical.ChunkEntityResolverApproach()
  .setInputCols(Array("token", "chunk_embeddings"))
  .setOutputCol("recognized")
  .setNeighbours(1000)
  .setAlternatives(25)
  .setNormalizedCol("normalized_text")
  .setLabelCol("label")
  .setEnableWmd(true).setEnableTfidf(true).setEnableJaccard(true)
  .setEnableSorensenDice(true).setEnableJaroWinkler(true).setEnableLevenshtein(true)
  .setDistanceWeights(Array(1, 2, 2, 1, 1, 1))
  .setAllDistancesMetadata(true)
  .setPoolingStrategy("MAX")
  .setThreshold(1e32)
val model = snomedExtractor.fit(snomedData)

ChunkFilterer

Model

Filters entities coming from CHUNK annotations. Filters can be set via a white list of terms or a regular expression. White list criteria is enabled by default. To use regex, criteria has to be set to regex.

Input Annotator Types: DOCUMENT,CHUNK

Output Annotator Type: CHUNK

Python API: ChunkFilterer Scala API: ChunkFilterer
Show Example

from johnsnowlabs import *

# Filtering POS tags

# First pipeline stages to extract the POS tags are defined

data = spark.createDataFrame([["Has a past history of gastroenteritis and stomach pain, however patient ..."]]).toDF("text")
docAssembler = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
sentenceDetector = nlp.SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
tokenizer = nlp.Tokenizer().setInputCols(["sentence"]).setOutputCol("token")

posTagger = nlp.PerceptronModel.pretrained() \
  .setInputCols(["sentence", "token"]) \
  .setOutputCol("pos")

chunker = nlp.Chunker() \
  .setInputCols(["pos", "sentence"]) \
  .setOutputCol("chunk") \
  .setRegexParsers(["(<NN>)+"])

# Then the chunks can be filtered via a white list. Here only terms with "gastroenteritis" remain.
chunkerFilter = medical.ChunkFilterer() \
  .setInputCols(["sentence","chunk"]) \
  .setOutputCol("filtered") \
  .setCriteria("isin") \
  .setWhiteList(["gastroenteritis"])

pipeline = Pipeline(stages=[
  docAssembler,
  sentenceDetector,
  tokenizer,
  posTagger,
  chunker,
  chunkerFilter])

result = pipeline.fit(data).transform(data)
result.selectExpr("explode(chunk)").show(truncate=False)
+---------------------------------------------------------------------------------+
|col                                                                              |
+---------------------------------------------------------------------------------+
|{chunk, 11, 17, history, {sentence -> 0, chunk -> 0}, []}                        |
|{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []}                |
|{chunk, 42, 53, stomach pain, {sentence -> 0, chunk -> 2}, []}                   |
|{chunk, 64, 70, patient, {sentence -> 0, chunk -> 3}, []}                        |
|{chunk, 81, 110, stomach pain now.We don't care, {sentence -> 0, chunk -> 4}, []}|
|{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []}              |
+---------------------------------------------------------------------------------+

result.selectExpr("explode(filtered)").show(truncate=False)
+-------------------------------------------------------------------+
|col                                                                |
+-------------------------------------------------------------------+
|{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []}  |
|{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []}|
+-------------------------------------------------------------------+
from johnsnowlabs import *
# Filtering POS tags
# First pipeline stages to extract the POS tags are defined

docAssembler = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
sentenceDetector = nlp.SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
tokenizer = nlp.Tokenizer().setInputCols(["sentence"]).setOutputCol("token")

posTagger = nlp.PerceptronModel.pretrained() \
  .setInputCols(["sentence", "token"]) \
  .setOutputCol("pos")

chunker = nlp.Chunker() \
  .setInputCols(["pos", "sentence"]) \
  .setOutputCol("chunk") \
  .setRegexParsers(["(<NN>)+"])

# Then the chunks can be filtered via a white list. Here only terms with "gastroenteritis" remain.
chunkerFilter = finance.ChunkFilterer() \
  .setInputCols(["sentence","chunk"]) \
  .setOutputCol("filtered") \
  .setCriteria("isin") \
  .setWhiteList(["gastroenteritis"])

pipeline = Pipeline(stages=[
  docAssembler,
  sentenceDetector,
  tokenizer,
  posTagger,
  chunker,
  chunkerFilter])

result = pipeline.fit(data).transform(data)
from johnsnowlabs import *

# Filtering POS tags

# First pipeline stages to extract the POS tags are defined

docAssembler = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
sentenceDetector = nlp.SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
tokenizer = nlp.Tokenizer().setInputCols(["sentence"]).setOutputCol("token")

posTagger = nlp.PerceptronModel.pretrained() \
  .setInputCols(["sentence", "token"]) \
  .setOutputCol("pos")

chunker = nlp.Chunker() \
  .setInputCols(["pos", "sentence"]) \
  .setOutputCol("chunk") \
  .setRegexParsers(["(<NN>)+"])

# Then the chunks can be filtered via a white list. Here only terms with "gastroenteritis" remain.
chunkerFilter = legal.ChunkFilterer() \
  .setInputCols(["sentence","chunk"]) \
  .setOutputCol("filtered") \
  .setCriteria("isin") \
  .setWhiteList(["gastroenteritis"])

pipeline = Pipeline(stages=[
  docAssembler,
  sentenceDetector,
  tokenizer,
  posTagger,
  chunker,
  chunkerFilter])

result = pipeline.fit(data).transform(data)

from johnsnowlabs import * 
// Filtering POS tags
// First pipeline stages to extract the POS tags are defined
val data = Seq("Has a past history of gastroenteritis and stomach pain, however patient ...").toDF("text")
val docAssembler = new nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector().setInputCols("document").setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer().setInputCols("sentence").setOutputCol("token")

val posTagger = nlp.PerceptronModel.pretrained()
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("pos")

val chunker = new nlp.Chunker()
  .setInputCols(Array("pos", "sentence"))
  .setOutputCol("chunk")
  .setRegexParsers(Array("(<NN>)+"))

// Then the chunks can be filtered via a white list. Here only terms with "gastroenteritis" remain.
val chunkerFilter = new medical.ChunkFilterer()
  .setInputCols(Array("sentence","chunk"))
  .setOutputCol("filtered")
  .setCriteria("isin")
  .setWhiteList("gastroenteritis")

val pipeline = new Pipeline().setStages(Array(
  docAssembler,
  sentenceDetector,
  tokenizer,
  posTagger,
  chunker,
  chunkerFilter))

result.selectExpr("explode(chunk)").show(truncate=false)
+---------------------------------------------------------------------------------+
|col                                                                              |
+---------------------------------------------------------------------------------+
|{chunk, 11, 17, history, {sentence -> 0, chunk -> 0}, []}                        |
|{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []}                |
|{chunk, 42, 53, stomach pain, {sentence -> 0, chunk -> 2}, []}                   |
|{chunk, 64, 70, patient, {sentence -> 0, chunk -> 3}, []}                        |
|{chunk, 81, 110, stomach pain now.We don't care, {sentence -> 0, chunk -> 4}, []}|
|{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []}              |
+---------------------------------------------------------------------------------+

result.selectExpr("explode(filtered)").show(truncate=false)
+-------------------------------------------------------------------+
|col                                                                |
+-------------------------------------------------------------------+
|{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []}  |
|{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []}|
+-------------------------------------------------------------------+
from johnsnowlabs import * 

val docAssembler = new nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector().setInputCols("document").setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer().setInputCols("sentence").setOutputCol("token")

val posTagger = nlp.PerceptronModel.pretrained()
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("pos")

val chunker = new nlp.Chunker()
  .setInputCols(Array("pos", "sentence"))
  .setOutputCol("chunk")
  .setRegexParsers(Array("(<NN>)+"))

// Then the chunks can be filtered via a white list. Here only terms with "gastroenteritis" remain.
val chunkerFilter = new finance.ChunkFilterer()
  .setInputCols(Array("sentence","chunk"))
  .setOutputCol("filtered")
  .setCriteria("isin")
  .setWhiteList("gastroenteritis")

val pipeline = new Pipeline().setStages(Array(
  docAssembler,
  sentenceDetector,
  tokenizer,
  posTagger,
  chunker,
  chunkerFilter))
from johnsnowlabs import * 

val docAssembler = new nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector().setInputCols("document").setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer().setInputCols("sentence").setOutputCol("token")

val posTagger = nlp.PerceptronModel.pretrained()
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("pos")

val chunker = new nlp.Chunker()
  .setInputCols(Array("pos", "sentence"))
  .setOutputCol("chunk")
  .setRegexParsers(Array("(<NN>)+"))

// Then the chunks can be filtered via a white list. Here only terms with "gastroenteritis" remain.
val chunkerFilter = new legal.ChunkFilterer()
  .setInputCols(Array("sentence","chunk"))
  .setOutputCol("filtered")
  .setCriteria("isin")
  .setWhiteList("gastroenteritis")

val pipeline = new Pipeline().setStages(Array(
  docAssembler,
  sentenceDetector,
  tokenizer,
  posTagger,
  chunker,
  chunkerFilter))

ChunkKeyPhraseExtraction

Model

Chunk KeyPhrase Extraction uses Bert Sentence Embeddings to determine the most relevant key phrases describing a text. The input to the model consists of chunk annotations and sentence or document annotation. The model compares the chunks against the corresponding sentences/documents and selects the chunks which are most representative of the broader text context (i.e. the document or the sentence they belong to). The key phrases candidates (i.e. the input chunks) can be generated in various ways, e.g. by NGramGenerator, TextMatcher or NerConverter. The model operates either at sentence (selecting the most descriptive chunks from the sentence they belong to) or at document level. In the latter case, the key phrases are selected to represent all the input document annotations.

This model is a subclass of [[BertSentenceEmbeddings]] and shares all parameters with it. It can load any pretrained BertSentenceEmbeddings model. Available models can be found at the Models Hub.

Input Annotator Types: DOCUMENT, CHUNK

Output Annotator Type: CHUNK

Python API: ChunkKeyPhraseExtraction Scala API: ChunkKeyPhraseExtraction
Show Example

from johnsnowlabs import *

documenter = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentencer = nlp.SentenceDetector() \
    .setInputCols(["document"])\
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("tokens") \

embeddings = nlp.WordEmbeddingsModel() \
    .pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "tokens"]) \
    .setOutputCol("embeddings")

ner_tagger = medical.NerModel() \
    .pretrained("ner_jsl_slim", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens", "embeddings"]) \
    .setOutputCol("ner_tags")

ner_converter = nlp.NerConverter()\
    .setInputCols("sentences", "tokens", "ner_tags")\
    .setOutputCol("ner_chunks")

key_phrase_extractor = medical.ChunkKeyPhraseExtraction\
    .pretrained()\
    .setTopN(1)\
    .setDocumentLevelProcessing(False)\
    .setDivergence(0.4)\
    .setInputCols(["sentences", "ner_chunks"])\
    .setOutputCol("ner_chunk_key_phrases")

pipeline = sparknlp.base.Pipeline() \
    .setStages([documenter, sentencer, tokenizer, embeddings, ner_tagger, ner_converter, key_phrase_extractor])

data = spark.createDataFrame([["Her Diabetes has become type 2 in the last year with her Diabetes.He complains of swelling in his right forearm."]]).toDF("text")
results = pipeline.fit(data).transform(data)
results\
    .selectExpr("explode(ner_chunk_key_phrases) AS key_phrase")\
    .selectExpr(
        "key_phrase.result",
        "key_phrase.metadata.entity",
        "key_phrase.metadata.DocumentSimilarity",
        "key_phrase.metadata.MMRScore")\
    .show(truncate=False)

+-----------------------------+------------------+-------------------+
|result                       |DocumentSimilarity|MMRScore           |
+-----------------------------+------------------+-------------------+
|gestational diabetes mellitus|0.7391447825527298|0.44348688715422274|
|28-year-old                  |0.4366776288430703|0.13577881610104517|
|type two diabetes mellitus   |0.7323921930094919|0.085800103824974  |
+-----------------------------+------------------+-------------------+
from johnsnowlabs import *

documenter = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentencer = nlp.SentenceDetector() \
    .setInputCols(["document"])\
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("tokens")

embeddings = nlp.WordEmbeddingsModel() \
    .pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "tokens"]) \
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols("sentences", "tokens", "ner")\
    .setOutputCol("ner_chunks")

key_phrase_extractor = finance.ChunkKeyPhraseExtraction\
    .pretrained()\
    .setTopN(1)\
    .setDocumentLevelProcessing(False)\
    .setDivergence(0.4)\
    .setInputCols(["sentences", "ner_chunks"])\
    .setOutputCol("ner_chunk_key_phrases")

pipeline = sparknlp.base.Pipeline() \
    .setStages([documenter, sentencer, tokenizer, embeddings, ner_model, ner_converter, key_phrase_extractor])
from johnsnowlabs import *

documenter = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentencer = nlp.SentenceDetector() \
    .setInputCols(["document"])\
    .setOutputCol("sentences")

tokenizer = nlp.Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("tokens") \

embeddings = nlp.WordEmbeddingsModel() \
    .pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["document", "tokens"]) \
    .setOutputCol("embeddings")

ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
    .setInputCols("sentences", "tokens", "ner")\
    .setOutputCol("ner_chunks")

key_phrase_extractor = legal.ChunkKeyPhraseExtraction\
    .pretrained()\
    .setTopN(1)\
    .setDocumentLevelProcessing(False)\
    .setDivergence(0.4)\
    .setInputCols(["sentences", "ner_chunks"])\
    .setOutputCol("ner_chunk_key_phrases")

pipeline = sparknlp.base.Pipeline() \
    .setStages([documenter, sentencer, tokenizer, embeddings, ner_model, ner_converter, key_phrase_extractor])

from johnsnowlabs import *

val documentAssembler = new nlp.DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new nlp.Tokenizer()
    .setInputCols("document")
    .setOutputCol("tokens")

val stopWordsCleaner = nlp.StopWordsCleaner.pretrained()
    .setInputCols("tokens")
    .setOutputCol("clean_tokens")
    .setCaseSensitive(false)

val nGrams = new nlp.NGramGenerator()
    .setInputCols(Array("clean_tokens"))
    .setOutputCol("ngrams")
    .setN(3)


val chunkKeyPhraseExtractor = medical.ChunkKeyPhraseExtraction
    .pretrained()
    .setTopN(2)
    .setDivergence(0.7f)
    .setInputCols(Array("document", "ngrams"))
    .setOutputCol("key_phrases")

val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    tokenizer,
    stopWordsCleaner,
    nGrams,
    chunkKeyPhraseExtractor))

val sampleText = "Her Diabetes has become type 2 in the last year with her Diabetes." +
    " He complains of swelling in his right forearm."

val testDataset = Seq("").toDS.toDF("text")
val result = pipeline.fit(emptyDataset).transform(testDataset)

result
    .selectExpr("explode(key_phrases) AS key_phrase")
    .selectExpr(
        "key_phrase.result",
        "key_phrase.metadata.DocumentSimilarity",
        "key_phrase.metadata.MMRScore")
    .show(truncate=false)

+--------------------------+-------------------+------------------+
|result                    |DocumentSimilarity |MMRScore          |
+--------------------------+-------------------+------------------+
|complains swelling forearm|0.6325718954229369 |0.1897715761677257|
|type 2 year               |0.40181028931546364|-0.189501077108947|
+--------------------------+-------------------+------------------+
from johnsnowlabs import *

val documentAssembler = new nlp.DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new nlp.Tokenizer()
    .setInputCols("document")
    .setOutputCol("tokens")

val stopWordsCleaner = nlp.StopWordsCleaner.pretrained()
    .setInputCols("tokens")
    .setOutputCol("clean_tokens")
    .setCaseSensitive(false)

val nGrams = new nlp.NGramGenerator()
    .setInputCols(Array("clean_tokens"))
    .setOutputCol("ngrams")
    .setN(3)


val chunkKeyPhraseExtractor = finance.ChunkKeyPhraseExtraction
    .pretrained()
    .setTopN(2)
    .setDivergence(0.7f)
    .setInputCols(Array("document", "ngrams"))
    .setOutputCol("key_phrases")

val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    tokenizer,
    stopWordsCleaner,
    nGrams,
    chunkKeyPhraseExtractor))
from johnsnowlabs import *

val documentAssembler = new nlp.DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new nlp.Tokenizer()
    .setInputCols("document")
    .setOutputCol("tokens")

val stopWordsCleaner = nlp.StopWordsCleaner.pretrained()
    .setInputCols("tokens")
    .setOutputCol("clean_tokens")
    .setCaseSensitive(false)

val nGrams = new nlp.NGramGenerator()
    .setInputCols(Array("clean_tokens"))
    .setOutputCol("ngrams")
    .setN(3)


val chunkKeyPhraseExtractor = legal.ChunkKeyPhraseExtraction
    .pretrained()
    .setTopN(2)
    .setDivergence(0.7f)
    .setInputCols(Array("document", "ngrams"))
    .setOutputCol("key_phrases")

val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    tokenizer,
    stopWordsCleaner,
    nGrams,
    chunkKeyPhraseExtractor))

ChunkMapper

We can use ChunkMapper to map entities with their associated code/reference based on pre-defined dictionaries.

This is the AnnotatorModel of the ChunkMapper, which can be used to access pretrained models with the .pretrained() or .load() methods. To train a new model, check the documentation of the ChunkMapperApproach annotator.

The annotator also allows using fuzzy matching, which can take into consideration parts of the tokens tha can map even when word order is different, char ngrams that can map even when thre are typos, and using fuzzy distance metric (Jaccard, Levenshtein, etc.).

Example usage and more details can be found on Spark NLP Workshop repository accessible in GitHub, for example the notebook Healthcare Chunk Mapping.

Input Annotator Types: CHUNK

Output Annotator Type: LABEL_DEPENDENCY

Python API: ChunkMapperModel Scala API: ChunkMapperModel
Show Example

# Use `rxnorm_mapper` pretrained model to map entities with their corresponding RxNorm codes.

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("ner_chunk")

chunkerMapper = ChunkMapperModel.pretrained("rxnorm_mapper", "en", "clinical/models")\
    .setInputCols(["ner_chunk"])\
    .setOutputCol("rxnorm")\
    .setRels(["rxnorm_code"])

mapper_pipeline = Pipeline().setStages([document_assembler, chunkerMapper])

empty_df = spark.createDataFrame([['']]).toDF('text')
mapper_model = mapper_pipeline.fit(empty_df)

mapper_lp = LightPipeline(mapper_model)
mapper_lp.fullAnnotate("metformin")

[{'ner_chunk': [Annotation(document, 0, 8, metformin, {})],
  'rxnorm': [Annotation(labeled_dependency, 0, 8, 6809, {'entity': 'metformin', 'relation': 'rxnorm_code', 'all_relations': ''})]}]

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("ner_chunk")

val chunkerMapper = ChunkMapperModel.pretrained("rxnorm_mapper", "en", "clinical/models")\
    .setInputCols(["ner_chunk"])\
    .setOutputCol("rxnorm")\
    .setRels(["rxnorm_code"])

mapper_pipeline = Pipeline().setStages([document_assembler, chunkerMapper])

empty_df = spark.createDataFrame([['']]).toDF('text')
mapper_model = mapper_pipeline.fit(empty_df)

mapper_lp = LightPipeline(mapper_model)
mapper_lp.fullAnnotate("metformin")

[{'ner_chunk': [Annotation(document, 0, 8, metformin, {})],
  'rxnorm': [Annotation(labeled_dependency, 0, 8, 6809, {'entity': 'metformin', 'relation': 'rxnorm_code', 'all_relations': ''})]}]

We can use ChunkMapper to map entities with their associated code/reference based on pre-defined dictionaries.

This is the AnnotatorApproach of the ChunkMapper, which can be used to train ChunkMapper models by giving a custom mapping dictionary. To use pretriained models, check the documentation of the ChunkMapperModel annotator.

The annotator also allows using fuzzy matching, which can take into consideration parts of the tokens tha can map even when word order is different, char ngrams that can map even when thre are typos, and using fuzzy distance metric (Jaccard, Levenshtein, etc.).

Example usage and more details can be found on Spark NLP Workshop repository accessible in GitHub, for example the notebook Healthcare Chunk Mapping.

Input Annotator Types: CHUNK

Output Annotator Type: LABEL_DEPENDENCY

Python API: ChunkMapperApproach Scala API: ChunkMapperApproach
Show Example

# First, create a dictionay in JSON format following this schema:
import json

data_set= {
  "mappings": [
    {
      "key": "metformin",
      "relations": [
        {
          "key": "action",
          "values" : ["hypoglycemic", "Drugs Used In Diabetes"]
        },
        {
          "key": "treatment",
          "values" : ["diabetes", "t2dm"]
        }]
    }]
}

with open('sample_drug.json', 'w', encoding='utf-8') as f:
    json.dump(data_set, f, ensure_ascii=False, indent=4)


# Create a pipeline

document_assembler = DocumentAssembler()\
      .setInputCol('text')\
      .setOutputCol('document')

sentence_detector = SentenceDetector()\
      .setInputCols(["document"])\
      .setOutputCol("sentence")

tokenizer = Tokenizer()\
      .setInputCols("sentence")\
      .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
      .setInputCols(["sentence", "token"])\
      .setOutputCol("embeddings")

#NER model to detect drug in the text
clinical_ner = MedicalNerModel.pretrained("ner_posology_small","en","clinical/models")\
	    .setInputCols(["sentence","token","embeddings"])\
	    .setOutputCol("ner")\
      .setLabelCasing("upper")
 
ner_converter = NerConverter()\
      .setInputCols(["sentence", "token", "ner"])\
      .setOutputCol("ner_chunk")\
      .setWhiteList(["DRUG"])

chunkerMapper = ChunkMapperApproach()\
      .setInputCols(["ner_chunk"])\
      .setOutputCol("mappings")\
      .setDictionary("sample_drug.json")\
      .setRels(["action"]) #or treatment

pipeline = Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
        chunkerMapper,
    ]
)


# Train the model

text = ["The patient was given 1 unit of metformin daily."]
test_data = spark.createDataFrame([text]).toDF("text")

model = pipeline.fit(test_data)

ChunkMapperFilterer

Model

ChunkMapperFilterer is an annotator to be used after ChunkMapper that allows to filter chunks based on the results of the mapping, whether it was successful or failed.

Example usage and more details can be found on Spark NLP Workshop repository accessible in GitHub, for example the notebook Healthcare Chunk Mapping.

Input Annotator Types: CHUNK, LABEL_DEPENDENCY

Output Annotator Type: CHUNK

Python API: ChunkMapperFilterer Scala API: ChunkMapperFilterer
Show Example

document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

sentence_detector = (
    SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
)

tokenizer = Tokenizer().setInputCols("sentence").setOutputCol("token")

word_embeddings = (
    WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(["sentence", "token"])
    .setOutputCol("embeddings")
)

ner_model = (
    MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models")
    .setInputCols(["sentence", "token", "embeddings"])
    .setOutputCol("ner")
)

ner_converter = (
    NerConverter().setInputCols("sentence", "token", "ner").setOutputCol("chunk")
)

chunkerMapper = (
    ChunkMapperModel.pretrained("rxnorm_mapper", "en", "clinical/models")
    .setInputCols(["chunk"])
    .setOutputCol("RxNorm_Mapper")
    .setRel("rxnorm_code")
)

cfModel = (
    ChunkMapperFilterer()
    .setInputCols(["chunk", "RxNorm_Mapper"])
    .setOutputCol("chunks_fail")
    .setReturnCriteria("fail")
)

chunk2doc = Chunk2Doc().setInputCols("chunks_fail").setOutputCol("doc_chunk")

sbert_embedder = (
    BertSentenceEmbeddings.pretrained(
        "sbiobert_base_cased_mli", "en", "clinical/models"
    )
    .setInputCols(["doc_chunk"])
    .setOutputCol("sentence_embeddings")
    .setCaseSensitive(False)
)

resolver = (
    SentenceEntityResolverModel.pretrained(
        "sbiobertresolve_rxnorm_augmented", "en", "clinical/models"
    )
    .setInputCols(["chunks_fail", "sentence_embeddings"])
    .setOutputCol("resolver_code")
    .setDistanceFunction("EUCLIDEAN")
)

resolverMerger = (
    ResolverMerger()
    .setInputCols(["resolver_code", "RxNorm_Mapper"])
    .setOutputCol("RxNorm")
)

mapper_pipeline = Pipeline(
    stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        ner_model,
        ner_converter,
        chunkerMapper,
        chunkerMapper,
        cfModel,
        chunk2doc,
        sbert_embedder,
        resolver,
        resolverMerger,
    ]
)

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = mapper_pipeline.fit(empty_data)


samples = [
    ["The patient was given Adapin 10 MG, coumadn 5 mg"],
    ["The patient was given Avandia 4 mg, Tegretol, zitiga"],
]

result = model.transform(spark.createDataFrame(samples).toDF("text"))

result.selectExpr(
    "chunk.result as chunk",
    "RxNorm_Mapper.result as RxNorm_Mapper",
    "chunks_fail.result as chunks_fail",
    "resolver_code.result as resolver_code",
    "RxNorm.result as RxNorm",
).show(truncate=False)
+--------------------------------+----------------------+--------------+-------------+------------------------+
chunk                           |RxNorm_Mapper         |chunks_fail   |resolver_code|RxNorm                  |
+--------------------------------+----------------------+--------------+-------------+------------------------+
[Adapin 10 MG, coumadn 5 mg]    |[1000049, NONE]       |[coumadn 5 mg]|[200883]     |[1000049, 200883]       |
[Avandia 4 mg, Tegretol, zitiga]|[261242, 203029, NONE]|[zitiga]      |[220989]     |[261242, 203029, 220989]|
+--------------------------------+----------------------+--------------+-------------+------------------------+

ChunkMerge

Merges entities coming from different CHUNK annotations

Input Annotator Types: CHUNK, CHUNK

Output Annotator Type: CHUNK

Python API: ChunkMergeModel Scala API: ChunkMergeModel

Merges two chunk columns coming from two annotators(NER, ContextualParser or any other annotator producing chunks). The merger of the two chunk columns is made by selecting one chunk from one of the columns according to certain criteria. The decision on which chunk to select is made according to the chunk indices in the source document. (chunks with longer lengths and highest information will be kept from each source) Labels can be changed by setReplaceDictResource.

Input Annotator Types: CHUNK, CHUNK

Output Annotator Type: CHUNK

Python API: ChunkMergeApproach Scala API: ChunkMergeApproach
Show Example

from johnsnowlabs import *
# Define a pipeline with 2 different NER models with a ChunkMergeApproach at the end
data = spark.createDataFrame([["A 63-year-old man presents to the hospital ..."]]).toDF("text")
pipeline = Pipeline(stages=[
 nlp.DocumentAssembler().setInputCol("text").setOutputCol("document"),
 nlp.SentenceDetector().setInputCols(["document"]).setOutputCol("sentence"),
 nlp.Tokenizer().setInputCols(["sentence"]).setOutputCol("token"),
  nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models").setOutputCol("embs"),
  medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embs"]).setOutputCol("jsl_ner"),
 nlp.NerConverter().setInputCols(["sentence", "token", "jsl_ner"]).setOutputCol("jsl_ner_chunk"),
  medical.NerModel.pretrained("ner_bionlp", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embs"]).setOutputCol("bionlp_ner"),
 nlp.NerConverter().setInputCols(["sentence", "token", "bionlp_ner"]) \
    .setOutputCol("bionlp_ner_chunk"),
 medical.ChunkMergeApproach().setInputCols(["jsl_ner_chunk", "bionlp_ner_chunk"]).setOutputCol("merged_chunk")
])

# Show results
result = pipeline.fit(data).transform(data).cache()
result.selectExpr("explode(merged_chunk) as a") \
  .selectExpr("a.begin","a.end","a.result as chunk","a.metadata.entity as entity") \
  .show(5, False)
+-----+---+-----------+---------+
|begin|end|chunk      |entity   |
+-----+---+-----------+---------+
|5    |15 |63-year-old|Age      |
|17   |19 |man        |Gender   |
|64   |72 |recurrent  |Modifier |
|98   |107|cellulitis |Diagnosis|
|110  |119|pneumonias |Diagnosis|
+-----+---+-----------+---------+
from johnsnowlabs import *

data = spark.createDataFrame([["Jeffrey Preston Bezos is an American entrepreneur, founder and CEO of Amazon"]]).toDF("text")

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

bert_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("bert_embeddings")

fin_ner = finance.NerModel.pretrained('finner_deid', "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

ner_converter =  finance.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setReplaceLabels({"ORG": "PARTY"}) # Replace "ORG" entity as "PARTY"

ner_finner = finance.NerModel.pretrained("finner_org_per_role_date", "en", "finance/models")\
    .setInputCols(["sentence", "token", "bert_embeddings"]) \
    .setOutputCol("ner_finner") 
    #.setLabelCasing("upper")

ner_converter_finner = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner_finner"]) \
    .setOutputCol("ner_finner_chunk") \
    .setWhiteList(['ROLE']) # Just use "ROLE" entity from this NER

chunk_merge =  finance.ChunkMergeApproach()\
    .setInputCols("ner_finner_chunk", "ner_chunk")\
    .setOutputCol("deid_merged_chunk")

nlpPipeline = Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      bert_embeddings,
      fin_ner,
      ner_converter,
      ner_finner,
      ner_converter_finner,
      chunk_merge])

# Show results
result = nlpPipeline.fit(data).transform(data).cache()
result.select(F.explode(F.arrays_zip(result.deid_merged_chunk.result, 
                                     result.deid_merged_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)
+---------------------+---------+
|chunk                |ner_label|
+---------------------+---------+
|Jeffrey Preston Bezos|PERSON   |
|founder              |ROLE     |
|CEO                  |ROLE     |
|Amazon               |PARTY    |
+---------------------+---------+
from johnsnowlabs import *

data = spark.createDataFrame([["ENTIRE AGREEMENT.  This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby, supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter.

2THEMART.COM, INC.:                         I-ESCROW, INC.:

By:Dominic J. Magliarditi                By:Sanjay Bajaj Name: Dominic J. Magliarditi                Name: Sanjay Bajaj Title: President                            Title: VP Business Development Date: 6/21/99                               Date: 6/11/99 "]]).toDF("text")

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

legal_ner = legal.NerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

ner_converter = legal.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setReplaceLabels({"ALIAS": "PARTY"})

ner_signers = legal.NerModel.pretrained("legner_signers", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_signers") 
    #.setLabelCasing("upper")

ner_converter_signers = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner_signers"]) \
    .setOutputCol("ner_signer_chunk")

chunk_merge = legal.ChunkMergeApproach()\
    .setInputCols("ner_signer_chunk", "ner_chunk")\
    .setOutputCol("deid_merged_chunk")

nlpPipeline = Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      legal_ner,
      ner_converter,
      ner_signers,
      ner_converter_signers,
      chunk_merge])

# Show results
result = nlpPipeline.fit(data).transform(data).cache()
result.select(F.explode(F.arrays_zip(result.deid_merged_chunk.result, 
                                     result.deid_merged_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)
+-----------------------+--------------+
|chunk                  |ner_label     |
+-----------------------+--------------+
|ENTIRE AGREEMENT       |DOC           |
|INC                    |PARTY         |
|J. Magliarditi         |SIGNING_PERSON|
|Bajaj                  |SIGNING_PERSON|
|Dominic J. Magliarditi |SIGNING_PERSON|
|Sanjay Bajaj           |SIGNING_PERSON|
|President              |SIGNING_TITLE |
|VP Business Development|SIGNING_TITLE |
+-----------------------+--------------+

from johnsnowlabs import *

// Define a pipeline with 2 different NER models with a ChunkMergeApproach at the end
val data = Seq(("A 63-year-old man presents to the hospital ...")).toDF("text")
val pipeline = new Pipeline().setStages(Array(
  new nlp.DocumentAssembler().setInputCol("text").setOutputCol("document"),
  new nlp.SentenceDetector().setInputCol("document").setOutputCol("sentence"),
  new nlp.Tokenizer().setInputCol("sentence").setOutputCol("token"),
  nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models").setInputCols(Array("sentence","token")).setOutputCol("embs"),
  medical.NerModel.pretrained("ner_jsl", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embs")).setOutputCol("jsl_ner"),
  new nlp.NerConverter().setInputCols(Array("sentence", "token", "jsl_ner")).setOutputCol("jsl_ner_chunk"),
  medical.NerModel.pretrained("ner_bionlp", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embs")).setOutputCol("bionlp_ner"),
  new nlp.NerConverter().setInputCols(Array("sentence", "token", "bionlp_ner"))
    .setOutputCol("bionlp_ner_chunk"),
  new medical.ChunkMergeApproach().setInputCols(Array("jsl_ner_chunk", "bionlp_ner_chunk")).setOutputCol("merged_chunk")
))

// Show results
val result = pipeline.fit(data).transform(data).cache()
result.selectExpr("explode(merged_chunk) as a")
  .selectExpr("a.begin","a.end","a.result as chunk","a.metadata.entity as entity")
  .show(5, false)
+-----+---+-----------+---------+
|begin|end|chunk      |entity   |
+-----+---+-----------+---------+
|5    |15 |63-year-old|Age      |
|17   |19 |man        |Gender   |
|64   |72 |recurrent  |Modifier |
|98   |107|cellulitis |Diagnosis|
|110  |119|pneumonias |Diagnosis|
+-----+---+-----------+---------+
from johnsnowlabs import *

val data = Seq(("Jeffrey Preston Bezos is an American entrepreneur, founder and CEO of Amazon")).toDF("text")

val documentAssembler = new nlp.DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new nlp.SentenceDetector()
    .setInputCol("document")
    .setOutputCol("sentence")

val tokenizer = new nlp.Tokenizer()
    .setInputCol("sentence")
    .setOutputCol("token")

val embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val bert_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("bert_embeddings")

val fin_ner = finance.NerModel.pretrained('finner_deid', "en", "finance/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

val ner_converter =  finance.NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")\
    .setReplaceLabels({"ORG": "PARTY"}) # Replace "ORG" entity as "PARTY"

val ner_finner = finance.NerModel.pretrained("finner_org_per_role_date", "en", "finance/models")\
    .setInputCols(Array("sentence", "token", "bert_embeddings"))
    .setOutputCol("ner_finner") 
    #.setLabelCasing("upper")

val ner_converter_finner = new nlp.NerConverter()
    .setInputCols(Array("sentence", "token", "ner_finner"))
    .setOutputCol("ner_finner_chunk")
    .setWhiteList(['ROLE']) # Just use "ROLE" entity from this NER

val chunk_merge =  new finance.ChunkMergeApproach()
    .setInputCols(Array("ner_finner_chunk", "ner_chunk"))
    .setOutputCol("deid_merged_chunk")

val nlpPipeline = new Pipeline().setStages(Array(
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      bert_embeddings,
      fin_ner,
      ner_converter,
      ner_finner,
      ner_converter_finner,
      chunk_merge))

val model = nlpPipeline.fit(data)
from johnsnowlabs import *

val data = Seq(("ENTIRE AGREEMENT.  This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby, supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter.

2THEMART.COM, INC.:                         I-ESCROW, INC.:

By:Dominic J. Magliarditi                By:Sanjay Bajaj Name: Dominic J. Magliarditi                Name: Sanjay Bajaj Title: President                            Title: VP Business Development Date: 6/21/99                               Date: 6/11/99 ")).toDF("text")

val documentAssembler = new nlp.DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new nlp.SentenceDetector()
    .setInputCol("document")
    .setOutputCol("sentence")

val tokenizer = new nlp.Tokenizer()
    .setInputCol("sentence")
    .setOutputCol("token")

val embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val legal_ner = legal.NerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

val ner_converter = new legal.NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")\
    .setReplaceLabels({"ALIAS": "PARTY"})

val ner_signers = legal.NerModel.pretrained("legner_signers", "en", "legal/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner_signers") 
    #.setLabelCasing("upper")

val ner_converter_signers = new nlp.NerConverter()
    .setInputCols(Array("sentence", "token", "ner_signers"))
    .setOutputCol("ner_signer_chunk")

val chunk_merge = new legal.ChunkMergeApproach()
    .setInputCols(Array("ner_signer_chunk", "ner_chunk"))
    .setOutputCol("deid_merged_chunk")

val nlpPipeline = new Pipeline().setStages(Array(
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      legal_ner,
      ner_converter,
      ner_signers,
      ner_converter_signers,
      chunk_merge))

val model = nlpPipeline.fit(data)

ChunkSentenceSplitter

Model

ChunkSentenceSplitter annotator can split the documents into chunks according to separators given as CHUNK columns. It is useful when you need to perform different models or analysis in different sections of your document (for example, for different headers, clauses, items, etc.). The given separator chunk can be the output from, for example, RegexMatcher or NerModel.

For detailed usage of this annotator, visit this notebook from our Spark NLP Workshop.

Input Annotator Types: DOCUMENT, CHUNK

Output Annotator Type: DOCUMENT

Python API: ChunkSentenceSplitter Scala API: ChunkSentenceSplitter
Show Example

# Defining the pipeline

documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token")

tokenClassifier = (
    MedicalBertForTokenClassifier.pretrained(
        "bert_token_classifier_ner_jsl_slim", "en", "clinical/models"
    )
    .setInputCols("token", "document")
    .setOutputCol("ner")
    .setCaseSensitive(True)
)

ner_converter = (
    NerConverter()
    .setInputCols(["document", "token", "ner"])
    .setOutputCol("ner_chunk")
    .setWhiteList(["Header"])
)

chunkSentenceSplitter = (
    ChunkSentenceSplitter()
    .setInputCols("document", "ner_chunk")
    .setOutputCol("paragraphs")
    .setGroupBySentences(False)
)

pipeline = Pipeline(
    stages=[
        documentAssembler,
        tokenizer,
        tokenClassifier,
        ner_converter,
        chunkSentenceSplitter,
    ]
)

empty_df = spark.createDataFrame([[""]]).toDF("text")
pipeline_model = pipeline.fit(empty_df)

sentences = [
    [
        """ADMISSION DIAGNOSIS Right pleural effusion and suspected malignant mesothelioma.
        PRINCIPAL DIAGNOSIS Right pleural effusion, suspected malignant mesothelioma.
        REVIEW OF SYSTEMS Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface.
    """
    ]
]

df = spark.createDataFrame(sentences).toDF("text")

paragraphs = pipeline_model.transform(df)
 paragraphs.selectExpr("explode(paragraphs) as result").selectExpr("result.result","result.metadata.entity", "result.metadata.splitter_chunk").show(truncate=80)
+--------------------------------------------------------------------------------+------+-------------------+
|                                                                          result|entity|     splitter_chunk|
+--------------------------------------------------------------------------------+------+-------------------+
|ADMISSION DIAGNOSIS Right pleural effusion and suspected malignant mesothelio...|Header|ADMISSION DIAGNOSIS|
|PRINCIPAL DIAGNOSIS Right pleural effusion, suspected malignant mesothelioma....|Header|PRINCIPAL DIAGNOSIS|
|REVIEW OF SYSTEMS Right pleural effusion, firm nodules, diffuse scattered thr...|Header|  REVIEW OF SYSTEMS|
+--------------------------------------------------------------------------------+------+-------------------+

val data = Seq(text,text).toDS.toDF("text")
val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("doc")
val regexMatcher = new RegexMatcher().setInputCols("doc").setOutputCol("chunks").setExternalRules("src/test/resources/chunker/title_regex.txt",",")
val chunkSentenceSplitter = new ChunkSentenceSplitter().setInputCols("chunks","doc").setOutputCol("paragraphs")
val pipeline =  new Pipeline().setStages(Array(documentAssembler,regexMatcher,chunkSentenceSplitter))
val result = pipeline.fit(data).transform(data).select("paragraphs")
result.show(truncate = false)

ContextualParser

Extracts entity from a document based on user defined rules. Rule matching is based on a RegexMatcher defined in a JSON file. In this file, regex is defined that you want to match along with the information that will output on metadata field. To instantiate a model, see ContextualParserApproach and its accompanied example.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CHUNK

Python API: ContextualParserModel Scala API: ContextualParserModel

Creates a model, that extracts entity from a document based on user defined rules. Rule matching is based on a RegexMatcher defined in a JSON file. It is set through the parameter setJsonPath() In this JSON file, regex is defined that you want to match along with the information that will output on metadata field. Additionally, a dictionary can be provided with setDictionary to map extracted entities to a unified representation. The first column of the dictionary file should be the representation with following columns the possible matches.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CHUNK

Python API: ContextualParserApproach Scala API: ContextualParserApproach
Show Example

from johnsnowlabs import *
# An example JSON file `regex_token.json` can look like this:
#
# {
#    "entity": "Stage",
#    "ruleScope": "sentence",
#    "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]",
#    "matchScope": "token"
#  }
#
# Which means to extract the stage code on a sentence level.
# An example pipeline could then be defined like this
# Pipeline could then be defined like this
documentAssembler = nlp.DocumentAssembler() \
  .setInputCol("text") \
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector() \
  .setInputCols(["document"]) \
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
  .setInputCols(["sentence"]) \
  .setOutputCol("token")

# Define the parser (json file needs to be provided)
data = spark.createDataFrame([["A patient has liver metastases pT1bN0M0 and the T5 primary site may be colon or... "]]).toDF("text")

contextualParser = medical.ContextualParserApproach() \
  .setInputCols(["sentence", "token"]) \
  .setOutputCol("entity") \
  .setJsonPath("/path/to/regex_token.json") \
  .setCaseSensitive(True) \
  .setContextMatch(False)

pipeline = Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    contextualParser
  ])

result = pipeline.fit(data).transform(data)

# Show Results
result.selectExpr("explode(entity)").show(5, truncate=False)
+-------------------------------------------------------------------------------------------------------------------------+
|col                                                                                                                      |
+-------------------------------------------------------------------------------------------------------------------------+
|{chunk, 32, 39, pT1bN0M0, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 0}, []}   |
|{chunk, 49, 50, T5, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 0}, []}         |
|{chunk, 148, 156, cT4bcN2M1, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 1}, []}|
|{chunk, 189, 194, T?N3M1, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 2}, []}   |
|{chunk, 316, 323, pT1bN0M0, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 3}, []} |
+-------------------------------------------------------------------------------------------------------------------------+
from johnsnowlabs import *
# An example JSON file `regex_token.json` can look like this:
#
# {
#    "entity": "Stage",
#    "ruleScope": "sentence",
#    "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]",
#    "matchScope": "token"
#  }
#
# Which means to extract the stage code on a sentence level.
# An example pipeline could then be defined like this
# Pipeline could then be defined like this
documentAssembler = nlp.DocumentAssembler() \
  .setInputCol("text") \
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector() \
  .setInputCols(["document"]) \
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
  .setInputCols(["sentence"]) \
  .setOutputCol("token")

# Define the parser (json file needs to be provided)

contextualParser = finance.ContextualParserApproach() \
  .setInputCols(["sentence", "token"]) \
  .setOutputCol("entity") \
  .setJsonPath("/path/to/regex_token.json") \
  .setCaseSensitive(True) \
  .setContextMatch(False)

pipeline = Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    contextualParser
  ])
from johnsnowlabs import *
# An example JSON file `regex_token.json` can look like this:
#
# {
#    "entity": "Stage",
#    "ruleScope": "sentence",
#    "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]",
#    "matchScope": "token"
#  }
#
# Which means to extract the stage code on a sentence level.
# An example pipeline could then be defined like this
# Pipeline could then be defined like this
documentAssembler = nlp.DocumentAssembler() \
  .setInputCol("text") \
  .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector() \
  .setInputCols(["document"]) \
  .setOutputCol("sentence")

tokenizer = nlp.Tokenizer() \
  .setInputCols(["sentence"]) \
  .setOutputCol("token")

# Define the parser (json file needs to be provided)

contextualParser = legal.ContextualParserApproach() \
  .setInputCols(["sentence", "token"]) \
  .setOutputCol("entity") \
  .setJsonPath("/path/to/regex_token.json") \
  .setCaseSensitive(True) \
  .setContextMatch(False)

pipeline = Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    contextualParser
  ])

from johnsnowlabs import * 
// An example JSON file `regex_token.json` can look like this:
//
// {
//    "entity": "Stage",
//    "ruleScope": "sentence",
//    "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]",
//    "matchScope": "token"
//  }
//
// Which means to extract the stage code on a sentence level.
// An example pipeline could then be defined like this
val documentAssembler = new nlp.DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new nlp.SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new nlp.Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

// Define the parser (json file needs to be provided)
val data = Seq("A patient has liver metastases pT1bN0M0 and the T5 primary site may be colon or... ").toDF("text")
val contextualParser = new medical.ContextualParserApproach()
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("entity")
  .setJsonPath("/path/to/regex_token.json")
  .setCaseSensitive(true)
  .setContextMatch(false)
val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    contextualParser
  ))

val result = pipeline.fit(data).transform(data)

// Show Results
//
// result.selectExpr("explode(entity)").show(5, truncate=false)
// +-------------------------------------------------------------------------------------------------------------------------+
// |col                                                                                                                      |
// +-------------------------------------------------------------------------------------------------------------------------+
// |{chunk, 32, 39, pT1bN0M0, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 0}, []}   |
// |{chunk, 49, 50, T5, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 0}, []}         |
// |{chunk, 148, 156, cT4bcN2M1, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 1}, []}|
// |{chunk, 189, 194, T?N3M1, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 2}, []}   |
// |{chunk, 316, 323, pT1bN0M0, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 3}, []} |
// +-------------------------------------------------------------------------------------------------------------------------+
//
from johnsnowlabs import * 
// An example JSON file `regex_token.json` can look like this:
//
// {
//    "entity": "Stage",
//    "ruleScope": "sentence",
//    "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]",
//    "matchScope": "token"
//  }
//
// Which means to extract the stage code on a sentence level.
// An example pipeline could then be defined like this
val documentAssembler = new nlp.DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new nlp.SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new nlp.Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

// Define the parser (json file needs to be provided)
val data = Seq("A patient has liver metastases pT1bN0M0 and the T5 primary site may be colon or... ").toDF("text")
val contextualParser = new finance.ContextualParserApproach()
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("entity")
  .setJsonPath("/path/to/regex_token.json")
  .setCaseSensitive(true)
  .setContextMatch(false)
val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    contextualParser
  ))
from johnsnowlabs import * 
// An example JSON file `regex_token.json` can look like this:
//
// {
//    "entity": "Stage",
//    "ruleScope": "sentence",
//    "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]",
//    "matchScope": "token"
//  }
//
// Which means to extract the stage code on a sentence level.
// An example pipeline could then be defined like this
val documentAssembler = new nlp.DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new nlp.SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new nlp.Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

// Define the parser (json file needs to be provided)
val data = Seq("A patient has liver metastases pT1bN0M0 and the T5 primary site may be colon or... ").toDF("text")
val contextualParser = new legal.ContextualParserApproach()
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("entity")
  .setJsonPath("/path/to/regex_token.json")
  .setCaseSensitive(true)
  .setContextMatch(false)
val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    contextualParser
  ))

DateNormalizer

Model

This annotator transforms date mentions to a common standard format: YYYY/MM/DD. It is useful when using data from different sources, some times from different countries that has different formats to represent dates.

For the relative dates (next year, past month, etc.), you can define an achor date to create the normalized date by setting the parameters anchorDateYear, anchorDateMonth, and anchorDateDay.

The resultant chunk date will contain a metada indicating whether the normalization was successful or not (True / False).

Input Annotator Types: CHUNK

Output Annotator Type: CHUNK

Python API: DateNormalizer Scala API: DateNormalizer
Show Example

from pyspark.sql.types import StringType

dates = [
    "08/02/2018",
    "11/2018",
    "11/01/2018",
    "12Mar2021",
    "Jan 30, 2018",
    "13.04.1999",
    "3April 2020",
    "next monday",
    "today",
    "next week",
]
df = spark.createDataFrame(dates, StringType()).toDF("original_date")

document_assembler = (
    DocumentAssembler().setInputCol("original_date").setOutputCol("document")
)

doc2chunk = Doc2Chunk().setInputCols("document").setOutputCol("date_chunk")

date_normalizer = (
    DateNormalizer()
    .setInputCols("date_chunk")
    .setOutputCol("date")
    .setAnchorDateYear(2000)
    .setAnchorDateMonth(3)
    .setAnchorDateDay(15)
)

pipeline = Pipeline(stages=[document_assembler, doc2chunk, date_normalizer])

result = pipeline.fit(df).transform(df)
result.selectExpr(
    "date.result as normalized_date",
    "original_date",
    "date.metadata[0].normalized as metadata",
).show()

+---------------+-------------+--------+
|normalized_date|original_date|metadata|
+---------------+-------------+--------+
|   [2018/08/02]|   08/02/2018|    true|
|   [2018/11/DD]|      11/2018|    true|
|   [2018/11/01]|   11/01/2018|    true|
|   [2021/03/12]|    12Mar2021|    true|
|   [2018/01/30]| Jan 30, 2018|    true|
|   [1999/04/13]|   13.04.1999|    true|
|   [2020/04/03]|  3April 2020|    true|
|   [2000/03/20]|  next monday|    true|
|   [2000/03/15]|        today|    true|
|   [2000/03/22]|    next week|    true|
+---------------+-------------+--------+

val df = Seq(("08/02/2018"),("11/2018"),("11/01/2018"),("next monday"),("today"),("next week")).toDF("original_date")

val documentAssembler = new DocumentAssembler().setInputCol("original_date").setOutputCol("document")

val chunksDF = documentAssembler
				  .transform(df)
				  .mapAnnotationsCol[Seq[Annotation]]("document",
													  "chunk_date",
													   CHUNK,
												  (aa:Seq[Annotation]) =>
													aa.map( ann => ann.copy(annotatorType = CHUNK)))
val dateNormalizerModel = new DateNormalizer()
        .setInputCols("chunk_date")
        .setOutputCol("date")
        .setAnchorDateDay(15)
        .setAnchorDateMonth(3)
        .setAnchorDateYear(2000)
val dateDf = dateNormalizerModel.transform(chunksDF)

dateDf.select("chunk_date.result","text").show()
+-------------+-------------+
|       result|original_date|
+-------------+-------------+
| [08/02/2018]|   08/02/2018|
|    [11/2018]|      11/2018|
| [11/01/2018]|   11/01/2018|
|[next monday]|  next monday|
|      [today]|        today|
|  [next week]|    next week|
+-------------+-------------+

DeIdentification

Deidentifies Input Annotations of types DOCUMENT, TOKEN and CHUNK, by either masking or obfuscating the given CHUNKS.

To create a configured DeIdentificationModel, please see the example of DeIdentification.

Input Annotator Types: DOCUMENT, TOKEN, CHUNK

Output Annotator Type: DOCUMENT

Python API: DeIdentificationModel Scala API: DeIdentificationModel
Show Example

from johnsnowlabs import * 
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

bert_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("bert_embeddings")

fin_ner = finance.NerModel.pretrained('finner_deid', "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

ner_converter =  finance.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setReplaceLabels({"ORG": "PARTY"}) # Replace "ORG" entity as "PARTY"

ner_finner = finance.NerModel.pretrained("finner_org_per_role_date", "en", "finance/models")\
    .setInputCols(["sentence", "token", "bert_embeddings"]) \
    .setOutputCol("ner_finner") 
    #.setLabelCasing("upper")

ner_converter_finner = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner_finner"]) \
    .setOutputCol("ner_finner_chunk") \
    .setWhiteList(['ROLE']) # Just use "ROLE" entity from this NER

chunk_merge =  finance.ChunkMergeApproach()\
    .setInputCols("ner_finner_chunk", "ner_chunk")\
    .setOutputCol("deid_merged_chunk")

deidentification =  finance.DeIdentification() \
    .setInputCols(["sentence", "token", "deid_merged_chunk"]) \
    .setOutputCol("deidentified") \
    .setMode("mask")\
    .setIgnoreRegex(True)

# Pipeline
data = spark.createDataFrame([
    ["Jeffrey Preston Bezos is an American entrepreneur, founder and CEO of Amazon"]
]).toDF("text")

nlpPipeline = Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      bert_embeddings,
      fin_ner,
      ner_converter,
      ner_finner,
      ner_converter_finner,
      chunk_merge,
      deidentification])

result = nlpPipeline.fit(data).transform(data)
from johnsnowlabs import * 
documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

legal_ner = legal.NerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

ner_converter = legal.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setReplaceLabels({"ALIAS": "PARTY"})

ner_signers = legal.NerModel.pretrained("legner_signers", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_signers") 
    #.setLabelCasing("upper")

ner_converter_signers = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner_signers"]) \
    .setOutputCol("ner_signer_chunk")

chunk_merge = legal.ChunkMergeApproach()\
    .setInputCols("ner_signer_chunk", "ner_chunk")\
    .setOutputCol("deid_merged_chunk")

deidentification = legal.DeIdentification() \
    .setInputCols(["sentence", "token", "deid_merged_chunk"]) \
    .setOutputCol("deidentified") \
    .setMode("mask")\
    .setIgnoreRegex(True)

# Pipeline
data = spark.createDataFrame([
    ["ENTIRE AGREEMENT.  This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby, supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter.

2THEMART.COM, INC.:                         I-ESCROW, INC.:

By:Dominic J. Magliarditi                By:Sanjay Bajaj Name: Dominic J. Magliarditi                Name: Sanjay Bajaj Title: President                            Title: VP Business Development Date: 6/21/99                               Date: 6/11/99 "]
]).toDF("text")

nlpPipeline = Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      legal_ner,
      ner_converter,
      ner_signers,
      ner_converter_signers,
      chunk_merge,
      deidentification])

result = nlpPipeline.fit(data).transform(data)

from johnsnowlabs import * 
val documentAssembler = new nlp.DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new nlp.SentenceDetector()
    .setInputCols(["document"])
    .setOutputCol("sentence")

val tokenizer = new nlp.Tokenizer()
    .setInputCols(["sentence"])
    .setOutputCol("token")

val embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val bert_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("bert_embeddings")

val fin_ner = finance.NerModel.pretrained('finner_deid', "en", "finance/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

val ner_converter =  finance.NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")
    .setReplaceLabels({"ORG": "PARTY"}) # Replace "ORG" entity as "PARTY"

val ner_finner = finance.NerModel.pretrained("finner_org_per_role_date", "en", "finance/models")
    .setInputCols(Array("sentence", "token", "bert_embeddings"))
    .setOutputCol("ner_finner") 
    #.setLabelCasing("upper")

val ner_converter_finner = new nlp.NerConverter()
    .setInputCols(Array("sentence", "token", "ner_finner"))
    .setOutputCol("ner_finner_chunk")
    .setWhiteList(['ROLE']) # Just use "ROLE" entity from this NER

val chunk_merge =  new finance.ChunkMergeApproach()
    .setInputCols(Array("ner_finner_chunk", "ner_chunk"))
    .setOutputCol("deid_merged_chunk")

val deidentification =  new finance.DeIdentification()
    .setInputCols(Array("sentence", "token", "deid_merged_chunk"))
    .setOutputCol("deidentified")
    .setMode("mask")
    .setIgnoreRegex(True)

# Pipeline
val data = Seq("Jeffrey Preston Bezos is an American entrepreneur, founder and CEO of Amazon").toDF("text")

val nlpPipeline = new Pipeline().setStages(Array(
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      bert_embeddings,
      fin_ner,
      ner_converter,
      ner_finner,
      ner_converter_finner,
      chunk_merge,
      deidentification))

val result = nlpPipeline.fit(data).transform(data)
from johnsnowlabs import * 
val documentAssembler = new nlp.DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = new nlp.SentenceDetector()
    .setInputCols(["document"])
    .setOutputCol("sentence")

val tokenizer = new nlp.Tokenizer()
    .setInputCols(["sentence"])
    .setOutputCol("token")

val embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val legal_ner = legal.NerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

val ner_converter = new legal.NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")\
    .setReplaceLabels({"ALIAS": "PARTY"})

val ner_signers = legal.NerModel.pretrained("legner_signers", "en", "legal/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner_signers") 
    #.setLabelCasing("upper")

val ner_converter_signers = new nlp.NerConverter()
    .setInputCols(Array("sentence", "token", "ner_signers"))
    .setOutputCol("ner_signer_chunk")

val chunk_merge = new legal.ChunkMergeApproach()
    .setInputCols(Array("ner_signer_chunk", "ner_chunk"))
    .setOutputCol("deid_merged_chunk")

val deidentification = new legal.DeIdentification()
    .setInputCols(Array("sentence", "token", "deid_merged_chunk"))
    .setOutputCol("deidentified") \
    .setMode("mask")\
    .setIgnoreRegex(True)

# Pipeline
val data = Seq("ENTIRE AGREEMENT.  This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby, supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter.

2THEMART.COM, INC.:                         I-ESCROW, INC.:

By:Dominic J. Magliarditi                By:Sanjay Bajaj Name: Dominic J. Magliarditi                Name: Sanjay Bajaj Title: President                            Title: VP Business Development Date: 6/21/99                               Date: 6/11/99 ").toDF("text")

val nlpPipeline = new Pipeline().setStages(Array(
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      legal_ner,
      ner_converter,
      ner_signers,
      ner_converter_signers,
      chunk_merge,
      deidentification))

val result = nlpPipeline.fit(data).transform(data)

Contains all the methods for training a DeIdentificationModel model. This module can obfuscate or mask the entities that contains personal information. These can be set with a file of regex patterns with setRegexPatternsDictionary, where each line is a mapping of entity to regex.

DATE \d{4}
AID \d{6,7}

Additionally, obfuscation strings can be defined with setObfuscateRefFile, where each line is a mapping of string to entity. The format and seperator can be speficied with setRefFileFormat and setRefSep.

Dr. Gregory House#DOCTOR
01010101#MEDICALRECORD

Ideally this annotator works in conjunction with Demographic Named EntityRecognizers that can be trained either using TextMatchers, RegexMatchers, DateMatchers, NerCRFs or NerDLs

Input Annotator Types: DOCUMENT, TOKEN, CHUNK

Output Annotator Type: DOCUMENT

Python API: DeIdentification Scala API: DeIdentification
Show Example

from johnsnowlabs import *

documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

 sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence") \
    .setUseAbbreviations(True)

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel \
    .pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

# Ner entities
clinical_sensitive_entities = medical.NerModel \
    .pretrained("ner_deid_enriched", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]).setOutputCol("ner")

nerConverter = medical.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

# Deidentification
deIdentification = medical.DeIdentification() \
    .setInputCols(["ner_chunk", "token", "sentence"]) \
    .setOutputCol("dei") \
    # file with custom regex pattern for custom entities
    .setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt") \
    # file with custom obfuscator names for the entities
    .setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt") \
    .setRefFileFormat("csv") \
    .setRefSep("#") \
    .setMode("obfuscate") \
    .setDateFormats(Array("MM/dd/yy","yyyy-MM-dd")) \
    .setObfuscateDate(True) \
    .setDateTag("DATE") \
    .setDays(5) \
    .setObfuscateRefSource("file")

# Pipeline
data = spark.createDataFrame([
    ["# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09."]
]).toDF("text")

pipeline = Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    clinical_sensitive_entities,
    nerConverter,
    deIdentification
])
result = pipeline.fit(data).transform(data)

# Show Results
result.select("dei.result").show(truncate = False)
+--------------------------------------------------------------------------------------------------+
|result                                                                                            |
+--------------------------------------------------------------------------------------------------+
|[# 01010101 Date : 01/18/93 PCP : Dr. Gregory House , <AGE> years-old , Record date : 2079-11-14.]|
+--------------------------------------------------------------------------------------------------+
from johnsnowlabs import *

documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

 sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence") \
    .setUseAbbreviations(True)

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel \
    .pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

# Ner entities
ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

nerConverter = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_con")

# Deidentification
deIdentification = finance.DeIdentification() \
    .setInputCols(["ner_chunk", "token", "sentence"]) \
    .setOutputCol("dei") \
    # file with custom regex pattern for custom entities
    .setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt") \
    # file with custom obfuscator names for the entities
    .setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt") \
    .setRefFileFormat("csv") \
    .setRefSep("#") \
    .setMode("obfuscate") \
    .setDateFormats(Array("MM/dd/yy","yyyy-MM-dd")) \
    .setObfuscateDate(True) \
    .setDateTag("DATE") \
    .setDays(5) \
    .setObfuscateRefSource("file")

# Pipeline
pipeline = Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    ner_model,
    nerConverter,
    deIdentification
])
from johnsnowlabs import *

documentAssembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

 sentenceDetector = nlp.SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence") \
    .setUseAbbreviations(True)

tokenizer = nlp.Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel \
    .pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

# Ner entities
ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

nerConverter = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_con")

# Deidentification
deIdentification = legal.DeIdentification() \
    .setInputCols(["ner_chunk", "token", "sentence"]) \
    .setOutputCol("dei") \
    # file with custom regex pattern for custom entities
    .setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt") \
    # file with custom obfuscator names for the entities
    .setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt") \
    .setRefFileFormat("csv") \
    .setRefSep("#") \
    .setMode("obfuscate") \
    .setDateFormats(Array("MM/dd/yy","yyyy-MM-dd")) \
    .setObfuscateDate(True) \
    .setDateTag("DATE") \
    .setDays(5) \
    .setObfuscateRefSource("file")

# Pipeline
pipeline = Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    ner_model,
    nerConverter,
    deIdentification
])

from johnsnowlabs import * 
val documentAssembler = new nlp.DocumentAssembler()
     .setInputCol("text")
     .setOutputCol("document")

 val sentenceDetector = new nlp.SentenceDetector()
     .setInputCols(Array("document"))
     .setOutputCol("sentence")
     .setUseAbbreviations(true)

 val tokenizer = new nlp.Tokenizer()
     .setInputCols(Array("sentence"))
     .setOutputCol("token")

 val embeddings = nlp.WordEmbeddingsModel
     .pretrained("embeddings_clinical", "en", "clinical/models")
     .setInputCols(Array("sentence", "token"))
     .setOutputCol("embeddings")

// Ner entities
val clinical_sensitive_entities = medical.NerModel.pretrained("ner_deid_enriched", "en", "clinical/models")
        .setInputCols(Array("sentence", "token", "embeddings")).setOutputCol("ner")

 val nerConverter = new nlp.NerConverter()
     .setInputCols(Array("sentence", "token", "ner"))
     .setOutputCol("ner_con")

// Deidentification
val deIdentification = new medical.DeIdentification()
     .setInputCols(Array("ner_chunk", "token", "sentence"))
     .setOutputCol("dei")
     // file with custom regex patterns for custom entities
     .setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt")
     // file with custom obfuscator names for the entities
     .setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt")
     .setRefFileFormat("csv")
     .setRefSep("#")
     .setMode("obfuscate")
     .setDateFormats(Array("MM/dd/yy","yyyy-MM-dd"))
     .setObfuscateDate(true)
     .setDateTag("DATE")
     .setDays(5)
     .setObfuscateRefSource("file")

// Pipeline
val data = Seq(
  "# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09."
).toDF("text")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  clinical_sensitive_entities,
  nerConverter,
  deIdentification
))
val result = pipeline.fit(data).transform(data)

result.select("dei.result").show(truncate = false)

// Show Results
//
// result.select("dei.result").show(truncate = false)
// +--------------------------------------------------------------------------------------------------+
// |result                                                                                            |
// +--------------------------------------------------------------------------------------------------+
// |[# 01010101 Date : 01/18/93 PCP : Dr. Gregory House , <AGE> years-old , Record date : 2079-11-14.]|
// +--------------------------------------------------------------------------------------------------+
//
from johnsnowlabs import * 
val documentAssembler = new nlp.DocumentAssembler()
     .setInputCol("text")
     .setOutputCol("document")

 val sentenceDetector = new nlp.SentenceDetector()
     .setInputCols(Array("document"))
     .setOutputCol("sentence")
     .setUseAbbreviations(true)

 val tokenizer = new nlp.Tokenizer()
     .setInputCols(Array("sentence"))
     .setOutputCol("token")

 val embeddings = nlp.WordEmbeddingsModel
     .pretrained("embeddings_clinical", "en", "clinical/models")
     .setInputCols(Array("sentence", "token"))
     .setOutputCol("embeddings")

// Ner entities
val ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")
     .setInputCols(Array("sentence", "token", "embeddings"))
     .setOutputCol("ner")

 val nerConverter = new nlp.NerConverter()
     .setInputCols(Array("sentence", "token", "ner"))
     .setOutputCol("ner_con")

// Deidentification
val deIdentification = new finance.DeIdentification()
     .setInputCols(Array("ner_chunk", "token", "sentence"))
     .setOutputCol("dei")
     // file with custom regex patterns for custom entities
     .setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt")
     // file with custom obfuscator names for the entities
     .setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt")
     .setRefFileFormat("csv")
     .setRefSep("#")
     .setMode("obfuscate")
     .setDateFormats(Array("MM/dd/yy","yyyy-MM-dd"))
     .setObfuscateDate(true)
     .setDateTag("DATE")
     .setDays(5)
     .setObfuscateRefSource("file")

// Pipeline

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  ner_model,
  nerConverter,
  deIdentification
))
from johnsnowlabs import * 
val documentAssembler = new nlp.DocumentAssembler()
     .setInputCol("text")
     .setOutputCol("document")

 val sentenceDetector = new nlp.SentenceDetector()
     .setInputCols(Array("document"))
     .setOutputCol("sentence")
     .setUseAbbreviations(true)

 val tokenizer = new nlp.Tokenizer()
     .setInputCols(Array("sentence"))
     .setOutputCol("token")

 val embeddings = nlp.WordEmbeddingsModel
     .pretrained("embeddings_clinical", "en", "clinical/models")
     .setInputCols(Array("sentence", "token"))
     .setOutputCol("embeddings")

// Ner entities
val ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

 val nerConverter = new nlp.NerConverter()
     .setInputCols(Array("sentence", "token", "ner"))
     .setOutputCol("ner_con")

// Deidentification
val deIdentification = new legal.DeIdentification()
     .setInputCols(Array("ner_chunk", "token", "sentence"))
     .setOutputCol("dei")
     // file with custom regex patterns for custom entities
     .setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt")
     // file with custom obfuscator names for the entities
     .setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt")
     .setRefFileFormat("csv")
     .setRefSep("#")
     .setMode("obfuscate")
     .setDateFormats(Array("MM/dd/yy","yyyy-MM-dd"))
     .setObfuscateDate(true)
     .setDateTag("DATE")
     .setDays(5)
     .setObfuscateRefSource("file")

// Pipeline

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  ner_model,
  nerConverter,
  deIdentification
))

Doc2ChunkInternal

Model

Converts DOCUMENT, TOKEN typed annotations into CHUNK type with the contents of a chunkCol. Chunk text must be contained within input DOCUMENT. May be either StringType or ArrayType[StringType] (using setIsArray). Useful for annotators that require a CHUNK type input.

For more extended examples on document pre-processing see the Spark NLP Workshop.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CHUNK

Python API: Doc2ChunkInternal Scala API: Doc2ChunkInternal
Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
tokenizer = Tokenizer().setInputCol("document").setOutputCol("token")
chunkAssembler = (
    Doc2ChunkInternal()
    .setInputCols("document", "token")
    .setChunkCol("target")
    .setOutputCol("chunk")
    .setIsArray(True)
)

data = spark.createDataFrame(
    [
        [
            "Spark NLP is an open-source text processing library for advanced natural language processing.",
            ["Spark NLP", "text processing library", "natural language processing"],
        ]
    ]
).toDF("text", "target")

pipeline = (
    Pipeline().setStages([documentAssembler, tokenizer, chunkAssembler]).fit(data)
)

result = pipeline.transform(data)
result.selectExpr("chunk.result", "chunk.annotatorType").show(truncate=False)
+-----------------------------------------------------------------+---------------------+
|result                                                           |annotatorType        |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+

DocumentHashCoder

Model

This annotator can replace dates in a column of DOCUMENT type according with the hash code of any other column. It uses the hash of the specified column and creates a new document column containing the day shift information. In sequence, the DeIdentification annotator deidentifies the document with the shifted date information.

If the specified column contains strings that can be parsed to integers, use those numbers to make the shift in the data accordingly.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

Python API: DocumentHashCoder Scala API: DocumentHashCoder
Show Example

import pandas as pd


data = pd.DataFrame(
    {'patientID' : ['A001', 'A001', 
                    'A003', 'A003'],
     'text' : ['Chris Brown was discharged on 10/02/2022', 
               'Mark White was discharged on 10/04/2022', 
               'John was discharged on 15/03/2022',
               'John Moore was discharged on 15/12/2022'
              ],
     'dateshift' : ['10', '10', 
                    '30', '30']
    }
)

my_input_df = spark.createDataFrame(data)

documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

documentHasher = DocumentHashCoder()\
    .setInputCols("document")\
    .setOutputCol("document2")\
    .setDateShiftColumn("dateshift")

tokenizer = Tokenizer()\
    .setInputCols(["document2"])\
    .setOutputCol("token")

embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["document2", "token"])\
    .setOutputCol("word_embeddings")

clinical_ner = MedicalNerModel\
    .pretrained("ner_deid_subentity_augmented", "en", "clinical/models")\
    .setInputCols(["document2","token", "word_embeddings"])\
    .setOutputCol("ner")

ner_converter = NerConverter()\
    .setInputCols(["document2", "token", "ner"])\
    .setOutputCol("ner_chunk")

de_identification = DeIdentification() \
    .setInputCols(["ner_chunk", "token", "document2"]) \
    .setOutputCol("deid_text") \
    .setMode("obfuscate") \
    .setObfuscateDate(True) \
    .setDateTag("DATE") \
    .setLanguage("en") \
    .setObfuscateRefSource('faker') \
    .setUseShifDays(True)

pipeline_col = Pipeline().setStages([
    documentAssembler,
    documentHasher,
    tokenizer,
    embeddings,
    clinical_ner,
    ner_converter,
    de_identification
])

empty_data = spark.createDataFrame([["", "", ""]]).toDF("patientID","text", "dateshift")
pipeline_col_model = pipeline_col.fit(empty_data)

output = pipeline_col_model.transform(my_input_df)
output.select('text', 'dateshift', 'deid_text.result').show(truncate = False)

+----------------------------------------+---------+----------------------------------------------+
text                                    |dateshift|result                                        |
+----------------------------------------+---------+----------------------------------------------+
Chris Brown was discharged on 10/02/2022|10       |[Ellender Manual was discharged on 20/02/2022]|
Mark White was discharged on 10/04/2022 |10       |[Errol Bang was discharged on 20/04/2022]     |
John was discharged on 15/03/2022       |30       |[Ariel Null was discharged on 14/04/2022]     |
John Moore was discharged on 15/12/2022 |30       |[Jean Cotton was discharged on 14/01/2023]    |
+----------------------------------------+---------+----------------------------------------------+

DocumentLogRegClassifier

Classifies documents with a Logarithmic Regression algorithm. Currently there are no pretrained models available. Please see DocumentLogRegClassifierApproach to train your own model.

Please check out the Models Hub for available models in the future.

Input Annotator Types: TOKEN

Output Annotator Type: CATEGORY

Python API: DocumentLogRegClassifierModel Scala API: DocumentLogRegClassifierModel

Trains a model to classify documents with a Logarithmic Regression algorithm. Training data requires columns for text and their label. The result is a trained DocumentLogRegClassifierModel.

Input Annotator Types: TOKEN

Output Annotator Type: CATEGORY

Python API: DocumentLogRegClassifierApproach Scala API: DocumentLogRegClassifierApproach
Show Example

from johnsnowlabs import *
# Define pipeline stages to prepare the data
document_assembler = nlp.DocumentAssembler() \
  .setInputCol("text") \
  .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
  .setInputCols(["document"]) \
  .setOutputCol("token")

normalizer = nlp.Normalizer() \
  .setInputCols(["token"]) \
  .setOutputCol("normalized")

stopwords_cleaner = nlp.StopWordsCleaner() \
  .setInputCols(["normalized"]) \
  .setOutputCol("cleanTokens") \
  .setCaseSensitive(False)

stemmer = nlp.Stemmer() \
  .setInputCols(["cleanTokens"]) \
  .setOutputCol("stem")

# Define the document classifier and fit training data to it
logreg = medical.DocumentLogRegClassifierApproach() \
  .setInputCols(["stem"]) \
  .setLabelCol("category") \
  .setOutputCol("prediction")

pipeline = Pipeline(stages=[
  document_assembler,
  tokenizer,
  normalizer,
  stopwords_cleaner,
  stemmer,
  logreg
])

model = pipeline.fit(trainingData)
from johnsnowlabs import *
# Define pipeline stages to prepare the data
document_assembler = nlp.DocumentAssembler() \
  .setInputCol("text") \
  .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
  .setInputCols(["document"]) \
  .setOutputCol("token")

normalizer = nlp.Normalizer() \
  .setInputCols(["token"]) \
  .setOutputCol("normalized")

stopwords_cleaner = nlp.StopWordsCleaner() \
  .setInputCols(["normalized"]) \
  .setOutputCol("cleanTokens") \
  .setCaseSensitive(False)

stemmer = nlp.Stemmer() \
  .setInputCols(["cleanTokens"]) \
  .setOutputCol("stem")

# Define the document classifier and fit training data to it
logreg = finance.DocumentLogRegClassifierApproach() \
  .setInputCols(["stem"]) \
  .setLabelCol("category") \
  .setOutputCol("prediction")

pipeline = Pipeline(stages=[
  document_assembler,
  tokenizer,
  normalizer,
  stopwords_cleaner,
  stemmer,
  logreg
])

model = pipeline.fit(trainingData)
from johnsnowlabs import *
# Define pipeline stages to prepare the data
document_assembler = nlp.DocumentAssembler() \
  .setInputCol("text") \
  .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
  .setInputCols(["document"]) \
  .setOutputCol("token")

normalizer = nlp.Normalizer() \
  .setInputCols(["token"]) \
  .setOutputCol("normalized")

stopwords_cleaner = nlp.StopWordsCleaner() \
  .setInputCols(["normalized"]) \
  .setOutputCol("cleanTokens") \
  .setCaseSensitive(False)

stemmer = nlp.Stemmer() \
  .setInputCols(["cleanTokens"]) \
  .setOutputCol("stem")

# Define the document classifier and fit training data to it
logreg = legal.DocumentLogRegClassifierApproach() \
  .setInputCols(["stem"]) \
  .setLabelCol("category") \
  .setOutputCol("prediction")

pipeline = Pipeline(stages=[
  document_assembler,
  tokenizer,
  normalizer,
  stopwords_cleaner,
  stemmer,
  logreg
])

model = pipeline.fit(trainingData)

from johnsnowlabs import * 
// Define pipeline stages to prepare the data
val document_assembler = new nlp.DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new nlp.Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val normalizer = new nlp.Normalizer()
  .setInputCols("token")
  .setOutputCol("normalized")

val stopwords_cleaner = new nlp.StopWordsCleaner()
  .setInputCols("normalized")
  .setOutputCol("cleanTokens")
  .setCaseSensitive(false)

val stemmer = new nlp.Stemmer()
  .setInputCols("cleanTokens")
  .setOutputCol("stem")

// Define the document classifier and fit training data to it
val logreg = new medical.DocumentLogRegClassifierApproach()
  .setInputCols("stem")
  .setLabelCol("category")
  .setOutputCol("prediction")

val pipeline = new Pipeline().setStages(Array(
  document_assembler,
  tokenizer,
  normalizer,
  stopwords_cleaner,
  stemmer,
  logreg
))

val model = pipeline.fit(trainingData)
from johnsnowlabs import * 
// Define pipeline stages to prepare the data
val document_assembler = new nlp.DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new nlp.Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val normalizer = new nlp.Normalizer()
  .setInputCols("token")
  .setOutputCol("normalized")

val stopwords_cleaner = new nlp.StopWordsCleaner()
  .setInputCols("normalized")
  .setOutputCol("cleanTokens")
  .setCaseSensitive(false)

val stemmer = new nlp.Stemmer()
  .setInputCols("cleanTokens")
  .setOutputCol("stem")

// Define the document classifier and fit training data to it
val logreg = new finance.DocumentLogRegClassifierApproach()
  .setInputCols("stem")
  .setLabelCol("category")
  .setOutputCol("prediction")

val pipeline = new Pipeline().setStages(Array(
  document_assembler,
  tokenizer,
  normalizer,
  stopwords_cleaner,
  stemmer,
  logreg
))

val model = pipeline.fit(trainingData)
from johnsnowlabs import * 
// Define pipeline stages to prepare the data
val document_assembler = new nlp.DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new nlp.Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val normalizer = new nlp.Normalizer()
  .setInputCols("token")
  .setOutputCol("normalized")

val stopwords_cleaner = new nlp.StopWordsCleaner()
  .setInputCols("normalized")
  .setOutputCol("cleanTokens")
  .setCaseSensitive(false)

val stemmer = new nlp.Stemmer()
  .setInputCols("cleanTokens")
  .setOutputCol("stem")

// Define the document classifier and fit training data to it
val logreg = new legal.DocumentLogRegClassifierApproach()
  .setInputCols("stem")
  .setLabelCol("category")
  .setOutputCol("prediction")

val pipeline = new Pipeline().setStages(Array(
  document_assembler,
  tokenizer,
  normalizer,
  stopwords_cleaner,
  stemmer,
  logreg
))

val model = pipeline.fit(trainingData)

DrugNormalizer

Model

Annotator which normalizes raw text from clinical documents, e.g. scraped web pages or xml documents, from document type columns into Sentence. Removes all dirty characters from text following one or more input regex patterns. Can apply non wanted character removal which a specific policy. Can apply lower case normalization.

See Spark NLP Workshop for more examples of usage.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

Python API: DrugNormalizer Scala API: DrugNormalizer
Show Example

from johnsnowlabs import *
data = spark.createDataFrame([
  ["Sodium Chloride/Potassium Chloride 13bag"],
  ["interferon alfa-2b 10 million unit ( 1 ml ) injec"],
  ["aspirin 10 meq/ 5 ml oral sol"]
]).toDF("text")
document = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
drugNormalizer = medical.DrugNormalizer().setInputCols(["document"]).setOutputCol("document_normalized")

trainingPipeline = Pipeline(stages=[document, drugNormalizer])
result = trainingPipeline.fit(data).transform(data)

result.selectExpr("explode(document_normalized.result) as normalized_text").show(truncate=False)
+----------------------------------------------------+
|normalized_text                                     |
+----------------------------------------------------+
|Sodium Chloride / Potassium Chloride 13 bag         |
|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|
|aspirin 2 meq/ml oral solution                      |
+----------------------------------------------------+
from johnsnowlabs import *

document = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
drugNormalizer = finance.DrugNormalizer().setInputCols(["document"]).setOutputCol("document_normalized")

trainingPipeline = Pipeline(stages=[document, drugNormalizer])
from johnsnowlabs import *

document = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
drugNormalizer = legal.DrugNormalizer().setInputCols(["document"]).setOutputCol("document_normalized")

trainingPipeline = Pipeline(stages=[document, drugNormalizer])

from johnsnowlabs import * 
val data = Seq(
  ("Sodium Chloride/Potassium Chloride 13bag"),
  ("interferon alfa-2b 10 million unit ( 1 ml ) injec"),
  ("aspirin 10 meq/ 5 ml oral sol")
).toDF("text")
val document = new nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
val drugNormalizer = new medical.DrugNormalizer().setInputCols("document").setOutputCol("document_normalized")

val trainingPipeline = new Pipeline().setStages(Array(document, drugNormalizer))
val result = trainingPipeline.fit(data).transform(data)

result.selectExpr("explode(document_normalized.result) as normalized_text").show(false)
+----------------------------------------------------+
|normalized_text                                     |
+----------------------------------------------------+
|Sodium Chloride / Potassium Chloride 13 bag         |
|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|
|aspirin 2 meq/ml oral solution                      |
+----------------------------------------------------+
from johnsnowlabs import * 

val document = new nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
val drugNormalizer = new finance.DrugNormalizer().setInputCols("document").setOutputCol("document_normalized")

val trainingPipeline = new Pipeline().setStages(Array(document, drugNormalizer))
from johnsnowlabs import * 

val document = new nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
val drugNormalizer = new legal.DrugNormalizer().setInputCols("document").setOutputCol("document_normalized")

val trainingPipeline = new Pipeline().setStages(Array(document, drugNormalizer))

EntityChunkEmbeddings

Model

Weighted average embeddings of multiple named entities chunk annotations.

Entity Chunk Embeddings uses BERT Sentence embeddings to compute a weighted average vector represention of related entity chunks. The input the model consists of chunks of recognized named entities. One or more entities are selected as target entities and for each of them a list of related entities is specified (if empty, all other entities are assumed to be related).

The model looks for chunks of the target entities and then tries to pair each target entity (e.g. DRUG) with other related entities (e.g. DOSAGE, STRENGTH, FORM, etc). The criterion for pairing a target entity with another related entity is that they appear in the same sentence and the maximal syntactic distance is below a predefined threshold.

The relationship between target and related entities is one-to-many, meaning that if there multiple instances of the same target entity (e.g.) within a sentence, the model will map a related entity (e.g. DOSAGE) to at most one of the instances of the target entity. For example, if there is a sentence “The patient was given 125 mg of paracetamol and metformin”, the model will pair “125 mg” to “paracetamol”, but not to “metformin”.

The output of the model is an average embeddings of the chunks of each of the target entities and their related entities. It is possible to specify a particular weight for each entity type.

An entity can be defined both as target a entity and as a related entity for some other target entity. For example, we may want to compute the embeddings of SYMPTOMs and their related entities, as well as the embeddings of DRUGs and their related entities, one of each is also SYMPTOM. In such cases, it is possible to use the TARGET_ENTITY:RELATED_ENTITY notation to specify the weight of an related entity (e.g. “DRUG:SYMPTOM” to set the weight of SYMPTOM when it appears as an related entity to target entity DRUG). The relative weights of entities for particular entity chunk embeddings are available in the annotations metadata.

This model is a subclass of BertSentenceEmbeddings and shares all parameters with it. It can load any pretrained BertSentenceEmbeddings model.

The default model is "sbiobert_base_cased_mli" from clinical/models. Other available models can be found at Models Hub.

Input Annotator Types: DEPENDENCY, CHUNK

Output Annotator Type: SENTENCE_EMBEDDINGS

Python API: EntityChunkEmbeddingsModel Scala API: EntityChunkEmbeddingsModel
Show Example

import sparknlp
from sparknlp.base import *
from sparknlp_jsl.common import *
from sparknlp.annotator import *
from sparknlp.training import *
import sparknlp_jsl
from sparknlp_jsl.base import *
from sparknlp_jsl.annotator import *
from pyspark.ml import Pipeline

documenter = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("documents")
sentence_detector = SentenceDetector() \
    .setInputCols("documents") \
    .setOutputCol("sentences")
tokenizer = Tokenizer() \
    .setInputCols("sentences") \
    .setOutputCol("tokens")
embeddings = WordEmbeddingsModel() \
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("embeddings")
ner_model = MedicalNerModel()\
    .pretrained("ner_posology_large", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens", "embeddings"])\
    .setOutputCol("ner")
ner_converter = NerConverterInternal()\
    .setInputCols("sentences", "tokens", "ner")\
    .setOutputCol("ner_chunks")
pos_tager = PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models")\
    .setInputCols("sentences", "tokens")\
    .setOutputCol("pos_tags")
dependency_parser = DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentences", "pos_tags", "tokens"])\
    .setOutputCol("dependencies")
drug_chunk_embeddings = EntityChunkEmbeddings()\
    .pretrained("sbiobert_base_cased_mli","en","clinical/models")\
    .setInputCols(["ner_chunks", "dependencies"])\
    .setOutputCol("drug_chunk_embeddings")\
    .setMaxSyntacticDistance(3)\
    .setTargetEntities({"DRUG": []})
    .setEntityWeights({"DRUG": 0.8, "STRENGTH": 0.2, "DOSAGE": 0.2, "FORM": 0.5})
sampleData = "The parient was given metformin 125 mg, 250 mg of coumadin and then one pill paracetamol"
data = SparkContextForTest.spark.createDataFrame([[sampleData]]).toDF("text")
pipeline = Pipeline().setStages([
    documenter,
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter,
    pos_tager,
    dependency_parser,
    drug_chunk_embeddings])
results = pipeline.fit(data).transform(data)
results = results \
    .selectExpr("explode(drug_chunk_embeddings) AS drug_chunk") \
    .selectExpr("drug_chunk.result", "slice(drug_chunk.embeddings, 1, 5) AS drug_embedding") \
    .cache()
results.show(truncate=False)
+-----------------------------+-----------------------------------------------------------------+
|                       result|                                                  drug_embedding"|
+-----------------------------+-----------------------------------------------------------------+
|metformin 125 mg             |[-0.267413, 0.07614058, -0.5620966, 0.83838946, 0.8911504]       |
|250 mg coumadin              |[0.22319649, -0.07094894, -0.6885556, 0.79176235, 0.82672405]    |
|one pill paracetamol         |[-0.10939768, -0.29242, -0.3574444, 0.3981813, 0.79609615]       |
+-----------------------------+-----------------------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotators.parser.dep.DependencyParserModel
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
import com.johnsnowlabs.nlp.annotators.ner.{MedicalNerModel, NerConverterInternal}
import com.johnsnowlabs.nlp.annotators.embeddings.EntityChunkEmbeddings
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
   .setInputCol("text")
   .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
   .setInputCols("document")
   .setOutputCol("sentence")

val tokenizer = new Tokenizer()
   .setInputCols("sentence")
   .setOutputCol("tokens")

val wordEmbeddings = WordEmbeddingsModel
   .pretrained("embeddings_clinical", "en", "clinical/models")
   .setInputCols(Array("sentences", "tokens"))
   .setOutputCol("word_embeddings")

val nerModel = MedicalNerModel
   .pretrained("ner_posology_large", "en", "clinical/models")
   .setInputCols(Array("sentence", "tokens", "word_embeddings"))
   .setOutputCol("ner")

val nerConverter = new NerConverterInternal()
   .setInputCols("sentence", "tokens", "ner")
   .setOutputCol("ner_chunk")

val posTager = PerceptronModel
   .pretrained("pos_clinical", "en", "clinical/models")
   .setInputCols("sentences", "tokens")
   .setOutputCol("pos_tags")

val dependencyParser = DependencyParserModel
   .pretrained("dependency_conllu", "en")
   .setInputCols(Array("sentences", "pos_tags", "tokens"))
   .setOutputCol("dependencies")

val drugChunkEmbeddings = EntityChunkEmbeddings
   .pretrained("sbiobert_base_cased_mli","en","clinical/models")
   .setInputCols(Array("ner_chunks", "dependencies"))
   .setOutputCol("drug_chunk_embeddings")
   .setMaxSyntacticDistance(3)
   .setTargetEntities(Map("DRUG" -> List()))
   .setEntityWeights(Map[String, Float]("DRUG" -> 0.8f, "STRENGTH" -> 0.2f, "DOSAGE" -> 0.2f, "FORM" -> 0.5f))

val pipeline = new Pipeline()
     .setStages(Array(
         documentAssembler,
         sentenceDetector,
         tokenizer,
         wordEmbeddings,
         nerModel,
         nerConverter,
         posTager,
         dependencyParser,
         drugChunkEmbeddings))

val sampleText = "The patient was given metformin 125 mg, 250 mg of coumadin and then one pill paracetamol."

val testDataset = Seq("").toDS.toDF("text")
val result = pipeline.fit(emptyDataset).transform(testDataset)

result
   .selectExpr("explode(drug_chunk_embeddings) AS drug_chunk")
   .selectExpr("drug_chunk.result", "slice(drug_chunk.embeddings, 1, 5) AS drugEmbedding")
   .show(truncate=false)

+-----------------------------+-----------------------------------------------------------------+
|                       result|                                                    drugEmbedding|
+-----------------------------+-----------------------------------------------------------------+
|metformin 125 mg             |[-0.267413, 0.07614058, -0.5620966, 0.83838946, 0.8911504]       |
|250 mg coumadin              |[0.22319649, -0.07094894, -0.6885556, 0.79176235, 0.82672405]    |
|one pill paracetamol          |[-0.10939768, -0.29242, -0.3574444, 0.3981813, 0.79609615]      |
+-----------------------------+----------------------------------------------------------------+

FeaturesAssembler

Approach

The FeaturesAssembler is used to collect features from different columns. It can collect features from single value columns (anything which can be cast to a float, if casts fails then the value is set to 0), array columns or SparkNLP annotations (if the annotation is an embedding, it takes the embedding, otherwise tries to cast the result field). The output of the transformer is a FEATURE_VECTOR annotation (the numeric vector is in the embeddings field).

Input Annotator Types: NONE

Output Annotator Type: "feature_vector"

Python API: FeaturesAssembler Scala API: FeaturesAssembler
Show Example

from johnsnowlabs import * 
features_asm = medical.FeaturesAssembler() \
  .setInputCols(["feature_1", "feature_2", "...", "feature_n"]) \
  .setOutputCol("features")

gen_clf = medical.GenericClassifierApproach() \
  .setLabelColumn("target") \
  .setInputCols(["features"]) \
  .setOutputCol("prediction") \
  .setModelFile("/path/to/graph_file.pb") \
  .setEpochsNumber(50) \
  .setBatchSize(100) \
  .setFeatureScaling("zscore") \
  .setLearningRate(0.001) \
  .setFixImbalance(True) \
  .setOutputLogsPath("logs") \
  .setValidationSplit(0.2) # keep 20% of the data for validation purposes

pipeline = Pipeline(stages=[
  features_asm,
  gen_clf
])

clf_model = pipeline.fit(data)
from johnsnowlabs import * 
features_asm = finance.FeaturesAssembler() \
  .setInputCols(["feature_1", "feature_2", "...", "feature_n"]) \
  .setOutputCol("features")

gen_clf = finance.GenericClassifierApproach() \
  .setLabelColumn("target") \
  .setInputCols(["features"]) \
  .setOutputCol("prediction") \
  .setModelFile("/path/to/graph_file.pb") \
  .setEpochsNumber(50) \
  .setBatchSize(100) \
  .setFeatureScaling("zscore") \
  .setLearningRate(0.001) \
  .setFixImbalance(True) \
  .setOutputLogsPath("logs") \
  .setValidationSplit(0.2) # keep 20% of the data for validation purposes

pipeline = Pipeline(stages=[
  features_asm,
  gen_clf
])

clf_model = pipeline.fit(data)
from johnsnowlabs import * 
features_asm = legal.FeaturesAssembler() \
  .setInputCols(["feature_1", "feature_2", "...", "feature_n"]) \
  .setOutputCol("features")

gen_clf = legal.GenericClassifierApproach() \
  .setLabelColumn("target") \
  .setInputCols(["features"]) \
  .setOutputCol("prediction") \
  .setModelFile("/path/to/graph_file.pb") \
  .setEpochsNumber(50) \
  .setBatchSize(100) \
  .setFeatureScaling("zscore") \
  .setLearningRate(0.001) \
  .setFixImbalance(True) \
  .setOutputLogsPath("logs") \
  .setValidationSplit(0.2) # keep 20% of the data for validation purposes

pipeline = Pipeline(stages=[
  features_asm,
  gen_clf
])

clf_model = pipeline.fit(data)

from johnsnowlabs import * 
val features_asm = new medical.FeaturesAssembler()
  .setInputCols(Array("feature_1", "feature_2", "...", "feature_n"))
  .setOutputCol("features")

val gen_clf = new medical.GenericClassifierApproach()
  .setLabelColumn("target")
  .setInputCols("features")
  .setOutputCol("prediction")
  .setModelFile("/path/to/graph_file.pb")
  .setEpochsNumber(50)
  .setBatchSize(100)
  .setFeatureScaling("zscore")
  .setlearningRate(0.001f)
  .setFixImbalance(true)
  .setOutputLogsPath("logs")
  .setValidationSplit(0.2f) // keep 20% of the data for validation purposes

val pipeline = new Pipeline().setStages(Array(
  features_asm,
  gen_clf
))

val clf_model = pipeline.fit(data)
from johnsnowlabs import * 
val features_asm = new finance.FeaturesAssembler()
  .setInputCols(Array("feature_1", "feature_2", "...", "feature_n"))
  .setOutputCol("features")

val gen_clf = new finance.GenericClassifierApproach()
  .setLabelColumn("target")
  .setInputCols("features")
  .setOutputCol("prediction")
  .setModelFile("/path/to/graph_file.pb")
  .setEpochsNumber(50)
  .setBatchSize(100)
  .setFeatureScaling("zscore")
  .setlearningRate(0.001f)
  .setFixImbalance(true)
  .setOutputLogsPath("logs")
  .setValidationSplit(0.2f) // keep 20% of the data for validation purposes

val pipeline = new Pipeline().setStages(Array(
  features_asm,
  gen_clf
))

val clf_model = pipeline.fit(data)
from johnsnowlabs import * 
val features_asm = new legal.FeaturesAssembler()
  .setInputCols(Array("feature_1", "feature_2", "...", "feature_n"))
  .setOutputCol("features")

val gen_clf = new legal.GenericClassifierApproach()
  .setLabelColumn("target")
  .setInputCols("features")
  .setOutputCol("prediction")
  .setModelFile("/path/to/graph_file.pb")
  .setEpochsNumber(50)
  .setBatchSize(100)
  .setFeatureScaling("zscore")
  .setlearningRate(0.001f)
  .setFixImbalance(true)
  .setOutputLogsPath("logs")
  .setValidationSplit(0.2f) // keep 20% of the data for validation purposes

val pipeline = new Pipeline().setStages(Array(
  features_asm,
  gen_clf
))

val clf_model = pipeline.fit(data)

GenericClassifier

Creates a generic single-label classifier which uses pre-generated Tensorflow graphs. The model operates on FEATURE_VECTOR annotations which can be produced using FeatureAssembler. Requires the FeaturesAssembler to create the input.

Input Annotator Types: FEATURE_VECTOR

Output Annotator Type: CATEGORY

Python API: GenericClassifierModel Scala API: GenericClassifierModel

Trains a TensorFlow model for generic classification of feature vectors. It takes FEATURE_VECTOR annotations from FeaturesAssembler as input, classifies them and outputs CATEGORY annotations. Please see the Parameters section for required training parameters.

For a more extensive example please see the Spark NLP Workshop.

Input Annotator Types: FEATURE_VECTOR

Output Annotator Type: CATEGORY

Python API: GenericClassifierApproach Scala API: GenericClassifierApproach
Show Example

from johnsnowlabs import *
features_asm = medical.FeaturesAssembler() \
    .setInputCols(["feature_1", "feature_2", "...", "feature_n"]) \
    .setOutputCol("features")

gen_clf = medical.GenericClassifierApproach() \
    .setLabelColumn("target") \
    .setInputCols(["features"]) \
    .setOutputCol("prediction") \
    .setModelFile("/path/to/graph_file.pb") \
    .setEpochsNumber(50) \
    .setBatchSize(100) \
    .setFeatureScaling("zscore") \
    .setlearningRate(0.001) \
    .setFixImbalance(True) \
    .setOutputLogsPath("logs") \
    .setValidationSplit(0.2) # keep 20% of the data for validation purposes

pipeline = Pipeline().setStages([
    features_asm,
    gen_clf
])

clf_model = pipeline.fit(data)
from johnsnowlabs import *
features_asm = finance.FeaturesAssembler() \
    .setInputCols(["feature_1", "feature_2", "...", "feature_n"]) \
    .setOutputCol("features")

gen_clf = finance.GenericClassifierApproach() \
    .setLabelColumn("target") \
    .setInputCols(["features"]) \
    .setOutputCol("prediction") \
    .setModelFile("/path/to/graph_file.pb") \
    .setEpochsNumber(50) \
    .setBatchSize(100) \
    .setFeatureScaling("zscore") \
    .setlearningRate(0.001) \
    .setFixImbalance(True) \
    .setOutputLogsPath("logs") \
    .setValidationSplit(0.2) # keep 20% of the data for validation purposes

pipeline = Pipeline().setStages([
    features_asm,
    gen_clf
])

clf_model = pipeline.fit(data)
from johnsnowlabs import *
features_asm = legal.FeaturesAssembler() \
    .setInputCols(["feature_1", "feature_2", "...", "feature_n"]) \
    .setOutputCol("features")

gen_clf = legal.GenericClassifierApproach() \
    .setLabelColumn("target") \
    .setInputCols(["features"]) \
    .setOutputCol("prediction") \
    .setModelFile("/path/to/graph_file.pb") \
    .setEpochsNumber(50) \
    .setBatchSize(100) \
    .setFeatureScaling("zscore") \
    .setlearningRate(0.001) \
    .setFixImbalance(True) \
    .setOutputLogsPath("logs") \
    .setValidationSplit(0.2) # keep 20% of the data for validation purposes

pipeline = Pipeline().setStages([
    features_asm,
    gen_clf
])

clf_model = pipeline.fit(data)

from johnsnowlabs import * 
val features_asm = new medical.FeaturesAssembler()
  .setInputCols(Array("feature_1", "feature_2", "...", "feature_n"))
  .setOutputCol("features")

val gen_clf = new medical.GenericClassifierApproach()
  .setLabelColumn("target")
  .setInputCols("features")
  .setOutputCol("prediction")
  .setModelFile("/path/to/graph_file.pb")
  .setEpochsNumber(50)
  .setBatchSize(100)
  .setFeatureScaling("zscore")
  .setlearningRate(0.001f)
  .setFixImbalance(true)
  .setOutputLogsPath("logs")
  .setValidationSplit(0.2f) // keep 20% of the data for validation purposes

val pipeline = new Pipeline().setStages(Array(
  features_asm,
  gen_clf
))

val clf_model = pipeline.fit(data)
from johnsnowlabs import * 
val features_asm = new finance.FeaturesAssembler()
  .setInputCols(Array("feature_1", "feature_2", "...", "feature_n"))
  .setOutputCol("features")

val gen_clf = new finance.GenericClassifierApproach()
  .setLabelColumn("target")
  .setInputCols("features")
  .setOutputCol("prediction")
  .setModelFile("/path/to/graph_file.pb")
  .setEpochsNumber(50)
  .setBatchSize(100)
  .setFeatureScaling("zscore")
  .setlearningRate(0.001f)
  .setFixImbalance(true)
  .setOutputLogsPath("logs")
  .setValidationSplit(0.2f) // keep 20% of the data for validation purposes

val pipeline = new Pipeline().setStages(Array(
  features_asm,
  gen_clf
))

val clf_model = pipeline.fit(data)
from johnsnowlabs import * 
val features_asm = new legal.FeaturesAssembler()
  .setInputCols(Array("feature_1", "feature_2", "...", "feature_n"))
  .setOutputCol("features")

val gen_clf = new legal.GenericClassifierApproach()
  .setLabelColumn("target")
  .setInputCols("features")
  .setOutputCol("prediction")
  .setModelFile("/path/to/graph_file.pb")
  .setEpochsNumber(50)
  .setBatchSize(100)
  .setFeatureScaling("zscore")
  .setlearningRate(0.001f)
  .setFixImbalance(true)
  .setOutputLogsPath("logs")
  .setValidationSplit(0.2f) // keep 20% of the data for validation purposes

val pipeline = new Pipeline().setStages(Array(
  features_asm,
  gen_clf
))

val clf_model = pipeline.fit(data)

IOBTagger

Model

Merges token tags and NER labels from chunks in the specified format. For example output columns as inputs from NerConverter and Tokenizer can be used to merge.

Input Annotator Types: TOKEN, CHUNK

Output Annotator Type: NAMED_ENTITY

Python API: IOBTagger Scala API: IOBTagger
Show Example

from johnsnowlabs import * 
# Pipeline stages are defined where NER is done. NER is converted to chunks.
data = spark.createDataFrame([["A 63-year-old man presents to the hospital ..."]]).toDF("text")
docAssembler = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
sentenceDetector = nlp.SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
tokenizer = nlp.Tokenizer().setInputCols(["sentence"]).setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models").setInputCols(["sentence", "token"]).setOutputCol("embs")
nerModel = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models").setInputCols(["sentence", "token", "embs"]).setOutputCol("ner")
nerConverter = nlp.NerConverter().setInputCols(["sentence", "token", "ner"]).setOutputCol("ner_chunk")

# Define the IOB tagger, which needs tokens and chunks as input. Show results.
iobTagger = medical.IOBTagger().setInputCols(["token", "ner_chunk"]).setOutputCol("ner_label")
pipeline = Pipeline(stages=[docAssembler, sentenceDetector, tokenizer, embeddings, nerModel, nerConverter, iobTagger])

result.selectExpr("explode(ner_label) as a") \
  .selectExpr("a.begin","a.end","a.result as chunk","a.metadata.word as word") \
  .where("chunk!='O'").show(5, False)

+-----+---+-----------+-----------+
|begin|end|chunk      |word       |
+-----+---+-----------+-----------+
|5    |15 |B-Age      |63-year-old|
|17   |19 |B-Gender   |man        |
|64   |72 |B-Modifier |recurrent  |
|98   |107|B-Diagnosis|cellulitis |
|110  |119|B-Diagnosis|pneumonias |
+-----+---+-----------+-----------+
from johnsnowlabs import * 
# Pipeline stages are defined where NER is done. NER is converted to chunks.

docAssembler = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
sentenceDetector = nlp.SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
tokenizer = nlp.Tokenizer().setInputCols(["sentence"]).setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models").setInputCols(["sentence", "token"]).setOutputCol("embs")
ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models").setInputCols(["sentence", "token", "embs"]).setOutputCol("ner")
nerConverter = nlp.NerConverter().setInputCols(["sentence", "token", "ner"]).setOutputCol("ner_chunk")

# Define the IOB tagger, which needs tokens and chunks as input. Show results.
iobTagger = finance.IOBTagger().setInputCols(["token", "ner_chunk"]).setOutputCol("ner_label")
pipeline = Pipeline(stages=[docAssembler, sentenceDetector, tokenizer, embeddings, ner_model, nerConverter, iobTagger])
from johnsnowlabs import * 
# Pipeline stages are defined where NER is done. NER is converted to chunks.

docAssembler = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
sentenceDetector = nlp.SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
tokenizer = nlp.Tokenizer().setInputCols(["sentence"]).setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models").setInputCols(["sentence", "token"]).setOutputCol("embs")
ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models").setInputCols(["sentence", "token", "embs"]).setOutputCol("ner")
nerConverter = nlp.NerConverter().setInputCols(["sentence", "token", "ner"]).setOutputCol("ner_chunk")

# Define the IOB tagger, which needs tokens and chunks as input. Show results.
iobTagger = legal.IOBTagger().setInputCols(["token", "ner_chunk"]).setOutputCol("ner_label")
pipeline = Pipeline(stages=[docAssembler, sentenceDetector, tokenizer, embeddings, ner_model, nerConverter, iobTagger])

from johnsnowlabs import * 
// Pipeline stages are defined where NER is done. NER is converted to chunks.
val data = Seq(("A 63-year-old man presents to the hospital ...")).toDF("text")
val docAssembler = new nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector().setInputCols("document").setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer().setInputCols("sentence").setOutputCol("token")
val embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models").setInputCols(Array("sentence", "token")).setOutputCol("embs")
val nerModel = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models").setInputCols(Array("sentence", "token", "embs")).setOutputCol("ner")
val nerConverter = new nlp.NerConverter().setInputCols(Array("sentence", "token", "ner")).setOutputCol("ner_chunk")

// Define the IOB tagger, which needs tokens and chunks as input. Show results.
val iobTagger = new medical.IOBTagger().setInputCols(Array("token", "ner_chunk")).setOutputCol("ner_label")
val pipeline = new Pipeline().setStages(Array(docAssembler, sentenceDetector, tokenizer, embeddings, nerModel, nerConverter, iobTagger))

result.selectExpr("explode(ner_label) as a")
  .selectExpr("a.begin","a.end","a.result as chunk","a.metadata.word as word")
  .where("chunk!='O'").show(5, false)

+-----+---+-----------+-----------+
|begin|end|chunk      |word       |
+-----+---+-----------+-----------+
|5    |15 |B-Age      |63-year-old|
|17   |19 |B-Gender   |man        |
|64   |72 |B-Modifier |recurrent  |
|98   |107|B-Diagnosis|cellulitis |
|110  |119|B-Diagnosis|pneumonias |
+-----+---+-----------+-----------+
from johnsnowlabs import * 
// Pipeline stages are defined where NER is done. NER is converted to chunks.
val docAssembler = new nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector().setInputCols("document").setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer().setInputCols("sentence").setOutputCol("token")
val embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models").setInputCols(Array("sentence", "token")).setOutputCol("embs")
val ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models").setInputCols(Array("sentence", "token", "embs")).setOutputCol("ner")
val nerConverter = new nlp.NerConverter().setInputCols(Array("sentence", "token", "ner")).setOutputCol("ner_chunk")

// Define the IOB tagger, which needs tokens and chunks as input. Show results.
val iobTagger = new legal.IOBTagger().setInputCols(Array("token", "ner_chunk")).setOutputCol("ner_label")
val pipeline = new Pipeline().setStages(Array(docAssembler, sentenceDetector, tokenizer, embeddings, ner_model, nerConverter, iobTagger))
from johnsnowlabs import * 
// Pipeline stages are defined where NER is done. NER is converted to chunks.
val docAssembler = new nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector().setInputCols("document").setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer().setInputCols("sentence").setOutputCol("token")
val embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models").setInputCols(Array("sentence", "token")).setOutputCol("embs")
val ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models").setInputCols(Array("sentence", "token", "embs")).setOutputCol("ner")
val nerConverter = new nlp.NerConverter().setInputCols(Array("sentence", "token", "ner")).setOutputCol("ner_chunk")

// Define the IOB tagger, which needs tokens and chunks as input. Show results.
val iobTagger = new legal.IOBTagger().setInputCols(Array("token", "ner_chunk")).setOutputCol("ner_label")
val pipeline = new Pipeline().setStages(Array(docAssembler, sentenceDetector, tokenizer, embeddings, ner_model, nerConverter, iobTagger))

NerChunker

Model

Extracts phrases that fits into a known pattern using the NER tags. Useful for entity groups with neighboring tokens when there is no pretrained NER model to address certain issues. A Regex needs to be provided to extract the tokens between entities.

Input Annotator Types: DOCUMENT, NAMED_ENTITY

Output Annotator Type: CHUNK

Python API: NerChunker Scala API: NerChunker
Show Example

from johnsnowlabs import * 
# Defining pipeline stages for NER
data= spark.createDataFrame([["She has cystic cyst on her kidney."]]).toDF("text")

documentAssembler= nlp.DocumentAssembler() \
  .setInputCol("text") \
  .setOutputCol("document")

sentenceDetector= nlp.SentenceDetector() \
  .setInputCols(["document"]) \
  .setOutputCol("sentence") \
  .setUseAbbreviations(False)

tokenizer= nlp.Tokenizer() \
  .setInputCols(["sentence"]) \
  .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
  .setInputCols(["sentence","token"]) \
  .setOutputCol("embeddings") \
  .setCaseSensitive(False)

ner = medical.NerModel.pretrained("ner_radiology", "en", "clinical/models") \
  .setInputCols(["sentence","token","embeddings"]) \
  .setOutputCol("ner") \
  .setIncludeConfidence(True)

# Define the NerChunker to combine to chunks
chunker = medical.NerChunker() \
  .setInputCols(["sentence","ner"]) \
  .setOutputCol("ner_chunk") \
  .setRegexParsers(["<ImagingFindings>.*<BodyPart>"])

pipeline= Pipeline(stages=[
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  ner,
  chunker
])

result = pipeline.fit(data).transform(data)

# Show results:
result.selectExpr("explode(arrays_zip(ner.metadata , ner.result))")
  .selectExpr("col['0'].word as word" , "col['1'] as ner").show(truncate=False)
+------+-----------------+
|word  |ner              |
+------+-----------------+
|She   |O                |
|has   |O                |
|cystic|B-ImagingFindings|
|cyst  |I-ImagingFindings|
|on    |O                |
|her   |O                |
|kidney|B-BodyPart       |
|.     |O                |
+------+-----------------+

result.select("ner_chunk.result").show(truncate=False)
+---------------------------+
|result                     |
+---------------------------+
|[cystic cyst on her kidney]|
+---------------------------+
from johnsnowlabs import * 
# Defining pipeline stages for NER


documentAssembler= nlp.DocumentAssembler() \
  .setInputCol("text") \
  .setOutputCol("document")

sentenceDetector= nlp.SentenceDetector() \
  .setInputCols(["document"]) \
  .setOutputCol("sentence") \
  .setUseAbbreviations(False)

tokenizer= nlp.Tokenizer() \
  .setInputCols(["sentence"]) \
  .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
  .setInputCols(["sentence","token"]) \
  .setOutputCol("embeddings") \
  .setCaseSensitive(False)

ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
  .setInputCols(["sentence", "token", "embeddings"]) \
  .setOutputCol("ner")

# Define the NerChunker to combine to chunks
chunker = finance.NerChunker() \
  .setInputCols(["sentence","ner"]) \
  .setOutputCol("ner_chunk") \
  .setRegexParsers(["<ImagingFindings>.*<BodyPart>"])

pipeline= Pipeline(stages=[
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  ner_model,
  chunker
])
from johnsnowlabs import * 
# Defining pipeline stages for NER


documentAssembler= nlp.DocumentAssembler() \
  .setInputCol("text") \
  .setOutputCol("document")

sentenceDetector= nlp.SentenceDetector() \
  .setInputCols(["document"]) \
  .setOutputCol("sentence") \
  .setUseAbbreviations(False)

tokenizer= nlp.Tokenizer() \
  .setInputCols(["sentence"]) \
  .setOutputCol("token")

embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
  .setInputCols(["sentence","token"]) \
  .setOutputCol("embeddings") \
  .setCaseSensitive(False)

ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models")\
  .setInputCols(["sentence", "token", "embeddings"])\
  .setOutputCol("ner")

# Define the NerChunker to combine to chunks
chunker = legal.NerChunker() \
  .setInputCols(["sentence","ner"]) \
  .setOutputCol("ner_chunk") \
  .setRegexParsers(["<ImagingFindings>.*<BodyPart>"])

pipeline= Pipeline(stages=[
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  ner_model,
  chunker
])

from johnsnowlabs import * 
// Defining pipeline stages for NER
val data= Seq("She has cystic cyst on her kidney.").toDF("text")

val documentAssembler=new nlp.DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector=new nlp.SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")
  .setUseAbbreviations(False)

val tokenizer=new nlp.Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence","token"))
  .setOutputCol("embeddings")
  .setCaseSensitive(False)

val ner = medical.NerModel.pretrained("ner_radiology", "en", "clinical/models")
  .setInputCols(Array("sentence","token","embeddings"))
  .setOutputCol("ner")
  .setIncludeConfidence(True)

// Define the NerChunker to combine to chunks
val chunker = new medical.NerChunker()
  .setInputCols(Array("sentence","ner"))
  .setOutputCol("ner_chunk")
  .setRegexParsers(Array("<ImagingFindings>.<BodyPart>"))

val pipeline=new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  ner,
  chunker
))

val result = pipeline.fit(data).transform(data)

// Show results:
//
// result.selectExpr("explode(arrays_zip(ner.metadata , ner.result))")
//   .selectExpr("col['0'].word as word" , "col['1'] as ner").show(truncate=false)
// +------+-----------------+
// |word  |ner              |
// +------+-----------------+
// |She   |O                |
// |has   |O                |
// |cystic|B-ImagingFindings|
// |cyst  |I-ImagingFindings|
// |on    |O                |
// |her   |O                |
// |kidney|B-BodyPart       |
// |.     |O                |
// +------+-----------------+
// result.select("ner_chunk.result").show(truncate=false)
// +---------------------------+
// |result                     |
// +---------------------------+
// |[cystic cyst on her kidney]|
// +---------------------------+
//
from johnsnowlabs import * 
// Defining pipeline stages for NER
val documentAssembler=new nlp.DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector=new nlp.SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")
  .setUseAbbreviations(False)

val tokenizer=new nlp.Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence","token"))
  .setOutputCol("embeddings")
  .setCaseSensitive(False)

val ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
  .setInputCols(Array("sentence", "token", "embeddings")) \
  .setOutputCol("ner")

// Define the NerChunker to combine to chunks
val chunker = new finance.NerChunker()
  .setInputCols(Array("sentence","ner"))
  .setOutputCol("ner_chunk")
  .setRegexParsers(Array("<ImagingFindings>.<BodyPart>"))

val pipeline=new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  ner_model,
  chunker
))
from johnsnowlabs import * 
// Defining pipeline stages for NER
val documentAssembler=new nlp.DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector=new nlp.SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")
  .setUseAbbreviations(False)

val tokenizer=new nlp.Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
  .setInputCols(Array("sentence","token"))
  .setOutputCol("embeddings")
  .setCaseSensitive(False)

val ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models")\
  .setInputCols(Array("sentence", "token", "embeddings"))\
  .setOutputCol("ner")

// Define the NerChunker to combine to chunks
val chunker = new legal.NerChunker()
  .setInputCols(Array("sentence","ner"))
  .setOutputCol("ner_chunk")
  .setRegexParsers(Array("<ImagingFindings>.<BodyPart>"))

val pipeline=new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  ner_model,
  chunker
))

NerConverterInternal

Model

Converts a IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their label. Chunks with no associated entity (tagged “O”) are filtered out.

This licensed annotator adds extra functionality to the open-source version by adding the following parameters: blackList, greedyMode, threshold, and ignoreStopWords that are not available in the NerConverter annotator.

See also Inside–outside–beginning (tagging) for more information.

Input Annotator Types: DOCUMENT, TOKEN, NAMED_ENTITY

Output Annotator Type: CHUNK

Python API: NerConverterInternal Scala API: NerConverterInternal
Show Example

from johnsnowlabs import * 

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \
    .setInputCols(["document"]) \
    .setOutputCol("sentence") 

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

jsl_ner = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("jsl_ner")

jsl_ner_converter = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "jsl_ner"]) \
    .setOutputCol("jsl_ner_chunk")

jsl_ner_converter_internal = medical.NerConverterInternal()\
    .setInputCols(["sentence","token","jsl_ner"])\
    .setOutputCol("replaced_ner_chunk")\
    .setReplaceDictResource("replace_dict.csv","text", {"delimiter":","})
      
nlpPipeline = Pipeline(stages=[
    documentAssembler, 
    sentenceDetector,
    tokenizer,
    word_embeddings,
    jsl_ner,
    jsl_ner_converter,
    jsl_ner_converter_internal
    ])

result = nlpPipeline.fit(data).transform(data)
from johnsnowlabs import * 

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector =  nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")
    #.setCustomBounds(["\n\n"])

tokenizer =  nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings =  nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

fin_ner = finance.NerModel.pretrained("finner_deid", "en", "finance/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

ner_converter = finance.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setReplaceLabels({"ORG": "PARTY"}) # Replace "ORG" entity as "PARTY"

nlpPipeline = Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      fin_ner,
      ner_converter])

result = nlpPipeline.fit(data).transform(data)
from johnsnowlabs import * 

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")
    #.setCustomBounds(["\n\n"])

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

legal_ner = legal.NerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")\
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner") 
    #.setLabelCasing("upper")

ner_converter = legal.NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")\
    .setReplaceLabels({"ALIAS": "PARTY"}) # "ALIAS" are secondary names of companies, so let's extract them also as PARTY

nlpPipeline = Pipeline(stages=[
      documentAssembler, 
      sentenceDetector,
      tokenizer,
      embeddings,
      legal_ner,
      ner_converter])

result = nlpPipeline.fit(data).transform(data)

from johnsnowlabs import * 
val documentAssembler = new nlp.DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = nlp.SentenceDetectorDLModel
    .pretrained("sentence_detector_dl_healthcare","en","clinical/models") 
    .setInputCols("document") 
    .setOutputCol("sentence") 


val tokenizer = new nlp.Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

 
val word_embeddings = nlp.WordEmbeddingsModel
   .pretrained("embeddings_clinical", "en","clinical/models")
   .setInputCols(Array("sentence", "token"))
   .setOutputCol("embeddings")


val jsl_ner = medical.NerModel
    .pretrained("ner_jsl", "en", "clinical/models") 
    .setInputCols(Array("sentence", "token","embeddings")) 
    .setOutputCol("jsl_ner")


val jsl_ner_converter = new nlp.NerConverter() 
    .setInputCols(Array("sentence", "token", "jsl_ner")) 
    .setOutputCol("jsl_ner_chunk")

val jsl_ner_converter_internal = new medical.NerConverterInternal() 
    .setInputCols(Array("sentence", "token", "jsl_ner")) 
    .setOutputCol("replaced_ner_chunk")
    .setReplaceDictResource("replace_dict.csv","text", {"delimiter":","})


val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  word_embeddings,
  jsl_ner,
  jsl_ner_converter,
  jsl_ner_converter_internal

))

val result = pipeline.fit(data).transform(data)
from johnsnowlabs import * 
val documentAssembler = new nlp.DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = nlp.SentenceDetectorDLModel
    .pretrained("sentence_detector_dl","xx") 
    .setInputCols("document")
    .setOutputCol("sentence") 


val tokenizer = new nlp.Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

 
val embeddings = nlp.RoBertaEmbeddings
   .pretrained("roberta_embeddings_legal_roberta_base", "en")
   .setInputCols(Array("sentence", "token"))
   .setOutputCol("embeddings")


val fin_ner = finance.NerModel
    .pretrained("finner_deid", "en", "finance/models") 
    .setInputCols(Array("sentence", "token","embeddings")) 
    .setOutputCol("ner")


val ner_converter = new finance.NerConverterInternal() 
    .setInputCols(Array("sentence", "token", "ner")) 
    .setOutputCol("ner_chunk")
    .setReplaceLabels({"ORG": "PARTY"}) 


val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  fin_ner,
  ner_converter
))

val result = pipeline.fit(data).transform(data)
from johnsnowlabs import * 
val documentAssembler = new nlp.DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = nlp.SentenceDetectorDLModel
    .pretrained("sentence_detector_dl","xx") 
    .setInputCols("document") 
    .setOutputCol("sentence") 


val tokenizer = new nlp.Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

 
val embeddings = nlp.RoBertaEmbeddings
   .pretrained("roberta_embeddings_legal_roberta_base", "en")
   .setInputCols(Array("sentence", "token"))
   .setOutputCol("embeddings")


val legal_ner = legal.NerModel
    .pretrained("legner_contract_doc_parties", "en", "legal/models") 
    .setInputCols(Array("sentence", "token","embeddings")) 
    .setOutputCol("ner")


val ner_converter = new legal.NerConverterInternal() 
    .setInputCols(Array("sentence", "token", "ner")) 
    .setOutputCol("ner_chunk")
    .setReplaceLabels({"ALIAS": "PARTY"})


val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  embeddings,
  legal_ner,
  ner_converter
))

val result = pipeline.fit(data).transform(data)

NerDisambiguator

Links words of interest, such as names of persons, locations and companies, from an input text document to a corresponding unique entity in a target Knowledge Base (KB). Words of interest are called Named Entities (NEs), mentions, or surface forms. Instantiated / pretrained model of the NerDisambiguator. Links words of interest, such as names of persons, locations and companies, from an input text document to a corresponding unique entity in a target Knowledge Base (KB). Words of interest are called Named Entities (NEs), mentions, or surface forms.

Input Annotator Types: CHUNK, SENTENCE_EMBEDDINGS

Output Annotator Type: DISAMBIGUATION

Python API: NerDisambiguatorModel Scala API: NerDisambiguatorModel

Links words of interest, such as names of persons, locations and companies, from an input text document to a corresponding unique entity in a target Knowledge Base (KB). Words of interest are called Named Entities (NEs), mentions, or surface forms. The model needs extracted CHUNKS and SENTENCE_EMBEDDINGS type input from e.g. SentenceEmbeddings and NerConverter.

Input Annotator Types: CHUNK, SENTENCE_EMBEDDINGS

Output Annotator Type: DISAMBIGUATION

Python API: NerDisambiguator Scala API: NerDisambiguator
Show Example

from johnsnowlabs import *
# Extracting Person identities
# First define pipeline stages that extract entities and embeddings. Entities are filtered for PER type entities.
# Extracting Person identities
# First define pipeline stages that extract entities and embeddings. Entities are filtered for PER type entities.
data = spark.createDataFrame([["The show also had a contestant named Donald Trump who later defeated Christina Aguilera ..."]]) \
  .toDF("text")
documentAssembler = nlp.DocumentAssembler() \
  .setInputCol("text") \
  .setOutputCol("document")
sentenceDetector = nlp.SentenceDetector() \
  .setInputCols(["document"]) \
  .setOutputCol("sentence")
tokenizer = nlp.Tokenizer() \
  .setInputCols(["sentence"]) \
  .setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained() \
  .setInputCols(["sentence", "token"]) \
  .setOutputCol("embeddings")
sentence_embeddings = nlp.SentenceEmbeddings() \
  .setInputCols(["sentence","embeddings"]) \
  .setOutputCol("sentence_embeddings")
ner_model = nlp.NerDLModel.pretrained() \
  .setInputCols(["sentence", "token", "embeddings"]) \
  .setOutputCol("ner")
ner_converter = nlp.NerConverter() \
  .setInputCols(["sentence", "token", "ner"]) \
  .setOutputCol("ner_chunk") \
  .setWhiteList(["PER"])

# Then the extracted entities can be disambiguated.
disambiguator = medical.NerDisambiguator() \
  .setS3KnowledgeBaseName("i-per") \
  .setInputCols(["ner_chunk", "sentence_embeddings"]) \
  .setOutputCol("disambiguation") \
  .setNumFirstChars(5)

nlpPipeline = Pipeline(stages=[
  documentAssembler,
  sentenceDetector,
  tokenizer,
  word_embeddings,
  sentence_embeddings,
  ner_model,
  ner_converter,
  disambiguator])

model = nlpPipeline.fit(data)
result = model.transform(data)

# Show results
result.selectExpr("explode(disambiguation)")
  .selectExpr("col.metadata.chunk as chunk", "col.result as result").show(5, False)
+------------------+------------------------------------------------------------------------------------------------------------------------+
|chunk             |result                                                                                                                  |
+------------------+------------------------------------------------------------------------------------------------------------------------+
|Donald Trump      |http:#en.wikipedia.org/?curid=4848272, http:#en.wikipedia.org/?curid=31698421, http:#en.wikipedia.org/?curid=55907961   |
|Christina Aguilera|http:#en.wikipedia.org/?curid=144171, http:#en.wikipedia.org/?curid=6636454                                             |
+------------------+------------------------------------------------------------------------------------------------------------------------+
from johnsnowlabs import *
# Extracting Person identities
# First define pipeline stages that extract entities and embeddings. Entities are filtered for PER type entities.
# Extracting Person identities
# First define pipeline stages that extract entities and embeddings. Entities are filtered for PER type entities.

documentAssembler = nlp.DocumentAssembler() \
  .setInputCol("text") \
  .setOutputCol("document")
sentenceDetector = nlp.SentenceDetector() \
  .setInputCols(["document"]) \
  .setOutputCol("sentence")
tokenizer = nlp.Tokenizer() \
  .setInputCols(["sentence"]) \
  .setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained() \
  .setInputCols(["sentence", "token"]) \
  .setOutputCol("embeddings")
sentence_embeddings = nlp.SentenceEmbeddings() \
  .setInputCols(["sentence","embeddings"]) \
  .setOutputCol("sentence_embeddings")
ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
  .setInputCols(["sentence", "token", "embeddings"]) \
  .setOutputCol("ner")
ner_converter = nlp.NerConverter() \
  .setInputCols(["sentence", "token", "ner"]) \
  .setOutputCol("ner_chunk") \
  .setWhiteList(["PER"])

# Then the extracted entities can be disambiguated.
disambiguator = finance.NerDisambiguator() \
  #.setS3KnowledgeBaseName("i-per") \
  .setInputCols(["ner_chunk", "sentence_embeddings"]) \
  .setOutputCol("disambiguation") \
  .setNumFirstChars(5)

nlpPipeline = Pipeline(stages=[
  documentAssembler,
  sentenceDetector,
  tokenizer,
  word_embeddings,
  sentence_embeddings,
  ner_model,
  ner_converter,
  disambiguator])
from johnsnowlabs import *
# Extracting Person identities
# First define pipeline stages that extract entities and embeddings. Entities are filtered for PER type entities.
# Extracting Person identities
# First define pipeline stages that extract entities and embeddings. Entities are filtered for PER type entities.

documentAssembler = nlp.DocumentAssembler() \
  .setInputCol("text") \
  .setOutputCol("document")
sentenceDetector = nlp.SentenceDetector() \
  .setInputCols(["document"]) \
  .setOutputCol("sentence")
tokenizer = nlp.Tokenizer() \
  .setInputCols(["sentence"]) \
  .setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained() \
  .setInputCols(["sentence", "token"]) \
  .setOutputCol("embeddings")
sentence_embeddings = nlp.SentenceEmbeddings() \
  .setInputCols(["sentence","embeddings"]) \
  .setOutputCol("sentence_embeddings")
ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models")\
  .setInputCols(["sentence", "token", "embeddings"])\
  .setOutputCol("ner")
ner_converter = nlp.NerConverter() \
  .setInputCols(["sentence", "token", "ner"]) \
  .setOutputCol("ner_chunk") \
  .setWhiteList(["PER"])

# Then the extracted entities can be disambiguated.
disambiguator = legal.NerDisambiguator() \
  #.setS3KnowledgeBaseName("i-per") \
  .setInputCols(["ner_chunk", "sentence_embeddings"]) \
  .setOutputCol("disambiguation") \
  .setNumFirstChars(5)

nlpPipeline = Pipeline(stages=[
  documentAssembler,
  sentenceDetector,
  tokenizer,
  word_embeddings,
  sentence_embeddings,
  ner_model,
  ner_converter,
  disambiguator])

from johnsnowlabs import * 
// Extracting Person identities
// First define pipeline stages that extract entities and embeddings. Entities are filtered for PER type entities.
val data = Seq("The show also had a contestant named Donald Trump who later defeated Christina Aguilera ...")
  .toDF("text")
val documentAssembler = new nlp.DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")
val word_embeddings = nlp.WordEmbeddingsModel.pretrained()
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")
val sentence_embeddings = new nlp.SentenceEmbeddings()
  .setInputCols(Array("sentence","embeddings"))
  .setOutputCol("sentence_embeddings")
val ner_model = nlp.NerDLModel.pretrained()
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")
val ner_converter = new nlp.NerConverter()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")
  .setWhiteList("PER")

// Then the extracted entities can be disambiguated.
val disambiguator = new medical.NerDisambiguator()
  .setS3KnowledgeBaseName("i-per")
  .setInputCols(Array("ner_chunk", "sentence_embeddings"))
  .setOutputCol("disambiguation")
  .setNumFirstChars(5)

val nlpPipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  word_embeddings,
  sentence_embeddings,
  ner_model,
  ner_converter,
  disambiguator))

val model = nlpPipeline.fit(data)
val result = model.transform(data)

// Show results
//
// result.selectExpr("explode(disambiguation)")
//   .selectExpr("col.metadata.chunk as chunk", "col.result as result").show(5, false)
// +------------------+------------------------------------------------------------------------------------------------------------------------+
// |chunk             |result                                                                                                                  |
// +------------------+------------------------------------------------------------------------------------------------------------------------+
// |Donald Trump      |https://en.wikipedia.org/?curid=4848272, https://en.wikipedia.org/?curid=31698421, https://en.wikipedia.org/?curid=55907961|
// |Christina Aguilera|https://en.wikipedia.org/?curid=144171, https://en.wikipedia.org/?curid=6636454                                           |
// +------------------+------------------------------------------------------------------------------------------------------------------------+
//
from johnsnowlabs import * 
// Extracting Person identities
// First define pipeline stages that extract entities and embeddings. Entities are filtered for PER type entities.
val data = Seq("The show also had a contestant named Donald Trump who later defeated Christina Aguilera ...")
  .toDF("text")
val documentAssembler = new nlp.DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")
val word_embeddings = nlp.WordEmbeddingsModel.pretrained()
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")
val sentence_embeddings = new nlp.SentenceEmbeddings()
  .setInputCols(Array("sentence","embeddings"))
  .setOutputCol("sentence_embeddings")
val ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
  .setInputCols(Array("sentence", "token", "embeddings")) \
  .setOutputCol("ner")
val ner_converter = new nlp.NerConverter()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")
  .setWhiteList("PER")

// Then the extracted entities can be disambiguated.
val disambiguator = new finance.NerDisambiguator()
  #.setS3KnowledgeBaseName("i-per")
  .setInputCols(Array("ner_chunk", "sentence_embeddings"))
  .setOutputCol("disambiguation")
  .setNumFirstChars(5)

val nlpPipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  word_embeddings,
  sentence_embeddings,
  ner_model,
  ner_converter,
  disambiguator))
from johnsnowlabs import * 
// Extracting Person identities
// First define pipeline stages that extract entities and embeddings. Entities are filtered for PER type entities.
val data = Seq("The show also had a contestant named Donald Trump who later defeated Christina Aguilera ...")
  .toDF("text")
val documentAssembler = new nlp.DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")
val word_embeddings = nlp.WordEmbeddingsModel.pretrained()
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")
val sentence_embeddings = new nlp.SentenceEmbeddings()
  .setInputCols(Array("sentence","embeddings"))
  .setOutputCol("sentence_embeddings")
val ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models")\
  .setInputCols(Array("sentence", "token", "embeddings"))\
  .setOutputCol("ner")
val ner_converter = new nlp.NerConverter()
  .setInputCols(Array("sentence", "token", "ner"))
  .setOutputCol("ner_chunk")
  .setWhiteList("PER")

// Then the extracted entities can be disambiguated.
val disambiguator = new legal.NerDisambiguator()
  #.setS3KnowledgeBaseName("i-per")
  .setInputCols(Array("ner_chunk", "sentence_embeddings"))
  .setOutputCol("disambiguation")
  .setNumFirstChars(5)

val nlpPipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  word_embeddings,
  sentence_embeddings,
  ner_model,
  ner_converter,
  disambiguator))

NerModel

This Named Entity recognition annotator is a generic NER model based on Neural Networks.

Pretrained models can be loaded with pretrained of the companion object:

val nerModel = nlp.NerDLModel.pretrained()
  .setInputCols("sentence", "token", "embeddings")
  .setOutputCol("ner")

The default model is "ner_clinical", if no name is provided.

For available pretrained models please see the Models Hub. Additionally, pretrained pipelines are available for this module, see Pipelines.

Note that some pretrained models require specific types of embeddings, depending on which they were trained on. For example, the default model "ner_dl" requires the WordEmbeddings "ner_clinical".

For extended examples of usage, see the Spark NLP Workshop (sections starting with Training a Clinical NER)

Input Annotator Types: DOCUMENT, TOKEN, WORD_EMBEDDINGS

Output Annotator Type: NAMED_ENTITY

Python API: MedicalNerModel Scala API: MedicalNerModel
Show Example

from johnsnowlabs import * 

documentAssembler = nlp.DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \
    .setInputCols(["document"]) \
    .setOutputCol("sentence") 

tokenizer = nlp.Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")


jsl_ner = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("jsl_ner")
    
jsl_ner_converter = nlp.NerConverter() \
    .setInputCols(["sentence", "token", "jsl_ner"]) \
    .setOutputCol("jsl_ner_chunk")

jsl_ner_pipeline = Pipeline(stages=[
    documentAssembler, 
    sentenceDetector,
    tokenizer,
    word_embeddings,
    jsl_ner,
    jsl_ner_converter])

result = jsl_ner_pipeline.fit(data).transform(data)
from johnsnowlabs import * 

documentAssembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")
        
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = finance.NerModel.pretrained("finner_headers", "en", "finance/models")\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
        .setInputCols(["sentence","token","ner"])\
        .setOutputCol("ner_chunk")

nlpPipeline = Pipeline(stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        embeddings,
        ner_model,
        ner_converter])



result = nlpPipeline.fit(data).transform(data)
from johnsnowlabs import * 

documentAssembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")
        
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = legal.NerModel.pretrained("legner_headers", "en", "legal/models")\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner")

ner_converter = nlp.NerConverter()\
        .setInputCols(["sentence","token","ner"])\
        .setOutputCol("ner_chunk")

nlpPipeline = Pipeline(stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        embeddings,
        ner_model,
        ner_converter])



result = nlpPipeline.fit(data).transform(data)

from johnsnowlabs import * 
val documentAssembler = new nlp.DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = nlp.SentenceDetectorDLModel
    .pretrained("sentence_detector_dl_healthcare","en","clinical/models") 
    .setInputCols("document") 
    .setOutputCol("sentence") 

val tokenizer = new nlp.Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val word_embeddings = nlp.WordEmbeddingsModel
   .pretrained("embeddings_clinical", "en", "clinical/models")
   .setInputCols(Array("sentence", "token"))
   .setOutputCol("embeddings")

val jsl_ner = medical.NerModel
    .pretrained("ner_jsl", "en", "clinical/models") 
    .setInputCols(Array("sentence", "token","embeddings")) 
    .setOutputCol("jsl_ner")

val jsl_ner_converter = new nlp.NerConverter() 
    .setInputCols(Array("sentence", "token", "jsl_ner")) 
    .setOutputCol("jsl_ner_chunk")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  word_embeddings,
  jsl_ner,
  jsl_ner_converter
))

val result = pipeline.fit(data).transform(data)
from johnsnowlabs import * 
val documentAssembler = new nlp.DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = nlp.SentenceDetectorDLModel
    .pretrained("sentence_detector_dl","xx") 
    .setInputCols("document") 
    .setOutputCol("sentence") 


val tokenizer = new nlp.Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

 
val embeddings = nlp.BertEmbeddings
   .pretrained("bert_embeddings_sec_bert_base", "en")
   .setInputCols(Array("sentence", "token"))
   .setOutputCol("embeddings")


val ner_model = finance.NerModel
    .pretrained