A Spark NLP Enterprise license includes access to unique annotators. At the Spark NLP Workshop you can see different types of annotators in action.
By clicking on any annotator, you will see different sections:
- The
Approach
, or class to train models. - The
Model
, to infer using pretrained models.
Also, for most of the annotators, you will find examples for the different enterprise libraries:
- Healthcare NLP
- Finance NLP
- Legal NLP
Check out the Spark NLP Annotators page for more information on how to read this page.
Available Annotators
Annotators | Description |
---|---|
AnnotationMerger | Merge annotations from different pipeline steps that have the same annotation type into a unified annotation. |
AssertionChunkConverter | AssertionChunkConverter annotator uses both begin and end indices of the tokens as input to add a more robust metadata to the chunk column in a way that improves the reliability of the indices and avoid loss of data. |
AssertionDL | AssertionDL is a deep Learning based approach used to extract Assertion Status from extracted entities and text. |
AssertionFilterer | Filters entities coming from ASSERTION type annotations and returns the CHUNKS. |
AssertionLogReg | Logistic Regression is used to extract Assertion Status from extracted entities and text. |
AssertionMerger | Merges variety assertion columns coming from Assertion annotators |
AverageEmbeddings | Computes the mean of vector embeddings for two sentences of equal size, producing a unified representation |
BertForSequenceClassification | Can load Bert Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. |
BertForTokenClassifier | Can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) for Named-Entity-Recognition (NER) tasks. |
BertSentenceChunkEmbeddings | This annotator combines sentence and NER chunk embeddings to enhance resolution codes, leveraging contextual information in the embeddings for more precise results. It takes sentence context and NER chunks as input and produces embeddings for each chunk, facilitating input for the resolution model. |
Chunk2Token | A feature transformer that converts the input array of strings (annotatorType CHUNK) into an array of chunk-based tokens (annotatorType TOKEN). |
ChunkConverter | This annotator merges NER-detected entities with RegexMatcher-based rules for unified processing in the pipeline. |
ChunkEntityResolver | Returns a normalized entity for a particular trained ontology / curated dataset (e.g. clinical ICD-10, RxNorm, SNOMED; financial SEC’s EDGAR database, etc). |
ChunkFilterer | Filters entities coming from CHUNK annotations. |
ChunkKeyPhraseExtraction | Uses Bert Sentence Embeddings to determine the most relevant key phrases describing a text. |
ChunkMapper | We can use ChunkMapper to map entities with their associated code/reference based on pre-defined dictionaries. |
ChunkMapperFilterer | Annotator to be used after ChunkMapper that allows to filter chunks based on the results of the mapping, whether it was successful or failed. |
ChunkMerge | Merges entities coming from different CHUNK annotations. |
ChunkSentenceSplitter | Annotator can split the documents into chunks according to separators given as CHUNK columns. It is useful when you need to perform different models or analysis in different sections of your document |
ContextualAssertion | This model identifies contextual cues within text data, such as negation, uncertainty etc.It annotates text chunks with assertions based on configurable rules, prefix and suffix patterns, and exception patterns. |
ContextualEntityFilterer | ContextualEntityFilterer can filter chunks coming from CHUNK annotations based on entity(identifier,field) info in metadata and contextual cues. |
ContextualEntityRuler | ContextualEntityRuler is an annotator that updates chunks based on contextual rules. |
ContextualParser | Extracts entity from a document based on user defined rules. |
ContextSplitAssembler | Converts and assembles VECTOR_SIMILARITY_RANKINGS type annotations into DOCUMENT type. |
DateNormalizer | This annotator transforms date mentions to a common standard format: YYYY/MM/DD. It is useful when using data from different sources, some times from different countries that has different formats to represent dates. |
DeIdentification | Deidentifies Input Annotations of types DOCUMENT, TOKEN and CHUNK, by either masking or obfuscating the given CHUNKS. |
DistilBertForSequenceClassification | Can load DistilBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. |
Doc2ChunkInternal | Converts DOCUMENT , TOKEN typed annotations into CHUNK type with the contents of a chunkCol . |
DocMapper | Uses the text representation of document annotations to map clinical codes to other codes or relevant information. |
DocumentFiltererByClassifier | This annotator sorts documents based on classifier results. It uses white and black lists, allowing or blocking specific outcomes. It can be case-sensitive or case-insensitive for broader matching. This tool efficiently organizes documents based on classifier outcomes. |
DocumentFiltererByNER | This annotator returns sentences containing the entity chunks you have filtered, allowing you to see only the sentences with the entities you want. |
DocumentHashCoder | This annotator swaps dates in a document column with hash codes from another column, creating a new column with shifted day information. The subsequent DeIdentification annotator anonymizes the document, incorporating the altered dates. |
DocumentLogRegClassifier | Classifies documents with a Logarithmic Regression algorithm. |
DocumentMLClassifier | classifies documents with a Logarithmic Regression algorithm. |
DrugNormalizer | Annotator which normalizes raw text from documents, e.g. scraped web pages or xml documents. |
EntityChunkEmbeddings | Entity Chunk Embeddings uses BERT Sentence embeddings to compute a weighted average vector represention of related entity chunks. |
EntityRulerInternal | This annotator match exact strings or regex patterns provided in a file against a Document and assigns them an named entity. |
FeaturesAssembler | Collects features from different columns. |
FewShotClassifier | This Annotator specifically target few-shot classification tasks, which involve training a model to make accurate predictions with limited labeled data. |
FewShotAssertionClassifierModel | assertion classification using large (LLMS based) few shot classifiers based on the SetFit approach. |
FewShotAssertionSentenceClassifier | It is a util annotator that is used in some datasets to train a new FewShotAssertionClassifierModel. |
Flattener | Flattener annotator in Spark NLP converts annotation results into a simplified DataFrame format for easier analysis and interpretation. |
GenericClassifier | Creates a generic single-label classifier which uses pre-generated Tensorflow graphs. |
GenericLogRegClassifier | Is a derivative of GenericClassifier which implements a multinomial logistic regression. |
GenericSVMClassifier | Creates a generic single-label classifier which uses pre-generated Tensorflow graphs. |
InternalDocumentSplitter | This annotator splits large documents into small documents. |
IOBTagger | Merges token tags and NER labels from chunks in the specified format. |
LargeFewShotClassifier | This annotator is designed to work effectively with minimal labeled data, offering flexibility and adaptability to new, unseen classes. |
LightDeIdentification | Light version of DeIdentification. |
LLMLoader | LLMLoader is designed to interact with a LLMs that are converted into gguf format. This module allows using John Snow Labs’ licensed LLMs at various sizes that are finetuned on medical context for certain tasks. |
Mapper2Chunk | This annotator converts ‘LABELED_DEPENDENCY’ type annotations coming from ChunkMapper into ‘CHUNK’ type to create new chunk-type column |
MedicalLLM | MedicalLLM was designed to load and run large language models (LLMs) in GGUF format with scalable performance. |
MultiChunk2Doc | Merges a given chunks to create a document. |
NameChunkObfuscator | This annotator allows to transform a dataset with an Input Annotation of type CHUNK, into its obfuscated version of by obfuscating the given CHUNKS. |
NerChunker | Extracts phrases that fits into a known pattern using the NER tags. |
NerConverterInternal | Converts a IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their label. |
NerDisambiguator | Links words of interest, such as names of persons, locations and companies, from an input text document to a corresponding unique entity in a target Knowledge Base (KB). |
NerModel | This Named Entity recognition annotator is a generic NER model based on Neural Networks. |
NerQuestionGenerator | This annotator takes an NER chunk (obtained by, e.g., NerConverterInternal ) and generates a questions based on two entity types, a pronoun and a strategy. |
QuestionAnswering | GPT-based model for answering questions given a context. |
PretrainedZeroShotNER | it makes easy to identify specific entities in text without needing pre-labeled datasets. |
RegexMatcherInternal | RegexMatcherInternal matches predefined regex patterns with entities in text, allowing for flexible entity recognition based on user-defined rules. |
ReIdentification | Reidentifies obfuscated entities by DeIdentification. |
RelationExtraction | Extracts and classifies instances of relations between named entities. |
RelationExtractionDL | Extracts and classifies instances of relations between named entities. |
REChunkMerger | Merges relation chunks to create a new chunk. |
RENerChunksFilter | Filters and outputs combinations of relations between extracted entities, for further processing. |
Replacer | This annotator allows to replace entities in the original text with the ones extracted by the annotators NameChunkObfuscatorApproach or DateNormalizer . |
Resolution2Chunk | This annotator is responsible for converting the annotations generated by entity resolver models (typically labeled as ENTITY) into a format compatible with subsequent stages of the pipeline, such as the ChunkMapperModel. |
ResolverMerger | This annotator is provide the ability to merge sentence enitity resolver and chunk mapper model output columns. |
Router | This annotator is provide the ability to split an output of an annotator for a selected metadata field and the value for that field. |
SentenceEntityResolver | Returns the normalized entity for a particular trained ontology / curated dataset (e.g. clinical ICD-10, RxNorm, SNOMED; financial SEC’s EDGAR database, etc) based on sentence embeddings. |
StructuredJsonConverter | StructuredJsonConverter is a transformer that converts the output of the pipeline into a structured JSON format. |
Summarizer | Helps to quickly summarize complex medical information. |
TextGenerator | Uses the basic BioGPT model to perform various tasks related to medical text abstraction. |
TextMatcherInternal | This annotator match exact phrases provided in a file against a Document. |
TFGraphBuilder | Creates Tensorflow graphs. |
VectorDBPostProcessor | VectorDBPostProcessor is used to filter and sort the annotations from the :class:sparknlp_jsl.annotator.resolution.VectorDBModel . |
WindowedSentenceModel | This annotator that helps you to merge the previous and following sentences of a given piece of text, so that you add the context surrounding them. |
ZeroShotNerModel | This is a zero-shot named entity recognition using RoBertaForQuestionAnswering . It identifies entities across diverse data without domain-specific fine-tuning. |
ZeroShotRelationExtractionModel | This annotator implements zero-shot binary relations extraction by utilizing BERT transformer models trained on the NLI (Natural Language Inference) task. |
AnnotationMerger
Merge annotations from different pipeline steps that have the same annotation type into a unified annotation. Possible annotations that can be merged include:
- document (e.g., output of
DocumentAssembler
annotator) - token (e.g., output of
Tokenizer
annotator) - word_embeddings (e.g., output of
WordEmbeddingsModel
annotator) - sentence_embeddings (e.g., output of
BertSentenceEmbeddings
annotator) - category (e.g., output of
RelationExtractionModel
annotator) - date (e.g., output of
DateMatcher
annotator) - sentiment (e.g., output of
SentimentDLModel
annotator) - pos (e.g., output of
PerceptronModel
annotator) - chunk (e.g., output of
NerConverter
annotator) - named_entity (e.g., output of
NerDLModel
annotator) - regex (e.g., output of
RegexTokenizer
annotator) - dependency (e.g., output of
DependencyParserModel
annotator) - language (e.g., output of
LanguageDetectorDL
annotator) - keyword (e.g., output of
YakeModel
annotator)
Parameters:
inputType
: The type of the annotations that you want to merge. Possible values.
Input Annotator Types: ANY
Output Annotator Type: ANY
Python API: AnnotationMerger | Scala API: AnnotationMerger | Notebook: AnnotationMerger |
Show Example
from johnsnowlabs import nlp, medical
# Create the pipeline with two RE models
documenter = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencer = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentences")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentences"])\
.setOutputCol("tokens")
words_embedder = nlp.WordEmbeddingsModel()\
.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentences", "tokens"])\
.setOutputCol("embeddings")
pos_tagger = nlp.PerceptronModel()\
.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"])\
.setOutputCol("pos_tags")
pos_ner_tagger = medical.NerModel()\
.pretrained("ner_posology", "en", "clinical/models")\
.setInputCols("sentences", "tokens", "embeddings")\
.setOutputCol("ner_pos")
pos_ner_chunker = medical.NerConverterInternal()\
.setInputCols(["sentences", "tokens", "ner_pos"])\
.setOutputCol("pos_ner_chunks")
dependency_parser = nlp.DependencyParserModel()\
.pretrained("dependency_conllu", "en")\
.setInputCols(["sentences", "pos_tags", "tokens"])\
.setOutputCol("dependencies")
pos_reModel = medical.RelationExtractionModel()\
.pretrained("posology_re")\
.setInputCols(["embeddings", "pos_tags", "pos_ner_chunks", "dependencies"])\
.setOutputCol("pos_relations")\
.setMaxSyntacticDistance(4)
ade_ner_tagger = medical.NerModel.pretrained("ner_ade_clinical", "en", "clinical/models")\
.setInputCols("sentences", "tokens", "embeddings")\
.setOutputCol("ade_ner_tags")
ade_ner_chunker = medical.NerConverterInternal()\
.setInputCols(["sentences", "tokens", "ade_ner_tags"])\
.setOutputCol("ade_ner_chunks")
ade_reModel = medical.RelationExtractionModel()\
.pretrained("re_ade_clinical", "en", 'clinical/models')\
.setInputCols(["embeddings", "pos_tags", "ade_ner_chunks", "dependencies"])\
.setOutputCol("ade_relations")\
.setMaxSyntacticDistance(10)\
.setRelationPairs(["drug-ade, ade-drug"])
annotation_merger = medical.AnnotationMerger()\
.setInputCols("ade_relations", "pos_relations")\
.setInputType("category")\
.setOutputCol("all_relations")
merger_pipeline = nlp.Pipeline(stages=[
documenter,
sentencer,
tokenizer,
words_embedder,
pos_tagger,
pos_ner_tagger,
pos_ner_chunker,
dependency_parser,
pos_reModel,
ade_ner_tagger,
ade_ner_chunker,
ade_reModel,
annotation_merger
])
# Show example result
text = """
The patient was prescribed 1 unit of naproxen for 5 days after meals for chronic low back pain. The patient was also given 1 unit of oxaprozin daily for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands..
"""
data = spark.createDataFrame([[text]]).toDF("text")
result = merger_pipeline.fit(data).transform(data)
result.selectExpr("pos_relations.result as PosologyRelation",
"ade_relations.result as AdeRelation",
"all_relations.result as MergedRelation").show(truncate=False)
+---------------------------------------------------------+-----------+---------------------------------------------------------------+
|PosologyRelation |AdeRelation|MergedRelation |
+---------------------------------------------------------+-----------+---------------------------------------------------------------+
|[DOSAGE-DRUG, DRUG-DURATION, DOSAGE-DRUG, DRUG-FREQUENCY]|[1, 1] |[1, 1, DOSAGE-DRUG, DRUG-DURATION, DOSAGE-DRUG, DRUG-FREQUENCY]|
+---------------------------------------------------------+-----------+---------------------------------------------------------------+
from johnsnowlabs import nlp, finance
# Create the pipeline with two RE models
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
text_splitter = finance.TextSplitter()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model_date = finance.NerModel.pretrained("finner_sec_dates", "en", "finance/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner_dates")
ner_converter_date = nlp.NerConverter()\
.setInputCols(["sentence","token","ner_dates"])\
.setOutputCol("ner_chunk_date")
ner_model_org= finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner_orgs")
ner_converter_org = nlp.NerConverter()\
.setInputCols(["sentence","token","ner_orgs"])\
.setOutputCol("ner_chunk_org")\
chunk_merger = finance.ChunkMergeApproach()\
.setInputCols('ner_chunk_org', "ner_chunk_date")\
.setOutputCol('ner_chunk')
pos = nlp.PerceptronModel.pretrained()\
.setInputCols(["sentence", "token"])\
.setOutputCol("pos")
dependency_parser = nlp.DependencyParserModel().pretrained("dependency_conllu", "en")\
.setInputCols(["sentence", "pos", "token"])\
.setOutputCol("dependencies")
re_filter = finance.RENerChunksFilter()\
.setInputCols(["ner_chunk", "dependencies"])\
.setOutputCol("re_ner_chunk")\
.setRelationPairs(["ORG-ORG", "ORG-DATE"])\
.setMaxSyntacticDistance(10)
reDL = finance.RelationExtractionDLModel().pretrained('finre_acquisitions_subsidiaries_md', 'en', 'finance/models')\
.setInputCols(["re_ner_chunk", "sentence"])\
.setOutputCol("relations_acq")\
.setPredictionThreshold(0.1)
reDL_alias = finance.RelationExtractionDLModel().pretrained('finre_org_prod_alias', 'en', 'finance/models')\
.setInputCols(["re_ner_chunk", "sentence"])\
.setOutputCol("relations_alias")\
.setPredictionThreshold(0.1)
annotation_merger = finance.AnnotationMerger()\
.setInputCols("relations_acq", "relations_alias")\
.setOutputCol("relations")\
.setInputType("category")
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
text_splitter,
tokenizer,
embeddings,
ner_model_date,
ner_converter_date,
ner_model_org,
ner_converter_org,
chunk_merger,
pos,
dependency_parser,
re_filter,
reDL,
reDL_alias,
annotation_merger])
# Show example result
text ="""Definite-lived intangible assets acquired with Cadence’s fiscal 2021 acquisitions were as follows:
Acquisition Date Fair Value
Weighted Average Amortization Period
(In thousands)
(in years)
Existing technology
$
59,100
13.7 years
Agreements and relationships
28,900
13.7 years
Tradenames, trademarks and patents
4,600
14.3 years
Total acquired intangibles with definite lives
$
92,600
13.7 years
2020 Acquisitions
In fiscal 2020, Cadence acquired all of the outstanding equity of AWR Corporation ("AWR") and Integrand Software, Inc. ("Integrand"). These acquisitions enhanced Cadence’s technology portfolio to address growing radio frequency design activity, driven by expanding use of 5G communications.
The aggregate cash consideration for these acquisitions was $195.6 million, after taking into account cash acquired of $1.5 million. The total purchase consideration was allocated to the assets acquired and liabilities assumed based on their respective estimated fair values on the acquisition dates. Cadence will also make payments to certain employees, subject to continued employment and other performance-based conditions, through the first quarter of fiscal 2023.
With its acquisitions of AWR and Integrand, Cadence recorded $101.3 million of definite-lived intangible assets with a weighted average amortization period of approximately nine years. The definite-lived intangible assets related primarily to existing technology and customer agreements and relationships. Cadence also recorded $119.4 million of goodwill and $25.1 million of net liabilities, consisting primarily of deferred tax liabilities, assumed deferred revenue and trade accounts receivable. The recorded goodwill was primarily related to the acquired assembled workforce and expected synergies from combining operations of the acquired companies with Cadence. None of the goodwill related to the acquisitions of AWR and Integrand is deductible for tax purposes.
Cadence completed one additional acquisition during fiscal 2020 that was not material to the consolidated financial statements.
Pro Forma Financial Information
Cadence has not presented pro forma financial information for any of the businesses it acquired during fiscal 2021 and fiscal 2020 because the results of operations for these businesses are not material to Cadence’s consolidated financial statements.
Acquisition-Related Transaction Costs
Transaction costs associated with acquisitions, which consist of professional fees and administrative costs, were not material during fiscal 2021, 2020 or 2019 and were expensed as incurred in Cadence’s consolidated income statements.
NOTE 7. GOODWILL AND ACQUIRED INTANGIBLES
Goodwill
The changes in the carrying amount of goodwill during fiscal 2021 and 2020 were as follows:
Gross CarryingAmount
(In thousands)
Balance as of December 28, 2019
$
661,856
Goodwill resulting from acquisitions
120,564
Effect of foreign currency translation
(333)
Balance as of January 2, 2021
782,087
Goodwill resulting from acquisitions
154,362
Effect of foreign currency translation
(8,091)
Balance as of January 1, 2022
$
928,358
Cadence completed its annual goodwill impairment test during the third quarter of fiscal 2021 and determined that the fair value of Cadence’s single reporting unit exceeded the carrying amount of its net assets and that no impairment existed.
65"""
data = spark.createDataFrame([[text]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
# Show the results
result.selectExpr("relations_acq.result as AcqRelation",
"relations_alias.result as AliasRelation",
"relations.result as MergedRelation").show(truncate=False)
+------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|AcqRelation |AliasRelation |MergedRelation |
+------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[has_acquisition_date, was_acquired_by, other, other, other, has_acquisition_date, other, other]|[has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias]|[has_acquisition_date, was_acquired_by, other, other, other, has_acquisition_date, other, other, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias]|
+------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
from johnsnowlabs import nlp, legal
# Create the pipeline with two RE models
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
text_splitter = legal.TextSplitter()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings =nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model_date = legal.NerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner_parties")
ner_converter_date = nlp.NerConverter()\
.setInputCols(["sentence","token","ner_parties"])\
.setOutputCol("ner_chunk_parties")
ner_model_org= legal.NerModel.pretrained("legner_whereas_md", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner_whereas")
ner_converter_org = nlp.NerConverter()\
.setInputCols(["sentence","token","ner_whereas"])\
.setOutputCol("ner_chunk_whereas")\
chunk_merger = legal.ChunkMergeApproach()\
.setInputCols('ner_chunk_whereas', "ner_chunk_parties")\
.setOutputCol('ner_chunk')
pos = nlp.PerceptronModel.pretrained()\
.setInputCols(["sentence", "token"])\
.setOutputCol("pos")
dependency_parser = nlp.DependencyParserModel().pretrained("dependency_conllu", "en")\
.setInputCols(["sentence", "pos", "token"])\
.setOutputCol("dependencies")
re_filter = legal.RENerChunksFilter()\
.setInputCols(["ner_chunk", "dependencies"])\
.setOutputCol("re_ner_chunk")\
.setMaxSyntacticDistance(10)
reDL = legal.RelationExtractionDLModel().pretrained("legre_contract_doc_parties_md", "en", "legal/models")\
.setInputCols(["re_ner_chunk", "sentence"])\
.setOutputCol("relations_parties")\
.setPredictionThreshold(0.1)
reDL_alias = legal.RelationExtractionDLModel().pretrained("legre_whereas", "en", "legal/models")\
.setInputCols(["re_ner_chunk", "sentence"])\
.setOutputCol("relations_whereas")\
.setPredictionThreshold(0.1)
annotation_merger = legal.AnnotationMerger()\
.setInputCols("relations_parties", "relations_whereas")\
.setOutputCol("relations")\
.setInputType("category")
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
text_splitter,
tokenizer,
embeddings,
ner_model_date,
ner_converter_date,
ner_model_org,
ner_converter_org,
chunk_merger,
pos,
dependency_parser,
re_filter,
reDL,
reDL_alias,
annotation_merger])
# Show example result
text = """
WHEREAS, the Company Entities own certain Copyrights and Know-How which may be used in the Arizona Field, and in connection with the transactions contemplated by the Stock Purchase Agreement, Arizona desires to obtain a license from the Company Entities to use such Intellectual Property on the terms and subject to the conditions set forth herein.
"""
data = spark.createDataFrame([[text]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
# Show the results
result.selectExpr("relations_parties.result as PartiesRelation",
"relations_whereas.result as WhereasRelation",
"relations.result as MergedRelation").show(truncate=False)
+-----------------------------+--------------------------------------+-------------------------------------------------------------------+
|PartiesRelation |WhereasRelation |MergedRelation |
+-----------------------------+--------------------------------------+-------------------------------------------------------------------+
|[signed_by, other, signed_by]|[has_subject, has_subject, has_object]|[signed_by, other, signed_by, has_subject, has_subject, has_object]|
+-----------------------------+--------------------------------------+-------------------------------------------------------------------+
import spark.implicits._
// Create the pipeline with two RE models
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentencer = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentences")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentences"))
.setOutputCol("tokens")
val words_embedder = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("embeddings")
val pos_tagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("pos_tags")
val pos_ner_tagger = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens", "embeddings"))
.setOutputCol("ner_pos")
val pos_ner_chunker = new NerConverterInternal()
.setInputCols(Array("sentences", "tokens", "ner_pos"))
.setOutputCol("pos_ner_chunks")
val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en")
.setInputCols(Array("sentences", "pos_tags", "tokens"))
.setOutputCol("dependencies")
val pos_reModel = RelationExtractionModel.pretrained("posology_re")
.setInputCols(Array("embeddings", "pos_tags", "pos_ner_chunks", "dependencies"))
.setOutputCol("pos_relations")
.setMaxSyntacticDistance(4)
val ade_ner_tagger = MedicalNerModel.pretrained("ner_ade_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens", "embeddings"))
.setOutputCol("ade_ner_tags")
val ade_ner_chunker = new NerConverterInternal()
.setInputCols(Array("sentences", "tokens", "ade_ner_tags"))
.setOutputCol("ade_ner_chunks")
val ade_reModel = RelationExtractionModel.pretrained("re_ade_clinical", "en", "clinical/models")
.setInputCols(Array("embeddings", "pos_tags", "ade_ner_chunks", "dependencies"))
.setOutputCol("ade_relations")
.setMaxSyntacticDistance(10)
.setRelationPairs(Array("drug-ade", "ade-drug"))
val annotation_merger = new AnnotationMerger()
.setInputCols(Array("ade_relations", "pos_relations"))
.setInputType("category")
.setOutputCol("all_relations")
val merger_pipeline = new Pipeline().setStages(Array(
documenter,
sentencer,
tokenizer,
words_embedder,
pos_tagger,
pos_ner_tagger,
pos_ner_chunker,
dependency_parser,
pos_reModel,
ade_ner_tagger,
ade_ner_chunker,
ade_reModel,
annotation_merger
))
// Show example result
val text =
"""
The patient was prescribed 1 unit of naproxen for 5 days after meals for chronic low back pain. The patient was also given 1 unit of oxaprozin daily for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands..
"""
val data = Seq(text).toDF("text")
val result = merger_pipeline.fit(data).transform(data)
+---------------------------------------------------------+-----------+---------------------------------------------------------------+
|PosologyRelation |AdeRelation|MergedRelation |
+---------------------------------------------------------+-----------+---------------------------------------------------------------+
|[DOSAGE-DRUG, DRUG-DURATION, DOSAGE-DRUG, DRUG-FREQUENCY]|[1, 1] |[1, 1, DOSAGE-DRUG, DRUG-DURATION, DOSAGE-DRUG, DRUG-FREQUENCY]|
+---------------------------------------------------------+-----------+---------------------------------------------------------------+
import spark.implicits._
// Create the pipeline with two RE models
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val text_splitter = new TextSplitter()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner_model_date = FinanceNerModel.pretrained("finner_sec_dates", "en", "finance/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner_dates")
val ner_converter_date = new NerConverter()
.setInputCols(Array("sentence", "token", "ner_dates"))
.setOutputCol("ner_chunk_date")
val ner_model_org = FinanceNerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner_orgs")
val ner_converter_org = new NerConverter()
.setInputCols(Array("sentence", "token", "ner_orgs"))
.setOutputCol("ner_chunk_org")
val chunk_merger = new Chunker()
.setInputCols(Array("ner_chunk_org", "ner_chunk_date"))
.setOutputCol("ner_chunk")
val pos = new PerceptronModel()
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val dependency_parser = new DependencyParserModel()
.pretrained("dependency_conllu", "en")
.setInputCols(Array("sentence", "pos", "token"))
.setOutputCol("dependencies")
val re_filter = new RelationExtractionModel()
.setInputCols(Array("ner_chunk", "dependencies"))
.setOutputCol("re_ner_chunk")
.setRelationPairs(Array("ORG-ORG", "ORG-DATE"))
.setMaxSyntacticDistance(10)
val reDL = new RelationExtractionModel()
.pretrained("finre_acquisitions_subsidiaries_md", "en", "finance/models")
.setInputCols(Array("re_ner_chunk", "sentence"))
.setOutputCol("relations_acq")
.setPredictionThreshold(0.1)
val reDL_alias = new RelationExtractionModel()
.pretrained("finre_org_prod_alias", "en", "finance/models")
.setInputCols(Array("re_ner_chunk", "sentence"))
.setOutputCol("relations_alias")
.setPredictionThreshold(0.1)
val annotation_merger = new AnnotationMerger()
.setInputCols("relations_acq", "relations_alias")
.setOutputCol("relations")
.setInputType("category")
val nlpPipeline = new Pipeline().setStages(Array(
document_assembler,
text_splitter,
tokenizer,
embeddings,
ner_model_date,
ner_converter_date,
ner_model_org,
ner_converter_org,
chunk_merger,
pos,
dependency_parser,
re_filter,
reDL,
reDL_alias,
annotation_merger
))
// Show example result
val text = """
Definite-lived intangible assets acquired with Cadence’s fiscal 2021 acquisitions were as follows:
Acquisition Date Fair Value
Weighted Average Amortization Period
(In thousands)
(in years)
Existing technology
$
59,100
13.7 years
Agreements and relationships
28,900
13.7 years
Tradenames, trademarks and patents
4,600
14.3 years
Total acquired intangibles with definite lives
$
92,600
13.7 years
2020 Acquisitions
In fiscal 2020, Cadence acquired all of the outstanding equity of AWR Corporation ("AWR") and Integrand Software, Inc. ("Integrand"). These acquisitions enhanced Cadence’s technology portfolio to address growing radio frequency design activity, driven by expanding use of 5G communications.
The aggregate cash consideration for these acquisitions was $195.6 million, after taking into account cash acquired of $1.5 million. The total purchase consideration was allocated to the assets acquired and liabilities assumed based on their respective estimated fair values on the acquisition dates. Cadence will also make payments to certain employees, subject to continued employment and other performance-based conditions, through the first quarter of fiscal 2023.
With its acquisitions of AWR and Integrand, Cadence recorded $101.3 million of definite-lived intangible assets with a weighted average amortization period of approximately nine years. The definite-lived intangible assets related primarily to existing technology and customer agreements and relationships. Cadence also recorded $119.4 million of goodwill and $25.1 million of net liabilities, consisting primarily of deferred tax liabilities, assumed deferred revenue and trade accounts receivable. The recorded goodwill was primarily related to the acquired assembled workforce and expected synergies from combining operations of the acquired companies with Cadence. None of the goodwill related to the acquisitions of AWR and Integrand is deductible for tax purposes.
Cadence completed one additional acquisition during fiscal 2020 that was not material to the consolidated financial statements.
Pro Forma Financial Information
Cadence has not presented pro forma financial information for any of the businesses it acquired during fiscal 2021 and fiscal 2020 because the results of operations for these businesses are not material to Cadence’s consolidated financial statements.
Acquisition-Related Transaction Costs
Transaction costs associated with acquisitions, which consist of professional fees and administrative costs, were not material during fiscal 2021, 2020 or 2019 and were expensed as incurred in Cadence’s consolidated income statements.
NOTE 7. GOODWILL AND ACQUIRED INTANGIBLES
Goodwill
The changes in the carrying amount of goodwill during fiscal 2021 and 2020 were as follows:
Gross CarryingAmount
(In thousands)
Balance as of December 28, 2019
$
661,856
Goodwill resulting from acquisitions
120,564
Effect of foreign currency translation
(333)
Balance as of January 2, 2021
782,087
Goodwill resulting from acquisitions
154,362
Effect of foreign currency translation
(8,091)
Balance as of January 1, 2022
$
928,358
Cadence completed its annual goodwill impairment test during the third quarter of fiscal 2021 and determined that the fair value of Cadence’s single reporting unit exceeded the carrying amount of its net assets and that no impairment existed.
65
"""
val data = Seq(text).toDF("text")
val result = nlpPipeline.fit(data).transform(data)
+------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|AcqRelation |AliasRelation |MergedRelation |
+------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[has_acquisition_date, was_acquired_by, other, other, other, has_acquisition_date, other, other]|[has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias]|[has_acquisition_date, was_acquired_by, other, other, other, has_acquisition_date, other, other, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias, has_alias]|
+------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
// Create the pipeline with two RE models
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val text_splitter = new TextSplitter()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner_model_date = LegalNerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner_parties")
val ner_converter_date = new NerConverter()
.setInputCols(Array("sentence", "token", "ner_parties"))
.setOutputCol("ner_chunk_parties")
val ner_model_org = LegalNerModel.pretrained("legner_whereas_md", "en", "legal/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner_whereas")
val ner_converter_org = new NerConverter()
.setInputCols(Array("sentence", "token", "ner_whereas"))
.setOutputCol("ner_chunk_whereas")
val chunk_merger = new Chunker()
.setInputCols(Array("ner_chunk_whereas", "ner_chunk_parties"))
.setOutputCol("ner_chunk")
val pos = new PerceptronModel()
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val dependency_parser = new DependencyParserModel()
.pretrained("dependency_conllu", "en")
.setInputCols(Array("sentence", "pos", "token"))
.setOutputCol("dependencies")
val re_filter = new RelationExtractionModel()
.setInputCols(Array("ner_chunk", "dependencies"))
.setOutputCol("re_ner_chunk")
.setMaxSyntacticDistance(10)
val reDL = new RelationExtractionModel()
.pretrained("legre_contract_doc_parties_md", "en", "legal/models")
.setInputCols(Array("re_ner_chunk", "sentence"))
.setOutputCol("relations_parties")
.setPredictionThreshold(0.1)
val reDL_alias = new RelationExtractionModel()
.pretrained("legre_whereas", "en", "legal/models")
.setInputCols(Array("re_ner_chunk", "sentence"))
.setOutputCol("relations_whereas")
.setPredictionThreshold(0.1)
val annotation_merger = new AnnotationMerger()
.setInputCols("relations_parties", "relations_whereas")
.setOutputCol("relations")
.setInputType("category")
val nlpPipeline = new Pipeline().setStages(Array(
document_assembler,
text_splitter,
tokenizer,
embeddings,
ner_model_date,
ner_converter_date,
ner_model_org,
ner_converter_org,
chunk_merger,
pos,
dependency_parser,
re_filter,
reDL,
reDL_alias,
annotation_merger
))
// Show example result
val text = """WHEREAS, the Company Entities own certain Copyrights and Know-How which may be used in the Arizona Field, and in connection with the transactions contemplated by the Stock Purchase Agreement, Arizona desires to obtain a license from the Company Entities to use such Intellectual Property on the terms and subject to the conditions set forth herein.
"""
val data = Seq(text).toDF("text")
val result = nlpPipeline.fit(data).transform(data)
+-----------------------------+--------------------------------------+-------------------------------------------------------------------+
|PartiesRelation |WhereasRelation |MergedRelation |
+-----------------------------+--------------------------------------+-------------------------------------------------------------------+
|[signed_by, other, signed_by]|[has_subject, has_subject, has_object]|[signed_by, other, signed_by, has_subject, has_subject, has_object]|
+-----------------------------+--------------------------------------+-------------------------------------------------------------------+
AssertionChunkConverter
This annotator creates a CHUNK
column with metadata useful for training an Assertion Status Detection model (see AssertionDL).
In some cases, there may be issues while creating the chunk column when using token indices that can lead to loss of data to train assertion status models.
The AssertionChunkConverter
annotator uses both the begin and end indices of the tokens as input to add more robust metadata to the chunk column in a way that improves the reliability of the indices and avoids loss of data.
Parameters:
-
chunkBeginCol
: (Str) The column containing the start index of the chunk. -
chunkEndCol
: (Str) The column containing the end index of the chunk. -
chunkTextCol
: (Str) The column containing the text chunk. -
outputTokenBeginCol
: (Str) The column containing the selected token start. -
outputTokenEndCol
: (Str) The column containing the selected token end index. -
metadataFields
: (Dict) The dictionary of of metadata fields to be added to the chunk column
NOTE: Chunk begin and end indices in the assertion status model training dataframe can be populated using the new version of the ALAB module.
Input Annotator Types: TOKEN
Output Annotator Type: CHUNK
Python API: AssertionChunkConverter | Scala API: AssertionChunkConverter | Notebook: AssertionChunkConverterNootebook |
Show Example
from johnsnowlabs import nlp, medical
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("tokens")
converter = medical.AssertionChunkConverter()\
.setInputCols("tokens")\
.setChunkTextCol("target")\
.setChunkBeginCol("char_begin")\
.setChunkEndCol("char_end")\
.setOutputTokenBeginCol("token_begin")\
.setOutputTokenEndCol("token_end")\
.setOutputCol("chunk")
pipeline = nlp.Pipeline().setStages(
[document_assembler,
sentenceDetector,
tokenizer,
converter]
)
data = spark.createDataFrame([
["An angiography showed bleeding in two vessels off of the Minnie supplying the sigmoid that were succesfully embolized.","Minnie", 57, 64,],
["After discussing this with his PCP, Leon was clear that the patient had had recurrent DVTs and ultimately a PE and his PCP felt strongly that he required long-term anticoagulation","PCP",31,34,],
]).toDF("text", "target", "char_begin", "char_end")
results = pipeline.fit(data).transform(data)
results.selectExpr(
"target",
"char_begin",
"char_end",
"token_begin",
"token_end",
"tokens[token_begin].result",
"tokens[token_end].result",
"target",
"chunk",
).show(truncate=False)
+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
|target|char_begin|char_end|token_begin|token_end|tokens[token_begin].result|tokens[token_end].result|target|chunk |
+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
|Minnie|57 |64 |10 |10 |Minnie |Minnie |Minnie|[{chunk, 57, 62, Minnie, {sentence -> 0}, []}]|
|PCP |31 |34 |5 |5 |PCP |PCP |PCP |[{chunk, 31, 33, PCP, {sentence -> 0}, []}] |
+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
from johnsnowlabs import nlp, finance
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("tokens")
converter = finance.AssertionChunkConverter()\
.setInputCols("tokens")\
.setChunkTextCol("target")\
.setChunkBeginCol("char_begin")\
.setChunkEndCol("char_end")\
.setOutputTokenBeginCol("token_begin")\
.setOutputTokenEndCol("token_end")\
.setOutputCol("chunk")
pipeline = nlp.Pipeline().setStages(
[document_assembler,
sentenceDetector,
tokenizer,
converter]
)
data = spark.createDataFrame([
["Tom Martin worked as Cadence's CTO until 2010","Cadence's CTO",21,33],
["Mrs. Charles was before Managing Director at a big consultancy company","Managing Director",24,40],
]).toDF("text", "target", "char_begin", "char_end")
results = pipeline.fit(data).transform(data)
results.selectExpr(
"target",
"char_begin",
"char_end",
"token_begin",
"token_end",
"tokens[token_begin].result",
"tokens[token_end].result",
"target",
"chunk",
).show(truncate=False)
+-----------------+----------+--------+-----------+---------+--------------------------+------------------------+-----------------+---------------------------------------------------------+
|target |char_begin|char_end|token_begin|token_end|tokens[token_begin].result|tokens[token_end].result|target |chunk |
+-----------------+----------+--------+-----------+---------+--------------------------+------------------------+-----------------+---------------------------------------------------------+
|Cadence's CTO |21 |33 |4 |4 |Cadence's |Cadence's |Cadence's CTO |[{chunk, 21, 29, Cadence's CTO, {sentence -> 0}, []}] |
|Managing Director|24 |40 |5 |5 |Managing |Managing |Managing Director|[{chunk, 24, 31, Managing Director, {sentence -> 0}, []}]|
+-----------------+----------+--------+-----------+---------+--------------------------+------------------------+-----------------+---------------------------------------------------------+
from johnsnowlabs import nlp, legal
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("tokens")
converter = legal.AssertionChunkConverter()\
.setInputCols("tokens")\
.setChunkTextCol("target")\
.setChunkBeginCol("char_begin")\
.setChunkEndCol("char_end")\
.setOutputTokenBeginCol("token_begin")\
.setOutputTokenEndCol("token_end")\
.setOutputCol("chunk")
pipeline = nlp.Pipeline().setStages(
[document_assembler,
sentenceDetector,
tokenizer,
converter]
)
data = spark.createDataFrame([
["This Agreement may be executed by different parties hereto","parties",44,50,],
["The Administrative Agent will determine the Dollar Equivalent amount","Agent",19,23,],
]).toDF("text", "target", "char_begin", "char_end")
results = pipeline.fit(data).transform(data)
results.selectExpr(
"target",
"char_begin",
"char_end",
"token_begin",
"token_end",
"tokens[token_begin].result",
"tokens[token_end].result",
"target",
"chunk",
).show(truncate=False)
+-------+----------+--------+-----------+---------+--------------------------+------------------------+-------+-----------------------------------------------+
|target |char_begin|char_end|token_begin|token_end|tokens[token_begin].result|tokens[token_end].result|target |chunk |
+-------+----------+--------+-----------+---------+--------------------------+------------------------+-------+-----------------------------------------------+
|parties|44 |50 |7 |6 |parties |different |parties|[{chunk, 44, 42, parties, {sentence -> 0}, []}]|
|Agent |19 |23 |2 |1 |Agent |Administrative |Agent |[{chunk, 19, 17, Agent, {sentence -> 0}, []}] |
+-------+----------+--------+-----------+---------+--------------------------+------------------------+-------+-----------------------------------------------+
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("tokens")
val converter = new AssertionChunkConverter()
.setInputCols("tokens")
.setOutputCol("chunk")
.setChunkTextCol("target")
.setChunkBeginCol("char_begin")
.setChunkEndCol("char_end")
.setOutputTokenBeginCol("token_begin")
.setOutputTokenEndCol("token_end")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
sentenceDetector,
tokenizer,
converter))
val data = Seq(Array(
("An angiography showed bleeding in two vessels off of the Minnie supplying the sigmoid that were succesfully embolized.", "Minnie",57,64,),
("After discussing this with his PCP, Leon was clear that the patient had had recurrent DVTs and ultimately a PE and his PCP felt strongly that he required long-term anticoagulation", "PCP", 31, 34,)
)).toDF("text", "target", "char_begin", "char_end")
val results = pipeline.fit(data).transform(data)
+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
|target|char_begin|char_end|token_begin|token_end|tokens[token_begin].result|tokens[token_end].result|target|chunk |
+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
|Minnie|57 |64 |10 |10 |Minnie |Minnie |Minnie|[{chunk, 57, 62, Minnie, {sentence -> 0}, []}]|
|PCP |31 |34 |5 |5 |PCP |PCP |PCP |[{chunk, 31, 33, PCP, {sentence -> 0}, []}] |
+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("tokens")
val converter = new AssertionChunkConverter()
.setInputCols("tokens")
.setOutputCol("chunk")
.setChunkTextCol("target")
.setChunkBeginCol("char_begin")
.setChunkEndCol("char_end")
.setOutputTokenBeginCol("token_begin")
.setOutputTokenEndCol("token_end")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
sentenceDetector,
tokenizer,
converter))
val data = Seq(Array(
("Tom Martin worked as Cadence's CTO until 2010","Cadence's CTO",21,33,),
("Mrs. Charles was before Managing Director at a big consultancy company", "Managing Director",24, 40,)
)).toDF("text", "target", "char_begin", "char_end")
val results = pipeline.fit(data).transform(data)
+-----------------+----------+--------+-----------+---------+-----+----+------------+----------+-------------------+
|target |char_begin|char_end|token_begin|token_end|begin|end |begin_result|end_result|result |
+-----------------+----------+--------+-----------+---------+-----+----+------------+----------+-------------------+
|Cadence's CTO |21 |33 |4 |4 |[21] |[29]|Cadence's |Cadence's |[Cadence's CTO] |
|Managing Director|24 |40 |5 |5 |[24] |[31]|Managing |Managing |[Managing Director]|
+-----------------+----------+--------+-----------+---------+-----+----+------------+----------+-------------------+
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("tokens")
val converter = new AssertionChunkConverter()
.setInputCols("tokens")
.setOutputCol("chunk")
.setChunkTextCol("target")
.setChunkBeginCol("char_begin")
.setChunkEndCol("char_end")
.setOutputTokenBeginCol("token_begin")
.setOutputTokenEndCol("token_end")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
sentenceDetector,
tokenizer,
converter))
val data = Seq(Array(
("Tom Martin worked as Cadence's CTO until 2010","Cadence's CTO", 21,33,),
("Mrs. Charles was before Managing Director at a big consultancy company","Managing Director",24,40,)
)).toDF("text", "target", "char_begin", "char_end")
val results = pipeline.fit(data).transform(data)
+-------+----------+--------+-----------+---------+--------------------------+------------------------+-------+-----------------------------------------------+
|target |char_begin|char_end|token_begin|token_end|tokens[token_begin].result|tokens[token_end].result|target |chunk |
+-------+----------+--------+-----------+---------+--------------------------+------------------------+-------+-----------------------------------------------+
|parties|44 |50 |7 |6 |parties |different |parties|[{chunk, 44, 42, parties, {sentence -> 0}, []}]|
|Agent |19 |23 |2 |1 |Agent |Administrative |Agent |[{chunk, 19, 17, Agent, {sentence -> 0}, []}] |
+-------+----------+--------+-----------+---------+--------------------------+------------------------+-------+-----------------------------------------------+
AssertionDL
AssertionDL is a deep Learning based approach used to extract Assertion Status
from extracted entities and text. AssertionDLModel requires DOCUMENT, CHUNK and WORD_EMBEDDINGS type
annotator inputs, which can be obtained by e.g a
DocumentAssembler,
NerConverter
and WordEmbeddingsModel.
The result is an assertion status annotation for each recognized entity.
Possible values include “present”,“absent”,“hypothetical”,“conditional”,“associated_with_other_person”
etc.
Parameters:
-
inputCols
: Gets current column names of input annotations. -
outputCol
: Gets output column name of annotations. -
ScopeWindow
: Sets the scope of the window of the assertion expression. -
EntityAssertionCaseSensitive
: Sets the case sensitivity of entities and assertion labels. -
DoExceptionHandling
: If it is set as True, the annotator tries to process as usual and ff exception-causing data (e.g. corrupted record/ document) is passed to the annotator, an exception warning is emitted which has the exception message. -
datasetInfo
(Str): Descriptive information about the dataset being used.
For pretrained models please see the Models Hub for available models.
Input Annotator Types: DOCUMENT, CHUNK, WORD_EMBEDDINGS
Output Annotator Type: ASSERTION
Python API: AssertionDLModel | Scala API: AssertionDLModel | Notebook: AssertionDLModelNotebook |
Show Example
from johnsnowlabs import nlp, medical
# Define pipeline stages to extract NER chunks first
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
nerModel = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
nerConverter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
# Then a pretrained AssertionDLModel is used to extract the assertion status
clinicalAssertion = medical.AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models") \
.setInputCols(["sentence", "ner_chunk", "embeddings"]) \
.setOutputCol("assertion")
assertionPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
nerModel,
nerConverter,
clinicalAssertion
])
data = spark.createDataFrame([
["Patient with severe fever and sore throat"],
["Patient shows no stomach pain"],
["She was maintained on an epidural and PCA for pain control."]]).toDF("text")
# Show results
result = assertionPipeline.fit(data).transform(data)
result.selectExpr("ner_chunk.result as chunk_result", "assertion.result as assertion_result").show(3, truncate=False)
+--------------------------------+--------------------------------+
|chunk_result |assertion_result |
+--------------------------------+--------------------------------+
|[severe fever, sore throat] |[present, present] |
|[stomach pain] |[absent] |
|[an epidural, PCA, pain control]|[present, present, hypothetical]|
+--------------------------------+--------------------------------+
from johnsnowlabs import nlp, finance
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")\
ner_converter = finance.NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")\
assertion = finance.AssertionDLModel.pretrained("finassertion_competitors", "en", "finance/models")\
.setInputCols(["sentence", "ner_chunk", "embeddings"]) \
.setOutputCol("assertion")
pipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter,
assertion
])
data = spark.createDataFrame([["Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc."]]).toDF("text")
# Show results
result = pipeline.fit(data).transform(data)
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata, result.assertion.result)).alias("cols"))\
.select(F.expr("cols['1']['sentence']").alias("sent_id"),
F.expr("cols['0']").alias("chunk"),
F.expr("cols['1']['entity']").alias("ner_label"),
F.expr("cols['2']").alias("assertion")).show(truncate=False)
+-------+------------+---------+----------+
|sent_id|chunk |ner_label|assertion |
+-------+------------+---------+----------+
|0 |McAfee LLC |ORG |COMPETITOR|
|0 |Broadcom Inc|ORG |COMPETITOR|
+-------+------------+---------+----------+
from johnsnowlabs import nlp, legal
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings_ner = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings_ner")\
ner_model = legal.NerModel.pretrained('legner_contract_doc_parties', 'en', 'legal/models')\
.setInputCols(["sentence", "token", "embeddings_ner"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")\
.setWhiteList(["DOC", "EFFDATE", "PARTY"])
embeddings_ass = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings_ass")
assertion = legal.AssertionDLModel.pretrained("legassertion_time", "en", "legal/models")\
.setInputCols(["sentence", "ner_chunk", "embeddings_ass"]) \
.setOutputCol("assertion")
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings_ner,
ner_model,
ner_converter,
embeddings_ass,
assertion
])
data = spark.createDataFrame([["This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc."]]).toDF("text")
# Show results
result = nlpPipeline.fit(data).transform(data)
result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
result.ner_chunk.begin,
result.ner_chunk.end,
result.ner_chunk.metadata,
result.assertion.result)).alias("cols"))\
.select(F.expr("cols['0']").alias("chunk"),
F.expr("cols['1']").alias("begin"),
F.expr("cols['2']").alias("end"),
F.expr("cols['3']['entity']").alias("ner_label"),
F.expr("cols['4']").alias("assertion")).show(truncate=False)
+-------------------------------+-----+---+---------+---------+
|chunk |begin|end|ner_label|assertion|
+-------------------------------+-----+---+---------+---------+
|Intellectual Property Agreement|11 |41 |DOC |PRESENT |
|Amazon Inc |51 |60 |PARTY |PRESENT |
|Atlantic Inc |67 |78 |PARTY |PRESENT |
+-------------------------------+-----+---+---------+---------+
import spark.implicits._
// Define pipeline stages to extract NER chunks first
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val nerModel = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val nerConverter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
// Then a pretrained AssertionDLModel is used to extract the assertion status
val clinicalAssertion = AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models")
.setInputCols(Array("sentence", "ner_chunk", "embeddings"))
.setOutputCol("assertion")
val assertionPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
nerModel,
nerConverter,
clinicalAssertion
))
val data = Seq(
"Patient with severe fever and sore throat",
"Patient shows no stomach pain",
"She was maintained on an epidural and PCA for pain control.").toDF("text")
// Show results
val result = assertionPipeline.fit(data).transform(data)
+--------------------------------+--------------------------------+
|chunk_result |assertion_result |
+--------------------------------+--------------------------------+
|[severe fever, sore throat] |[present, present] |
|[stomach pain] |[absent] |
|[an epidural, PCA, pain control]|[present, present, hypothetical]|
+--------------------------------+--------------------------------+
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner_model = FinanceNerModel.pretrained("finner_orgs_prods_alias","en","finance/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val assertion = AssertionDLModel.pretrained("finassertion_competitors", "en", "finance/models")
.setInputCols(Array("sentence", "ner_chunk", "embeddings"))
.setOutputCol("assertion")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter,
assertion
))
val data = Seq("Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc.").toDF("text")
// Show results
val result = pipeline.fit(data).transform(data)
+-------+------------+---------+----------+
|sent_id|chunk |ner_label|assertion |
+-------+------------+---------+----------+
|0 |McAfee LLC |ORG |COMPETITOR|
|0 |Broadcom Inc|ORG |COMPETITOR|
+-------+------------+---------+----------+
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings_ner = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings_ner")
val ner_model = LegalNerModel.pretrained('legner_contract_doc_parties', 'en', 'legal/models')
.setInputCols(Array("sentence", "token", "embeddings_ner"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
.setWhiteList(Array("DOC", "EFFDATE", "PARTY"))
val embeddings_ass = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings_ass")
val assertion = AssertionDLModel.pretrained("legassertion_time", "en", "legal/models")
.setInputCols(Array("sentence", "ner_chunk", "embeddings_ass"))
.setOutputCol("assertion")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
embeddings_ner,
ner_model,
ner_converter,
embeddings_ass,
assertion
))
val data = Seq("This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc.").toDF("text")
// Show results
val result = pipeline.fit(data).transform(data)
+-------------------------------+-----+---+---------+---------+
|chunk |begin|end|ner_label|assertion|
+-------------------------------+-----+---+---------+---------+
|Intellectual Property Agreement|11 |41 |DOC |PRESENT |
|Amazon Inc |51 |60 |PARTY |PRESENT |
|Atlantic Inc |67 |78 |PARTY |PRESENT |
+-------------------------------+-----+---+---------+---------+
Trains AssertionDL, a deep Learning based approach used to extract Assertion Status from extracted entities and text. Contains all the methods for training an AssertionDLModel. For pretrained models please use AssertionDLModel and see the Models Hub for available models.
Parameters:
-
inputCols
: Gets current column names of input annotations. -
outputCol
: Gets output column name of annotations. -
ScopeWindow
: Sets the scope of the window of the assertion expression. -
StartCol
: Set a column that contains the token number for the start of the target. -
datasetInfo
(Str): Descriptive information about the dataset being used.
Input Annotator Types: DOCUMENT, CHUNK, WORD_EMBEDDINGS
Output Annotator Type: ASSERTION
Python API: AssertionDLApproach | Scala API: AssertionDLApproach | Notebook: AssertionDLApproachNotebook |
Show Example
from johnsnowlabs import nlp, medical
# First, pipeline stages for pre-processing the dataset (containing columns for text and label) are defined.
document = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
chunk = nlp.Doc2Chunk() \
.setInputCols(["document"]) \
.setOutputCol("chunk") \
.setChunkCol("target")\
.setStartCol("start")\
.setStartColByTokenIndex(True)\
.setFailOnMissing(False)\
.setLowerCase(True)
token = nlp.Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
# Define AssertionDLApproach with parameters and start training
assertionStatus = medical.AssertionDLApproach() \
.setLabelCol("label") \
.setInputCols(["document", "chunk", "embeddings"]) \
.setOutputCol("assertion") \
.setBatchSize(128) \
.setDropout(0.012) \
.setLearningRate(0.015) \
.setEpochs(1) \
.setStartCol("start") \
.setEndCol("end") \
.setMaxSentLen(250)
trainingPipeline = nlp.Pipeline().setStages([
document,
chunk,
token,
embeddings,
assertionStatus
])
assertionResults = trainingPipeline.fit(data).transform(data).cache()
from johnsnowlabs import nlp, finance
# First, pipeline stages for pre-processing the dataset (containing columns for text and label) are defined.
document = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
chunk = nlp.Doc2Chunk() \
.setInputCols(["document"]) \
.setOutputCol("chunk")
token = nlp.Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
# Define AssertionDLApproach with parameters and start training
assertionStatus = finance.AssertionDLApproach() \
.setLabelCol("label") \
.setInputCols(["document", "chunk", "embeddings"]) \
.setOutputCol("assertion") \
.setBatchSize(128) \
.setDropout(0.012) \
.setLearningRate(0.015) \
.setEpochs(1) \
.setStartCol("start") \
.setEndCol("end") \
.setMaxSentLen(250)
trainingPipeline = nlp.Pipeline().setStages([
document,
chunk,
token,
embeddings,
assertionStatus
])
assertionResults = trainingPipeline.fit(data).transform(data).cache()
from johnsnowlabs import nlp, legal
# First, pipeline stages for pre-processing the dataset (containing columns for text and label) are defined.
document = nlp.DocumentAssembler()\
.setInputCol("sentence")\
.setOutputCol("document")
chunk = nlp.Doc2Chunk()\
.setInputCols("document")\
.setOutputCol("doc_chunk")
token = nlp.Tokenizer()\
.setInputCols(['document'])\
.setOutputCol('token')
roberta_embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setMaxSentenceLength(512)
# Define AssertionDLApproach with parameters and start training
assertionStatus = legal.AssertionDLApproach()\
.setLabelCol("assertion_label")\
.setInputCols(["document", "doc_chunk", "embeddings"])\
.setOutputCol("assertion")\
.setBatchSize(128)\
.setLearningRate(0.001)\
.setEpochs(2)\
.setStartCol("tkn_start")\
.setEndCol("tkn_end")\
.setMaxSentLen(1200)\
.setEnableOutputLogs(True)\
.setOutputLogsPath('training_logs/')\
.setGraphFolder(graph_folder)\
.setGraphFile(f"{graph_folder}/assertion_graph.pb")\
.setTestDataset(path="test_data.parquet", read_as='SPARK', options={'format': 'parquet'})\
.setScopeWindow(scope_window)
#.setValidationSplit(0.2)\
#.setDropout(0.1)\
trainingPipeline = nlp.Pipeline().setStages([
document,
chunk,
token,
roberta_embeddings,
assertionStatus
])
assertionResults = trainingPipeline.fit(data).transform(data).cache()
import spark.implicits._
// First, pipeline stages for pre-processing the dataset (containing columns for text and label) are defined.
val document = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val chunk = new Doc2Chunk()
.setInputCols(Array("document"))
.setOutputCol("chunk")
val token = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
// Define AssertionDLApproach with parameters and start training
val assertionStatus = new AssertionDLApproach()
.setLabelCol("label")
.setInputCols(Array("document", "chunk", "embeddings"))
.setOutputCol("assertion")
.setBatchSize(128)
.setDropout(0.012)
.setLearningRate(0.015)
.setEpochs(1)
.setStartCol("start")
.setEndCol("end")
.setMaxSentLen(250)
val trainingPipeline = new Pipeline().setStages(Array(
document,
chunk,
token,
embeddings,
assertionStatus
))
val assertionResults = trainingPipeline.fit(data).transform(data).cache()
import spark.implicits._
// First, pipeline stages for pre-processing the dataset (containing columns for text and label) are defined.
val document = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val chunk = new Doc2Chunk()
.setInputCols(Array("document"))
.setOutputCol("chunk")
val token = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
// Define AssertionDLApproach with parameters and start training
val assertionStatus = new AssertionDLApproach()
.setLabelCol("label")
.setInputCols(Array("document", "chunk", "embeddings"))
.setOutputCol("assertion")
.setBatchSize(128)
.setDropout(0.012)
.setLearningRate(0.015)
.setEpochs(1)
.setStartCol("start")
.setEndCol("end")
.setMaxSentLen(250)
val trainingPipeline = new Pipeline().setStages(Array(
document,
chunk,
token,
embeddings,
assertionStatus
))
val assertionResults = trainingPipeline.fit(data).transform(data).cache()
import spark.implicits._
val document = new DocumentAssembler()
.setInputCol("sentence")
.setOutputCol("document")
val chunk = new Doc2Chunk()
.setInputCols(Array("document"))
.setOutputCol("doc_chunk")
.setChunkCol("chunk")
.setStartCol("tkn_start")
.setStartColByTokenIndex(true)
.setFailOnMissing(false)
.setLowerCase(false)
val token = new Tokenizer()
.setInputCols(Array('document'))
.setOutputCol('token')
val roberta_embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setMaxSentenceLength(512)
# Define AssertionDLApproach with parameters and start training
val assertionStatus = new AssertionDLApproach()
.setLabelCol("assertion_label")
.setInputCols(Array("document", "doc_chunk", "embeddings"))
.setOutputCol("assertion")
.setBatchSize(128)
.setLearningRate(0.001)
.setEpochs(2)
.setStartCol("tkn_start")
.setEndCol("tkn_end")
.setMaxSentLen(1200)
.setEnableOutputLogs(true)
.setOutputLogsPath('training_logs/')
.setGraphFolder(graph_folder)
.setGraphFile(f"{graph_folder}/assertion_graph.pb")
.setTestDataset(path="test_data.parquet", read_as='SPARK', options={'format': 'parquet'})
.setScopeWindow(scope_window)
#.setValidationSplit(0.2)
#.setDropout(0.1)
val trainingPipeline = new Pipeline().setStages(Array(
document,
chunk,
token,
roberta_embeddings,
assertionStatus
))
val assertionResults = trainingPipeline.fit(data).transform(data).cache()
AssertionFilterer
Filters entities coming from ASSERTION type annotations and returns the CHUNKS.
Filters can be set via a white list on the extracted chunk, the assertion or a regular expression.
White list for assertion is enabled by default. To use chunk white list, criteria
has to be set to "isin"
.
For regex, criteria
has to be set to "regex"
.
Parameters:
-
whiteList
: (list) If defined, list of entities to process. The rest will be ignored. -
CaseSensitive
: (bool) Determines whether the definitions of the white listed entities are case sensitive. -
regex
: (list) List of dash-separated pairs of named entities. -
criteria
: (list) Set tag representing what is the criteria to filter the chunks. possibles values (assertion,isIn,regex). assertion: Filter by the assertion, isIn : Filter by the chunk, regex : Filter using a regex. -
entitiesConfidence
: (Str) Entity pairs to remove based on the confidence level.
Input Annotator Types: DOCUMENT, CHUNK, ASSERTION
Output Annotator Type: CHUNK
Python API: AssertionFilterer | Scala API: AssertionFilterer | Notebook: AssertionFiltererNotebook |
Show Example
from johnsnowlabs import nlp, medical
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")\
#.setIncludeAllConfidenceScores(False)
ner_converter = medical.NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")\
.setWhiteList(["PROBLEM", "TEST","TREATMENT"])
clinical_assertion = medical.AssertionDLModel.pretrained("assertion_jsl", "en", "clinical/models") \
.setInputCols(["sentence", "ner_chunk", "embeddings"]) \
.setOutputCol("assertion")
assertion_filterer = medical.AssertionFilterer()\
.setInputCols("sentence","ner_chunk","assertion")\
.setOutputCol("assertion_filtered")\
.setCaseSensitive(False)\
.setWhiteList(["Present"])
#or .setBlackList([["absent"]])
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
clinical_assertion,
assertion_filterer
])
data = spark.createDataFrame([["Patient has a headache for the last 2 weeks, needs to get a head CT, and appears anxious when she walks fast. Alopecia noted. She denies pain."]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
# Show results:
result.selectExpr("ner_chunk.result as ner_chunk", "assertion.result as assertion").show(3, truncate=False)
+------------------------------------------------+--------------------------------------------------+
|ner_chunk |assertion |
+------------------------------------------------+--------------------------------------------------+
|[a headache, a head CT, anxious, Alopecia, pain]|[Present, Hypothetical, Possible, Present, Absent]|
+------------------------------------------------+--------------------------------------------------+
result.select("filtered.result").show(3, truncate=False)
+----------------------+
|result |
+----------------------+
|[a headache, Alopecia]|
+----------------------+
from johnsnowlabs import nlp, finance
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")\
ner_converter = finance.NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")\
assertion = finance.AssertionDLModel.pretrained("finassertion_competitors", "en", "finance/models")\
.setInputCols(["sentence", "ner_chunk", "embeddings"]) \
.setOutputCol("assertion")
assertion_filterer = finance.AssertionFilterer()\
.setInputCols("sentence","ner_chunk","assertion")\
.setOutputCol("assertion_filtered")\
.setCaseSensitive(False)\
.setWhiteList(["Competitor"])
pipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter,
assertion,
assertion_filterer
])
data = spark.createDataFrame([["Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc."]]).toDF("text")
# Show results
result = pipeline.fit(data).transform(data)
result.selectExpr("ner_chunk.result as ner_chunk", "assertion.result as assertion").show(3, truncate=False)
+--------------------------+------------------------+
|ner_chunk |assertion |
+--------------------------+------------------------+
|[McAfee LLC, Broadcom Inc]|[COMPETITOR, COMPETITOR]|
+--------------------------+------------------------+
result.select("assertion_filtered.result").show(3, truncate=False)
+--------------------------+
|result |
+--------------------------+
|[McAfee LLC, Broadcom Inc]|
+--------------------------+
from johnsnowlabs import nlp, legal
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings_ner = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings_ner")\
ner_model = legal.NerModel.pretrained('legner_contract_doc_parties', 'en', 'legal/models')\
.setInputCols(["sentence", "token", "embeddings_ner"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")\
.setWhiteList(["DOC", "EFFDATE", "PARTY"])
embeddings_ass = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings_ass")
assertion = legal.AssertionDLModel.pretrained("legassertion_time", "en", "legal/models")\
.setInputCols(["sentence", "ner_chunk", "embeddings_ass"]) \
.setOutputCol("assertion")
assertion_filterer = legal.AssertionFilterer()\
.setInputCols("sentence","ner_chunk","assertion")\
.setOutputCol("assertion_filtered")\
.setCaseSensitive(False)\
.setWhiteList(["Present"])
nlpPipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings_ner,
ner_model,
ner_converter,
embeddings_ass,
assertion,
assertion_filterer
])
data = spark.createDataFrame([["This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc."]]).toDF("text")
# Show results
result = nlpPipeline.fit(data).transform(data)
result.selectExpr("ner_chunk.result as ner_chunk", "assertion.result as assertion").show(3, truncate=False)
+-----------------------------------------------------------+---------------------------+
|ner_chunk |assertion |
+-----------------------------------------------------------+---------------------------+
|[Intellectual Property Agreement, Amazon Inc, Atlantic Inc]|[PRESENT, PRESENT, PRESENT]|
+-----------------------------------------------------------+---------------------------+
result.select("assertion_filtered.result").show(3, truncate=False)
+-----------------------------------------------------------+
|result |
+-----------------------------------------------------------+
|[Intellectual Property Agreement, Amazon Inc, Atlantic Inc]|
+-----------------------------------------------------------+
import spark.implicits._
// Annotator that transforms a text column from dataframe into an Annotation ready for NLP
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
// Sentence Detector annotator, processes various sentences per line
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
// Tokenizer splits words in a relevant format for NLP
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
// Clinical word embeddings trained on PubMED dataset
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
//.setIncludeAllConfidenceScores(false)
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
.setWhiteList(Array("PROBLEM", "TEST","TREATMENT"))
val clinical_assertion = AssertionDLModel.pretrained("assertion_jsl", "en", "clinical/models")
.setInputCols(Array("sentence", "ner_chunk", "embeddings"))
.setOutputCol("assertion")
val assertion_filterer = new AssertionFilterer()
.setInputCols("sentence","ner_chunk","assertion")
.setOutputCol("assertion_filtered")
.setCaseSensitive(false)
.setWhiteList(Array("Present"))
//or .setBlackList(Array("absent"))
val nlpPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
clinical_assertion,
assertion_filterer
))
val text ="""Patient has a headache for the last 2 weeks, needs to get a head CT, and appears anxious when she walks fast. Alopecia noted. She denies pain."""
val data = Seq(text).toDF("text")
val result = nlpPipeline.fit(data).transform(data)
// Show results:
+------------------------------------------------+--------------------------------------------------+
|ner_chunk |assertion |
+------------------------------------------------+--------------------------------------------------+
|[a headache, a head CT, anxious, Alopecia, pain]|[Present, Hypothetical, Possible, Present, Absent]|
+------------------------------------------------+--------------------------------------------------+
+----------------------+
|result |
+----------------------+
|[a headache, Alopecia]|
+----------------------+
import spark.implicits._
// Annotator that transforms a text column from dataframe into an Annotation ready for NLP
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
// Sentence Detector annotator, processes various sentences per line
val sentence_detector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
// Tokenizer splits words in a relevant format for NLP
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
// Clinical word embeddings trained on PubMED dataset
val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner_model = FinanceNerModel.pretrained("finner_orgs_prods_alias","en","finance/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val assertion = AssertionDLModel.pretrained("finassertion_competitors", "en", "finance/models")
.setInputCols(Array("sentence", "ner_chunk", "embeddings"))
.setOutputCol("assertion")
val assertion_filterer = new AssertionFilterer()
.setInputCols("sentence","ner_chunk","assertion")
.setOutputCol("assertion_filtered")
.setCaseSensitive(false)
.setWhiteList(Array("Competitor"))
val nlpPipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter,
assertion,
assertion_filterer
))
val text ="""Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc."""
val data = Seq(text).toDF("text")
val result = nlpPipeline.fit(data).transform(data)
// Show results:
+--------------------------+------------------------+
|ner_chunk |assertion |
+--------------------------+------------------------+
|[McAfee LLC, Broadcom Inc]|[COMPETITOR, COMPETITOR]|
+--------------------------+------------------------+
+--------------------------+
|result |
+--------------------------+
|[McAfee LLC, Broadcom Inc]|
+--------------------------+
import spark.implicits._
// Annotator that transforms a text column from dataframe into an Annotation ready for NLP
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
// Sentence Detector annotator, processes various sentences per line
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
// Tokenizer splits words in a relevant format for NLP
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
// Clinical word embeddings trained on PubMED dataset
val embeddings_ner = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings_ner")
val ner_model = LegalNerModel.pretrained('legner_contract_doc_parties', 'en', 'legal/models')
.setInputCols(Array("sentence", "token", "embeddings_ner"))
.setOutputCol("ner")
//.setIncludeAllConfidenceScores(false)
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
.setWhiteList(Array("DOC", "EFFDATE", "PARTY"))
val embeddings_ass = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings_ass")
val assertion = AssertionDLModel.pretrained("assertion_jsl", "en", "clinical/models")
.setInputCols(Array("sentence", "ner_chunk", "embeddings_ass"))
.setOutputCol("assertion")
val assertion_filterer = new AssertionFilterer()
.setInputCols("sentence","ner_chunk","assertion")
.setOutputCol("assertion_filtered")
.setCaseSensitive(false)
.setWhiteList(Array("Present"))
val nlpPipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
embeddings_ner,
ner_model,
ner_converter,
embeddings_ass,
assertion,
assertion_filterer
))
val text ="""This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc."""
val data = Seq(text).toDF("text")
val result = nlpPipeline.fit(data).transform(data)
// Show results:
+-----------------------------------------------------------+---------------------------+
|ner_chunk |assertion |
+-----------------------------------------------------------+---------------------------+
|[Intellectual Property Agreement, Amazon Inc, Atlantic Inc]|[PRESENT, PRESENT, PRESENT]|
+-----------------------------------------------------------+---------------------------+
+-----------------------------------------------------------+
|result |
+-----------------------------------------------------------+
|[Intellectual Property Agreement, Amazon Inc, Atlantic Inc]|
+-----------------------------------------------------------+
AssertionLogReg
This is a main class in AssertionLogReg family. Logarithmic Regression is used to extract Assertion Status from extracted entities and text. AssertionLogRegModel requires DOCUMENT, CHUNK and WORD_EMBEDDINGS type annotator inputs, which can be obtained by e.g a DocumentAssembler, NerConverter and WordEmbeddingsModel. The result is an assertion status annotation for each recognized entity.
Possible values are "Negated", "Affirmed" and "Historical"
.
Unlike the DL Model, this class does not extend AnnotatorModel. Instead it extends the RawAnnotator, that’s why the main point of interest is method transform().
At the moment there are no pretrained models available for this class. Please refer to AssertionLogRegApproach to train your own model.
Parametres:
-
setAfter(Int)
: Length of the context after the target (Default: 13) -
setBefore(Int)
: Length of the context before the target (Default: 11) -
setEndCol(String)
: Column that contains the token number for the end of the target -
setStartCol(String)
: Column that contains the token number for the start of the target
Input Annotator Types: DOCUMENT, CHUNK, WORD_EMBEDDINGS
Output Annotator Type: ASSERTION
Python API: AssertionLogRegModel | Scala API: AssertionLogRegModel | Notebook: AssertionLogRegModelNotebook |
Trains a classification method, which uses the Logarithmic Regression Algorithm. It is used to extract Assertion Status from extracted entities and text. Contains all the methods for training a AssertionLogRegModel, together with trainWithChunk, trainWithStartEnd.
Parameters:
-
label
: Column with label per each token -
maxIter
: This specifies the maximum number of iterations to be performed in the model’s training, default: 26 -
regParam
: This specifies the regularization parameter. Regularization helps to control the complexity of the model, aiding in preventing the issue of overfitting. -
eNetParam
: Elastic net parameter -
beforeParam
: Length of the context before the target -
afterParam
: Length of the context after the target -
startCol
: Column that contains the token number for the start of the target -
endCol
: Column that contains the token number for the end of the target
Input Annotator Types: DOCUMENT, CHUNK, WORD_EMBEDDINGS
Output Annotator Type: ASSERTION
Python API: AssertionLogRegApproach | Scala API: AssertionLogRegApproach | Notebook: AssertionLogRegApproachNotebook |
Show Example
from johnsnowlabs import nlp, medical
# First define pipeline stages to extract embeddings and text chunks
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = nlp.Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
glove = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["document", "token"]) \
.setOutputCol("word_embeddings") \
.setCaseSensitive(False)
chunkAssembler = nlp.Doc2Chunk() \
.setInputCols(["document"]) \
.setChunkCol("target") \
.setOutputCol("chunk")
# Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
assertion = medical.AssertionLogRegApproach() \
.setLabelCol("label") \
.setInputCols(["document", "chunk", "word_embeddings"]) \
.setOutputCol("assertion") \
.setReg(0.01) \
.setBefore(11) \
.setAfter(13) \
.setStartCol("start") \
.setEndCol("end")
assertionPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
nerModel,
nerConverter,
assertion
])
assertionModel = assertionPipeline.fit(dataset)
from johnsnowlabs import nlp, finance
# First define pipeline stages to extract embeddings and text chunks
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = nlp.Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
glove = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["document", "token"]) \
.setOutputCol("word_embeddings") \
.setCaseSensitive(False)
chunkAssembler = nlp.Doc2Chunk() \
.setInputCols(["document"]) \
.setChunkCol("target") \
.setOutputCol("chunk")
# Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
assertion = finance.AssertionLogRegApproach() \
.setLabelCol("label") \
.setInputCols(["document", "chunk", "word_embeddings"]) \
.setOutputCol("assertion") \
.setReg(0.01) \
.setBefore(11) \
.setAfter(13) \
.setStartCol("start") \
.setEndCol("end")
assertionPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
nerModel,
nerConverter,
assertion
])
assertionModel = assertionPipeline.fit(dataset)
from johnsnowlabs import nlp, legal
# First define pipeline stages to extract embeddings and text chunks
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = nlp.Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
glove = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["document", "token"]) \
.setOutputCol("word_embeddings") \
.setCaseSensitive(False)
chunkAssembler = nlp.Doc2Chunk() \
.setInputCols(["document"]) \
.setChunkCol("target") \
.setOutputCol("chunk")
# Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
assertion = legal.AssertionLogRegApproach() \
.setLabelCol("label") \
.setInputCols(["document", "chunk", "word_embeddings"]) \
.setOutputCol("assertion") \
.setReg(0.01) \
.setBefore(11) \
.setAfter(13) \
.setStartCol("start") \
.setEndCol("end")
assertionPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
nerModel,
nerConverter,
assertion
])
assertionModel = assertionPipeline.fit(dataset)
import spark.implicits._
// First define pipeline stages to extract embeddings and text chunks
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val glove = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("document", "token"))
.setOutputCol("word_embeddings")
.setCaseSensitive(false)
val chunkAssembler = new Doc2Chunk()
.setInputCols("document")
.setChunkCol("target")
.setOutputCol("chunk")
// Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
val assertion = new AssertionLogRegApproach()
.setLabelCol("label")
.setInputCols(Array("document", "chunk", "word_embeddings"))
.setOutputCol("assertion")
.setReg(0.01)
.setBefore(11)
.setAfter(13)
.setStartCol("start")
.setEndCol("end")
val assertionPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
nerModel,
nerConverter,
assertion
))
val assertionModel = assertionPipeline.fit(dataset)
import spark.implicits._
// First define pipeline stages to extract embeddings and text chunks
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val glove = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("document", "token"))
.setOutputCol("word_embeddings")
.setCaseSensitive(false)
val chunkAssembler = new Doc2Chunk()
.setInputCols("document")
.setChunkCol("target")
.setOutputCol("chunk")
// Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
val assertion = new AssertionLogRegApproach()
.setLabelCol("label")
.setInputCols(Array("document", "chunk", "word_embeddings"))
.setOutputCol("assertion")
.setReg(0.01)
.setBefore(11)
.setAfter(13)
.setStartCol("start")
.setEndCol("end")
val assertionPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
nerModel,
nerConverter,
assertion
))
val assertionModel = assertionPipeline.fit(dataset)
import spark.implicits._
// First define pipeline stages to extract embeddings and text chunks
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val glove = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("document", "token"))
.setOutputCol("word_embeddings")
.setCaseSensitive(false)
val chunkAssembler = new Doc2Chunk()
.setInputCols("document")
.setChunkCol("target")
.setOutputCol("chunk")
// Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
val assertion = new AssertionLogRegApproach()
.setLabelCol("label")
.setInputCols(Array("document", "chunk", "word_embeddings"))
.setOutputCol("assertion")
.setReg(0.01)
.setBefore(11)
.setAfter(13)
.setStartCol("start")
.setEndCol("end")
val assertionPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
nerModel,
nerConverter,
assertion
))
val assertionModel = assertionPipeline.fit(dataset)
AssertionMerger
Merges variety assertion columns coming from Assertion annotators like sparknlp_jsl.annotator.assertion.AssertionDLModel
.
AssertionMerger can filter, prioritize and merge assertion annotations by using proper parameters.
See Also: sparknlp_jsl.annotator.WhiteBlackListParams
for filtering options.
Parameters:
-
mergeOverlapping
(Bool): Whether to merge overlapping matched assertions. -
applyFilterBeforeMerge
(Bool): Whether to apply filtering before merging. -
assertionsConfidence
(dict[str, float]): Pairs (assertion,confidenceThreshold) to filter assertions which have confidence lower than the confidence threshold. -
orderingFeatures
(list[str]): Specifies the ordering features to use for overlapping entities. Possible values include: ‘begin’, ‘end’, ‘length’, ‘source’, ‘confidence’. Default: [‘begin’, ‘length’, ‘source’] -
selectionStrategy
(str): Determines the strategy for selecting annotations. Annotations can be selected either sequentially based on their order (Sequential) or using a more diverse strategy (DiverseLonger). Currently, only Sequential and DiverseLonger options are available. Default: Sequential. -
defaultConfidence
(float): When the confidence value is included in the orderingFeatures and a given annotation does not have any confidence, this parameter determines the value to be used. The default value is 0. -
assertionSourcePrecedence
(str): Specifies the assertion sources to use for prioritizing overlapping annotations when the ‘source’ ordering feature is utilized. This parameter contains a comma-separated list of assertion sources that drive the prioritization. Annotations will be prioritized based on the order of the given string. -
sortByBegin
(Bool): Whether to sort the annotations by begin at the end of the merge and filter process. Default: False. -
blackList
(list[str]): If defined, list of entities to ignore. The rest will be processed. -
whiteList
(list[str]): If defined, list of entities to process. The rest will be ignored. Do not include IOB prefix on labels. -
caseSensitive
(Bool): Determines whether the definitions of the white listed and black listed entities are case sensitive. Default: True. -
majorityVoting
(Bool): Whether to use majority voting to resolve conflicts. Default: False.
Input Annotator Types: ASSERTION
Output Annotator Type: ASSERTION
Python API: AssertionMerger | Scala API: AssertionMerger |
Show Example
from johnsnowlabs import nlp, medical
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_jsl = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner_jsl")\
#.setIncludeAllConfidenceScores(False)
ner_jsl_converter = medical.NerConverterInternal() \
.setInputCols(["sentence", "token", "ner_jsl"]) \
.setOutputCol("ner_jsl_chunk")\
.setWhiteList(["SYMPTOM","VS_FINDING","DISEASE_SYNDROME_DISORDER","ADMISSION_DISCHARGE","PROCEDURE"])
assertion_jsl = medical.AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models") \
.setInputCols(["sentence", "ner_jsl_chunk", "embeddings"]) \
.setOutputCol("assertion_jsl")\
.setEntityAssertionCaseSensitive(False)
ner_clinical = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner_clinical")\
#.setIncludeAllConfidenceScores(False)
ner_clinical_converter = medical.NerConverterInternal() \
.setInputCols(["sentence", "token", "ner_clinical"]) \
.setOutputCol("ner_clinical_chunk")\
assertion_dl = medical.AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models") \
.setInputCols(["sentence", "ner_clinical_chunk", "embeddings"]) \
.setOutputCol("assertion_dl")
assertion_merger = medical.AssertionMerger() \
.setInputCols("assertion_jsl", "assertion_dl") \
.setOutputCol("assertion_merger") \
.setMergeOverlapping(True) \
.setSelectionStrategy("sequential") \
.setAssertionSourcePrecedence("assertion_dl, assertion_jsl") \
.setCaseSensitive(False) \
.setAssertionsConfidence({"past": 0.70}) \
.setOrderingFeatures(["length", "source", "confidence"]) \
.setDefaultConfidence(0.50)\
#.setBlackList(["HYPothetical"])
pipeline = Pipeline( stages =[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner_jsl,
ner_jsl_converter,
assertion_jsl,
ner_clinical,
ner_clinical_converter,
assertion_dl,
assertion_merger])
data = spark.createDataFrame([
"""Patient had a headache for the last 2 weeks, and appears anxious when she walks fast. No alopecia noted. She denies pain. Her father is paralyzed and it is a stressor for her. She got antidepressant. We prescribed sleeping pills for her current insomnia."""], StringType()).toDF("text")
data = data.coalesce(1).withColumn("idx", F.monotonically_increasing_id())
results = pipeline.fit(data).transform(data)
## Result
+---+--------------+-----+---+---------+---------+----------------+----------+
|idx|ner_chunk |begin|end|ner_label|assertion|assertion_source|confidence|
+---+--------------+-----+---+---------+---------+----------------+----------+
|0 |headache |14 |21 |Symptom |Past |assertion_jsl |0.9999 |
|0 |anxious |57 |63 |PROBLEM |present |assertion_dl |0.9392 |
|0 |alopecia |89 |96 |PROBLEM |absent |assertion_dl |0.9992 |
|0 |pain |116 |119|PROBLEM |absent |assertion_dl |0.9884 |
|0 |paralyzed |136 |144|Symptom |Family |assertion_jsl |0.9995 |
|0 |stressor |158 |165|Symptom |Family |assertion_jsl |1.0 |
|0 |antidepressant|184 |197|TREATMENT|present |assertion_dl |0.9628 |
|0 |sleeping pills|214 |227|TREATMENT|present |assertion_dl |0.998 |
|0 |insomnia |245 |252|Symptom |Past |assertion_jsl |0.9862 |
+---+--------------+-----+---+---------+---------+----------------+----------+
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner_jsl = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner_jsl")
//.setIncludeAllConfidenceScores(false)
val ner_jsl_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner_jsl"))
.setOutputCol("ner_jsl_chunk")
.setWhiteList(Array("SYMPTOM", "VS_FINDING", "DISEASE_SYNDROME_DISORDER", "ADMISSION_DISCHARGE", "PROCEDURE"))
val assertion_jsl = AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models")
.setInputCols(Array("sentence", "ner_jsl_chunk", "embeddings"))
.setOutputCol("assertion_jsl")
.setEntityAssertionCaseSensitive(false)
val ner_clinical = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner_clinical")
//.setIncludeAllConfidenceScores(false)
val ner_clinical_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner_clinical"))
.setOutputCol("ner_clinical_chunk")
val assertion_dl = AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models")
.setInputCols(Array("sentence", "ner_clinical_chunk", "embeddings"))
.setOutputCol("assertion_dl")
val assertion_merger = new AssertionMerger()
.setInputCols("assertion_jsl", "assertion_dl")
.setOutputCol("assertion_merger")
.setMergeOverlapping(true)
.setSelectionStrategy("sequential")
.setAssertionSourcePrecedence("assertion_dl, assertion_jsl")
.setCaseSensitive(false)
.setAssertionsConfidence(Map("past"-> 0.70f))
.setOrderingFeatures(Array("length", "source", "confidence"))
.setDefaultConfidence(0.50f)
// .setBlackList(("HYPothetical"))
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner_jsl,
ner_jsl_converter,
assertion_jsl,
ner_clinical,
ner_clinical_converter,
assertion_dl,
assertion_merger))
val text = "Patient had a headache for the last 2 weeks, and appears anxious when she walks fast. No alopecia noted. She denies pain. Her father is paralyzed and it is a stressor for her. She got antidepressant. We prescribed sleeping pills for her current insomnia."
val data = Seq(text).toDF("text")
//val data = data.coalesce(1).withColumn("idx", F.monotonically_increasing_id())
val results = pipeline.fit(data).transform(data)
// Result
+---+--------------+-----+---+---------+---------+----------------+----------+
|idx|ner_chunk |begin|end|ner_label|assertion|assertion_source|confidence|
+---+--------------+-----+---+---------+---------+----------------+----------+
|0 |headache |14 |21 |Symptom |Past |assertion_jsl |0.9999 |
|0 |anxious |57 |63 |PROBLEM |present |assertion_dl |0.9392 |
|0 |alopecia |89 |96 |PROBLEM |absent |assertion_dl |0.9992 |
|0 |pain |116 |119|PROBLEM |absent |assertion_dl |0.9884 |
|0 |paralyzed |136 |144|Symptom |Family |assertion_jsl |0.9995 |
|0 |stressor |158 |165|Symptom |Family |assertion_jsl |1.0 |
|0 |antidepressant|184 |197|TREATMENT|present |assertion_dl |0.9628 |
|0 |sleeping pills|214 |227|TREATMENT|present |assertion_dl |0.998 |
|0 |insomnia |245 |252|Symptom |Past |assertion_jsl |0.9862 |
+---+--------------+-----+---+---------+---------+----------------+----------+
AverageEmbeddings
AverageEmbeddings
computes the mean of vector embeddings for two sentences of equal size, producing a unified representation.
Parameters:
-
inputCols
: The name of the columns containing the input annotations. It can read either a String column or an Array. -
outputCol
: The name of the column in Document type that is generated. We can specify only one column here.
All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols()
.
Input Annotator Types: SENTENCE_EMBEDDINGS, SENTENCE_EMBEDDINGS, CHUNK
Output Annotator Type: EMBEDDINGS
Python API: AverageEmbeddings | Scala API: AverageEmbeddings | Notebook: AverageEmbeddingsNotebook |
Show Example
from johnsnowlabs import nlp, medical
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")\
sentence_detector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
doc2Chunk = nlp.Doc2Chunk() \
.setInputCols("sentence") \
.setOutputCol("chunk") \
.setIsArray(True)
sbiobert_base_cased_mli = nlp.BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
.setInputCols("sentence")\
.setOutputCol("sbiobert_base_cased_mli")
sent_biobert_clinical_base_cased = nlp.BertSentenceEmbeddings.pretrained("sent_biobert_clinical_base_cased", "en") \
.setInputCols("sentence") \
.setOutputCol("sent_biobert_clinical_base_cased")
avg_embeddings = medical.AverageEmbeddings()\
.setInputCols(["sent_biobert_clinical_base_cased","sbiobert_base_cased_mli","chunk"])\
.setOutputCol("embeddings")
pipeline = nlp.Pipeline(
stages=[
document_assembler,
sentence_detector,
doc2Chunk,
sbiobert_base_cased_mli,
sent_biobert_clinical_base_cased,
avg_embeddings
])
data = spark.createDataFrame([[" The patient was prescribed 1 capsule of Advil for 5 days "]]).toDF("text")
result = pipeline.fit(data).transform(data)
result_df = result.select(F.explode(F.arrays_zip(result.chunk.result,
result.chunk.metadata,
result.sentence.result,
result.embeddings.embeddings,
result.sent_biobert_clinical_base_cased.embeddings,
result.sbiobert_base_cased_mli.embeddings,)).alias("cols"))\
.select(F.expr("cols['0']").alias("sentence"),
F.expr("cols['1']").alias("sentence_metadata"),
F.expr("cols['2']").alias("chunk"),
F.expr("cols['3']").alias("embeddings"),
F.expr("cols['4']").alias("sent_biobert_clinical_base_cased"),
F.expr("cols['5']").alias("sbiobert_base_cased_mli"))
result_df.show(50, truncate=1000)
## Result
+--------------------------------------------------+---------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
| sentence| sentence_metadata| chunk| embeddings| sent_biobert_clinical_base_cased| sbiobert_base_cased_mli|
+--------------------------------------------------+---------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|The patient was prescribed 1 capsule of Advil f...|{sentence -> 0, chunk -> 0}|The patient was prescribed 1 capsule of Advil f...|[0.32466835, 0.12497781, -0.20237188, 0.3716198...|[-0.07857181, -0.061015874, -0.020198729, 0.177...|[0.7279085, 0.3109715, -0.38454503, 0.5657965, ...|
+--------------------------------------------------+---------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val doc2Chunk = new Doc2Chunk()
.setInputCols("sentence")
.setOutputCol("chunk")
.setIsArray(true)
val sbiobert_base_cased_mli = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols("sentence")
.setOutputCol("sbiobert_base_cased_mli")
val sent_biobert_clinical_base_cased = BertSentenceEmbeddings.pretrained("sent_biobert_clinical_base_cased","en")
.setInputCols("sentence")
.setOutputCol("sent_biobert_clinical_base_cased")
val avg_embeddings = new AverageEmbeddings()
.setInputCols(Array("sent_biobert_clinical_base_cased","sbiobert_base_cased_mli","chunk"))
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
doc2Chunk,
sbiobert_base_cased_mli,
sent_biobert_clinical_base_cased,
avg_embeddings))
val data = Seq(" The patient was prescribed 1 capsule of Advil for 5 days").toDF("text")
val result = pipeline.fit(data).transform(data)
// Show results
+--------------------------------------------------+---------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
| sentence| sentence_metadata| chunk| embeddings| sent_biobert_clinical_base_cased| sbiobert_base_cased_mli|
+--------------------------------------------------+---------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|The patient was prescribed 1 capsule of Advil f...|{sentence -> 0, chunk -> 0}|The patient was prescribed 1 capsule of Advil f...|[0.32466835, 0.12497781, -0.20237188, 0.3716198...|[-0.07857181, -0.061015874, -0.020198729, 0.177...|[0.7279085, 0.3109715, -0.38454503, 0.5657965, ...|
+--------------------------------------------------+---------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
BertForSequenceClassification
BertForSequenceClassification
can load Bert Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.
Parameters:
-
batchSize
’: Size of every batch (default: 8). -
coalesceSentences
’: Instead of 1 class per sentence (if inputCols is “sentence” output 1 class per document by averaging probabilities in all sentences (default: False). -
maxSentenceLength
’: Max sentence length to process (default: 128). -
caseSensitive
’: Whether to ignore case in tokens for embeddings matching (default: True)
Input Annotator Types: DOCUMENT, TOKEN
Output Annotator Type: CATEGORY
Python API: BertForSequenceClassification | Scala API: BertForSequenceClassification |
Show Example
from johnsnowlabs import nlp, medical
document_assembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = nlp.Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
sequenceClassifier = medical.BertForSequenceClassification.pretrained("bert_sequence_classifier_ade", "en", "clinical/models")\
.setInputCols(["document","token"])\
.setOutputCol("classes")
pipeline = nlp.Pipeline(stages=[
document_assembler,
tokenizer,
sequenceClassifier
])
text =[["Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair 2137. Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. Name (NI) past or present smoking hx, no EtOH."],
["Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep Apnea. Denies tobacco and ETOH. Works as cafeteria worker."]]
data = spark.createDataFrame(text).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("text", "classes.result").show(2,truncate=100)
| text | result |
|------------------------------------------------------------------------------------------------|-------|
| Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair... | [False] |
| Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep... | [False] |
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_ade", "en", "clinical/models")
.setInputCols(Array("document","token"))
.setOutputCol("classes")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
tokenizer,
sequenceClassifier))
val text = List(
List("Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair 2137. Retired schoolteacher, now substitutes. Lives with wife in location 1439. Has a 27 yo son and a 25 yo daughter. Name (NI) past or present smoking hx, no EtOH."),
List("Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep Apnea. Denies tobacco and ETOH. Works as cafeteria worker.")
)
val data = Seq(text).toDF("text")
val result = pipeline.fit(data).transform(data)
| text | result |
|------------------------------------------------------------------------------------------------|-------|
| Right inguinal hernia repair in childhood Cervical discectomy 3 years ago Umbilical hernia repair... | [False] |
| Atrial Septal Defect with Right Atrial Thrombus Pulmonary Hypertension Obesity, Obstructive Sleep... | [False] |
BertForTokenClassifier
BertForTokenClassifier
can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) for Named-Entity-Recognition (NER) tasks.\
Parameters:
CaseSensitive
(Boolean): Whether to lowercase tokens or not (Default: False).
Input Annotator Types: DOCUMENT, TOKEN
Output Annotator Type: NAMED_ENTITY
Python API: BertForTokenClassifier | Scala API: BertForTokenClassifier |
Show Example
from johnsnowlabs import nlp, medical
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
tokenClassifier = medical.BertForTokenClassification.pretrained("bert_token_classifier_ner_clinical", "en", "clinical/models")\
.setInputCols("token", "sentence")\
.setOutputCol("ner")\
.setCaseSensitive(True)
ner_converter = medical.NerConverterInternal()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
pipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
tokenClassifier,
ner_converter
])
text = """A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .
Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection .
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation .
Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity .
Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 .
Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia .
The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission .
However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L .
The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again .
The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours .
Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use .
The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day .
It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge ."""
data = spark.createDataFrame([[text]]).toDF("text")
res = pipeline.fit(data).transform(data)
res.select(F.explode(F.arrays_zip(res.ner_chunk.result, res.ner_chunk.begin, res.ner_chunk.end, res.ner_chunk.metadata)).alias("cols")) \
.select(F.expr("cols['3']['sentence']").alias("sentence_id"),
F.expr("cols['0']").alias("chunk"),
F.expr("cols['2']").alias("end"),
F.expr("cols['3']['entity']").alias("ner_label"))\
.filter("ner_label!='O'")\
.show(truncate=False)
## Result
+-----------+-----------------------------+---+---------+
|sentence_id|chunk |end|ner_label|
+-----------+-----------------------------+---+---------+
|0 |gestational diabetes mellitus|67 |PROBLEM |
|0 |type two diabetes mellitus |153|PROBLEM |
|0 |T2DM |160|PROBLEM |
|0 |HTG-induced pancreatitis |209|PROBLEM |
|0 |an acute hepatitis |280|PROBLEM |
|0 |obesity |294|PROBLEM |
|0 |a body mass index |317|TEST |
|0 |BMI |323|TEST |
|0 |polyuria |387|PROBLEM |
|0 |polydipsia |400|PROBLEM |
|0 |poor appetite |416|PROBLEM |
|0 |vomiting |431|PROBLEM |
|1 |amoxicillin |521|TREATMENT|
|1 |a respiratory tract infection|555|PROBLEM |
|2 |metformin |578|TREATMENT|
|2 |glipizide |590|TREATMENT|
|2 |dapagliflozin |610|TREATMENT|
|2 |T2DM |619|PROBLEM |
|2 |atorvastatin |636|TREATMENT|
|2 |gemfibrozil |652|TREATMENT|
+-----------+-----------------------------+---+---------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val tokenClassifier = MedicalBertForTokenClassification.pretrained("bert_token_classifier_ner_clinical", "en", "clinical/models")
.setInputCols(Array("token", "sentence"))
.setOutputCol("ner")
.setCaseSensitive(True)
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
tokenClassifier,
ner_converter))
val text = """A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting .
Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection .
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation .
Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity .
Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 .
Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia .
The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission .
However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L .
The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again .
The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours .
Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use .
The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day .
It was determined that all SGLT2 inhibitors should be discontinued indefinitely . She had close follow-up with endocrinology post discharge ."""
val data = Seq(text).toDF("text")
val result = pipeline.fit(data).transform(data)
// Result
+-----------+-----------------------------+---+---------+
|sentence_id|chunk |end|ner_label|
+-----------+-----------------------------+---+---------+
|0 |gestational diabetes mellitus|67 |PROBLEM |
|0 |type two diabetes mellitus |153|PROBLEM |
|0 |T2DM |160|PROBLEM |
|0 |HTG-induced pancreatitis |209|PROBLEM |
|0 |an acute hepatitis |280|PROBLEM |
|0 |obesity |294|PROBLEM |
|0 |a body mass index |317|TEST |
|0 |BMI |323|TEST |
|0 |polyuria |387|PROBLEM |
|0 |polydipsia |400|PROBLEM |
|0 |poor appetite |416|PROBLEM |
|0 |vomiting |431|PROBLEM |
|1 |amoxicillin |521|TREATMENT|
|1 |a respiratory tract infection|555|PROBLEM |
|2 |metformin |578|TREATMENT|
|2 |glipizide |590|TREATMENT|
|2 |dapagliflozin |610|TREATMENT|
|2 |T2DM |619|PROBLEM |
|2 |atorvastatin |636|TREATMENT|
|2 |gemfibrozil |652|TREATMENT|
+-----------+-----------------------------+---+---------+
BertSentenceChunkEmbeddings
This annotator allows aggregating sentence embeddings with ner chunk embeddings to get specific and more accurate resolution codes. It works by averaging sentence and chunk embeddings add contextual information in the embedding value. Input to this annotator is the context (sentence) and ner chunks, while the output is embedding for each chunk that can be fed to the resolver model.
Parameters:
-
inputCols
: The name of the columns containing the input annotations. It can read either a String column or an Array. -
outputCol
: The name of the column in Document type that is generated. We can specify only one column here. -
chunkWeight
: Relative weight of chunk embeddings in comparison to sentence embeddings. The value should between 0 and 1. The default is 0.5, which means the chunk and sentence embeddings are given equal weight. -
setMaxSentenceLength
: Sets max sentence length to process, by default 128. -
caseSensitive
: Determines whether the definitions of the white listed entities are case sensitive.
All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols()
.
For more information and examples of
BertSentenceChunkEmbeddings
annotator, you can check the Spark NLP Workshop, and in special, the notebook 24.1.Improved_Entity_Resolution_with_SentenceChunkEmbeddings.ipynb.
Input Annotator Types: DOCUMENT, CHUNK
Output Annotator Type: SENTENCE_EMBEDDINGS
Python API: BertSentenceChunkEmbeddings | Scala API: BertSentenceChunkEmbeddings | Notebook: BertSentenceChunkEmbeddingsNotebook |
Show Example
from johnsnowlabs import nlp, medical
# Define the pipeline
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = nlp.Tokenizer()\
.setInputCols(["document"])\
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["document", "token"])\
.setOutputCol("word_embeddings")
clinical_ner = medical.NerModel.pretrained("ner_abbreviation_clinical", "en", "clinical/models") \
.setInputCols(["document", "token", "word_embeddings"]) \
.setOutputCol("ner")
ner_converter = medical.NerConverterInternal() \
.setInputCols(["document", "token", "ner"]) \
.setOutputCol("ner_chunk")\
.setWhiteList(['ABBR'])
sentence_chunk_embeddings = medical.BertSentenceChunkEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
.setInputCols(["document", "ner_chunk"])\
.setOutputCol("sentence_embeddings")\
.setChunkWeight(0.5)\
.setCaseSensitive(True)
resolver_pipeline = nlp.Pipeline(
stages = [
document_assembler,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
sentence_chunk_embeddings
])
sample_text = [
"""The patient admitted from the IR for aggressive irrigation of the Miami pouch. DISCHARGE DIAGNOSES: 1. A 58-year-old female with a history of stage 2 squamous cell carcinoma of the cervix status post total pelvic exenteration in 1991.""",
"""Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal.
Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet."""]
from pyspark.sql.types import StringType, IntegerType
df = spark.createDataFrame(sample_text, StringType()).toDF('text')
result = resolver_pipeline.fit(df).transform(df)
result.selectExpr("explode(sentence_embeddings) AS s")\
.selectExpr("s.result", "slice(s.embeddings, 1, 5) AS averageEmbedding")\
.show(truncate=False)
+------+--------------------------------------------------------------+
|result|averageEmbedding |
+------+--------------------------------------------------------------+
|IR |[0.11792798, 0.36022937, -1.0620842, 0.87576616, 0.5389829] |
|CBC |[-0.07262431, -0.671684, 0.009878114, 0.76053196, 0.4687413] |
|AB |[-0.2781681, -0.43619046, -0.20924012, 0.84943366, 0.40831584]|
|VDRL |[-0.07109344, -0.20644212, 0.0367461, 0.43459156, 0.3684616] |
|HIV |[-0.1740405, -0.4599509, -0.041505605, 0.61368394, 0.66777927]|
+------+--------------------------------------------------------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("tokens")
val wordEmbeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("document", "tokens"))
.setOutputCol("word_embeddings")
val nerModel = MedicalNerModel.pretrained("ner_abbreviation_clinical", "en", "clinical/models")
.setInputCols(Array("document", "tokens", "word_embeddings"))
.setOutputCol("ner")
val nerConverter = new NerConverterInternal()
.setInputCols("document", "tokens", "ner")
.setOutputCol("ner_chunk")
.setWhiteList(Array('ABBR'))
val sentenceChunkEmbeddings = BertSentenceChunkEmbeddings.pretrained("sbluebert_base_uncased_mli", "en", "clinical/models")
.setInputCols(Array("document", "ner_chunk"))
.setOutputCol("sentence_embeddings")
.setChunkWeight(0.5)
.setCaseSensitive(True)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
wordEmbeddings,
nerModel,
nerConverter,
sentenceChunkEmbeddings))
val sampleText = "The patient admitted from the IR for aggressive irrigation of the Miami pouch. DISCHARGE DIAGNOSES: 1. A 58-year-old female with a history of stage 2 squamous cell carcinoma of the cervix status post total pelvic exenteration in 1991." +
"Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal.
Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet."
val data = Seq(sampleText).toDF("sampleText")
val result = pipeline.fit(data).transform(data)
+------+--------------------------------------------------------------+
|result|averageEmbedding |
+------+--------------------------------------------------------------+
|IR |[0.11792798, 0.36022937, -1.0620842, 0.87576616, 0.5389829] |
|CBC |[-0.07262431, -0.671684, 0.009878114, 0.76053196, 0.4687413] |
|AB |[-0.2781681, -0.43619046, -0.20924012, 0.84943366, 0.40831584]|
|VDRL |[-0.07109344, -0.20644212, 0.0367461, 0.43459156, 0.3684616] |
|HIV |[-0.1740405, -0.4599509, -0.041505605, 0.61368394, 0.66777927]|
+------+--------------------------------------------------------------+
Chunk2Token
A feature transformer that converts the input array of strings (annotatorType CHUNK) into an array of chunk-based tokens (annotatorType TOKEN).
When the input is empty, an empty array is returned.
This Annotator is specially convenient when using NGramGenerator annotations as inputs to WordEmbeddingsModels.
Parameters:
-
inputCols
: The name of the columns containing the input annotations. It can read either a String column or an Array. -
outputCol
: The name of the column in Document type that is generated. We can specify only one column here.
All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols()
.
Input Annotator Types: CHUNK
Output Annotator Type: TOKEN
Python API: Chunk2Token | Scala API: Chunk2Token | Notebook: Chunk2TokenNotebook |
Show Example
from johnsnowlabs import nlp, medical
# Define a pipeline for generating n-grams
document = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
token = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
ngrammer = nlp.NGramGenerator() \
.setN(2) \
.setEnableCumulative(False) \
.setInputCols(["token"]) \
.setOutputCol("ngrams") \
.setDelimiter("_")
# Stage to convert n-gram CHUNKS to TOKEN type
chunk2Token = medical.Chunk2Token()\
.setInputCols(["ngrams"])\
.setOutputCol("ngram_tokens")
trainingPipeline = nlp.Pipeline(stages=[
document,
sentenceDetector,
token,
ngrammer,
chunk2Token])
data = spark.createDataFrame([["A 63-year-old man presents to the hospital ..."]]).toDF("text")
result = trainingPipeline.fit(data).transform(data).cache()
result.selectExpr("explode(ngram_tokens)").show(5, False)
+----------------------------------------------------------------+
|col |
+----------------------------------------------------------------+
|{token, 0, 12, A_63-year-old, {sentence -> 0, chunk -> 0}, []} |
|{token, 2, 16, 63-year-old_man, {sentence -> 0, chunk -> 1}, []}|
|{token, 14, 25, man_presents, {sentence -> 0, chunk -> 2}, []} |
|{token, 18, 28, presents_to, {sentence -> 0, chunk -> 3}, []} |
|{token, 27, 32, to_the, {sentence -> 0, chunk -> 4}, []} |
+----------------------------------------------------------------+
from johnsnowlabs import nlp, finance
# Define a pipeline for generating n-grams
document = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
token = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
ngrammer = nlp.NGramGenerator() \
.setN(2) \
.setEnableCumulative(False) \
.setInputCols(["token"]) \
.setOutputCol("ngrams")
# Stage to convert n-gram CHUNKS to TOKEN type
chunk2Token = finance.Chunk2Token()\
.setInputCols(["ngrams"])\
.setOutputCol("ngram_tokens")
trainingPipeline = nlp.Pipeline(stages=[
document,
sentenceDetector,
token,
ngrammer,
chunk2Token])
data = spark.createDataFrame([["Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc."]]).toDF("text")
result = trainingPipeline.fit(data).transform(data)
result.selectExpr("explode(ngram_tokens)").show(5, False)
+--------------------------------------------------------------------+
|col |
+--------------------------------------------------------------------+
|{token, 0, 14, Our competitors, {sentence -> 0, chunk -> 0}, []} |
|{token, 4, 22, competitors include, {sentence -> 0, chunk -> 1}, []}|
|{token, 16, 26, include the, {sentence -> 0, chunk -> 2}, []} |
|{token, 24, 36, the following, {sentence -> 0, chunk -> 3}, []} |
|{token, 28, 39, following by, {sentence -> 0, chunk -> 4}, []} |
+--------------------------------------------------------------------+
from johnsnowlabs import nlp, legal
# Define a pipeline for generating n-grams
document = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
token = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
ngrammer = nlp.NGramGenerator() \
.setN(2) \
.setEnableCumulative(False) \
.setInputCols(["token"]) \
.setOutputCol("ngrams")
# Stage to convert n-gram CHUNKS to TOKEN type
chunk2Token = legal.Chunk2Token()\
.setInputCols(["ngrams"])\
.setOutputCol("ngram_tokens")
trainingPipeline = nlp.Pipeline(stages=[
document,
sentenceDetector,
token,
ngrammer,
chunk2Token])
data = spark.createDataFrame([["This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc."]]).toDF("text")
result = trainingPipeline.fit(data).transform(data).cache()
result.selectExpr("explode(ngram_tokens)").show(5, False)
+-----------------------------------------------------------------------+
|col |
+-----------------------------------------------------------------------+
|{token, 0, 6, This is, {sentence -> 0, chunk -> 0}, []} |
|{token, 5, 9, is an, {sentence -> 0, chunk -> 1}, []} |
|{token, 8, 22, an Intellectual, {sentence -> 0, chunk -> 2}, []} |
|{token, 11, 31, Intellectual Property, {sentence -> 0, chunk -> 3}, []}|
|{token, 24, 41, Property Agreement, {sentence -> 0, chunk -> 4}, []} |
+-----------------------------------------------------------------------+
import spark.implicits._
// Define a pipeline for generating n-grams
val document = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val token = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val ngrammer = new NGramGenerator()
.setN(2)
.setEnableCumulative(false)
.setInputCols("token")
.setOutputCol("ngrams")
.setDelimiter("_")
// Stage to convert n-gram CHUNKS to TOKEN type
val chunk2Token = new Chunk2Token()
.setInputCols("ngrams")
.setOutputCol("ngram_tokens")
val trainingPipeline = new Pipeline().setStages(Array(
document,
sentenceDetector,
token,
ngrammer,
chunk2Token))
val data = Seq(("A 63-year-old man presents to the hospital ...")).toDF("text")
val result = trainingPipeline.fit(data).transform(data)
+----------------------------------------------------------------+
|col |
+----------------------------------------------------------------+
|{token, 3, 15, A_63-year-old, {sentence -> 0, chunk -> 0}, []} |
|{token, 5, 19, 63-year-old_man, {sentence -> 0, chunk -> 1}, []}|
|{token, 17, 28, man_presents, {sentence -> 0, chunk -> 2}, []} |
|{token, 21, 31, presents_to, {sentence -> 0, chunk -> 3}, []} |
|{token, 30, 35, to_the, {sentence -> 0, chunk -> 4}, []} |
+----------------------------------------------------------------+
import spark.implicits._
// Define a pipeline for generating n-grams
val document = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val token = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val ngrammer = new NGramGenerator()
.setN(2)
.setEnableCumulative(false)
.setInputCols("token")
.setOutputCol("ngrams")
// Stage to convert n-gram CHUNKS to TOKEN type
val chunk2Token = new Chunk2Token()
.setInputCols("ngrams")
.setOutputCol("ngram_tokens")
val trainingPipeline = new Pipeline().setStages(Array(
document,
sentenceDetector,
token,
ngrammer,
chunk2Token))
val data = Seq(("Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc.")).toDF("text")
val result = trainingPipeline.fit(data).transform(data)
+--------------------------------------------------------------------+
|col |
+--------------------------------------------------------------------+
|{token, 0, 14, Our competitors, {sentence -> 0, chunk -> 0}, []} |
|{token, 4, 22, competitors include, {sentence -> 0, chunk -> 1}, []}|
|{token, 16, 26, include the, {sentence -> 0, chunk -> 2}, []} |
|{token, 24, 36, the following, {sentence -> 0, chunk -> 3}, []} |
|{token, 28, 39, following by, {sentence -> 0, chunk -> 4}, []} |
+--------------------------------------------------------------------+
import spark.implicits._
// Define a pipeline for generating n-grams
val document = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val token = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val ngrammer = new NGramGenerator()
.setN(2)
.setEnableCumulative(false)
.setInputCols("token")
.setOutputCol("ngrams")
// Stage to convert n-gram CHUNKS to TOKEN type
val chunk2Token = new Chunk2Token()
.setInputCols("ngrams")
.setOutputCol("ngram_tokens")
val trainingPipeline = new Pipeline().setStages(Array(
document,
sentenceDetector,
token,
ngrammer,
chunk2Token))
val data = Seq(("This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc.")).toDF("text")
val result = trainingPipeline.fit(data).transform(data)
+-----------------------------------------------------------------------+
|col |
+-----------------------------------------------------------------------+
|{token, 0, 6, This is, {sentence -> 0, chunk -> 0}, []} |
|{token, 5, 9, is an, {sentence -> 0, chunk -> 1}, []} |
|{token, 8, 22, an Intellectual, {sentence -> 0, chunk -> 2}, []} |
|{token, 11, 31, Intellectual Property, {sentence -> 0, chunk -> 3}, []}|
|{token, 24, 41, Property Agreement, {sentence -> 0, chunk -> 4}, []} |
+-----------------------------------------------------------------------+
ChunkConverter
Convert chunks from RegexMatcher to chunks with a entity in the metadata.
This annotator is important when the user wants to merge entities identified by NER models together with rules-based matching used by the RegexMathcer annotator. In the following steps of the pipeline, all the identified entities can be treated in a unified field.
Parameters:
inputCols
: The name of the columns containing the input annotations. It can read either a String column or an Array.outputCol
: The name of the column in Document type that is generated. We can specify only one column here.resetSentenceIndices
: Whether to reset sentence indices to treat the entire output as if it originates from a single document. Default: False.
All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols()
.
Input Annotator Types: DOCUMENT, CHUNK
Output Annotator Type: CHUNK
Python API: ChunkConverter | Scala API: ChunkConverter | Notebook: ChunkConverterNotebook |
Show Example
from johnsnowlabs import nlp, medical
# Creating the pipeline
rules = '''
\b[A-Z]+(\s+[A-Z]+)*:\b, SECTION_HEADER
'''
with open('regex_rules.txt', 'w') as f:
f.write(rules)
sample_text = """
POSTOPERATIVE DIAGNOSIS: Cervical lymphadenopathy.
PROCEDURE: Excisional biopsy of right cervical lymph node.
ANESTHESIA: General endotracheal anesthesia.
Specimen: Right cervical lymph node.
EBL: 10 cc.
COMPLICATIONS: None.
FINDINGS: Enlarged level 2 lymph node was identified and removed and sent for pathologic examination.
FLUIDS: Please see anesthesia report.
URINE OUTPUT: None recorded during the case.
INDICATIONS FOR PROCEDURE: This is a 43-year-old female with a several-year history of persistent cervical lymphadenopathy. She reports that it is painful to palpation on the right and has had multiple CT scans as well as an FNA which were all nondiagnostic. After risks and benefits of surgery were discussed with the patient, an informed consent was obtained. She was scheduled for an excisional biopsy of the right cervical lymph node.
PROCEDURE IN DETAIL: The patient was taken to the operating room and placed in the supine position. She was anesthetized with general endotracheal anesthesia. The neck was then prepped and draped in the sterile fashion. Again, noted on palpation there was an enlarged level 2 cervical lymph node.A 3-cm horizontal incision was made over this lymph node. Dissection was carried down until the sternocleidomastoid muscle was identified. The enlarged lymph node that measured approximately 2 cm in diameter was identified and was removed and sent to Pathology for touch prep evaluation. The area was then explored for any other enlarged lymph nodes. None were identified, and hemostasis was achieved with electrocautery. A quarter-inch Penrose drain was placed in the wound.The wound was then irrigated and closed with 3-0 interrupted Vicryl sutures for a deep closure followed by a running 4-0 Prolene subcuticular suture. Mastisol and Steri-Strip were placed over the incision, and sterile bandage was applied. The patient tolerated this procedure well and was extubated without complications and transported to the recovery room in stable condition. She will return to the office tomorrow in followup to have the Penrose drain removed.
"""
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model = medical.NerModel.pretrained("ner_clinical_large","en","clinical/models") \
.setInputCols("sentence","token","embeddings") \
.setOutputCol("ner")
ner_converter= medical.NerConverterInternal()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")\
regex_matcher = nlp.RegexMatcher()\
.setInputCols('document')\
.setStrategy("MATCH_ALL")\
.setOutputCol("regex_matches")\
.setExternalRules(path='/content/regex_rules.txt', delimiter=',')
chunkConverter = medical.ChunkConverter()\
.setInputCols("regex_matches")\
.setOutputCol("regex_chunk")
merger= medical.ChunkMergeApproach()\
.setInputCols(["regex_chunk", "ner_chunk"])\
.setOutputCol("merged_chunks")\
.setMergeOverlapping(True)\
.setChunkPrecedence("field")
pipeline= nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
ner_model,
ner_converter,
regex_matcher,
chunkConverter,
merger
])
data= spark.createDataFrame([[sample_text]]).toDF("text")
result = pipeline.fit(data).transform(data)
# Results
result.select(F.explode(F.arrays_zip(result.merged_chunks.result,
result.merged_chunks.metadata)).alias("cols"))\
.select(F.expr("cols['0']").alias("chunk"),
F.expr("cols['1']['entity']").alias("merged_entity")).show(15, truncate=100)
+----------------------------------------------+--------------+
| chunk| merged_entity|
+----------------------------------------------+--------------+
| POSTOPERATIVE DIAGNOSIS:|SECTION_HEADER|
| Cervical lymphadenopathy| PROBLEM|
| PROCEDURE:|SECTION_HEADER|
|Excisional biopsy of right cervical lymph node| TEST|
| ANESTHESIA:|SECTION_HEADER|
| General endotracheal anesthesia| TREATMENT|
| Right cervical lymph node| PROBLEM|
| EBL:|SECTION_HEADER|
| COMPLICATIONS:|SECTION_HEADER|
| FINDINGS:|SECTION_HEADER|
| Enlarged level 2 lymph node| PROBLEM|
| pathologic examination| TEST|
| FLUIDS:|SECTION_HEADER|
| URINE OUTPUT:|SECTION_HEADER|
| INDICATIONS FOR PROCEDURE:|SECTION_HEADER|
+----------------------------------------------+--------------+
from johnsnowlabs import nlp, finance
# Creating the pipeline
rules = '''
\b[A-Z]+(\s+[A-Z]+)*:\b, SECTION_HEADER
'''
with open('regex_rules.txt', 'w') as f:
f.write(rules)
sample_text="""AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price. """
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter= nlp.NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
# .setWhiteList(["ORG"]) # Return only ORG entities
regex_matcher = nlp.RegexMatcher()\
.setInputCols('document')\
.setStrategy("MATCH_ALL")\
.setOutputCol("regex_matches")\
.setExternalRules(path='/content/regex_rules.txt', delimiter=',')
chunkConverter = finance.ChunkConverter()\
.setInputCols("regex_matches")\
.setOutputCol("regex_chunk")
merger= finance.ChunkMergeApproach()\
.setInputCols(["regex_chunk", "ner_chunk"])\
.setOutputCol("merged_chunks")\
.setMergeOverlapping(True)\
.setChunkPrecedence("field")
pipeline= nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
ner_model,
ner_converter,
regex_matcher,
chunkConverter,
merger
])
data= spark.createDataFrame([[sample_text]]).toDF("text")
result = pipeline.fit(data).transform(data)
# Results
result.select(F.explode(F.arrays_zip(result.merged_chunks.result,
result.merged_chunks.metadata)).alias("cols"))\
.select(F.expr("cols['0']").alias("chunk"),
F.expr("cols['1']['entity']").alias("merged_entity")).show(15, truncate=100)
+--------+-------------+
| chunk|merged_entity|
+--------+-------------+
|Group LP| ORG|
+--------+-------------+
from johnsnowlabs import nlp, legal
# Creating the pipeline
rules = '''
\b[A-Z]+(\s+[A-Z]+)*:\b, SECTION_HEADER
'''
with open('regex_rules.txt', 'w') as f:
f.write(rules)
sample_text="""AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price. """
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner_model = legal.NerModel.pretrained("legner_org_per_role_date", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter= nlp.NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
# .setWhiteList(["ORG"]) # Return only ORG entities
regex_matcher = nlp.RegexMatcher()\
.setInputCols('document')\
.setStrategy("MATCH_ALL")\
.setOutputCol("regex_matches")\
.setExternalRules(path='/content/regex_rules.txt', delimiter=',')
chunkConverter = legal.ChunkConverter()\
.setInputCols("regex_matches")\
.setOutputCol("regex_chunk")
merger= legal.ChunkMergeApproach()\
.setInputCols(["regex_chunk", "ner_chunk"])\
.setOutputCol("merged_chunks")\
.setMergeOverlapping(True)\
.setChunkPrecedence("field")
pipeline= nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
ner_model,
ner_converter,
regex_matcher,
chunkConverter,
merger
])
data= spark.createDataFrame([[sample_text]]).toDF("text")
result = pipeline.fit(data).transform(data)
# Results
result.select(F.explode(F.arrays_zip(result.merged_chunks.result,
result.merged_chunks.metadata)).alias("cols"))\
.select(F.expr("cols['0']").alias("chunk"),
F.expr("cols['1']['entity']").alias("merged_entity")).show(15, truncate=100)
+--------+-------------+
| chunk|merged_entity|
+--------+-------------+
|Group LP| ORG|
+--------+-------------+
// val rules = """\b[A-Z]+(\s+[A-Z]+)*:\b, SECTION_HEADER"""
// with open("regex_rules.txt","w") as f:
// f.write(rules)
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
val ner_model = MedicalNerModel.pretrained("ner_clinical_large","en","clinical/models")
.setInputCols("sentence","token","embeddings")
.setOutputCol("ner")
val ner_converter= new NerConverterInternal()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
val regex_matcher = new RegexMatcher()
.setInputCols("document")
.setStrategy("MATCH_ALL")
.setOutputCol("regex_matches")
.setExternalRules(path="/content/regex_rules.txt",delimiter=",")
val chunkConverter = new ChunkConverter()
.setInputCols("regex_matches")
.setOutputCol("regex_chunk")
val merger= new ChunkMergeApproach()
.setInputCols(Array("regex_chunk","ner_chunk"))
.setOutputCol("merged_chunks")
.setMergeOverlapping(true)
.setChunkPrecedence("field")
val pipeline= new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
ner_model,
ner_converter,
regex_matcher,
chunkConverter,
merger ))
val data = Seq(("POSTOPERATIVE DIAGNOSIS: Cervical lymphadenopathy. PROCEDURE: Excisional biopsy of right cervical lymph node. ANESTHESIA: General endotracheal anesthesia. Specimen: Right cervical lymph node. EBL: 10 cc. COMPLICATIONS: None. FINDINGS: Enlarged level 2 lymph node was identified and removed and sent for pathologic examination. FLUIDS: Please see anesthesia report. URINE OUTPUT: None recorded during the case. INDICATIONS FOR PROCEDURE: This is a 43-year-old female with a several-year history of persistent cervical lymphadenopathy. She reports that it is painful to palpation on the right and has had multiple CT scans as well as an FNA which were all nondiagnostic. After risks and benefits of surgery were discussed with the patient,an informed consent was obtained. She was scheduled for an excisional biopsy of the right cervical lymph node. PROCEDURE IN DETAIL: The patient was taken to the operating room and placed in the supine position. She was anesthetized with general endotracheal anesthesia. The neck was then prepped and draped in the sterile fashion. Again,noted on palpation there was an enlarged level 2 cervical lymph node.A 3-cm horizontal incision was made over this lymph node. Dissection was carried down until the sternocleidomastoid muscle was identified. The enlarged lymph node that measured approximately 2 cm in diameter was identified and was removed and sent to Pathology for touch prep evaluation. The area was then explored for any other enlarged lymph nodes. None were identified,and hemostasis was achieved with electrocautery. A quarter-inch Penrose drain was placed in the wound.The wound was then irrigated and closed with 3-0 interrupted Vicryl sutures for a deep closure followed by a running 4-0 Prolene subcuticular suture. Mastisol and Steri-Strip were placed over the incision,and sterile bandage was applied. The patient tolerated this procedure well and was extubated without complications and transported to the recovery room in stable condition. She will return to the office tomorrow in followup to have the Penrose drain removed.")).toDF("text")
val result = pipeline.fit(data).transform(data)
+----------------------------------------------+--------------+
| chunk| merged_entity|
+----------------------------------------------+--------------+
| POSTOPERATIVE DIAGNOSIS:|SECTION_HEADER|
| Cervical lymphadenopathy| PROBLEM|
| PROCEDURE:|SECTION_HEADER|
|Excisional biopsy of right cervical lymph node| TEST|
| ANESTHESIA:|SECTION_HEADER|
| General endotracheal anesthesia| TREATMENT|
| Right cervical lymph node| PROBLEM|
| EBL:|SECTION_HEADER|
| COMPLICATIONS:|SECTION_HEADER|
| FINDINGS:|SECTION_HEADER|
| Enlarged level 2 lymph node| PROBLEM|
| pathologic examination| TEST|
| FLUIDS:|SECTION_HEADER|
| URINE OUTPUT:|SECTION_HEADER|
| INDICATIONS FOR PROCEDURE:|SECTION_HEADER|
+----------------------------------------------+--------------+
// val rules = """\b[A-Z]+(\s+[A-Z]+)*:\b, SECTION_HEADER"""
// with open("regex_rules.txt","w") as f:
// f.write(rules)
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
val ner_model = FinanceNerModel.pretrained("finner_orgs_prods_alias","en","finance/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
val ner_converter= new NerConverterInternal()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
// .setWhiteList(Array("ORG"))
// Return only ORG entities
val regex_matcher = new RegexMatcher()
.setInputCols("document")
.setStrategy("MATCH_ALL")
.setOutputCol("regex_matches")
.setExternalRules(path="/content/regex_rules.txt",delimiter=",")
val chunkConverter = new ChunkConverter()
.setInputCols("regex_matches")
.setOutputCol("regex_chunk")
val merger= new ChunkMergeApproach()
.setInputCols(Array("regex_chunk","ner_chunk"))
.setOutputCol("merged_chunks")
.setMergeOverlapping(true)
.setChunkPrecedence("field")
val pipeline= new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
ner_model,
ner_converter,
regex_matcher,
chunkConverter,
merger ))
val data = Seq(("AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price.")).toDF("text")
val result = pipeline.fit(data).transform(data)
+--------+-------------+
| chunk|merged_entity|
+--------+-------------+
|Group LP| ORG|
+--------+-------------+
// val rules = """[A-Z]+[\s+[A-Z]+]*,SECTION_HEADER """
// with open("regex_rules.txt","w") as f:
// f.write(rules)
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
val ner_model = LegalNerModel.pretrained("legner_org_per_role_date", "en", "legal/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
val ner_converter= new NerConverterInternal()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
// .setWhiteList(Array("ORG"))
// Return only ORG entities
val regex_matcher = new RegexMatcher()
.setInputCols("document")
.setStrategy("MATCH_ALL")
.setOutputCol("regex_matches")
.setExternalRules(path="/content/regex_rules.txt",delimiter=",")
val chunkConverter = new ChunkConverter()
.setInputCols("regex_matches")
.setOutputCol("regex_chunk")
val merger= new ChunkMergeApproach()
.setInputCols(Array("regex_chunk","ner_chunk"))
.setOutputCol("merged_chunks")
.setMergeOverlapping(true)
.setChunkPrecedence("field")
val pipeline= new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
ner_model,
ner_converter,
regex_matcher,
chunkConverter,
merger ))
val data = Seq(("AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price.")).toDF("text")
val result = pipeline.fit(data).transform(data)
+--------+-------------+
| chunk|merged_entity|
+--------+-------------+
|Group LP| ORG|
+--------+-------------+
ChunkEntityResolver
The ChunkEntityResolverModel encompasses the functionality to produce a normalized entity from a specialized ontology or curated dataset (such as ICD-10, RxNorm, SNOMED, etc.). This model includes comprehensive parameters and methods essential for its training. It operates by transforming a dataset that incorporates two Input Annotations: TOKEN and WORD_EMBEDDINGS, sourced from tools like ChunkTokenizer and ChunkEmbeddings Annotators. Ultimately, it generates the normalized entity relevant to the specified trained ontology or curated dataset, ensuring accurate entity resolution within the given context.
For available pretrained models please see the Models Hub.
Importand Note
: This annotator has been deprecated.
Input Annotator Types: TOKEN, WORD_EMBEDDINGS
Output Annotator Type: ENTITY
Scala API: ChunkEntityResolverModel |
Show Example
from johnsnowlabs import nlp, medical
# Using pretrained models for SNOMED
# First the prior steps of the pipeline are defined.
# Output of types TOKEN and WORD_EMBEDDINGS are needed.
data = spark.createDataFrame([["A 63-year-old man presents to the hospital ..."]]).toDF("text")
docAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("word_embeddings")
icdo_ner = medical.NerModel.pretrained("ner_bionlp", "en", "clinical/models") \
.setInputCols(["sentence", "token", "word_embeddings"]) \
.setOutputCol("icdo_ner")
icdo_chunk = nlp.NerConverter()\
.setInputCols(["sentence","token","icdo_ner"])\
.setOutputCol("icdo_chunk")\
.setWhiteList(["Cancer"])
icdo_chunk_embeddings = nlp.ChunkEmbeddings() \
.setInputCols(["icdo_chunk", "word_embeddings"]) \
.setOutputCol("icdo_chunk_embeddings")
icdo_chunk_resolver = medical.ChunkEntityResolverModel\
.pretrained("chunkresolve_icdo_clinical", "en", "clinical/models") \
.setInputCols(["token","icdo_chunk_embeddings"]) \
.setOutputCol("tm_icdo_code")
clinical_ner = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token", "word_embeddings"]) \
.setOutputCol("ner")
ner_converter = nlp.NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
ner_chunk_tokenizer = nlp.ChunkTokenizer() \
.setInputCols(["ner_chunk"]) \
.setOutputCol("ner_token")
ner_chunk_embeddings = nlp.ChunkEmbeddings() \
.setInputCols(["ner_chunk", "word_embeddings"]) \
.setOutputCol("ner_chunk_embeddings")
# Definition of the SNOMED Resolution
ner_snomed_resolver = medical.ChunkEntityResolverModel\
.pretrained("chunkresolve_snomed_findings_clinical","en","clinical/models") \
.setInputCols(["ner_token","ner_chunk_embeddings"])\
.setOutputCol("snomed_result")
pipelineFull = nlp.Pipeline().setStages([
docAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
ner_chunk_embeddings,
ner_chunk_tokenizer,
ner_snomed_resolver,
icdo_ner,
icdo_chunk,
icdo_chunk_embeddings,
icdo_chunk_resolver
])
pipelineModelFull = pipelineFull.fit(data)
result = pipelineModelFull.transform(data).cache()
# Show results
result.selectExpr("explode(snomed_result)")
.selectExpr(
"col.metadata.target_text",
"col.metadata.resolved_text",
"col.metadata.confidence",
"col.metadata.all_k_results",
"col.metadata.all_k_resolutions")
.filter($"confidence" > 0.2).show(5)
+--------------------+--------------------+----------+--------------------+--------------------+
| target_text| resolved_text|confidence| all_k_results| all_k_resolutions|
+--------------------+--------------------+----------+--------------------+--------------------+
|hypercholesterolemia|Hypercholesterolemia| 0.2524|13644009:::267432...|Hypercholesterole...|
| CBC| Neocyte| 0.4980|259680000:::11573...|Neocyte:::Blood g...|
| CD38| Hypoviscosity| 0.2560|47872005:::370970...|Hypoviscosity:::E...|
| platelets| Increased platelets| 0.5267|6631009:::2596800...|Increased platele...|
| CD38| Hypoviscosity| 0.2560|47872005:::370970...|Hypoviscosity:::E...|
+--------------------+--------------------+----------+--------------------+--------------------+
import spark.implicits._
// Using pretrained models for SNOMED
// First the prior steps of the pipeline are defined.
// Output of types TOKEN and WORD_EMBEDDINGS are needed.
val data = Seq(("A 63-year-old man presents to the hospital ...")).toDF("text")
val docAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("word_embeddings")
val icdo_ner = MedicalNerModel.pretrained("ner_bionlp", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "word_embeddings"))
.setOutputCol("icdo_ner")
val icdo_chunk = new NerConverter()
.setInputCols(Array("sentence","token","icdo_ner"))
.setOutputCol("icdo_chunk")
.setWhiteList("Cancer")
val icdo_chunk_embeddings = new ChunkEmbeddings()
.setInputCols(Array("icdo_chunk", "word_embeddings"))
.setOutputCol("icdo_chunk_embeddings")
val icdo_chunk_resolver = ChunkEntityResolverModel.pretrained("chunkresolve_icdo_clinical", "en", "clinical/models")
.setInputCols(Array("token","icdo_chunk_embeddings"))
.setOutputCol("tm_icdo_code")
val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "word_embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val ner_chunk_tokenizer = new ChunkTokenizer()
.setInputCols("ner_chunk")
.setOutputCol("ner_token")
val ner_chunk_embeddings = new ChunkEmbeddings()
.setInputCols(Array("ner_chunk", "word_embeddings"))
.setOutputCol("ner_chunk_embeddings")
// Definition of the SNOMED Resolution
val ner_snomed_resolver = ChunkEntityResolverModel\
.pretrained("chunkresolve_snomed_findings_clinical","en","clinical/models")
.setInputCols(Array("ner_token","ner_chunk_embeddings"))\
.setOutputCol("snomed_result")
val pipelineFull = new Pipeline().setStages(Array(
docAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
ner_chunk_embeddings,
ner_chunk_tokenizer,
ner_snomed_resolver,
icdo_ner,
icdo_chunk,
icdo_chunk_embeddings,
icdo_chunk_resolver
))
val pipelineModelFull = pipelineFull.fit(data)
val result = pipelineModelFull.transform(data).cache()
// Show results
//
// result.selectExpr("explode(snomed_result)")
// .selectExpr(
// "col.metadata.target_text",
// "col.metadata.resolved_text",
// "col.metadata.confidence",
// "col.metadata.all_k_results",
// "col.metadata.all_k_resolutions")
// .filter($"confidence" > 0.2).show(5)
// +--------------------+--------------------+----------+--------------------+--------------------+
// | target_text| resolved_text|confidence| all_k_results| all_k_resolutions|
// +--------------------+--------------------+----------+--------------------+--------------------+
// |hypercholesterolemia|Hypercholesterolemia| 0.2524|13644009:::267432...|Hypercholesterole...|
// | CBC| Neocyte| 0.4980|259680000:::11573...|Neocyte:::Blood g...|
// | CD38| Hypoviscosity| 0.2560|47872005:::370970...|Hypoviscosity:::E...|
// | platelets| Increased platelets| 0.5267|6631009:::2596800...|Increased platele...|
// | CD38| Hypoviscosity| 0.2560|47872005:::370970...|Hypoviscosity:::E...|
// +--------------------+--------------------+----------+--------------------+--------------------+
//
Contains all the parameters and methods to train a ChunkEntityResolverModel. It transform a dataset with two Input Annotations of types TOKEN and WORD_EMBEDDINGS, coming from e.g. ChunkTokenizer and ChunkEmbeddings Annotators and returns the normalized entity for a particular trained ontology / curated dataset. (e.g. ICD-10, RxNorm, SNOMED etc.)
To use pretrained models please use ChunkEntityResolverModel and see the Models Hub for available models.
Input Annotator Types: TOKEN, WORD_EMBEDDINGS
Output Annotator Type: ENTITY
Scala API: ChunkEntityResolverApproach |
Show Example
from johnsnowlabs import nlp, medical
# Training a SNOMED model
# Define pre-processing pipeline for training data. It needs consists of columns for the normalized training data and their labels.
document = nlp.DocumentAssembler() \
.setInputCol("normalized_text") \
.setOutputCol("document")
chunk = nlp.Doc2Chunk() \
.setInputCols(["document"]) \
.setOutputCol("chunk")
token = nlp.Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel\
.pretrained("embeddings_healthcare_100d", "en", "clinical/models") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
chunkEmb = nlp.ChunkEmbeddings() \
.setInputCols(["chunk", "embeddings"]) \
.setOutputCol("chunk_embeddings")
snomedTrainingPipeline = nlp.Pipeline().setStages([
document,
chunk,
token,
embeddings,
chunkEmb
])
snomedTrainingModel = snomedTrainingPipeline.fit(data)
snomedData = snomedTrainingModel.transform(data).cache()
# Then the Resolver can be trained with
snomedExtractor = medical.ChunkEntityResolverApproach() \
.setInputCols(["token", "chunk_embeddings"]) \
.setOutputCol("recognized") \
.setNeighbours(1000) \
.setAlternatives(25) \
.setNormalizedCol("normalized_text") \
.setLabelCol("label") \
.setEnableWmd(True).setEnableTfidf(True).setEnableJaccard(True) \
.setEnableSorensenDice(True).setEnableJaroWinkler(True).setEnableLevenshtein(True) \
.setDistanceWeights([1, 2, 2, 1, 1, 1]) \
.setAllDistancesMetadata(True) \
.setPoolingStrategy("MAX") \
.setThreshold(1e32)
model = snomedExtractor.fit(snomedData)
import spark.implicits._
// Training a SNOMED model
// Define pre-processing pipeline for training data. It needs consists of columns for the normalized training data and their labels.
val document = new DocumentAssembler()
.setInputCol("normalized_text")
.setOutputCol("document")
val chunk = new Doc2Chunk()
.setInputCols("document")
.setOutputCol("chunk")
val token = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val embeddings = WordEmbeddingsModel\
.pretrained("embeddings_healthcare_100d", "en", "clinical/models")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val chunkEmb = new ChunkEmbeddings()
.setInputCols(Array("chunk", "embeddings"))
.setOutputCol("chunk_embeddings")
val snomedTrainingPipeline = new Pipeline().setStages(Array(
document,
chunk,
token,
embeddings,
chunkEmb
))
val snomedTrainingModel = snomedTrainingPipeline.fit(data)
val snomedData = snomedTrainingModel.transform(data).cache()
// Then the Resolver can be trained with
val snomedExtractor = new ChunkEntityResolverApproach()
.setInputCols(Array("token", "chunk_embeddings"))
.setOutputCol("recognized")
.setNeighbours(1000)
.setAlternatives(25)
.setNormalizedCol("normalized_text")
.setLabelCol("label")
.setEnableWmd(true).setEnableTfidf(true).setEnableJaccard(true)
.setEnableSorensenDice(true).setEnableJaroWinkler(true).setEnableLevenshtein(true)
.setDistanceWeights(Array(1, 2, 2, 1, 1, 1))
.setAllDistancesMetadata(true)
.setPoolingStrategy("MAX")
.setThreshold(1e32)
val model = snomedExtractor.fit(snomedData)
ChunkFilterer
Filters entities coming from CHUNK annotations. Filters can be set via a white list of terms or a regular expression.
White list criteria is enabled by default. To use regex, criteria
has to be set to regex
.
Parametres:
-
inputCols
: The name of the columns containing the input annotations. It can read either a String column or an Array. -
outputCol
: The name of the column in Document type that is generated. We can specify only one column here. -
criteria
: Tag representing what is the criteria to filter the chunks. Possibles values are: - isIn: Filter by the chunk - regex: Filter using a regex -
whiteList
: If defined, list of entities to process. The rest will be ignored. -
blackList
: If defined, list of entities to ignore. The rest will be processed. -
regex
: If defined, list of regex to process the chunks (Default: []). -
filterEntity
: If equal to “entity”, use the ner label to filter. If set to “result”, use the result attribute of the annotation to filter. -
entitiesConfidence
: Path to csv with pairs (entity,confidenceThreshold). Filter the chunks with entities which have confidence lower than the confidence threshold.
All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols()
.
Input Annotator Types: DOCUMENT,CHUNK
Output Annotator Type: CHUNK
Python API: ChunkFilterer | Scala API: ChunkFilterer | Notebook: ChunkFiltererNotebook |
Show Example
from johnsnowlabs import nlp, medical
# Filtering POS tags
# First pipeline stages to extract the POS tags are defined
docAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
posTagger = nlp.PerceptronModel.pretrained() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
chunker = nlp.Chunker() \
.setInputCols(["pos", "sentence"]) \
.setOutputCol("chunk") \
.setRegexParsers(["(<NN>)+"])
# Then the chunks can be filtered via a white list. Here only terms with "gastroenteritis" remain.
chunkerFilter = medical.ChunkFilterer() \
.setInputCols(["sentence","chunk"]) \
.setOutputCol("filtered") \
.setCriteria("isin") \
.setWhiteList(["gastroenteritis"])
pipeline = nlp.Pipeline(stages=[
docAssembler,
sentenceDetector,
tokenizer,
posTagger,
chunker,
chunkerFilter])
data = spark.createDataFrame([["Has a past history of gastroenteritis and stomach pain, however patient ..."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(chunk)").show(truncate=False)
+---------------------------------------------------------------------------------+
|col |
+---------------------------------------------------------------------------------+
|{chunk, 11, 17, history, {sentence -> 0, chunk -> 0}, []} |
|{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []} |
|{chunk, 42, 53, stomach pain, {sentence -> 0, chunk -> 2}, []} |
|{chunk, 64, 70, patient, {sentence -> 0, chunk -> 3}, []} |
|{chunk, 81, 110, stomach pain now.We don't care, {sentence -> 0, chunk -> 4}, []}|
|{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []} |
+---------------------------------------------------------------------------------+
result.selectExpr("explode(filtered)").show(truncate=False)
+-------------------------------------------------------------------+
|col |
+-------------------------------------------------------------------+
|{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []} |
|{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []}|
+-------------------------------------------------------------------+
from johnsnowlabs import nlp, finance
# Filtering POS tags
# First pipeline stages to extract the POS tags are defined
docAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
posTagger = nlp.PerceptronModel.pretrained() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
chunker = nlp.Chunker() \
.setInputCols(["pos", "sentence"]) \
.setOutputCol("chunk") \
.setRegexParsers(["(<NN>)+"])
# Then the chunks can be filtered via a white list. Here only terms with "gastroenteritis" remain.
chunkerFilter = finance.ChunkFilterer() \
.setInputCols(["sentence","chunk"]) \
.setOutputCol("filtered") \
.setCriteria("isin") \
.setWhiteList(["rate"])
pipeline = nlp.Pipeline(stages=[
docAssembler,
sentenceDetector,
tokenizer,
posTagger,
chunker,
chunkerFilter])
data = spark.createDataFrame([["AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(chunk)").show(truncate=False)
+-------------------------------------------------------+
|col |
+-------------------------------------------------------+
|{chunk, 73, 77, basis, {sentence -> 0, chunk -> 0}, []}|
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+
result.selectExpr("explode(filtered)").show(truncate=False)
+-------------------------------------------------------+
|col |
+-------------------------------------------------------+
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+
from johnsnowlabs import nlp, legal
# Filtering POS tags
# First pipeline stages to extract the POS tags are defined
docAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
posTagger = nlp.PerceptronModel.pretrained() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
chunker = nlp.Chunker() \
.setInputCols(["pos", "sentence"]) \
.setOutputCol("chunk") \
.setRegexParsers(["(<NN>)+"])
# Then the chunks can be filtered via a white list. Here only terms with "gastroenteritis" remain.
chunkerFilter = legal.ChunkFilterer() \
.setInputCols(["sentence","chunk"]) \
.setOutputCol("filtered") \
.setCriteria("isin") \
.setWhiteList(["rate"])
pipeline = nlp.Pipeline(stages=[
docAssembler,
sentenceDetector,
tokenizer,
posTagger,
chunker,
chunkerFilter])
data = spark.createDataFrame([["AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(chunk)").show(truncate=False)
+-------------------------------------------------------+
|col |
+-------------------------------------------------------+
|{chunk, 73, 77, basis, {sentence -> 0, chunk -> 0}, []}|
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+
result.selectExpr("explode(filtered)").show(truncate=False)
+-------------------------------------------------------+
|col |
+-------------------------------------------------------+
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+
// Filtering POS tags
// First pipeline stages to extract the POS tags are defined
import spark.implicits._
val docAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val posTagger = PerceptronModel.pretrained()
.setInputCols(Array("sentence","token"))
.setOutputCol("pos")
val chunker = new Chunker()
.setInputCols(Array("pos","sentence"))
.setOutputCol("chunk")
.setRegexParsers(Array("(<NN>) +"))
val chunkerFilter = new ChunkFilterer()
.setInputCols(Array("sentence","chunk"))
.setOutputCol("filtered")
.setCriteria("isin")
.setWhiteList(Array("gastroenteritis"))
val pipeline = new Pipeline().setStages(Array(
docAssembler,
sentenceDetector,
tokenizer,
posTagger,
chunker,
chunkerFilter))
val text ="""Has a past history of gastroenteritis and stomach pain, however patient ..."""
val data = Seq(text).toDF("text")
val result = pipeline.fit(data).transform(data)
// result.selectExpr("explode(chunk)").show(truncate=false)
+---------------------------------------------------------------------------------+
|col |
+---------------------------------------------------------------------------------+
|{chunk, 11, 17, history, {sentence -> 0, chunk -> 0}, []} |
|{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []} |
|{chunk, 42, 53, stomach pain, {sentence -> 0, chunk -> 2}, []} |
|{chunk, 64, 70, patient, {sentence -> 0, chunk -> 3}, []} |
|{chunk, 81, 110, stomach pain now.We don't care, {sentence -> 0, chunk -> 4}, []}|
|{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []} |
+---------------------------------------------------------------------------------+
// result.selectExpr("explode(filtered)").show(truncate=false)
+-------------------------------------------------------------------+
|col |
+-------------------------------------------------------------------+
|{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []} |
|{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []}|
+-------------------------------------------------------------------+
import spark.implicits._
val docAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val posTagger = PerceptronModel.pretrained()
.setInputCols(Array("sentence","token"))
.setOutputCol("pos")
val chunker = new Chunker()
.setInputCols(Array("pos","sentence"))
.setOutputCol("chunk")
.setRegexParsers(Array("(<NN>) +"))
val chunkerFilter = new ChunkFilterer()
.setInputCols(Array("sentence","chunk"))
.setOutputCol("filtered")
.setCriteria("isin")
val pipeline = new Pipeline().setStages(Array(
docAssembler,
sentenceDetector,
tokenizer,
posTagger,
chunker,
chunkerFilter))
val text ="""AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price."""
val data = Seq(text).toDF("text")
val result = pipeline.fit(data).transform(data)
// result.selectExpr("explode(chunk)").show(truncate=false)
+-------------------------------------------------------+
|col |
+-------------------------------------------------------+
|{chunk, 73, 77, basis, {sentence -> 0, chunk -> 0}, []}|
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+
// result.selectExpr("explode(filtered)").show(truncate=False)
+-------------------------------------------------------+
|col |
+-------------------------------------------------------+
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+
import spark.implicits._
val docAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val posTagger = PerceptronModel.pretrained()
.setInputCols(Array("sentence","token"))
.setOutputCol("pos")
val chunker = new Chunker()
.setInputCols(Array("pos","sentence"))
.setOutputCol("chunk")
.setRegexParsers(Array("(<NN>) +"))
val chunkerFilter = new ChunkFilterer()
.setInputCols(Array("sentence","chunk"))
.setOutputCol("filtered")
.setCriteria("isin")
val pipeline = new Pipeline().setStages(Array(
docAssembler,
sentenceDetector,
tokenizer,
posTagger,
chunker,
chunkerFilter))
val text ="""AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price."""
val data = Seq(text).toDF("text")
val result = pipeline.fit(data).transform(data)
// result.selectExpr("explode(chunk)").show(truncate=false)
+-------------------------------------------------------+
|col |
+-------------------------------------------------------+
|{chunk, 73, 77, basis, {sentence -> 0, chunk -> 0}, []}|
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+
// result.selectExpr("explode(filtered)").show(truncate=False)
+-------------------------------------------------------+
|col |
+-------------------------------------------------------+
|{chunk, 92, 95, rate, {sentence -> 0, chunk -> 1}, []} |
+-------------------------------------------------------+
ChunkKeyPhraseExtraction
Chunk KeyPhrase Extraction uses Bert Sentence Embeddings to determine the most relevant key phrases describing a text. The input to the model consists of chunk annotations and sentence or document annotation. The model compares the chunks against the corresponding sentences/documents and selects the chunks which are most representative of the broader text context (i.e. the document or the sentence they belong to). The key phrases candidates (i.e. the input chunks) can be generated in various ways, e.g. by NGramGenerator, TextMatcher or NerConverter. The model operates either at sentence (selecting the most descriptive chunks from the sentence they belong to) or at document level. In the latter case, the key phrases are selected to represent all the input document annotations.
Parametres:
-
setConcatenateSentences(value: Boolean)
: Concatenate the input sentence/documentation annotations before computing their embedding Default value is ‘true’. -
setDivergence(value: Float)
: Set the level of divergence of the extracted key phrases. -
setDocumentLevelProcessing(value: Boolean)
: Extract key phrases from the whole document (true) or from particular sentences which the chunks refer to (false) Default value is ‘true’. -
setDropPunctuation(value: Boolean)
: Remove punctuation marks from input chunks. -
setSelectMostDifferent(value: Boolean)
: Let the model return the top N key phrases which are the most different from each other. -
setTopN(value: Int)
: Set the number of key phrases to extract.
This model is a subclass of [[BertSentenceEmbeddings]] and shares all parameters with it. It can load any pretrained BertSentenceEmbeddings model. Available models can be found at the Models Hub.
Input Annotator Types: DOCUMENT, CHUNK
Output Annotator Type: CHUNK
Python API: ChunkKeyPhraseExtraction | Scala API: ChunkKeyPhraseExtraction | Notebook: ChunkKeyPhraseExtractionNotebook |
Show Example
from johnsnowlabs import nlp, medical
documenter = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentencer = nlp.SentenceDetector() \
.setInputCols(["document"])\
.setOutputCol("sentences")
tokenizer = nlp.Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("tokens") \
embeddings = nlp.WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["document", "tokens"]) \
.setOutputCol("embeddings")
ner_tagger = medical.NerModel().pretrained("ner_jsl_slim", "en", "clinical/models") \
.setInputCols(["sentences", "tokens", "embeddings"]) \
.setOutputCol("ner_tags")
ner_converter = nlp.NerConverter()\
.setInputCols("sentences", "tokens", "ner_tags")\
.setOutputCol("ner_chunks")
key_phrase_extractor = medical.ChunkKeyPhraseExtraction.pretrained()\
.setTopN(1)\
.setDocumentLevelProcessing(False)\
.setDivergence(0.4)\
.setInputCols(["sentences", "ner_chunks"])\
.setOutputCol("ner_chunk_key_phrases")
pipeline = nlp.Pipeline(stages=[
documenter,
sentencer,
tokenizer,
embeddings,
ner_tagger,
ner_converter,
key_phrase_extractor])
data = spark.createDataFrame([["Her Diabetes has become type 2 in the last year with her Diabetes.He complains of swelling in his right forearm."]]).toDF("text")
results = pipeline.fit(data).transform(data)
results.selectExpr("explode(ner_chunk_key_phrases) AS key_phrase")\
.selectExpr("key_phrase.result",
"key_phrase.metadata.entity",
"key_phrase.metadata.DocumentSimilarity",
"key_phrase.metadata.MMRScore").show(truncate=False)
+--------+-------------------------+------------------+-----------------+
|result |entity |DocumentSimilarity|MMRScore |
+--------+-------------------------+------------------+-----------------+
|Diabetes|Disease_Syndrome_Disorder|0.66827321499841 |0.400963944931921|
+--------+-------------------------+------------------+-----------------+
from johnsnowlabs import nlp, finance
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
key_phrase_extractor = finance.ChunkKeyPhraseExtraction\
.pretrained()\
.setTopN(1)\
.setDocumentLevelProcessing(False)\
.setDivergence(0.4)\
.setInputCols(["sentence", "ner_chunk"])\
.setOutputCol("ner_chunk_key_phrases")
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
ner_converter,
key_phrase_extractor])
text = ["""In 2020, we acquired certain assets of Spell Security Private Limited (also known as "Spell Security"). More specifically, their Compliance product - Policy Compliance (PC)")."""]
data = spark.createDataFrame([text]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
result.selectExpr("explode(ner_chunk_key_phrases) AS key_phrase")\
.selectExpr("key_phrase.result",
"key_phrase.metadata.entity",
"key_phrase.metadata.DocumentSimilarity",
"key_phrase.metadata.MMRScore").show(truncate=False)
+------------------------------+-------+------------------+-------------------+
|result |entity |DocumentSimilarity|MMRScore |
+------------------------------+-------+------------------+-------------------+
|Policy Compliance |PRODUCT|0.6446724461374882|0.38680348305268175|
|Spell Security Private Limited|ORG |0.6282153013401193|0.3769291957818915 |
+------------------------------+-------+------------------+-------------------+
from johnsnowlabs import nlp, legal
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias","en","legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
key_phrase_extractor = legal.ChunkKeyPhraseExtraction\
.pretrained()\
.setTopN(1)\
.setDocumentLevelProcessing(False)\
.setDivergence(0.4)\
.setInputCols(["sentence", "ner_chunk"])\
.setOutputCol("ner_chunk_key_phrases")
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
ner_converter,
key_phrase_extractor])
text = ["""This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties").
"""]
data = spark.createDataFrame([text]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
result.selectExpr("explode(ner_chunk_key_phrases) AS key_phrase")\
.selectExpr("key_phrase.result",
"key_phrase.metadata.entity",
"key_phrase.metadata.DocumentSimilarity",
"key_phrase.metadata.MMRScore").show(truncate=False)
+--------------+------+------------------+-------------------+
|result |entity|DocumentSimilarity|MMRScore |
+--------------+------+------------------+-------------------+
|Buyer Entities|ALIAS |0.5680936022739617|0.34085617490878395|
+--------------+------+------------------+-------------------+
import spark.implicits._
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentencer = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentences")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("tokens")
val embeddings = WordEmbeddingsModel
.pretrained("embeddings_clinical","en","clinical/models")
.setInputCols(Array("document","tokens"))
.setOutputCol("embeddings")
val ner_tagger = MedicalNerModel.pretrained("ner_jsl_slim","en","clinical/models")
.setInputCols(Array("sentences","tokens","embeddings"))
.setOutputCol("ner_tags")
val ner_converter = new NerConverter()
.setInputCols("sentences","tokens","ner_tags")
.setOutputCol("ner_chunks")
val key_phrase_extractor = ChunkKeyPhraseExtraction.pretrained()
.setTopN(1)
.setDocumentLevelProcessing(false)
.setDivergence(0.4)
.setInputCols(Array("sentences","ner_chunks"))
.setOutputCol("ner_chunk_key_phrases")
val pipeline = new Pipeline().setStages(Array(
documenter,
sentencer,
tokenizer,
embeddings,
ner_tagger,
ner_converter,
key_phrase_extractor))
val text ="""Her Diabetes has become type 2 in the last year with her Diabetes.He complains of swelling in his right forearm."""
val data = Seq(text).toDF("text")
val results = pipeline.fit(data).transform(data)
+--------+-------------------------+------------------+-----------------+
|result |entity |DocumentSimilarity|MMRScore |
+--------+-------------------------+------------------+-----------------+
|Diabetes|Disease_Syndrome_Disorder|0.66827321499841 |0.400963944931921|
+--------+-------------------------+------------------+-----------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
.setInputCols(Array("document") )
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
val ner_model = FinanceNerModel.pretrained("finner_orgs_prods_alias","en","finance/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
val key_phrase_extractor = ChunkKeyPhraseExtraction.pretrained()
.setTopN(1)
.setDocumentLevelProcessing(false)
.setDivergence(0.4)
.setInputCols(Array("sentence","ner_chunk"))
.setOutputCol("ner_chunk_key_phrases")
val nlpPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model, n
er_converter,
key_phrase_extractor) )
val text ="""In 2020, we acquired certain assets of Spell Security Private Limited (also known as "Spell Security"). More specifically, their Compliance product - Policy Compliance (PC)."""
val data = Seq(text).toDF("text")
val result = nlpPipeline.fit(data).transform(data)
+------------------------------+-------+------------------+-------------------+
|result |entity |DocumentSimilarity|MMRScore |
+------------------------------+-------+------------------+-------------------+
|Policy Compliance |PRODUCT|0.6446724461374882|0.38680348305268175|
|Spell Security Private Limited|ORG |0.6282153013401193|0.3769291957818915 |
+------------------------------+-------+------------------+-------------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
.setInputCols(Array("document") )
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
val ner_model = LegalNerModel.pretrained("legner_orgs_prods_alias","en","legal/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
val key_phrase_extractor = ChunkKeyPhraseExtraction.pretrained()
.setTopN(1)
.setDocumentLevelProcessing(false)
.setDivergence(0.4)
.setInputCols(Array("sentence","ner_chunk"))
.setOutputCol("ner_chunk_key_phrases")
val nlpPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model, n
er_converter,
key_phrase_extractor) )
val text ="""This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties")."""
val data = Seq(text).toDF("text")
val result = nlpPipeline.fit(data).transform(data)
+--------------+------+------------------+-------------------+
|result |entity|DocumentSimilarity|MMRScore |
+--------------+------+------------------+-------------------+
|Buyer Entities|ALIAS |0.5680936022739617|0.34085617490878395|
+--------------+------+------------------+-------------------+
ChunkMapper
We can use ChunkMapper to map entities with their associated code/reference based on pre-defined dictionaries.
This is the AnnotatorModel of the ChunkMapper, which can be used to access pretrained models with the .pretrained()
or .load()
methods. To train a new model, check the documentation of the ChunkMapperApproach annotator.
The annotator also allows using fuzzy matching, which can take into consideration parts of the tokens tha can map even when word order is different, char ngrams that can map even when thre are typos, and using fuzzy distance metric (Jaccard, Levenshtein, etc.).
Parametres:
-
setRels
(List[str]): Relations that we are going to use to map the chunk -
setLowerCase
(Boolean): Set if we want to map the chunks in lower case or not (Default: True) -
setAllowMultiTokenChunk
(Boolean): Whether to skip relations with multitokens (Default: True) -
setMultivaluesRelations
(Boolean): Whether to decide to return all values in a relation together or separately (Default: False)
Example usage and more details can be found on Spark NLP Workshop repository accessible in GitHub, for example the notebook Healthcare Chunk Mapping.
Input Annotator Types: CHUNK
Output Annotator Type: LABEL_DEPENDENCY
Python API: ChunkMapperModel | Scala API: ChunkMapperModel | Notebook: ChunkMapperModelNotebook |
Show Example
from johnsnowlabs import nlp, medical
documenter = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencer = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentences")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentences"])\
.setOutputCol("tokens")
words_embedder = nlp.WordEmbeddingsModel()\
.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentences", "tokens"])\
.setOutputCol("embeddings")
ner_tagger = medical.NerModel()\
.pretrained("ner_posology", "en", "clinical/models")\
.setInputCols("sentences", "tokens", "embeddings")\
.setOutputCol("ner_tags")
ner_converter = medical.NerConverterInternal()\
.setInputCols(["sentences", "tokens", "ner_tags"])\
.setOutputCol("ner_chunks")\
.setWhiteList(["DRUG"])
chunkToDoc = nlp.Chunk2Doc()\
.setInputCols("ner_chunks")\
.setOutputCol("ner_chunks_doc")
sbert_embedder = nlp.BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli", "en","clinical/models")\
.setInputCols(["ner_chunks_doc"])\
.setOutputCol("sbert_embeddings")\
.setCaseSensitive(False)
rxnorm_resolver = medical.SentenceEntityResolverModel\
.pretrained("sbiobertresolve_rxnorm_augmented", "en", "clinical/models")\
.setInputCols(["sbert_embeddings"])\
.setOutputCol("rxnorm_code")\
.setDistanceFunction("EUCLIDEAN")\
resolver2chunk = medical.Resolution2Chunk()\
.setInputCols(["rxnorm_code"]) \
.setOutputCol("rxnorm_chunk")\
chunkerMapper = medical.ChunkMapperModel.pretrained("rxnorm_drug_brandname_mapper", "en", "clinical/models")\
.setInputCols(["rxnorm_chunk"])\
.setOutputCol("rxnorm_drug_brandname_mapper")\
.setRels(["rxnorm_brandname"])
pipeline = nlp.Pipeline(
stages = [
documenter,
sentencer,
tokenizer,
words_embedder,
ner_tagger,
ner_converter,
chunkToDoc,
sbert_embedder,
rxnorm_resolver,
resolver2chunk,
chunkerMapper
])
data = spark.createDataFrame([["The doctor prescribed Sinequan 150 MG for depression and Zonalon 50 mg for managing skin itching"]]).toDF("text")
result= pipeline.fit(data).transform(data)
result.select(F.explode(F.arrays_zip(result.ner_chunks.result,
result.rxnorm_code.result)).alias("cols"))\
.select(F.expr("cols['0']").alias("ner_chunks"),
F.expr("cols['1']").alias("rxnorm_code")).show(15, truncate=100)
+----------+-----------+----------------------------+
|ner_chunks|rxnorm_code|rxnorm_drug_brandname_mapper|
+----------+-----------+----------------------------+
| Sinequan| 224915| Sinequan (Sinequan)|
| Zonalon| 9801| Zonalon (Zonalon)|
+----------+-----------+----------------------------+
from johnsnowlabs import nlp, finance
document_assembler = nlp.DocumentAssembler()\
.setInputCol('text')\
.setOutputCol('document')
tokenizer = nlp.Tokenizer()\
.setInputCols("document")\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
ner_model = finance.NerModel.pretrained("finner_ticker", "en", "finance/models")\
.setInputCols(["document", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
CM = finance.ChunkMapperModel.pretrained('finmapper_nasdaq_ticker_stock_screener', 'en', 'finance/models')\
.setInputCols(["ner_chunk"])\
.setOutputCol("mappings")
pipeline = nlp.Pipeline().setStages([
document_assembler,
tokenizer,
embeddings,
ner_model,
ner_converter,
CM])
text = ["""There are some serious purchases and sales of AMZN stock today."""]
data = spark.createDataFrame([text]).toDF("text")
result = pipeline.fit(data).transform(data)
+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result|result |
+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[AMZN]|[AMZN, Amazon.com Inc. Common Stock, $98.12, 2.85, 2.991%, 9.98556270184E11, United States, 1997, 85412563, Consumer Discretionary, Catalog/Specialty Distribution]|
+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
from johnsnowlabs import nlp, legal
document_assembler = nlp.DocumentAssembler()\
.setInputCol('text')\
.setOutputCol('document')
tokenizer = nlp.Tokenizer()\
.setInputCols("document")\
.setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel.pretrained('glove_100d') \
.setInputCols(['document', 'token']) \
.setOutputCol('embeddings')
ner_model = nlp.NerDLModel.pretrained("onto_100", "en") \
.setInputCols(["document", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")\
.setWhiteList(["CARDINAL"])
CM = legal.ChunkMapperModel().pretrained("legmapper_edgar_irs", "en", "legal/models")\
.setInputCols(["ner_chunk"])\
.setOutputCol("mappings")
pipeline = nlp.Pipeline().setStages([
document_assembler,
tokenizer,
embeddings,
ner_model,
ner_converter,
CM])
text = ["""873474341 is an American multinational corporation that is engaged in the design, development, manufacturing, and worldwide marketing and sales of footwear, apparel, equipment, accessories, and services"""]
data = spark.createDataFrame([text]).toDF("text")
result= pipeline.fit(data).transform(data)
+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |result |
+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[873474341]|[Masterworks 096, LLC, RETAIL-RETAIL STORES, NEC [5990], 5990, 873474341, 1231, NY, DE, 225 LIBERTY STREET, NEW YORK, NY, 10281, 2035185172, , , 2022-01-10, 1894064]|
+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentencer = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentences")
val tokenizer = new Tokenizer()
.setInputCols("sentences")
.setOutputCol("tokens")
val words_embedder = WordEmbeddingsModel
.pretrained("embeddings_clinical","en","clinical/models")
.setInputCols(Array("sentences","tokens"))
.setOutputCol("embeddings")
val ner_tagger = MedicalNerModel
.pretrained("ner_posology","en","clinical/models")
.setInputCols(Array("sentences","tokens","embeddings"))
.setOutputCol("ner_tags")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentences","tokens","ner_tags"))
.setOutputCol("ner_chunks")
.setWhiteList("DRUG")
val chunkToDoc = new Chunk2Doc()
.setInputCols("ner_chunks")
.setOutputCol("ner_chunks_doc")
val sbert_embedder = BertSentenceEmbeddings
.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols("ner_chunks_doc")
.setOutputCol("sbert_embeddings")
.setCaseSensitive(false)
val rxnorm_resolver = SentenceEntityResolverModel
.pretrained("sbiobertresolve_rxnorm_augmented","en","clinical/models")
.setInputCols("sbert_embeddings")
.setOutputCol("rxnorm_code")
.setDistanceFunction("EUCLIDEAN")
val resolver2chunk = new Resolution2Chunk()
.setInputCols("rxnorm_code")
.setOutputCol("rxnorm_chunk")
val chunkerMapper = ChunkMapperModel.pretrained("rxnorm_drug_brandname_mapper","en","clinical/models")
.setInputCols("rxnorm_chunk")
.setOutputCol("rxnorm_drug_brandname_mapper")
.setRels(Array("rxnorm_brandname"))
val pipeline = new Pipeline().setStages(Array(
documenter,
sentencer,
tokenizer,
words_embedder,
ner_tagger,
ner_converter,
chunkToDoc,
sbert_embedder,
rxnorm_resolver,
resolver2chunk,
chunkerMapper ))
val text ="""The doctor prescribed Sinequan 150 MG for depression and Zonalon 50 mg for managing skin itching"""
val data = Seq(text).toDF("text")
val result= mapper_pipeline.fit(data).transform(data)
+----------+-----------+----------------------------+
|ner_chunks|rxnorm_code|rxnorm_drug_brandname_mapper|
+----------+-----------+----------------------------+
| Sinequan| 224915| Sinequan (Sinequan)|
| Zonalon| 9801| Zonalon (Zonalon)|
+----------+-----------+----------------------------+
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
.setInputCols(Array("document","token"))
.setOutputCol("embeddings")
val ner_model = FinanceNerModel.pretrained("finner_ticker","en","finance/models")
.setInputCols(Array("document","token","embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document","token","ner"))
.setOutputCol("ner_chunk")
val CM = ChunkMapperModel.pretrained("finmapper_nasdaq_ticker_stock_screener","en","finance/models")
.setInputCols("ner_chunk")
.setOutputCol("mappings")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
tokenizer,
embeddings,
ner_model,
ner_converter,
CM) )
val text ="""There are some serious purchases and sales of AMZN stock today."""
val data = Seq(text).toDF("text")
val result = pipeline.fit(data).transform(data)
+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result|result |
+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[AMZN]|[AMZN, Amazon.com Inc. Common Stock, $98.12, 2.85, 2.991%, 9.98556270184E11, United States, 1997, 85412563, Consumer Discretionary, Catalog/Specialty Distribution]|
+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("glove_100d")
.setInputCols(Array("document","token"))
.setOutputCol("embeddings")
val ner_model = NerDLModel.pretrained("onto_100","en")
.setInputCols(Array("document","token","embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("document","token","ner"))
.setOutputCol("ner_chunk")
.setWhiteList(Array("CARDINAL"))
val CM = ChunkMapperModel.pretrained("legmapper_edgar_irs","en","legal/models")
.setInputCols("ner_chunk")
.setOutputCol("mappings")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
tokenizer,
embeddings,
ner_model,
ner_converter,
CM) )
val text ="""873474341 is an American multinational corporation that is engaged in the design,development,manufacturing,and worldwide marketing and sales of footwear,apparel,equipment,accessories,and services"""
val data = Seq(text).toDF("text")
val result= pipeline.fit(data).transform(data)
+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |result |
+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[873474341]|[Masterworks 096, LLC, RETAIL-RETAIL STORES, NEC [5990], 5990, 873474341, 1231, NY, DE, 225 LIBERTY STREET, NEW YORK, NY, 10281, 2035185172, , , 2022-01-10, 1894064]|
+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------+
We can use ChunkMapper to map entities with their associated code/reference based on pre-defined dictionaries.
This is the AnnotatorApproach of the ChunkMapper, which can be used to train ChunkMapper models by giving a custom mapping dictionary. To use pretriained models, check the documentation of the ChunkMapperModel annotator.
The annotator also allows using fuzzy matching, which can take into consideration parts of the tokens tha can map even when word order is different, char ngrams that can map even when thre are typos, and using fuzzy distance metric (Jaccard, Levenshtein, etc.).
Example usage and more details can be found on Spark NLP Workshop repository accessible in GitHub, for example the notebook Healthcare Chunk Mapping.
Input Annotator Types: CHUNK
Output Annotator Type: LABEL_DEPENDENCY
Python API: ChunkMapperApproach | Scala API: ChunkMapperApproach | Notebook: ChunkMapperApproachModelNotebook |
Show Example
from johnsnowlabs import nlp, medical
# First, create a dictionay in JSON format following this schema:
import json
data_set= {
"mappings": [
{
"key": "metformin",
"relations": [
{
"key": "action",
"values" : ["hypoglycemic", "Drugs Used In Diabetes"]
},
{
"key": "treatment",
"values" : ["diabetes", "t2dm"]
}
]
}
]
}
with open('sample_drug.json', 'w', encoding='utf-8') as f:
json.dump(data_set, f, ensure_ascii=False, indent=4)
# Create a pipeline
document_assembler = nlp.DocumentAssembler()\
.setInputCol('text')\
.setOutputCol('document')
sentence_detector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
#NER model to detect drug in the text
clinical_ner = medical.NerModel.pretrained("ner_posology_small","en","clinical/models")\
.setInputCols(["sentence","token","embeddings"])\
.setOutputCol("ner")\
.setLabelCasing("upper")
ner_converter = medical.NerConverterInternal()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")\
.setWhiteList(["DRUG"])
chunkerMapper = medical.ChunkMapperApproach()\
.setInputCols(["ner_chunk"])\
.setOutputCol("mappings")\
.setDictionary("/content/sample_drug.json")\
.setRels(["action"]) #or treatment
pipeline = nlp.Pipeline().setStages([
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
chunkerMapper])
text = ["The patient was given 1 unit of metformin daily."]
test_data = spark.createDataFrame([text]).toDF("text")
model = pipeline.fit(test_data)
res= model.transform(test_data)
model.stages[-1].write().save("models/drug_mapper")
from johnsnowlabs import nlp, finance
# First, create a dictionay in JSON format following this schema:
import json
data_set= {
"mappings": [
{
"key": "Rayton Solar Inc.",
"relations": [
{
"key": "name",
"values" : ['Rayton Solar Inc.']
},
{
"key": "sic",
"values" : ['SEMICONDUCTORS & RELATED DEVICES [3674]']
}]
}]
}
with open('sample_finance.json', 'w', encoding='utf-8') as f:
json.dump(data_set, f, ensure_ascii=False, indent=4)
# Create a pipeline
document_assembler = nlp.DocumentAssembler()\
.setInputCol('text')\
.setOutputCol('document')
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
word_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
finance_ner = finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")\
.setWhiteList(["ORG"]) # Return only ORG entities
chunkerMapper = finance.ChunkMapperApproach()\
.setInputCols(["ner_chunk"])\
.setOutputCol("mappings")\
.setDictionary("/content/sample_finance.json")\
.setRels(all_rels)
pipeline = nlp.Pipeline().setStages([
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
finance_ner,
ner_converter,
chunkerMapper])
text = ["AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price. "]
test_data = spark.createDataFrame([text]).toDF("text")
model = pipeline.fit(test_data)
res= model.transform(test_data)
model.stages[-1].write().save("models/finance_mapper")
from johnsnowlabs import nlp, legal
# First, create a dictionay in JSON format following this schema:
import json
data_set= {
"mappings": [
{
"key": "Rayton Solar Inc.",
"relations": [
{
"key": "name",
"values" : ['Rayton Solar Inc.']
},
{
"key": "sic",
"values" : ['SEMICONDUCTORS & RELATED DEVICES [3674]']
}]
}]
}
with open('sample_legal.json', 'w', encoding='utf-8') as f:
json.dump(data_set, f, ensure_ascii=False, indent=4)
# Create a pipeline
document_assembler = nlp.DocumentAssembler()\
.setInputCol('text')\
.setOutputCol('document')
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
word_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
legal_ner = legal.NerModel.pretrained("legner_org_per_role_date", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")\
.setWhiteList(["ORG"]) # Return only ORG entities
chunkerMapper = legal.ChunkMapperApproach()\
.setInputCols(["ner_chunk"])\
.setOutputCol("mappings")\
.setDictionary("/content/sample_legal.json")\
.setRels(all_rels)
pipeline = nlp.Pipeline().setStages([
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
legal_ner,
ner_converter,
chunkerMapper])
text = ["AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price. "]
test_data = spark.createDataFrame([text]).toDF("text")
model = pipeline.fit(test_data)
res= model.transform(test_data)
model.stages[-1].write().save("models/legal_mapper")
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings") //NER model to detect drug in the text
val clinical_ner = MedicalNerModel.pretrained("ner_posology_small","en","clinical/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
.setLabelCasing("upper")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
.setWhiteList(Array("DRUG"))
val chunkerMapper = new ChunkMapperApproach()
.setInputCols("ner_chunk")
.setOutputCol("mappings")
.setDictionary("/content/sample_drug.json")
.setRels(Array("action") ) //or treatment
val pipeline = new Pipeline()
.setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
chunkerMapper) )
val text = new Array("The patient was given 1 unit of metformin daily.")
val test_data = seq(Array(text)) .toDF("text")
val model = pipeline.fit(test_data)
res= model.transform(test_data)
model.stagesArray(-1) .write() .save("models/drug_mapper")
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
.setInputCols(Array("sentence","token") )
.setOutputCol("embeddings")
val finance_ner = FinanceNerModel.pretrained("finner_orgs_prods_alias","en","finance/models")
.setInputCols(Array("sentence","token","embeddings") )
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence","token","ner") )
.setOutputCol("ner_chunk")
.setWhiteList(Array("ORG") ) // Return only ORG entities
val chunkerMapper = new ChunkMapperApproach()
.setInputCols("ner_chunk")
.setOutputCol("mappings")
.setDictionary("/content/sample_json")
.setRels(all_rels)
val pipeline = new Pipeline()
.setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
finance_ner,
ner_converter,
chunkerMapper) )
val text = new Array("AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price. ")
val test_data = seq(Array(text)).toDF("text")
val model = pipeline.fit(test_data)
res= model.transform(test_data)
model.stagesArray(-1) .write() .save("models/finance_mapper")
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
val legal_ner = LegalNerModel.pretrained("legner_org_per_role_date","en","legal/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
.setWhiteList("ORG") // Return only ORG entities
val chunkerMapper = new ChunkMapperApproach()
.setInputCols("ner_chunk")
.setOutputCol("mappings")
.setDictionary("/content/sample_json")
.setRels(all_rels)
val pipeline = new Pipeline()
.setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
legal_ner,
ner_converter,
chunkerMapper) )
val text = new Array("AWA Group LP intends to pay dividends on the Common Units on a quarterly basis at an annual rate of 8.00% of the Offering Price. ")
val test_data = seq(Array(text) ) .toDF("text")
val model = pipeline.fit(test_data)
res= model.transform(test_data)
model.stagesArray(-1) .write() .save("models/legal_mapper")
ChunkMapperFilterer
ChunkMapperFilterer
is an annotator to be used after ChunkMapper
that allows to filter chunks based on the results of the mapping, whether it was successful or failed.
Parametres:
ReturnCriteria
(String): Has two possible values: “success” or “fail”. If “fail” (default), returns the chunks that are not in the label dependencies; if “success”, returns the labels that were successfully mapped by theChunkMapperModel
annotator.
Example usage and more details can be found on Spark NLP Workshop repository accessible in GitHub, for example the notebook Healthcare Chunk Mapping.
Input Annotator Types: CHUNK, LABEL_DEPENDENCY
Output Annotator Type: CHUNK
Python API: ChunkMapperFilterer | Scala API: ChunkMapperFilterer | Notebook: ChunkMapperFiltererNotebook |
Show Example
from johnsnowlabs import nlp, medical
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model = medical.NerModel.pretrained("ner_posology_greedy", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = medical.NerConverterInternal()\
.setInputCols("sentence", "token", "ner")\
.setOutputCol("chunk")
chunkerMapper = medical.ChunkMapperModel.pretrained("rxnorm_mapper", "en", "clinical/models")\
.setInputCols(["chunk"])\
.setOutputCol("RxNorm_Mapper")\
.setRels(["rxnorm_code"])
chunk_mapper_filterer = medical.ChunkMapperFilterer() \
.setInputCols(["chunk", "RxNorm_Mapper"]) \
.setOutputCol("chunks_fail") \
.setReturnCriteria("fail")
mapper_pipeline = nlp.Pipeline(
stages = [
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner_model,
ner_converter,
chunkerMapper,
chunkerMapper,
chunk_mapper_filterer
])
samples = [["The patient was given Adapin 10 MG, coumadn 5 mg"],
["The patient was given Avandia 4 mg, Tegretol, zitiga"] ]
data = spark.createDataFrame(samples).toDF("text")
result = mapper_pipeline.fit(data).transform(data)
result.selectExpr("chunk.result as chunk",
"RxNorm_Mapper.result as RxNorm_Mapper",
"chunks_fail.result as chunks_fail").show(truncate = False)
+--------------------------------+----------------------+--------------+
|chunk |RxNorm_Mapper |chunks_fail |
+--------------------------------+----------------------+--------------+
|[Adapin 10 MG, coumadn 5 mg] |[1000049, NONE] |[coumadn 5 mg]|
|[Avandia 4 mg, Tegretol, zitiga]|[261242, 203029, NONE]|[zitiga] |
+--------------------------------+----------------------+--------------+
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val ner_model = MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models")
.setInputCols("sentence", "token", "embeddings")
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols("sentence", "token", "ner")
.setOutputCol("chunk")
val chunkerMapper = ChunkMapperModel.pretrained("rxnorm_mapper", "en", "clinical/models")
.setInputCols("chunk")
.setOutputCol("RxNorm_Mapper")
.setRels(Array("rxnorm_code"))
val chunk_mapper_filterer = new ChunkMapperFilterer()
.setInputCols("chunk", "RxNorm_Mapper")
.setOutputCol("chunks_fail")
.setReturnCriteria("fail")
val mapper_pipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner_model,
ner_converter,
chunkerMapper,
chunk_mapper_filterer
))
val data = Seq("The patient was given Adapin 10 MG, coumadn 5 mg",
"The patient was given Avandia 4 mg, Tegretol, zitiga").toDF("text")
val result = mapper_pipeline.fit(data).transform(data)
+--------------------------------+----------------------+--------------+
|chunk |RxNorm_Mapper |chunks_fail |
+--------------------------------+----------------------+--------------+
|[Adapin 10 MG, coumadn 5 mg] |[1000049, NONE] |[coumadn 5 mg]|
|[Avandia 4 mg, Tegretol, zitiga]|[261242, 203029, NONE]|[zitiga] |
+--------------------------------+----------------------+--------------+
ChunkMerge
Merges two chunk columns coming from two annotators(NER, ContextualParser or any other annotator producing chunks). The merger of the two chunk columns is made by selecting one chunk from one of the columns according to certain criteria. The decision on which chunk to select is made according to the chunk indices in the source document. (chunks with longer lengths and highest information will be kept from each source) Labels can be changed by setReplaceDictResource.
Parameters:
inputCols
: The name of the columns containing the input annotations. It can read either a String column or an Array.outputCol
: The name of the column in Document type that is generated. We can specify only one column here.mergeOverlapping
: (Boolean) Sets whether to merge overlapping matched chunks. DefaultTrue
.falsePositivesResource
: Sets file with false positive pairsreplaceDictResource
: Sets replace dictionary pairs for NER labelsblackList
: (String List) If defined, list of entities to ignore. The rest will be processed.whiteList
: (String List) If defined, list of entities to accept.selectionStrategy
: (String) Sets Whether to select annotations sequentially based on annotation orderSequential
or using any other available strategy; currently onlySequential
andDiverseLonger
are available. DefaultSequential
.orderingFeatures
: (String List) The ordering features to use for overlapping entities. Possible values areChunkBegin, ChunkLength, ChunkPrecedence, ChunkConfidence.
defaultConfidence
: (Float) Sets when ChunkConfidence ordering feature is included and a given annotation does not have any confidence. The value of this param will be used as a confidence score for annotations without a confidence score.chunkPrecedence
: (String List) Sets what is the precedence order when a chunk labeled by two models.chunkPrecedenceValuePrioritization
: (String List) Sets when ChunkPrecedence ordering feature is used. This param contains an Array of comma-separated values representing the desired order of prioritization for the values in the metadata fields included from chunkPrecedence.resetSentenceIndices
: Whether to reset sentence indices to treat the entire output as if it originates from a single document. Default: False.
All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols()
.
Input Annotator Types: CHUNK, CHUNK
Output Annotator Type: CHUNK
Python API: ChunkMergeApproach | Scala API: ChunkMergeApproach |
Show Example
from johnsnowlabs import nlp, medical
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
# 1- ner_clinical model
clinical_ner = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("clinical_ner")
clinical_ner_converter = medical.NerConverterInternal() \
.setInputCols(["sentence", "token", "clinical_ner"]) \
.setOutputCol("clinical_ner_chunk")
# 2- posology ner model
posology_ner = medical.NerModel.pretrained("ner_posology", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("posology_ner")
posology_ner_converter = medical.NerConverterInternal() \
.setInputCols(["sentence", "token", "posology_ner"]) \
.setOutputCol("posology_ner_chunk")
# 3- generate a text matcher annotator that extracts female related entities
entities = ['she', 'her', 'girl', 'woman', 'women', 'womanish', 'womanlike', 'womanly', 'madam', 'madame', 'senora', 'lady', 'miss', 'girlfriend', 'wife', 'bride', 'misses', 'mrs.', 'female']
with open ('female_entities.txt', 'w') as f:
for i in entities:
f.write(i+'\n')
# Find female entities using TextMatcher
female_entity_extractor = nlp.TextMatcher() \
.setInputCols(["sentence",'token'])\
.setOutputCol("female_entities")\
.setEntities("female_entities.txt")\
.setCaseSensitive(False)\
.setEntityValue('female_entity')
# Chunk Merge annotator is used to merge columns
chunk_merger = medical.ChunkMergeApproach()\
.setInputCols("posology_ner_chunk", 'clinical_ner_chunk', "female_entities")\
.setOutputCol('merged_ner_chunk')
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
clinical_ner_converter,
posology_ner,
posology_ner_converter,
female_entity_extractor,
chunk_merger])
sample_text = """The lady was treated with a five-day course of amoxicillin for a respiratory tract infection .
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG ."""
data = spark.createDataFrame([[sample_text]]).toDF("text")
model = nlpPipeline.fit(data).transform(data)
# Show results
model.selectExpr("explode(merged_ner_chunk) as a") \
.selectExpr("a.begin","a.end","a.result as chunk","a.metadata.entity as entity") \
.show(10, False)
+-----+---+-----------------------------+-------------+
|begin|end|chunk |entity |
+-----+---+-----------------------------+-------------+
|4 |7 |lady |female_entity|
|47 |57 |amoxicillin |DRUG |
|63 |91 |a respiratory tract infection|PROBLEM |
|95 |97 |She |female_entity|
|106 |114|metformin |DRUG |
|118 |126|glipizide |TREATMENT |
|134 |146|dapagliflozin |TREATMENT |
|152 |155|T2DM |PROBLEM |
|161 |172|atorvastatin |DRUG |
|178 |188|gemfibrozil |TREATMENT |
+-----+---+-----------------------------+-------------+
from johnsnowlabs import nlp, finance
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
bert_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("bert_embeddings")
fin_ner = finance.NerModel.pretrained('finner_deid', "en", "finance/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
#.setLabelCasing("upper")
ner_converter = finance.NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")\
.setReplaceLabels({"ORG": "PARTY"}) # Replace "ORG" entity as "PARTY"
ner_finner = finance.NerModel.pretrained("finner_org_per_role_date", "en", "finance/models")\
.setInputCols(["sentence", "token", "bert_embeddings"]) \
.setOutputCol("ner_finner")
#.setLabelCasing("upper")
ner_converter_finner = nlp.NerConverter() \
.setInputCols(["sentence", "token", "ner_finner"]) \
.setOutputCol("ner_finner_chunk") \
.setWhiteList(['ROLE']) # Just use "ROLE" entity from this NER
chunk_merge = finance.ChunkMergeApproach()\
.setInputCols("ner_finner_chunk", "ner_chunk")\
.setOutputCol("deid_merged_chunk")
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
bert_embeddings,
fin_ner,
ner_converter,
ner_finner,
ner_converter_finner,
chunk_merge])
data = spark.createDataFrame([["Jeffrey Preston Bezos is an American entrepreneur, founder and CEO of Amazon"]]).toDF("text")
# Show results
result = nlpPipeline.fit(data).transform(data).cache()
result.select(F.explode(F.arrays_zip(result.deid_merged_chunk.result,
result.deid_merged_chunk.metadata)).alias("cols")) \
.select(F.expr("cols['0']").alias("chunk"),
F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)
+---------------------+---------+
|chunk |ner_label|
+---------------------+---------+
|Jeffrey Preston Bezos|PERSON |
|founder |ROLE |
|CEO |ROLE |
|Amazon |PARTY |
+---------------------+---------+
from johnsnowlabs import nlp, legal
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
legal_ner = legal.NerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
#.setLabelCasing("upper")
ner_converter = legal.NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")\
.setReplaceLabels({"ALIAS": "PARTY"})
ner_signers = legal.NerModel.pretrained("legner_signers", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner_signers")
#.setLabelCasing("upper")
ner_converter_signers = nlp.NerConverter() \
.setInputCols(["sentence", "token", "ner_signers"]) \
.setOutputCol("ner_signer_chunk")
chunk_merge = legal.ChunkMergeApproach()\
.setInputCols("ner_signer_chunk", "ner_chunk")\
.setOutputCol("deid_merged_chunk")
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
legal_ner,
ner_converter,
ner_signers,
ner_converter_signers,
chunk_merge])
data = spark.createDataFrame([["ENTIRE AGREEMENT. This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby, supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter.
2THEMART.COM, INC.: I-ESCROW, INC.: By:Dominic J. Magliarditi By:Sanjay Bajaj Name: Dominic J. Magliarditi Name: Sanjay Bajaj Title: President Title: VP Business Development Date: 6/21/99 Date: 6/11/99 "]]).toDF("text")
# Show results
result = nlpPipeline.fit(data).transform(data).cache()
result.select(F.explode(F.arrays_zip(result.deid_merged_chunk.result,
result.deid_merged_chunk.metadata)).alias("cols")) \
.select(F.expr("cols['0']").alias("chunk"),
F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)
+-----------------------+--------------+
|chunk |ner_label |
+-----------------------+--------------+
|ENTIRE AGREEMENT |DOC |
|INC |PARTY |
|J. Magliarditi |SIGNING_PERSON|
|Bajaj |SIGNING_PERSON|
|Dominic J. Magliarditi |SIGNING_PERSON|
|Sanjay Bajaj |SIGNING_PERSON|
|President |SIGNING_TITLE |
|VP Business Development|SIGNING_TITLE |
+-----------------------+--------------+
import spark.implicits._
// Annotator that transforms a text column from dataframe into an Annotation ready for NLP
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
// Sentence Detector annotator,processes various sentences per line
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
// Tokenizer splits words in a relevant format for NLP
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
// Clinical word embeddings trained on PubMED dataset
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
// 1- ner_clinical model
val clinical_ner = MedicalNerModel.pretrained("ner_clinical","en","clinical/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("clinical_ner")
val clinical_ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence","token","clinical_ner"))
.setOutputCol("clinical_ner_chunk")
// 2- posology ner model
val posology_ner = MedicalNerModel.pretrained("ner_posology","en","clinical/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("posology_ner")
val posology_ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence","token","posology_ner"))
.setOutputCol("posology_ner_chunk")
// 3- generate a text matcher annotator that extracts female related entities
val entities = new Array("she","her","girl","woman","women","womanish","womanlike","womanly","madam","madame","senora","lady","miss","girlfriend","wife","bride","misses","mrs.","female")
with open ('female_entities.txt', 'w') as f:
for i in entities:
f.write(i+'\n')
// Find female entities using TextMatcher
val female_entity_extractor = new TextMatcher()
.setInputCols(Array("sentence","token"))
.setOutputCol("female_entities")
.setEntities("female_entities.txt")
.setCaseSensitive(false)
.setEntityValue("female_entity")
// Chunk Merge annotator is used to merge columns
val chunk_merger = new ChunkMergeApproach()
.setInputCols(Array("posology_ner_chunk","clinical_ner_chunk","female_entities"))
.setOutputCol("merged_ner_chunk")
val nlpPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
clinical_ner_converter,
posology_ner,
posology_ner_converter,
female_entity_extractor,
chunk_merger))
val text ="""The lady was treated with a five-day course of amoxicillin for a respiratory tract infection .
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG ."""
val data = Seq(text).toDF("text")
val model = nlpPipeline.fit(data).transform(data)
+-----+---+-----------------------------+-------------+
|begin|end|chunk |entity |
+-----+---+-----------------------------+-------------+
|4 |7 |lady |female_entity|
|47 |57 |amoxicillin |DRUG |
|63 |91 |a respiratory tract infection|PROBLEM |
|95 |97 |She |female_entity|
|106 |114|metformin |DRUG |
|118 |126|glipizide |TREATMENT |
|134 |146|dapagliflozin |TREATMENT |
|152 |155|T2DM |PROBLEM |
|161 |172|atorvastatin |DRUG |
|178 |188|gemfibrozil |TREATMENT |
+-----+---+-----------------------------+-------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCol("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCol("sentence")
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val bert_embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("bert_embeddings")
val fin_ner = FinanceNerModel.pretrained('finner_deid', "en", "finance/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
#.setLabelCasing("upper")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
.setReplaceLabels({"ORG": "PARTY"}) # Replace "ORG" entity as "PARTY"
val ner_finner = FinanceNerModel.pretrained("finner_org_per_role_date", "en", "finance/models")\
.setInputCols(Array("sentence", "token", "bert_embeddings"))
.setOutputCol("ner_finner")
#.setLabelCasing("upper")
val ner_converter_finner = new NerConverter()
.setInputCols(Array("sentence", "token", "ner_finner"))
.setOutputCol("ner_finner_chunk")
.setWhiteList(['ROLE']) # Just use "ROLE" entity from this NER
val chunk_merge = new ChunkMergeApproach()
.setInputCols(Array("ner_finner_chunk", "ner_chunk"))
.setOutputCol("deid_merged_chunk")
val nlpPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
bert_embeddings,
fin_ner,
ner_converter,
ner_finner,
ner_converter_finner,
chunk_merge))
val data = Seq(("Jeffrey Preston Bezos is an American entrepreneur, founder and CEO of Amazon")).toDF("text")
# Show results
result = nlpPipeline.fit(data).transform(data)
+---------------------+---------+
|chunk |ner_label|
+---------------------+---------+
|Jeffrey Preston Bezos|PERSON |
|founder |ROLE |
|CEO |ROLE |
|Amazon |PARTY |
+---------------------+---------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCol("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCol("sentence")
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val legal_ner = LegalNerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
#.setLabelCasing("upper")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")\
.setReplaceLabels({"ALIAS": "PARTY"})
val ner_signers = LegalNerModel.pretrained("legner_signers", "en", "legal/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner_signers")
#.setLabelCasing("upper")
val ner_converter_signers = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner_signers"))
.setOutputCol("ner_signer_chunk")
val chunk_merge = new ChunkMergeApproach()
.setInputCols(Array("ner_signer_chunk", "ner_chunk"))
.setOutputCol("deid_merged_chunk")
val nlpPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
legal_ner,
ner_converter,
ner_signers,
ner_converter_signers,
chunk_merge))
val data = Seq(("ENTIRE AGREEMENT. This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby, supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter.
2THEMART.COM, INC.: I-ESCROW, INC.: By:Dominic J. Magliarditi By:Sanjay Bajaj Name: Dominic J. Magliarditi Name: Sanjay Bajaj Title: President Title: VP Business Development Date: 6/21/99 Date: 6/11/99 ")).toDF("text")
# Show results
result = nlpPipeline.fit(data).transform(data)
+-----------------------+--------------+
|chunk |ner_label |
+-----------------------+--------------+
|ENTIRE AGREEMENT |DOC |
|INC |PARTY |
|J. Magliarditi |SIGNING_PERSON|
|Bajaj |SIGNING_PERSON|
|Dominic J. Magliarditi |SIGNING_PERSON|
|Sanjay Bajaj |SIGNING_PERSON|
|President |SIGNING_TITLE |
|VP Business Development|SIGNING_TITLE |
+-----------------------+--------------+
ChunkSentenceSplitter
ChunkSentenceSplitter
annotator can split the documents into chunks according to separators given as CHUNK
columns. It is useful when you need to perform different models or analysis in different sections of your document (for example, for different headers, clauses, items, etc.). The given separator chunk can be the output from, for example, RegexMatcher or NerModel.
Parametres;
-
GroupBySentences
: (boolean) Sets the groupBySentences that allow split the paragraphs grouping the chunks by sentences. -
InsertChunk
: (boolean) Whether to insert the chunk in the paragraph or not. -
DefaultEntity
: (str) Sets the key in the metadata dictionary that you want to filter (by default ‘entity’)
For detailed usage of this annotator, visit this notebook from our Spark NLP Workshop
.
Input Annotator Types: DOCUMENT, CHUNK
Output Annotator Type: DOCUMENT
Python API: ChunkSentenceSplitter | Scala API: ChunkSentenceSplitter | Notebook: ChunkSentenceSplitterNotebook |
Show Example
from johnsnowlabs import nlp, medical
# Defining the pipeline
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")\
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = medical.NerModel.pretrained("ner_jsl_slim", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = medical.NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")\
.setWhiteList(["Header"])
#applying ChunkSentenceSplitter
chunkSentenceSplitter = medical.ChunkSentenceSplitter()\
.setInputCols("document","ner_chunk")\
.setOutputCol("paragraphs")\
.setGroupBySentences(False)
pipeline_model = nlp.Pipeline(
stages = [
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
chunkSentenceSplitter
])
sentences = [["""Sample Name: Mesothelioma - Pleural Biopsy
Description: Right pleural effusion and suspected malignant mesothelioma. (Medical Transcription Sample Report)
PREOPERATIVE DIAGNOSIS: Right pleural effusion and suspected malignant mesothelioma.
POSTOPERATIVE DIAGNOSIS: Right pleural effusion, suspected malignant mesothelioma.
ANESTHESIA: General double-lumen endotracheal.
DESCRIPTION OF FINDINGS: Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface.
SPECIMEN: Pleural biopsies for pathology and microbiology.
INDICATIONS: Briefly, this is a 66-year-old gentleman who has been transferred from an outside hospital after a pleural effusion had been drained and biopsies taken from the right chest that were thought to be consistent with mesothelioma. Upon transfer, he had a right pleural effusion demonstrated on x-ray as well as some shortness of breath and dyspnea on exertion. The risks, benefits, and alternatives to right VATS pleurodesis and pleural biopsy were discussed with the patient and his family and they wished to proceed.
Dr. X was present for the entire procedure which was right VATS pleurodesis and pleural biopsies.The counts were correct x2 at the end of the case."""]]
df = spark.createDataFrame(sentences).toDF("text")
paragraphs = pipeline_model.fit(df).transform(df)
paragraphs.selectExpr("explode(paragraphs) as result")\
.selectExpr("result.result","result.metadata.entity", "result.metadata.splitter_chunk").show(truncate=80)
+--------------------------------------------------------------------------------+------------+------------------------+
| result| entity| splitter_chunk|
+--------------------------------------------------------------------------------+------------+------------------------+
| Sample Name: Mesothelioma - Pleural Biopsy |introduction| UNK|
|Description: Right pleural effusion and suspected malignant mesothelioma. (Me...| Header| Description:|
|PREOPERATIVE DIAGNOSIS: Right pleural effusion and suspected malignant mesot...| Header| PREOPERATIVE DIAGNOSIS:|
|POSTOPERATIVE DIAGNOSIS: Right pleural effusion, suspected malignant mesothel...| Header|POSTOPERATIVE DIAGNOSIS:|
| ANESTHESIA: General double-lumen endotracheal. | Header| ANESTHESIA:|
|DESCRIPTION OF FINDINGS: Right pleural effusion, firm nodules, diffuse scatt...| Header|DESCRIPTION OF FINDINGS:|
| SPECIMEN: Pleural biopsies for pathology and microbiology. | Header| SPECIMEN:|
|INDICATIONS: Briefly, this is a 66-year-old gentleman who has been transferr...| Header| INDICATIONS:|
+--------------------------------------------------------------------------------+------------+------------------------+
from johnsnowlabs import nlp, finance, legal
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner_model = finance.NerModel.pretrained("finner_headers", "en", "finance/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = finance.NerConverterInternal()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
chunkSentenceSplitter = legal.ChunkSentenceSplitter()\
.setInputCols("document","ner_chunk")\
.setOutputCol("paragraphs")\
.setGroupBySentences(False)
nlp_pipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
ner_converter,
chunkSentenceSplitter])
text = """2. DEFINITION.
For purposes of this Agreement, the following terms have the meanings ascribed thereto in this Section 1 and 2 Appointment as Reseller.
2.1 Appointment.
The Company hereby [***]. Allscripts may also disclose Company's pricing information relating to its Merchant Processing Services and facilitate procurement of Merchant Processing Services on behalf of Sublicensed Customers, including, without limitation by references to such pricing information and Merchant Processing Services in Customer Agreements. 6
2.2 Customer Agreements."""
sdf = spark.createDataFrame([[text]]).toDF("text")
paragraphs = nlp_pipeline.fit(sdf).transform(sdf)
paragraphs.selectExpr("explode(paragraphs) as result")\
.selectExpr("result.result","result.metadata.entity").show(truncate=50)
+--------------------------------------------------+---------+
| result| entity|
+--------------------------------------------------+---------+
| 2. | HEADER|
|DEFINITION. For purposes of this Agreement, t...|SUBHEADER|
| 2.1 Appointment. |SUBHEADER|
|The Company hereby [***]. Allscripts may also d...|SUBHEADER|
| 6 2.2 Customer Agreements| HEADER|
+--------------------------------------------------+---------+
from johnsnowlabs import nlp, legal
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner_model = legal.NerModel.pretrained("legner_headers", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
chunkSentenceSplitter = legal.ChunkSentenceSplitter()\
.setInputCols("document","ner_chunk")\
.setOutputCol("paragraphs")\
.setGroupBySentences(False)
nlp_pipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
ner_converter,
chunkSentenceSplitter])
text = """AGREEMENT
NOW, THEREFORE, for good and valuable consideration, and in consideration of the mutual covenants and conditions herein contained, the Parties agree as follows:
2. Definitions. For purposes of this Agreement, the following terms have the meanings ascribed thereto in this Section 1. 2. Appointment as Reseller.
2.1 Appointment. The Company hereby [***]. Allscripts may also disclose Company's pricing information relating to its Merchant Processing Services and facilitate procurement of Merchant Processing Services on behalf of Sublicensed Customers, including, without limitation by references to such pricing information and Merchant Processing Services in Customer Agreements. 6
2.2 Customer Agreements.
a) Subscriptions. Allscripts and its Affiliates may sell Subscriptions for terms no less than one year and no greater than four (4) years on a subscription basis to Persons who subsequently execute a Customer Agreement, provided that Allscripts may enter into Customer Agreements with terms longer than four (4) years with large organizations, provided that Phreesia consents in each instance in writing in advance, which consent will not be unreasonably withheld."""
sdf = spark.createDataFrame([[text]]).toDF("text")
paragraphs = nlp_pipeline.fit(sdf).transform(sdf)
paragraphs.selectExpr("explode(paragraphs) as result")\
.selectExpr("result.result","result.metadata.entity").show(truncate=50)
+--------------------------------------------------+---------+
| result| entity|
+--------------------------------------------------+---------+
|AGREEMENT NOW, THEREFORE, for good and valuabl... |SUBHEADER|
| Appointment as Reseller. |SUBHEADER|
| 2.1 Appointment. |SUBHEADER|
|The Company hereby [***]. Allscripts may also d...|SUBHEADER|
| 6 2.2 Customer Agreements. | HEADER|
|a) Subscriptions. Allscripts and its Affiliates...|SUBHEADER|
+--------------------------------------------------+---------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_jsl_slim", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
.setWhiteList("Header")
#applying ChunkSentenceSplitter
val chunkSentenceSplitter = new ChunkSentenceSplitter()
.setInputCols(Array("document","ner_chunk"))
.setOutputCol("paragraphs")
.setGroupBySentences(false)
val pipeline_model = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
chunkSentenceSplitter
))
val sentences = ("""Sample Name: Mesothelioma - Pleural Biopsy
Description: Right pleural effusion and suspected malignant mesothelioma. (Medical Transcription Sample Report)
PREOPERATIVE DIAGNOSIS: Right pleural effusion and suspected malignant mesothelioma.
POSTOPERATIVE DIAGNOSIS: Right pleural effusion, suspected malignant mesothelioma.
ANESTHESIA: General double-lumen endotracheal.
DESCRIPTION OF FINDINGS: Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface.
SPECIMEN: Pleural biopsies for pathology and microbiology.
INDICATIONS: Briefly, this is a 66-year-old gentleman who has been transferred from an outside hospital after a pleural effusion had been drained and biopsies taken from the right chest that were thought to be consistent with mesothelioma. Upon transfer, he had a right pleural effusion demonstrated on x-ray as well as some shortness of breath and dyspnea on exertion. The risks, benefits, and alternatives to right VATS pleurodesis and pleural biopsy were discussed with the patient and his family and they wished to proceed.
Dr. X was present for the entire procedure which was right VATS pleurodesis and pleural biopsies.The counts were correct x2 at the end of the case.""")
val data = Seq(sentences).toDF("text")
val paragraphs = pipeline_model.fit(df).transform(df)
+--------------------------------------------------------------------------------+------------+------------------------+
| result| entity| splitter_chunk|
+--------------------------------------------------------------------------------+------------+------------------------+
| Sample Name: Mesothelioma - Pleural Biopsy |introduction| UNK|
|Description: Right pleural effusion and suspected malignant mesothelioma. (Me...| Header| Description:|
|PREOPERATIVE DIAGNOSIS: Right pleural effusion and suspected malignant mesot...| Header| PREOPERATIVE DIAGNOSIS:|
|POSTOPERATIVE DIAGNOSIS: Right pleural effusion, suspected malignant mesothel...| Header|POSTOPERATIVE DIAGNOSIS:|
| ANESTHESIA: General double-lumen endotracheal. | Header| ANESTHESIA:|
|DESCRIPTION OF FINDINGS: Right pleural effusion, firm nodules, diffuse scatt...| Header|DESCRIPTION OF FINDINGS:|
| SPECIMEN: Pleural biopsies for pathology and microbiology. | Header| SPECIMEN:|
|INDICATIONS: Briefly, this is a 66-year-old gentleman who has been transferr...| Header| INDICATIONS:|
+--------------------------------------------------------------------------------+------------+------------------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner_model = FinanceNerModel.pretrained("finner_headers", "en", "finance/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
val chunkSentenceSplitter = new ChunkSentenceSplitter()
.setInputCols(Array("document","ner_chunk"))
.setOutputCol("paragraphs")
.setGroupBySentences(false)
val nlp_pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
ner_converter,
chunkSentenceSplitter))
val text = """2. DEFINITION.
For purposes of this Agreement, the following terms have the meanings ascribed thereto in this Section 1 and 2 Appointment as Reseller.
2.1 Appointment.
The Company hereby [***]. Allscripts may also disclose Company's pricing information relating to its Merchant Processing Services and facilitate procurement of Merchant Processing Services on behalf of Sublicensed Customers, including, without limitation by references to such pricing information and Merchant Processing Services in Customer Agreements. 6
2.2 Customer Agreements."""
val data = Seq(text).toDF("text")
val paragraphs = nlp_pipeline.fit(data).transform(data)
+--------------------------------------------------+---------+
| result| entity|
+--------------------------------------------------+---------+
| 2. | HEADER|
|DEFINITION. For purposes of this Agreement, t...|SUBHEADER|
| 2.1 Appointment. |SUBHEADER|
|The Company hereby [***]. Allscripts may also d...|SUBHEADER|
| 6 2.2 Customer Agreements| HEADER|
+--------------------------------------------------+---------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner_model = LegalNerModel.pretrained("legner_headers", "en", "legal/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
val chunkSentenceSplitter = new ChunkSentenceSplitter()
.setInputCols(Array("document","ner_chunk"))
.setOutputCol("paragraphs")
.setGroupBySentences(false)
val nlp_pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
ner_converter,
chunkSentenceSplitter))
val text = """AGREEMENT
NOW, THEREFORE, for good and valuable consideration, and in consideration of the mutual covenants and conditions herein contained, the Parties agree as follows:
2. Definitions. For purposes of this Agreement, the following terms have the meanings ascribed thereto in this Section 1. 2. Appointment as Reseller.
2.1 Appointment. The Company hereby [***]. Allscripts may also disclose Company's pricing information relating to its Merchant Processing Services and facilitate procurement of Merchant Processing Services on behalf of Sublicensed Customers, including, without limitation by references to such pricing information and Merchant Processing Services in Customer Agreements. 6
2.2 Customer Agreements.
a) Subscriptions. Allscripts and its Affiliates may sell Subscriptions for terms no less than one year and no greater than four (4) years on a subscription basis to Persons who subsequently execute a Customer Agreement, provided that Allscripts may enter into Customer Agreements with terms longer than four (4) years with large organizations, provided that Phreesia consents in each instance in writing in advance, which consent will not be unreasonably withheld."""
val data = Seq(text).toDF("text")
val paragraphs = nlp_pipeline.fit(data).transform(data)
+--------------------------------------------------+---------+
| result| entity|
+--------------------------------------------------+---------+
|AGREEMENT NOW, THEREFORE, for good and valuabl... |SUBHEADER|
| Appointment as Reseller. |SUBHEADER|
| 2.1 Appointment. |SUBHEADER|
|The Company hereby [***]. Allscripts may also d...|SUBHEADER|
| 6 2.2 Customer Agreements. | HEADER|
|a) Subscriptions. Allscripts and its Affiliates...|SUBHEADER|
+--------------------------------------------------+---------+
ContextSplitAssembler
Converts and assembles VECTOR_SIMILARITY_RANKINGS
type annotations into DOCUMENT
type.
The input annotations are expected to be of type VECTOR_SIMILARITY_RANKINGS
and the output annotation type is DOCUMENT
.
It concatenates the results of the input annotations into a single result, separated by a join string.
When explodeSplits
is set to True, the splits are exploded into separate annotations.
The joinString
parameter is used to add the delimiter between the results of annotations when combining them into a single result.
Parameters:
joinString
(str): This parameter specifies the string that will be inserted between results of annotations when combining them into a single result. It acts as a delimiter, ensuring that the elements are properly separated and organized in the final result of annotation. Default:" "
.explodeSplits
(Bool): Whether to explode the splits into separate annotations or not. Default:False
.
Input Annotator Types: VECTOR_SIMILARITY_RANKINGS
Output Annotator Type: DOCUMENT
Python API: ContextSplitAssembler | Scala API: ContextSplitAssembler |
Show Example
from johnsnowlabs import nlp, medical
context_split_assembler = ( medical.ContextSplitAssembler()
.setInputCols("vector_db")
.setOutputCol("document")
.setJoinString("\n")
.setExplodeSplits(False))
ContextualAssertion
An annotator model for contextual assertion analysis. This model identifies contextual cues within text data, such as negation, uncertainty etc. It is used clinical assertion detection. It annotates text chunks with assertions based on configurable rules, prefix and suffix patterns, and exception patterns.
Parametres:
inputCols
: Input annotations.caseSensitive
: Whether to use case sensitive when matching values, by defaultFalse
.prefixAndSuffixMatch
: Whether to match both prefix and suffix to annotate the hit, by defaultFalse
.prefixKeywords
: Prefix keywords to match.suffixKeywords
: Suffix keywords to matchexceptionKeywords
: Exception keywords not to match.prefixRegexPatterns
: Prefix regex patterns to matchsuffixRegexPatterns
: Suffix regex pattern to matchexceptionRegexPatterns
: Exception regex pattern not to matchscopeWindow
: The scope window of the assertion expressionassertion
: Assertion to matchscopeWindowDelimiter
: Delimiters used to limit the scope window.includeChunkToScope
: Whether to include chunk to scope when matching valuesConfidenceCalculationDirection
: Indicates the direction for calculating assertion confidence (left, right, or both; default is left).
See Spark NLP Workshop for more examples of usage.
Input Annotator Types: DOCUMENT, TOKEN, CHUNK
Output Annotator Type: ASSERTION
Python API: ContextualAssertion | Scala API: ContextualAssertion |
Show Example
from johnsnowlabs import nlp, medical
document_assembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel \
.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
clinical_ner = nlp.MedicalNerModel \
.pretrained("ner_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = nlp.NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
contextual_assertion = medical.ContextualAssertion() \
.setInputCols(["sentence", "token", "ner_chunk"]) \
.setOutputCol("assertion") \
.setPrefixKeywords(["no", "not"]) \
.setSuffixKeywords(["unlikely","negative"]) \
.setPrefixRegexPatterns(["\\b(no|without|denies|never|none|free of|not include)\\b"]) \
.setSuffixRegexPatterns(["\\b(free of|negative for|absence of|not|rule out)\\b"]) \
.setExceptionKeywords(["without"]) \
.setExceptionRegexPatterns(["\\b(not clearly)\\b"]) \
.addPrefixKeywords(["negative for","negative"]) \
.addSuffixKeywords(["absent","neither"]) \
.setCaseSensitive(False) \
.setPrefixAndSuffixMatch(False) \
.setAssertion("absent") \
.setScopeWindow([2, 2])\
.setIncludeChunkToScope(True)\
.setScopeWindowDelimiters([","])
flattener = medical.Flattener() \
.setInputCols("assertion") \
.setExplodeSelectedFields({"assertion":["metadata.ner_chunk as ner_chunk",
"begin as begin",
"end as end",
"metadata.ner_label as ner_label",
"result as result"]})
pipeline = nlp.Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
contextual_assertion,
flattener])
text = """Patient resting in bed. Patient given azithromycin without any difficulty. Patient has audible wheezing, states chest tightness.
No evidence of hypertension. Patient denies nausea at this time. zofran declined. Patient is also having intermittent sweating
associated with pneumonia. Patient refused pain but tylenol still given. Neither substance abuse nor alcohol use however cocaine
once used in the last year. Alcoholism unlikely. Patient has headache and fever. Patient is not diabetic. Not clearly of diarrhea.
Lab reports confirm lymphocytopenia. Cardaic rhythm is Sinus bradycardia. Patient also has a history of cardiac injury.
No kidney injury reported. No abnormal rashes or ulcers. Patient might not have liver disease. Confirmed absence of hemoptysis.
Although patient has severe pneumonia and fever, test reports are negative for COVID-19 infection. COVID-19 viral infection absent.
"""
data = spark.createDataFrame([[text]]).toDF("text")
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = pipeline.fit(empty_data)
result = model.transform(data)
result.show(truncate=False)
# result
+------------------+-----+---+---------+------+
|ner_chunk |begin|end|ner_label|result|
+------------------+-----+---+---------+------+
|nausea |173 |178|PROBLEM |absent|
|Alcoholism |413 |422|PROBLEM |absent|
|diabetic |481 |488|PROBLEM |absent|
|kidney injury |639 |651|PROBLEM |absent|
|abnormal rashes |666 |680|PROBLEM |absent|
|liver disease |716 |728|PROBLEM |absent|
|COVID-19 infection|843 |860|PROBLEM |absent|
|viral infection |872 |886|PROBLEM |absent|
+------------------+-----+---+---------+------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentences")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentences"))
.setOutputCol("tokens")
val embedder = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("embeddings")
val nerTagger = MedicalNerModel .pretrained("ner_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens", "embeddings"))
.setOutputCol("nerTags")
val nerConverter = new NerConverterInternal()
.setInputCols(Array("sentences", "tokens", "nerTags"))
.setOutputCol("nerChunks")
val contextualAssertion = new ContextualAssertion()
.setInputCols(Array("sentences", "tokens","nerChunks"))
.setOutputCol("assertion")
.setScopeWindow(2,2)
.setPrefixRegexPatterns(Array("\\b(no|without|denies|never|none|free of|negative for|not include)\\b"))
.setSuffixRegexPatterns(Array("\\b(free of|negative for|absence of|absence|not|neither|rule out)\\b"))
.setPrefixKeywords(Array("not","never"))
.setSuffixKeywords(Array("no","never"))
.setCaseSensitive(false)
.setIncludeChunkToScope(true)
.addPrefixKeywords(Array("negative for","no evidence of"))
.addSuffixKeywords(Array("declined"))
.setAssertion("absent")
.setScopeWindowDelimiter(Array(","))
val flattener = new Flattener()
.setInputCols("assertion")
.setExplodeSelectedFields(Map("assertion" -> Array( "metadata.ner_chunk as ner_chunk ","begin as begin","end as end"," "metadata.ner_label as ner_label","result as result",
) ) )
val pipeline = new Pipeline()
.setStages(Array(documentAssembler,
sentenceDetector,
tokenizer,
embedder,
nerTagger,
nerConverter,
contextualAssertion,
flattener
))
val text = "Patient resting in bed. Patient given azithromycin without any difficulty. Patient has audible wheezing, states chest tightness." +
" No evidence of hypertension. Patient denies nausea at this time. zofran declined. Patient is also having intermittent sweating " +
"associated with pneumonia. Patient refused pain but tylenol still given. Neither substance abuse nor alcohol use however cocaine " +
"once used in the last year. Alcoholism unlikely. Patient has headache and fever. Patient is not diabetic. Not clearly of diarrhea. " +
"Lab reports confirm lymphocytopenia. Cardaic rhythm is Sinus bradycardia. Patient also has a history of cardiac injury." +
" No kidney injury reported. No abnormal rashes or ulcers. Patient might not have liver disease. Confirmed absence of hemoptysis." +
" Although patient has severe pneumonia and fever, test reports are negative for COVID-19 infection. COVID-19 viral infection absent."
val dataSet = Seq(text).toDS.toDF("text")
val result = pipeline.fit(dataSet).transform(dataSet)
# result
+------------------+-----+---+---------+------+
|ner_chunk |begin|end|ner_label|result|
+------------------+-----+---+---------+------+
|nausea |173 |178|PROBLEM |absent|
|Alcoholism |413 |422|PROBLEM |absent|
|diabetic |481 |488|PROBLEM |absent|
|kidney injury |639 |651|PROBLEM |absent|
|abnormal rashes |666 |680|PROBLEM |absent|
|liver disease |716 |728|PROBLEM |absent|
|COVID-19 infection|843 |860|PROBLEM |absent|
|viral infection |872 |886|PROBLEM |absent|
+------------------+-----+---+---------+------+
ContextualEntityFilterer
ContextualEntityFilterer can filter chunks coming from CHUNK annotations based on entity(identifier,field) info in metadata and contextual cues. Filters can be done via white list entities, black list entities, black list word and white list words. The filter can be applied to the scope of the sentence or the document.
Parameters:
ruleScope
: The rule scope to apply the filter. Options: sentence, document.caseSensitive
: Whether to use case-sensitive when matching words. Default isFalse
.rules
: The filtering rules. Each rule is a dictionary with the following keys:entity
: The target entity field for filtering.scopeWindow
: A list of two integers [before, after], specifying how many tokens/chunks before and after the target to consider.whiteListEntities
: The white list of entities. If one of the entity from this list appears within the scope window, the chunk will be kept. Only one element is enough to keep the chunk.blackListEntities
: The black list of entities. If an entity from this list appears within the scope window, the chunk will be filtered out. All elements must be absent to keep the chunk.scopeWindowLevel
: Determines whether thescopeWindow
is applied at the token or chunk level. Options:token
,chunk
.blackListWords
: The black list of words. If a word from this list appears within the scope window, the chunk will be filtered out.whiteListWords
: The white list of words. If a word from this list appears within the scope window, the chunk will be kept.confidenceThreshold
: The confidence threshold to filter the chunks. Filtering is only applied if the confidence of the chunk is below the threshold.
Input Annotator Types: DOCUMENT, TOKEN, CHUNK
Output Annotator Type: CHUNK
Python API: ContextualEntityFilterer | Scala API: ContextualEntityFilterer | Notebook: ContextualEntityFilterer |
Show Example
from johnsnowlabs import nlp, medical
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = nlp.Tokenizer()\
.setInputCols(["document"])\
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["document", "token"])\
.setOutputCol("embeddings")
ner_deid = medical.NerModel.pretrained("ner_deid_subentity_docwise", "en", "clinical/models") \
.setInputCols(["document", "token", "embeddings"]) \
.setOutputCol("ner_deid_subentity_docwise")
ner_deid_converter = medical.NerConverterInternal()\
.setInputCols(["document", "token", "ner_deid_subentity_docwise"])\
.setOutputCol("ner_chunk_subentity_docwise")
rules =[{ "entity": "STATE",
"scopeWindow": [2, 2],
"whiteListEntities": ["CITY"],
"blackListEntities": ["NAME"],
"scopeWindowLevel": "token"
}]
contextual_entity_filterer = medical.ContextualEntityFilterer() \
.setInputCols("document", "token", "ner_chunk_subentity_docwise") \
.setOutputCol("filtered_ner_chunks") \
.setRules(rules)\
.setRuleScope("sentence")
nlpPipeline = nlp.Pipeline(
stages=[
documentAssembler,
tokenizer,
word_embeddings,
ner_deid,
ner_deid_converter,
contextual_entity_filterer
])
text = "NY, a 34-year-old woman, Dr. Michael Johnson cares wit her, at CarePlus Clinic, located at 456 Elm Street, NewYork, NY has recommended starting insulin therapy."
df = spark.createDataFrame([[text]]).toDF("text")
result = nlpPipeline.fit(df).transform(df)
# result
+---------------+-----+---+---------+----------+
|chunk |begin|end|ner_label|confidence|
+---------------+-----+---+---------+----------+
|NY |0 |1 |STATE |0.9299 |
|34-year-old |6 |16 |AGE |0.7687 |
|Michael Johnson|29 |43 |DOCTOR |0.89965 |
|CarePlus Clinic|63 |77 |HOSPITAL |0.9661 |
|456 Elm Street |91 |104|STREET |0.7733667 |
|NewYork |107 |113|CITY |0.9302 |
|NY |116 |117|STATE |0.9991 |
+---------------+-----+---+---------+----------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols("document", "token")
.setOutputCol("embeddings")
val ner_deid = MedicalNerModel.pretrained("ner_deid_subentity_docwise", "en", "clinical/models")
.setInputCols("document", "token", "embeddings")
.setOutputCol("ner_deid_subentity_docwise")
val ner_deid_converter = new NerConverterInternal()
.setInputCols("document", "token", "ner_deid_subentity_docwise")
.setOutputCol("ner_chunk_subentity_docwise")
val rules =
"""
|[{
|"entity": "STATE",
| "scopeWindow": [2, 2],
| "whiteListEntities": ["CITY"],
| "blackListEntities": ["NAME"],
| "scopeWindowLevel": "token"
|
| }
| ]
|
|""".stripMargin
val contextual_entity_filterer = new ContextualEntityFilterer()
.setInputCols("document", "token", "ner_chunk_subentity_docwise")
.setOutputCol("filtered_ner_chunks")
.setRulesAsStr(rules)
.setRuleScope("sentence")
val nlpPipeline = new Pipeline().setStages(
Array(
documentAssembler,
tokenizer,
word_embeddings,
ner_deid,
ner_deid_converter,
contextual_entity_filterer
))
val text = "NY, a 34-year-old woman, Dr. Michael Johnson cares wit her, at CarePlus Clinic, located at 456 Elm Street, NewYork, NY has recommended starting insulin therapy."
val df = Seq(text).toDF("text")
val result = nlpPipeline.fit(df).transform(df)
# result
+---------------+-----+---+---------+----------+
|chunk |begin|end|ner_label|confidence|
+---------------+-----+---+---------+----------+
|NY |0 |1 |STATE |0.9299 |
|34-year-old |6 |16 |AGE |0.7687 |
|Michael Johnson|29 |43 |DOCTOR |0.89965 |
|CarePlus Clinic|63 |77 |HOSPITAL |0.9661 |
|456 Elm Street |91 |104|STREET |0.7733667 |
|NewYork |107 |113|CITY |0.9302 |
|NY |116 |117|STATE |0.9991 |
+---------------+-----+---+---------+----------+
ContextualEntityRuler
ContextualEntityRuler is an annotator that updates chunks based on contextual rules. These rules are defined in the form of dictionaries and can include prefixes, suffixes, and the context within a specified scope window around the chunks. This annotator modifies detected chunks by replacing their entity labels or content based on the patterns and rules if they match. It is particularly useful for refining entity recognition results according to specific needs.
Parameters:
setCaseSensitive
: Whether to perform case-sensitive matching. Default is False.setAllowPunctuationInBetween
: Whether to allow punctuation between prefix/suffix patterns and the entity. Default is True.setDropEmptyChunks
: If True, removes chunks with empty content after applying rules. Default is False.setCaseSensitive
: If True, it is case sensitive while checking the context. Default is False.setMergeOverlapping
: If False, it returns both modified entities and the original entities at the same time. Default is True.rules
: The updating rules. Each rule is a dictionary with the following keys:entity
: The target entity label to modify.
Example:"AGE"
.prefixPatterns
: Array of patterns (words/phrases) to match before the entity.
Example:["years", "old"]
matches entities preceded by “years” or “old.”suffixPatterns
: Array of patterns (words/phrases) to match after the entity.
Example:["years", "old"]
matches entities followed by “years” or “old.”scopeWindowLevel
: Specifies the level of the scope window to consider.
Valid values:"token"
or"char"
. Default:"token"
.scopeWindow
: A tuple defining the range of tokens or characters (based onscopeWindowLevel
) to include in the scope.
Default for “token” level:(2, 2)
. Default for “char” level:(10,10)
Example:(2, 3)
means 2 tokens/characters before and 3 after the entity are considered.prefixRegexes
: Array of regular expressions to match before the entity.
Example:["\\b(years|months)\\b"]
matches words like “years” or “months” as prefixes.suffixRegexes
: Array of regular expressions to match after the entity.
Example:["\\b(old|young)\\b"]
matches words like “old” or “young” as suffixes.replaceEntity
: Optional string specifying the new entity label to replace with the target entity label.
Example:"MODIFIED_AGE"
replaces"AGE"
with"MODIFIED_AGE"
in matching cases.mode
: Specifies the operational mode for the rules.
Possible values depend on the use case (e.g.,"include"
,"exclude"
). Default:"include"
Input Annotator Types: DOCUMENT, TOKEN, CHUNK
Output Annotator Type: CHUNK
Python API: ContextualEntityRuler | Scala API: ContextualEntityRuler | Notebook: ContextualEntityRuler |
Show Example
from johnsnowlabs import nlp, medical
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = nlp.Tokenizer()\
.setInputCols(["document"])\
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["document", "token"])\
.setOutputCol("embeddings")
jsl_ner = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("jsl_ner")
jsl_ner_converter = medical.NerConverterInternal() \
.setInputCols(["sentence", "token", "jsl_ner"]) \
.setOutputCol("ner_chunks")
rules = [ {
"entity" : "Age",
"scopeWindow" : [15,15],
"scopeWindowLevel" : "char",
"suffixPatterns" : ["years old", "year old", "months",],
"replaceEntity" : "Modified_Age",
"mode" : "exclude"
},
{
"entity" : "Diabetes",
"scopeWindow" : [3,3],
"scopeWindowLevel" : "token",
"suffixPatterns" : ["with complications"],
"replaceEntity" : "Modified_Diabetes",
"mode" : "include"
},
{
"entity" : "Date",
"suffixRegexes" : ["\\d{4}"],
"replaceEntity" : "Modified_Date",
"mode" : "include"
}
]
contextual_entity_ruler = medical.ContextualEntityRuler() \
.setInputCols("sentence", "token", "ner_chunks") \
.setOutputCol("ruled_ner_chunks") \
.setRules(rules) \
.setCaseSensitive(False)\
.setDropEmptyChunks(True)\
.setAllowPunctuationInBetween(True)
ruler_pipeline = nlp.Pipeline(
stages=[
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
jsl_ner,
jsl_ner_converter,
contextual_entity_ruler,
flattener
])
empty_data = spark.createDataFrame([[""]]).toDF("text")
ruler_model = ruler_pipeline.fit(empty_data)
text ="""The Doctor assessed the 36 years old who has a history of the diabetes mellitus with complications in May, 2006"""
ruler_result = ruler_model.transform(data)
ruler_result.show(truncate = False)
# result
+-----------------+-----+---+------------------------------------+
|entity |begin|end|ruled_ner_chunks_result |
+-----------------+-----+---+------------------------------------+
|Modified_Age |28 |29 |36 |
|Modified_Diabetes|66 |101|diabetes mellitus with complications|
|Modified_Date |106 |114|May, 2006 |
+-----------------+-----+---+------------------------------------+
import spark.implicits._
val rules =
"""
|[
| {
| "entity": "Age",
| "scopeWindow": [15, 15],
| "scopeWindowLevel": "char",
| "suffixPatterns": ["years old", "year old", "months"],
| "replaceEntity": "Modified_Age",
| "mode": "exclude"
| },
| {
| "entity": "Diabetes",
| "scopeWindow": [3, 3],
| "scopeWindowLevel": "token",
| "suffixPatterns": ["with complications"],
| "replaceEntity": "Modified_Diabetes",
| "mode": "include"
| },
| {
| "entity": "Date",
| "suffixRegexes": ["\\d{4}"],
| "replaceEntity": "Modified_Date",
| "mode": "include"
| }
|]
""".stripMargin
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val wordEmbeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols("document", "token")
.setOutputCol("embeddings")
val jslNer = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models")
.setInputCols("sentence", "token", "embeddings")
.setOutputCol("jsl_ner")
val jslNerConverter = new NerConverterInternal()
.setInputCols("sentence", "token", "jsl_ner")
.setOutputCol("ner_chunks")
val contextualEntityRuler = new ContextualEntityRuler()
.setInputCols("sentence", "token", "ner_chunks")
.setOutputCol("ruled_ner_chunks")
.setRulesAsStr(rules)
.setCaseSensitive(false)
.setDropEmptyChunks(true)
.setAllowPunctuationInBetween(true)
val ruler_pipeline = new Pipeline().setStages(
Array(
documentAssembler,
sentenceDetector,
tokenizer,
wordEmbeddings,
jslNer,
jslNerConverter,
contextualEntityRuler
))
val text = "The Doctor assessed the 36 years old who has a history of the diabetes mellitus with complications in May, 2006"
val df = Seq(text).toDF("text")
val result = nlpPipeline.fit(df).transform(df)
# result
+-----------------+-----+---+------------------------------------+
|entity |begin|end|ruled_ner_chunks_result |
+-----------------+-----+---+------------------------------------+
|Modified_Age |28 |30 |36 |
|Modified_Diabetes|66 |101|diabetes mellitus with complications|
|Modified_Date |106 |114|May, 2006 |
+-----------------+-----+---+------------------------------------+
ContextualParser
Creates a model, that extracts entity from a document based on user defined rules.
Rule matching is based on a RegexMatcher defined in a JSON file. It is set through the parameter setJsonPath()
In this JSON file, regex is defined that you want to match along with the information that will output on metadata
field. Additionally, a dictionary can be provided with setDictionary
to map extracted entities
to a unified representation. The first column of the dictionary file should be the representation with following
columns the possible matches.
Parametres;
inputCols
: The name of the columns containing the input annotations. It can read either a String column or an Array.outputCol
: The name of the column in Document type that is generated. We can specify only one column here.jsonPath
: Path to json file containing regex patterns and rules to match the entities.dictionary
: Path to dictionary file in tsv or csv format.caseSensitive
: Whether to use case sensitive when matching values.prefixAndSuffixMatch
: Whether to match both prefix and suffix to annotate the match.optionalContextRules
: When set to true, it will output regex match regardless of context matches.shortestContextMatch
: When set to true, it will stop finding for matches when prefix/suffix data is found in the text.completeContextMatch
: Whether to do an exact match of prefix and suffix.
All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols()
.
Input Annotator Types: DOCUMENT, TOKEN
Output Annotator Type: CHUNK
Python API: ContextualParserApproach | Scala API: ContextualParserApproach | Notebook: ContextualParserApproachNotebook |
Show Example
from johnsnowlabs import nlp, medical
# An example JSON file `regex_token.json` can look like this:
#
# {
# "entity": "Stage",
# "ruleScope": "sentence",
# "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]",
# "matchScope": "token"
# }
#
# Which means to extract the stage code on a sentence level.
# An example pipeline could then be defined like this
# Pipeline could then be defined like this
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
contextualParser = medical.ContextualParserApproach() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("entity") \
.setJsonPath("/path/to/regex_token.json") \
.setCaseSensitive(True) \
.setContextMatch(False)
pipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
contextualParser
])
# Define the parser (json file needs to be provided)
data = spark.createDataFrame([["A patient has liver metastases pT1bN0M0 and the T5 primary site may be colon or... "]]).toDF("text")
result = pipeline.fit(data).transform(data)
# Show Results
result.selectExpr("explode(entity)").show(5, truncate=False)
+-------------------------------------------------------------------------------------------------------------------------+
|col |
+-------------------------------------------------------------------------------------------------------------------------+
|{chunk, 32, 39, pT1bN0M0, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 0}, []} |
|{chunk, 49, 50, T5, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 0}, []} |
|{chunk, 148, 156, cT4bcN2M1, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 1}, []}|
|{chunk, 189, 194, T?N3M1, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 2}, []} |
|{chunk, 316, 323, pT1bN0M0, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 3}, []} |
+-------------------------------------------------------------------------------------------------------------------------+
from johnsnowlabs import nlp, finance
# An example JSON file `regex_token.json` can look like this:
#
# {
# "entity": "Stage",
# "ruleScope": "sentence",
# "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]",
# "matchScope": "token"
# }
#
# Which means to extract the stage code on a sentence level.
# An example pipeline could then be defined like this
# Pipeline could then be defined like this
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
# Define the parser (json file needs to be provided)
contextualParser = finance.ContextualParserApproach() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("entity") \
.setJsonPath("/path/to/regex_token.json") \
.setCaseSensitive(True) \
.setContextMatch(False)
pipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
contextualParser
])
# Define the parser (json file needs to be provided)
data = spark.createDataFrame([["Peter Parker is a nice guy and lives in New York . Bruce Wayne is also a nice guy and lives in San Antonio and Gotham City ."]]).toDF("text")
result = pipeline.fit(data).transform(data)
# Show Results
result.selectExpr("explode(entity)").show(5, truncate=False)
+---------------------------------------------------------------+
|result |
+---------------------------------------------------------------+
|[Peter Parker, New York, Bruce Wayne, San Antonio, Gotham City]|
+---------------------------------------------------------------+
from johnsnowlabs import nlp, legal
# An example JSON file `regex_token.json` can look like this:
#
# {
# "entity": "Stage",
# "ruleScope": "sentence",
# "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]",
# "matchScope": "token"
# }
#
# Which means to extract the stage code on a sentence level.
# An example pipeline could then be defined like this
# Pipeline could then be defined like this
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
contextualParser = legal.ContextualParserApproach() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("entity") \
.setJsonPath("/path/to/regex_token.json") \
.setCaseSensitive(True) \
.setContextMatch(False)
pipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
contextualParser
])
# Define the parser (json file needs to be provided)
data = spark.createDataFrame([["Peter Parker is a nice guy and lives in New York . Bruce Wayne is also a nice guy and lives in San Antonio and Gotham City ."]]).toDF("text")
result = pipeline.fit(data).transform(data)
# Show Results
result.selectExpr("explode(entity)").show(5, truncate=False)
+---------------------------------------------------------------+
|result |
+---------------------------------------------------------------+
|[Peter Parker, New York, Bruce Wayne, San Antonio, Gotham City]|
+---------------------------------------------------------------+
import spark.implicits._
// An example JSON file `regex_token.json` can look like this:
//
// {
// "entity": "Stage",
// "ruleScope": "sentence",
// "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]",
// "matchScope": "token"
// }
//
// Which means to extract the stage code on a sentence level.
// An example pipeline could then be defined like this
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val contextualParser = new ContextualParserApproach()
.setInputCols(Array("sentence", "token"))
.setOutputCol("entity")
.setJsonPath("/path/to/regex_token.json")
.setCaseSensitive(true)
.setContextMatch(false)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
contextualParser
))
// Define the parser (json file needs to be provided)
val data = Seq("A patient has liver metastases pT1bN0M0 and the T5 primary site may be colon or... ").toDF("text")
val result = pipeline.fit(data).transform(data)
// Show Results
//
// result.selectExpr("explode(entity)").show(5, truncate=false)
// +-------------------------------------------------------------------------------------------------------------------------+
// |col |
// +-------------------------------------------------------------------------------------------------------------------------+
// |{chunk, 32, 39, pT1bN0M0, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 0}, []} |
// |{chunk, 49, 50, T5, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 0}, []} |
// |{chunk, 148, 156, cT4bcN2M1, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 1}, []}|
// |{chunk, 189, 194, T?N3M1, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 2}, []} |
// |{chunk, 316, 323, pT1bN0M0, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 3}, []} |
// +-------------------------------------------------------------------------------------------------------------------------+
//
import spark.implicits._
// An example JSON file `regex_token.json` can look like this:
//
// {
// "entity": "Stage",
// "ruleScope": "sentence",
// "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]",
// "matchScope": "token"
// }
//
// Which means to extract the stage code on a sentence level.
// An example pipeline could then be defined like this
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val contextualParser = new ContextualParserApproach()
.setInputCols(Array("sentence", "token"))
.setOutputCol("entity")
.setJsonPath("/path/to/regex_token.json")
.setCaseSensitive(true)
.setContextMatch(false)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
contextualParser
))
// Define the parser (json file needs to be provided)
val data = Seq("Peter Parker is a nice guy and lives in New York . Bruce Wayne is also a nice guy and lives in San Antonio and Gotham City .").toDF("text")
val result = pipeline.fit(data).transform(data)
// Show Results
+---------------------------------------------------------------+
|result |
+---------------------------------------------------------------+
|[Peter Parker, New York, Bruce Wayne, San Antonio, Gotham City]|
+---------------------------------------------------------------+
import spark.implicits._
// An example JSON file `regex_token.json` can look like this:
//
// {
// "entity": "Stage",
// "ruleScope": "sentence",
// "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]",
// "matchScope": "token"
// }
//
// Which means to extract the stage code on a sentence level.
// An example pipeline could then be defined like this
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val contextualParser = new ContextualParserApproach()
.setInputCols(Array("sentence", "token"))
.setOutputCol("entity")
.setJsonPath("/path/to/regex_token.json")
.setCaseSensitive(true)
.setContextMatch(false)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
contextualParser
))
// Define the parser (json file needs to be provided)
val data = Seq("Peter Parker is a nice guy and lives in New York . Bruce Wayne is also a nice guy and lives in San Antonio and Gotham City .").toDF("text")
val result = pipeline.fit(data).transform(data)
// Show Results
+---------------------------------------------------------------+
|result |
+---------------------------------------------------------------+
|[Peter Parker, New York, Bruce Wayne, San Antonio, Gotham City]|
+---------------------------------------------------------------+
DateNormalizer
This annotator transforms date mentions to a common standard format: YYYY/MM/DD. It is useful when using data from different sources, some times from different countries that has different formats to represent dates.
For the relative dates (next year, past month, etc.), you can define an achor date to create the normalized date by setting the parameters anchorDateYear
, anchorDateMonth
, and anchorDateDay
.
The resultant chunk date will contain a metada indicating whether the normalization was successful or not (True / False).
Parametres:
-
anchorDateYear
: (Int) Sets an anchor year for the relative dates such as a day after tomorrow. If not set it will use the current year. -
anchorDateMonth
: (Int) Sets an anchor month for the relative dates such as a day after tomorrow. If not set it will use the current month. -
anchorDateDay
: (Int) Sets an anchor day of the day for the relative dates such as a day after tomorrow. If not set it will use the current day. outputDateformat
: (string) Select what output format to use. If not set, the dates will be formatted asYYYY/MM/DD
. Options are:eu
: Format the dates asDD/MM/YYYY
us
: Format the dates asMM/DD/YYYY
-
defaultReplacementDay
: (Int) Defines which value to use for creating the Day Value when original Date-Entity has no Day Information. Defaults to 15. -
defaultReplacementMonth
: (Int) Defines which value to use for creating the Month Value when original Date-Entity has no Month Information. Defaults to 06. defaultReplacementYear
: (Int) Defines which value to use for creating the Year Value when original Date-Entity has no Year Information. Defaults to 2020.
Input Annotator Types: CHUNK
Output Annotator Type: CHUNK
Python API: DateNormalizer | Scala API: DateNormalizer | Notebook: DateNormalizerNotebook |
Show Example
from johnsnowlabs import nlp, medical
document_assembler = nlp.DocumentAssembler()\
.setInputCol("original_date")\
.setOutputCol("document")
doc2chunk = nlp.Doc2Chunk()\
.setInputCols("document")\
.setOutputCol("date_chunk")
date_normalizer = medical.DateNormalizer()\
.setInputCols("date_chunk")\
.setOutputCol("date")\
.setAnchorDateYear(2000)
pipeline = nlp.Pipeline(stages=[document_assembler, doc2chunk, date_normalizer])
dates = [
"08/02/2018",
"11/2018",
"11/01/2018",
"12Mar2021",
"Jan 30, 2018",
"13.04.1999",
"3April 2020",
"next monday",
"today",
"next week",
]
df = spark.createDataFrame(dates, StringType()).toDF("original_date")
result = pipeline.fit(df).transform(df)
result.selectExpr(
"date.result as normalized_date",
"original_date",
"date.metadata[0].normalized as metadata",
).show()
+---------------+-------------+--------+
|normalized_date|original_date|metadata|
+---------------+-------------+--------+
| [2018/08/02]| 08/02/2018| true|
| [2018/11/15]| 11/2018| true|
| [2018/11/01]| 11/01/2018| true|
| [2021/03/12]| 12Mar2021| true|
| [2018/01/30]| Jan 30, 2018| true|
| [1999/04/13]| 13.04.1999| true|
| [2020/04/03]| 3April 2020| true|
| [2000/12/11]| next monday| true|
| [2000/12/06]| today| true|
| [2000/12/13]| next week| true|
+---------------+-------------+--------+
from johnsnowlabs import nlp, finance
document_assembler = nlp.DocumentAssembler()\
.setInputCol("original_date")\
.setOutputCol("document")
doc2chunk = nlp.Doc2Chunk()\
.setInputCols("document")\
.setOutputCol("date_chunk")
date_normalizer = finance.DateNormalizer()\
.setInputCols("date_chunk")\
.setOutputCol("date")\
.setAnchorDateYear(2000)
pipeline = nlp.Pipeline(stages=[document_assembler, doc2chunk, date_normalizer])
dates = [
"08/02/2018",
"11/2018",
"11/01/2018",
"12Mar2021",
"Jan 30, 2018",
"13.04.1999",
"3April 2020",
"next monday",
"today",
"next week",
]
df = spark.createDataFrame(dates, StringType()).toDF("original_date")
result = pipeline.fit(df).transform(df)
result.selectExpr(
"date.result as normalized_date",
"original_date",
"date.metadata[0].normalized as metadata",
).show()
+---------------+-------------+--------+
|normalized_date|original_date|metadata|
+---------------+-------------+--------+
| [2018/08/02]| 08/02/2018| true|
| [2018/11/15]| 11/2018| true|
| [2018/11/01]| 11/01/2018| true|
| [2021/03/12]| 12Mar2021| true|
| [2018/01/30]| Jan 30, 2018| true|
| [1999/04/13]| 13.04.1999| true|
| [2020/04/03]| 3April 2020| true|
| [2000/12/11]| next monday| true|
| [2000/12/06]| today| true|
| [2000/12/13]| next week| true|
+---------------+-------------+--------+
from johnsnowlabs import nlp, legal
document_assembler = nlp.DocumentAssembler()\
.setInputCol("original_date")\
.setOutputCol("document")
doc2chunk = nlp.Doc2Chunk()\
.setInputCols("document")\
.setOutputCol("date_chunk")
date_normalizer = legal.DateNormalizer()\
.setInputCols("date_chunk")\
.setOutputCol("date")\
.setAnchorDateYear(2000)
pipeline = nlp.Pipeline(stages=[document_assembler, doc2chunk, date_normalizer])
dates = [
"08/02/2018",
"11/2018",
"11/01/2018",
"12Mar2021",
"Jan 30, 2018",
"13.04.1999",
"3April 2020",
"next monday",
"today",
"next week",
]
df = spark.createDataFrame(dates, StringType()).toDF("original_date")
result = pipeline.fit(df).transform(df)
+---------------+-------------+--------+
|normalized_date|original_date|metadata|
+---------------+-------------+--------+
| [2018/08/02]| 08/02/2018| true|
| [2018/11/15]| 11/2018| true|
| [2018/11/01]| 11/01/2018| true|
| [2021/03/12]| 12Mar2021| true|
| [2018/01/30]| Jan 30, 2018| true|
| [1999/04/13]| 13.04.1999| true|
| [2020/04/03]| 3April 2020| true|
| [2000/12/11]| next monday| true|
| [2000/12/06]| today| true|
| [2000/12/13]| next week| true|
+---------------+-------------+--------+
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("original_date")
.setOutputCol("document")
val doc2chunk = new Doc2Chunk()
.setInputCols("document")
.setOutputCol("date_chunk")
val date_normalizer = new DateNormalizer()
.setInputCols("date_chunk")
.setOutputCol("date")
.setAnchorDateYear(2000)
val pipeline = new Pipeline().setStages(Array(
document_assembler,
doc2chunk,
date_normalizer
))
val df = Seq(("08/02/2018"),("11/2018"),("11/01/2018"),("next monday"),("today"),("next week")).toDF("original_date")
val result = pipeline.fit(df).transform(df)
+---------------+-------------+--------+
|normalized_date|original_date|metadata|
+---------------+-------------+--------+
| [2018/08/02]| 08/02/2018| true|
| [2018/11/15]| 11/2018| true|
| [2018/11/01]| 11/01/2018| true|
| [2021/03/12]| 12Mar2021| true|
| [2018/01/30]| Jan 30, 2018| true|
| [1999/04/13]| 13.04.1999| true|
| [2020/04/03]| 3April 2020| true|
| [2000/12/11]| next monday| true|
| [2000/12/06]| today| true|
| [2000/12/13]| next week| true|
+---------------+-------------+--------+
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("original_date")
.setOutputCol("document")
val doc2chunk = new Doc2Chunk()
.setInputCols("document")
.setOutputCol("date_chunk")
val date_normalizer = new DateNormalizer()
.setInputCols("date_chunk")
.setOutputCol("date")
.setAnchorDateYear(2000)
val pipeline = new Pipeline().setStages(Array(
document_assembler,
doc2chunk,
date_normalizer
))
val df = Seq(("08/02/2018"),("11/2018"),("11/01/2018"),("next monday"),("today"),("next week")).toDF("original_date")
val result = pipeline.fit(df).transform(df)
+---------------+-------------+--------+
|normalized_date|original_date|metadata|
+---------------+-------------+--------+
| [2018/08/02]| 08/02/2018| true|
| [2018/11/15]| 11/2018| true|
| [2018/11/01]| 11/01/2018| true|
| [2021/03/12]| 12Mar2021| true|
| [2018/01/30]| Jan 30, 2018| true|
| [1999/04/13]| 13.04.1999| true|
| [2020/04/03]| 3April 2020| true|
| [2000/12/11]| next monday| true|
| [2000/12/06]| today| true|
| [2000/12/13]| next week| true|
+---------------+-------------+--------+
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("original_date")
.setOutputCol("document")
val doc2chunk = new Doc2Chunk()
.setInputCols("document")
.setOutputCol("date_chunk")
val date_normalizer = new DateNormalizer()
.setInputCols("date_chunk")
.setOutputCol("date")
.setAnchorDateYear(2000)
val pipeline = new Pipeline().setStages(Array(
document_assembler,
doc2chunk,
date_normalizer
))
val df = Seq(("08/02/2018"),("11/2018"),("11/01/2018"),("next monday"),("today"),("next week")).toDF("original_date")
val result = pipeline.fit(df).transform(df)
+---------------+-------------+--------+
|normalized_date|original_date|metadata|
+---------------+-------------+--------+
| [2018/08/02]| 08/02/2018| true|
| [2018/11/15]| 11/2018| true|
| [2018/11/01]| 11/01/2018| true|
| [2021/03/12]| 12Mar2021| true|
| [2018/01/30]| Jan 30, 2018| true|
| [1999/04/13]| 13.04.1999| true|
| [2020/04/03]| 3April 2020| true|
| [2000/12/11]| next monday| true|
| [2000/12/06]| today| true|
| [2000/12/13]| next week| true|
+---------------+-------------+--------+
DeIdentification
Deidentification is a critical and important technology to facilitate the use of structured or unstructured clinical text while protecting patient privacy and confidentiality. John Snow Labs teams has invested great efforts in developing methods and corpora for deidentification of clinical text, PDF, image, DICOM, containing Protected Health Information (PHI):
- individual’s past, present, or future physical or mental health or condition.
- provision of health care to the individual.
- past, present, or future payment for the health care.
Protected health information includes many common identifiers (e.g., name, address, birth date, Social Security Number) when they can be associated with the health information.
Spark NLP for Healthcare proposes several techniques and strategies for deidentification, the principal ones are:
Mask:
- entity_labels: Mask with the entity type of that chunk. (default)
- same_length_chars: Mask the deid entities with same length of asterix ( * ) with brackets ( [ , ] ) on both end.
- fixed_length_chars: Mask the deid entities with a fixed length of asterix ( * ). The length is setting up using the setFixedMaskLength() method.
Obfuscation: replace sensetive entities with random values of the same type.
Faker: allows the user to use a set of fake entities that are in the memory of spark-nlp-internal
Also there is an advanced option allowing to deidentify with multiple modes at the same time. (Multi-Mode Deididentification). Deidentifies Input Annotations of types DOCUMENT, TOKEN and CHUNK, by either masking or obfuscating the given CHUNKS.
Parameters:
-
ageRanges
: (IntArrayParam) List of integers specifying limits of the age groups to preserve during obfuscation -
blackList
: (StringArrayParam) List of entities that will be ignored to in the regex file. -
consistentObfuscation
: (BooleanParam) Whether to replace very similar entities in a document with the same randomized term (default: true) The similarity is based on the Levenshtein Distance between the words. -
dateFormats
: (StringArrayParam) Format of dates to displace -
dateTag
: (Param[String]) Tag representing what are the NER entity (default: DATE) -
dateToYear
: (BooleanParam) true if dates must be converted to years, false otherwise -
days
: (IntParam) Number of days to obfuscate the dates by displacement. -
fixedMaskLength
: (IntParam) Select the fixed mask length: this is the length of the masking sequence that will be used when the ‘fixed_length_chars’ masking policy is selected. -
ignoreRegex
: (BooleanParam) Select if you want to use regex file loaded in the model. -
isRandomDateDisplacement
: (BooleanParam) Use a random displacement days in dates entities,that random number is based on the DeIdentificationParams.seed If true use random displacement days in dates entities,if false use the DeIdentificationParams.days The default value is false. -
language
: (Param[String]) The language used to select the regex file and some faker entities.’en’(english),’de’(German), ‘es’(Spanish), ‘fr’(French) or ‘ro’(Romanian) -
mappingsColumn
: (Param[String]) This is the mapping column that will return the Annotations chunks with the fake entities -
maskingPolicy
: (Param[String]) Select the masking policy: same_length_chars: Replace the obfuscated entity with a masking sequence composed of asterisks and surrounding squared brackets, being the total length of the masking sequence of the same length as the original sequence. Example, Smith -> [***]. If the entity is less than 3 chars (like Jo, or 5), asterisks without brackets will be returned. entity_labels: Replace the values with the corresponding entity labels. fixed_length_chars: Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisks. -
minYear
: (IntParam) Minimum year to use when converting date to year -
mode
: (Param[String]) Mode for Anonymizer [‘mask’, ‘obfuscate’] Given the following text -
obfuscateDate
: (BooleanParam) When mode==”obfuscate” whether to obfuscate dates or not. -
obfuscateRefFile
: (Param[String]) File with the terms to be used for Obfuscation -
obfuscateRefSource
: (Param[String]) The source of obfuscation of to obfuscate the entities.For dates entities doesnt apply tha method. -
outputAsDocument
: (BooleanParam) Whether to return all sentences joined into a single document -
refFileFormat
: (Param[String]) Format of the reference file for Obfuscation the default value for that is “csv” -
refSep
: (Param[String]) Separator character for the csv reference file for Obfuscation de default value is “#” -
regexOverride
: (BooleanParam) If is true prioritize the regex entities, if is false prioritize the ner. -
regexPatternsDictionary
: (ExternalResourceParam) dictionary with regular expression patterns that match some protected entity if the dictionary in not setting up we will use the default regex file. -
region
: (Param[String]) Usa or eu -
returnEntityMappings
: (BooleanParam) With this property you select if you want to return mapping column -
sameEntityThreshold
: (DoubleParam) Similarity threshold [0.0-1.0] to consider two appearances of an entity as the same (default: 0.9) For date entities this method doesn’t apply. -
sameLengthFormattedEntities
: (StringArrayParam) List of formatted entities to generate the same length outputs as original ones during obfuscation. -
seed
: (IntParam) It is the seed to select the entities on obfuscate mode.With the seed you can reply a execution several times with the same ouptut. -
selectiveObfuscationModesPath
: (Param[String]) Dictionary path where is the json that contains the selective obfuscation modes -
unnormalizedDateMode
: (Param[String]) The mode to use if the date is not formatted. -
zipCodeTag
: (Param[String]) Tag representing zip codes in the obfuscate reference file (default: ZIP). -
MetadataMaskingPolicy(str)
: (Param[String]) Options : ‘entity_labels’, ‘same_length_chars’, ‘fixed_length_chars’ If set, metadata includes the masked form of the document. -
obfuscateByAgeGroups
: (BooleanParam) Whether to obfuscate ages based on age groups. When True, the age groups specified in theageGroups
parameter will be used to obfuscate ages. When False, the age ranges specified in theageRanges
parameter will be used to obfuscate ages. Default: False -
ageGroups
: A dictionary of age groups to obfuscate ages. For this parameter to be active, theobfuscateByAgeGroups
parameter must be true. If the givenageGroups
do not fully contain the ages, the ages continue to be obfuscated according to theageRanges
parameter. The dictionary should contain the age group name as the key and an list of two integers as the value. The first integer is the lower bound of the age group, and the second integer is the upper bound of the age group. -
keepYear
: (BooleanParam) Whether to keep the year intact when obfuscating date entities. If True, the year will remain unchanged during the obfuscation process. If False, the year will be modified along with the month and day. Default: False. -
keepMonth
: Whether to keep the month intact when obfuscating date entities. If True, the month will remain unchanged during the obfuscation process. If False, the month will be modified along with the year and day. Default: False.
To create a configured DeIdentificationModel, please see the example of DeIdentification.
Input Annotator Types: DOCUMENT, TOKEN, CHUNK
Output Annotator Type: DOCUMENT
Python API: DeIdentificationModel | Scala API: DeIdentificationModel | Notebook: DeIdentificationModelNotebook |
Show Example
from johnsnowlabs import nlp, medical
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence") \
.setUseAbbreviations(True)
tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")\
embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_sensitive_entities = medical.NerModel \
.pretrained("ner_deid_enriched", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
nerConverter = medical.NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
deIdentification = medical.DeIdentificationModel.pretrained("deidentify_large", "en", "clinical/models") \
.setInputCols(["ner_chunk", "token", "sentence"]) \
.setOutputCol("dei") \
.setMode("obfuscate") \
.setDateFormats(["MM/dd/yy","yyyy-MM-dd"]) \
.setObfuscateDate(True) \
.setDateTag("DATE") \
.setDays(5) \
.setObfuscateRefSource("both")
data = spark.createDataFrame([
["# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09."]
]).toDF("text")
pipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_sensitive_entities,
nerConverter,
deIdentification
])
result = pipeline.fit(data).transform(data)
result.select(F.expr("sentence.result as Input") ,F.expr("dei.result as deidentified")).show(truncate=100)
+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+
| Input| deidentified|
+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+
|[# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09.]|[# 1610960 Date : 01/18/93 PCP : Vida Rigger , 27 years-old , Record date : 2079-11-14.]|
+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+
from johnsnowlabs import nlp, medical, finance, legal
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = legal.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
bert_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("bert_embeddings")
fin_ner = finance.NerModel.pretrained('finner_deid', "en", "finance/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
#.setLabelCasing("upper")
ner_converter = medical.NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")\
.setReplaceLabels({"ORG": "COMPANY"}) # Replace "ORG" entity as "COMPANY"
ner_finner = finance.NerModel.pretrained("finner_org_per_role_date", "en", "finance/models")\
.setInputCols(["sentence", "token", "bert_embeddings"]) \
.setOutputCol("ner_finner")
#.setLabelCasing("upper")
ner_converter_finner = nlp.NerConverter() \
.setInputCols(["sentence", "token", "ner_finner"]) \
.setOutputCol("ner_finner_chunk")
# .setWhiteList(['ROLE']) # Just use "ROLE" entity from this NER
chunk_merge = medical.ChunkMergeApproach()\
.setInputCols("ner_finner_chunk", "ner_chunk")\
.setOutputCol("deid_merged_chunk")
deidentification = finance.DeIdentification() \
.setInputCols(["sentence", "token", "deid_merged_chunk"]) \
.setOutputCol("deidentified") \
.setMode("mask")\
.setIgnoreRegex(True)
# Pipeline
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
bert_embeddings,
fin_ner,
ner_converter,
ner_finner,
ner_converter_finner,
chunk_merge,
deidentification])
data = spark.createDataFrame([
["Jeffrey Preston Bezos, DoB 12/01/1964, is an American entrepreneur, founder and CEO of Amazon"]
]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
result.select("sentence.result", "deidentified.result").show(truncate = False)
+-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------+
|result |result |
+-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------+
|[Jeffrey Preston Bezos, DoB 12/01/1964, is an American entrepreneur, founder and CEO of Amazon]|[<PERSON>, <DATE>, is an American entrepreneur, <ROLE> and <ROLE> of <ORG>]|
+-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------+
from johnsnowlabs import nlp, legal, medical
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = legal.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
legal_ner = legal.NerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
#.setLabelCasing("upper")
ner_converter = medical.NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")\
.setReplaceLabels({"ALIAS": "PARTY"})
ner_signers = legal.NerModel.pretrained("legner_signers", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner_signers")
#.setLabelCasing("upper")
ner_converter_signers = nlp.NerConverter() \
.setInputCols(["sentence", "token", "ner_signers"]) \
.setOutputCol("ner_signer_chunk")
chunk_merge = medical.ChunkMergeApproach()\
.setInputCols("ner_signer_chunk", "ner_chunk")\
.setOutputCol("deid_merged_chunk")
deidentification = legal.DeIdentification() \
.setInputCols(["sentence", "token", "deid_merged_chunk"]) \
.setOutputCol("deidentified") \
.setMode("mask")\
.setIgnoreRegex(True)
# Pipeline
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
legal_ner,
ner_converter,
ner_signers,
ner_converter_signers,
chunk_merge,
deidentification])
data = spark.createDataFrame([["ENTIRE AGREEMENT. This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby,\
supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter. THE MART.COM, INC.: I-ESCROW, INC.: By:Dominic J. Magliarditi \
By:Sanjay Bajaj Name: Dominic J. Magliarditi Name: Sanjay Bajaj Title: President Title: VP Business Development Date: 6/21/2023 "]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
result.select("sentence.result", "deidentified.result").toPandas()
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|sentence |deidentified |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ENTIRE AGREEMENT. |<DOC>. |
|This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby, supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter.|This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby, supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter.|
|THE MART.COM, INC.: I-ESCROW, INC.: By:Dominic J. Magliarditi By:Sanjay Bajaj Name: Dominic J. Magliarditi Name: Sanjay Bajaj Title: President Title: VP Business Development Date: 6/21/2023 |<PARTY>.: <PARTY>.: By:Dominic <SIGNING_PERSON> By:Sanjay <SIGNING_PERSON> Name: <SIGNING_PERSON> Name: <SIGNING_PERSON> Title: <SIGNING_TITLE> Title: <SIGNING_TITLE> Date: 6/21/2023 |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
.setUseAbbreviations(true)
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinicalSensitiveEntities = MedicalNerModel.pretrained("ner_deid_enriched", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val nerConverter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val deIdentification = DeIdentificationModel.pretrained("deidentify_large", "en", "clinical/models")
.setInputCols(Array("ner_chunk", "token", "sentence"))
.setOutputCol("dei")
.setMode("obfuscate")
.setDateFormats(Array("MM/dd/yy", "yyyy-MM-dd"))
.setObfuscateDate(true)
.setDateTag("DATE")
.setDays(5)
.setObfuscateRefSource("both")
val data = Seq(
"# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09."
).toDF("text")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinicalSensitiveEntities,
nerConverter,
deIdentification
))
val result = pipeline.fit(data).transform(data)
+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+
| Input| deidentified|
+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+
|[# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09.]|[# 1610960 Date : 01/18/93 PCP : Vida Rigger , 27 years-old , Record date : 2079-11-14.]|
+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val bertEmbeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("bert_embeddings")
val finNer = FinanceNerModel.pretrained("finner_deid", "en", "finance/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val nerConverter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
.setReplaceLabels(Map("ORG" -> "COMPANY"))
val nerFinner = FinanceNerModel.pretrained("finner_org_per_role_date", "en", "finance/models")
.setInputCols(Array("sentence", "token", "bert_embeddings"))
.setOutputCol("ner_finner")
val nerConverterFinner = new NerConverter()
.setInputCols(Array("sentence", "token", "ner_finner"))
.setOutputCol("ner_finner_chunk")
val chunkMerge = new ChunkMergeApproach()
.setInputCols(Array("ner_finner_chunk", "ner_chunk"))
.setOutputCol("deid_merged_chunk")
val deidentification = new DeIdentification()
.setInputCols(Array("sentence", "token", "deid_merged_chunk"))
.setOutputCol("deidentified")
.setMode("mask")
.setIgnoreRegex(true)
val nlpPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
bertEmbeddings,
finNer,
nerConverter,
nerFinner,
nerConverterFinner,
chunkMerge,
deidentification
))
val data = Seq(
"Jeffrey Preston Bezos, DoB 12/01/1964, is an American entrepreneur, founder and CEO of Amazon"
).toDF("text")
val result = nlpPipeline.fit(data).transform(data)
+-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------+
|result |result |
+-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------+
|[Jeffrey Preston Bezos, DoB 12/01/1964, is an American entrepreneur, founder and CEO of Amazon]|[<PERSON>, <DATE>, is an American entrepreneur, <ROLE> and <ROLE> of <ORG>]|
+-----------------------------------------------------------------------------------------------+---------------------------------------------------------------------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val legalNer = LegalNerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
.setLabelCasing("upper")
val nerConverter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
.setReplaceLabels(Map("ALIAS" -> "PARTY"))
val nerSigners = LegalNerModel.pretrained("legner_signers", "en", "legal/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner_signers")
.setLabelCasing("upper")
val nerConverterSigners = new NerConverter()
.setInputCols(Array("sentence", "token", "ner_signers"))
.setOutputCol("ner_signer_chunk")
val chunkMerge = new ChunkMergeApproach()
.setInputCols(Array("ner_signer_chunk", "ner_chunk"))
.setOutputCol("deid_merged_chunk")
val deidentification = new DeIdentification()
.setInputCols(Array("sentence", "token", "deid_merged_chunk"))
.setOutputCol("deidentified")
.setMode("mask")
.setIgnoreRegex(true)
val nlpPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
legalNer,
nerConverter,
nerSigners,
nerConverterSigners,
chunkMerge,
deidentification
))
val data = Seq(
"ENTIRE AGREEMENT. This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby, supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter. THE MART.COM, INC.: I-ESCROW, INC.: By:Dominic J. Magliarditi By:Sanjay Bajaj Name: Dominic J. Magliarditi Name: Sanjay Bajaj Title: President Title: VP Business Development Date: 6/21/2023"
).toDF("text")
val result = nlpPipeline.fit(data).transform(data)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|sentence |deidentified |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ENTIRE AGREEMENT. |<DOC>. |
|This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby, supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter.|This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby, supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter.|
|THE MART.COM, INC.: I-ESCROW, INC.: By:Dominic J. Magliarditi By:Sanjay Bajaj Name: Dominic J. Magliarditi Name: Sanjay Bajaj Title: President Title: VP Business Development Date: 6/21/2023 |<PARTY>.: <PARTY>.: By:Dominic <SIGNING_PERSON> By:Sanjay <SIGNING_PERSON> Name: <SIGNING_PERSON> Name: <SIGNING_PERSON> Title: <SIGNING_TITLE> Title: <SIGNING_TITLE> Date: 6/21/2023 |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Contains all the methods for training a DeIdentificationModel model. This module can obfuscate or mask the entities that contains personal information. These can be set with a file of regex patterns with setRegexPatternsDictionary, where each line is a mapping of entity to regex.
DATE \d{4}
AID \d{6,7}
Additionally, obfuscation strings can be defined with setObfuscateRefFile, where each line is a mapping of string to entity. The format and seperator can be speficied with setRefFileFormat and setRefSep.
Dr. Gregory House#DOCTOR
01010101#MEDICALRECORD
Ideally this annotator works in conjunction with Demographic Named EntityRecognizers that can be trained either using TextMatchers, RegexMatchers, DateMatchers, NerCRFs or NerDLs
Input Annotator Types: DOCUMENT, TOKEN, CHUNK
Output Annotator Type: DOCUMENT
Python API: DeIdentification | Scala API: DeIdentification |
Show Example
from johnsnowlabs import nlp, medical
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
# NER model trained on n2c2 (de-identification and Heart Disease Risk Factors Challenge) datasets)
clinical_ner = medical.NerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = medical.NerConverterInternal()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
#deid model with "entity_labels"
deid_entity_labels= medical.DeIdentification()\
.setInputCols(["sentence", "token", "ner_chunk"])\
.setOutputCol("deid_entity_label")\
.setMode("mask")\
.setReturnEntityMappings(True)\
.setMaskingPolicy("entity_labels")
obs_lines = """Marvin MARSHALL#PATIENT
Hubert GROGAN#PATIENT
ALTHEA COLBURN#PATIENT
Kalil AMIN#PATIENT
Inci FOUNTAIN#PATIENT
Ekaterina Rosa#DOCTOR
Rudiger Chao#DOCTOR
COLLETTE KOHLER#NAME
Mufi HIGGS#NAME"""
with open ('obfuscation.txt', 'w') as f:
f.write(obs_lines)
obfuscation = medical.DeIdentification()\
.setInputCols(["sentence", "token", "ner_chunk"]) \
.setOutputCol("deidentified") \
.setMode("obfuscate")\
.setObfuscateDate(True)\
.setObfuscateRefFile('obfuscation.txt')\
.setObfuscateRefSource("both")\ #file or faker
.setGenderAwareness(True)\
.setLanguage("en")\
.setUnnormalizedDateMode("obfuscate") #mask or skip
deidPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
deid_entity_labels,
obfuscation
])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = deidPipeline.fit(empty_data)
#sample data
text ='''
Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .
'''
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))
result.select(F.explode(F.arrays_zip(result.sentence.result,
result.deid_entity_label.result,
result.deidentified.result,
)).alias("cols")) \
.select(F.expr("cols['0']").alias("sentence"),
F.expr("cols['1']").alias("deid_entity_label"),
F.expr("cols['2']").alias("deidentified"),
).toPandas()
+-----------------------------------------------------------------------+-------------------------------------------------------+-----------------------------------------------------------------+
| sentence| deid_entity_label| deidentified|
+-----------------------------------------------------------------------+-------------------------------------------------------+-----------------------------------------------------------------+
| Record date : 2093-01-13 , David Hale , M.D .| Record date : <DATE> , <NAME> , M.D .| Record date : 2093-01-25 , Daryl Dieter , M.D .|
| , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 .| , Name : <NAME> , MR # <ID> Date : <DATE> .| , Name : Langston Papas , MR # 4784828 Date : 01/25/93 .|
| PCP : Oliveira , 25 years-old , Record date : 2079-11-09 .|PCP : <NAME> , <AGE> years-old , Record date : <DATE> .|PCP : Roseann Lederer , 23 years-old , Record date : 2079-11-21 .|
|Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .| <LOCATION> , <LOCATION> , Phone <CONTACT> .| 31 North St Joseph Ave , 400 Tickle St , Phone (59) 106-048 .|
+-----------------------------------------------------------------------+-------------------------------------------------------+-----------------------------------------------------------------+
from johnsnowlabs import nlp, finance
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence") \
.setUseAbbreviations(True)
tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel \
.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
# Ner entities
ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
nerConverter = nlp.NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_con")
# Deidentification
deIdentification = finance.DeIdentification() \
.setInputCols(["ner_chunk", "token", "sentence"]) \
.setOutputCol("dei") \
# file with custom regex pattern for custom entities
.setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt") \
# file with custom obfuscator names for the entities
.setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt") \
.setRefFileFormat("csv") \
.setRefSep("#") \
.setMode("obfuscate") \
.setDateFormats(Array("MM/dd/yy","yyyy-MM-dd")) \
.setObfuscateDate(True) \
.setDateTag("DATE") \
.setDays(5) \
.setObfuscateRefSource("file")
# Pipeline
pipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
nerConverter,
deIdentification
])
from johnsnowlabs import nlp, legal
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence") \
.setUseAbbreviations(True)
tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel \
.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
# Ner entities
ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
nerConverter = nlp.NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_con")
# Deidentification
deIdentification = legal.DeIdentification() \
.setInputCols(["ner_chunk", "token", "sentence"]) \
.setOutputCol("dei") \
# file with custom regex pattern for custom entities
.setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt") \
# file with custom obfuscator names for the entities
.setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt") \
.setRefFileFormat("csv") \
.setRefSep("#") \
.setMode("obfuscate") \
.setDateFormats(Array("MM/dd/yy","yyyy-MM-dd")) \
.setObfuscateDate(True) \
.setDateTag("DATE") \
.setDays(5) \
.setObfuscateRefSource("file")
# Pipeline
pipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
nerConverter,
deIdentification
])
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
// Sentence Detector annotator, processes various sentences per line
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
// Tokenizer splits words in a relevant format for NLP
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
// Clinical word embeddings trained on PubMED dataset
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
// NER model trained on n2c2 (de-identification and Heart Disease Risk Factors Challenge) datasets)
val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
//deid model with "entity_labels"
val deid_entity_labels= new DeIdentification()
.setInputCols(Array("ner_chunk", "token", "sentence"))
.setOutputCol("deid_entity_label")
.setMode("mask")
.setReturnEntityMappings(true)
.setMaskingPolicy("entity_labels")
//
val obs_lines = """Marvin MARSHALL#PATIENT
Hubert GROGAN#PATIENT
ALTHEA COLBURN#PATIENT
Kalil AMIN#PATIENT
Inci FOUNTAIN#PATIENT
Ekaterina Rosa#DOCTOR
Rudiger Chao#DOCTOR
COLLETTE KOHLER#NAME
Mufi HIGGS#NAME"""
//
val obfuscation = new DeIdentification()
.setInputCols(Array("ner_chunk", "token", "sentence"))
.setOutputCol("deidentified")
.setMode("obfuscate")
.setObfuscateDate(true)
.setObfuscateRefFile("obfuscation.txt")
.setObfuscateRefSource("both") //file or faker
.setGenderAwareness(true)
.setLanguage("en")
.setUnnormalizedDateMode("obfuscate") //mask or skip
val deidPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
deid_entity_labels,
obfuscation
))
//sample data
val text =
'''
Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old ,
Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .
'''
val data = Seq(text).toDF("text")
val result = new deidPipeline.fit(data).transform(data)
+-----------------------------------------------------------------------+-------------------------------------------------------+-----------------------------------------------------------------+
| sentence| deid_entity_label| deidentified|
+-----------------------------------------------------------------------+-------------------------------------------------------+-----------------------------------------------------------------+
| Record date : 2093-01-13 , David Hale , M.D .| Record date : <DATE> , <NAME> , M.D .| Record date : 2093-01-25 , Daryl Dieter , M.D .|
| , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 .| , Name : <NAME> , MR # <ID> Date : <DATE> .| , Name : Langston Papas , MR # 4784828 Date : 01/25/93 .|
| PCP : Oliveira , 25 years-old , Record date : 2079-11-09 .|PCP : <NAME> , <AGE> years-old , Record date : <DATE> .|PCP : Roseann Lederer , 23 years-old , Record date : 2079-11-21 .|
|Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .| <LOCATION> , <LOCATION> , Phone <CONTACT> .| 31 North St Joseph Ave , 400 Tickle St , Phone (59) 106-048 .|
+-----------------------------------------------------------------------+-------------------------------------------------------+-----------------------------------------------------------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(document)
.setOutputCol("sentence")
.setUseAbbreviations(true)
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = WordEmbeddingsModel
.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
// Ner entities
val ner_model = FinanceNerModel.pretrained("finner_orgs_prods_alias","en","finance/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val nerConverter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_con")
// Deidentification
val deIdentification = new DeIdentification()
.setInputCols(Array("ner_chunk", "token", "sentence"))
.setOutputCol("dei")
// file with custom regex patterns for custom entities
.setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt")
// file with custom obfuscator names for the entities
.setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt")
.setRefFileFormat("csv")
.setRefSep("#")
.setMode("obfuscate")
.setDateFormats(Array("MM/dd/yy","yyyy-MM-dd"))
.setObfuscateDate(true)
.setDateTag("DATE")
.setDays(5)
.setObfuscateRefSource("file")
// Pipeline
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
nerConverter,
deIdentification
))
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
.setUseAbbreviations(true)
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = WordEmbeddingsModel
.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
// Ner entities
val ner_model = LegalNerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val nerConverter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_con")
// Deidentification
val deIdentification = new DeIdentification()
.setInputCols(Array("ner_chunk", "token", "sentence"))
.setOutputCol("dei")
// file with custom regex patterns for custom entities
.setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt")
// file with custom obfuscator names for the entities
.setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt")
.setRefFileFormat("csv")
.setRefSep("#")
.setMode("obfuscate")
.setDateFormats(Array("MM/dd/yy","yyyy-MM-dd"))
.setObfuscateDate(true)
.setDateTag("DATE")
.setDays(5)
.setObfuscateRefSource("file")
// Pipeline
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
nerConverter,
deIdentification
))
DistilBertForSequenceClassification
DistilBertForSequenceClassification
can load DistilBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.
Parameters:
-
batchSize
’, ‘Size of every batch’: default: 8, -
coalesceSentences
’: “Instead of 1 class per sentence (if inputCols is ‘'’sentence’’’ output 1 class per document by averaging probabilities in all sentences.” default: False, -
maxSentenceLength
’, ‘Max sentence length to process’, default: 128 -
caseSensitive
’, ‘whether to ignore case in tokens for embeddings matching’,default: True,
Input Annotator Types: DOCUMENT, TOKEN
Output Annotator Type: CATEGORY
Python API: DistilBertForSequenceClassification | Scala API: DistilBertForSequenceClassification |
Show Example
from johnsnowlabs import nlp, medical
document_assembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = nlp.Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
sequenceClassifier = medical.DistilBertForSequenceClassification.pretrained("distilbert_sequence_classifier_ade", "en", "clinical/models")\
.setInputCols(["document","token"])\
.setOutputCol("classes")
pipeline = nlp.Pipeline(stages=[
document_assembler,
tokenizer,
sequenceClassifier
])
data = spark.createDataFrame([["I have an allergic reaction to vancomycin so I have itchy skin, sore throat/burning/itching, numbness of tongue and gums.I would not recommend this drug to anyone, especially since I have never had such an adverse reaction to any other medication."],
["Religare Capital Ranbaxy has been accepting approval for Diovan since 2012"]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("text", "classes.result").show(truncate=100)
| text | result |
|------------------------------------------------------------------------------------------------|-------|
| I have an allergic reaction to vancomycin so I have itchy skin, sore throat/burning/itching, numb... | [True] |
| Religare Capital Ranbaxy has been accepting approval for Diovan since 2012 | [False] |
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val sequenceClassifier = MedicalDistilBertForSequenceClassification.pretrained("distilbert_sequence_classifier_ade", "en", "clinical/models")
.setInputCols(Array("document","token"))
.setOutputCol("classes")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
tokenizer,
sequenceClassifier))
var text =List(
List("I have an allergic reaction to vancomycin so I have itchy skin, sore throat/burning/itching, numbness of tongue and gums.I would not recommend this drug to anyone, especially since I have never had such an adverse reaction to any other medication."),
List("Religare Capital Ranbaxy has been accepting approval for Diovan since 2012")
)
val data = Seq(text).toDF("text")
val result = pipeline.fit(data).transform(data)
| text | result |
|------------------------------------------------------------------------------------------------|-------|
| I have an allergic reaction to vancomycin so I have itchy skin, sore throat/burning/itching, numb... | [True] |
| Religare Capital Ranbaxy has been accepting approval for Diovan since 2012 | [False] |
Doc2ChunkInternal
Converts DOCUMENT
, TOKEN
typed annotations into CHUNK
type with the contents of a chunkCol
. Chunk text must be contained within input DOCUMENT
. May be either StringType
or ArrayType[StringType]
(using setIsArray
). Useful for annotators that require a CHUNK type input.
Parameters:
-
inputCols
: The name of the columns containing the input annotations. It can read either a String column or an Array. -
outputCol
: The name of the column in Document type that is generated. We can specify only one column here.
All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols()
.
For more extended examples on document pre-processing see the Spark NLP Workshop.
Input Annotator Types: DOCUMENT, TOKEN
Output Annotator Type: CHUNK
Python API: Doc2ChunkInternal | Scala API: Doc2ChunkInternal | Notebook: Doc2ChunkInternalNotebook |
Show Example
from johnsnowlabs import nlp, medical
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = nlp.Tokenizer()\
.setInputCols("document")\
.setOutputCol("token")
chunkAssembler = medical.Doc2ChunkInternal()\
.setInputCols("document", "token")\
.setChunkCol("target")\
.setOutputCol("chunk")\
.setIsArray(True)
pipeline = nlp.Pipeline().setStages([
documentAssembler,
tokenizer,
chunkAssembler])
data = spark.createDataFrame(
[
[
"Spark NLP is an open-source text processing library for advanced natural language processing.",
["Spark NLP", "text processing library", "natural language processing"],
]
]
).toDF("text", "target")
result = pipeline.fit(data).transform(data)
result.selectExpr("chunk.result", "chunk.annotatorType").show(truncate=False)
+-----------------------------------------------------------------+---------------------+
|result |annotatorType |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+
from johnsnowlabs import nlp, finance
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = nlp.Tokenizer()\
.setInputCols("document")\
.setOutputCol("token")
chunkAssembler = finance.Doc2ChunkInternal()\
.setInputCols("document", "token")\
.setChunkCol("target")\
.setOutputCol("chunk")\
.setIsArray(True)
pipeline = nlp.Pipeline().setStages([documentAssembler, tokenizer, chunkAssembler])
data = spark.createDataFrame(
[
[
"Spark NLP is an open-source text processing library for advanced natural language processing.",
["Spark NLP", "text processing library", "natural language processing"],
]
]
).toDF("text", "target")
result = pipeline.fit(data).transform(data)
result.selectExpr("chunk.result", "chunk.annotatorType").show(truncate=False)
+-----------------------------------------------------------------+---------------------+
|result |annotatorType |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+
from johnsnowlabs import nlp, legal
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = nlp.Tokenizer()\
.setInputCols("document")\
.setOutputCol("token")
chunkAssembler = legal.Doc2ChunkInternal()\
.setInputCols("document", "token")\
.setChunkCol("target")\
.setOutputCol("chunk")\
.setIsArray(True)
pipeline = nlp.Pipeline().setStages([documentAssembler, tokenizer, chunkAssembler])
data = spark.createDataFrame(
[
[
"Spark NLP is an open-source text processing library for advanced natural language processing.",
["Spark NLP", "text processing library", "natural language processing"],
]
]
).toDF("text", "target")
result = pipeline.fit(data).transform(data)
result.selectExpr("chunk.result", "chunk.annotatorType").show(truncate=False)
+-----------------------------------------------------------------+---------------------+
|result |annotatorType |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val chunkAssembler = new Doc2ChunkInternal()
.setInputCols(Array("document", "token"))
.setChunkCol("target")
.setOutputCol("chunk")
.setIsArray(true)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
chunkAssembler
))
val data = Seq(("Spark NLP is an open-source text processing library for advanced natural language processing.",
"Spark NLP", "text processing library", "natural language processing")).toDF("text", "target")
val result = pipeline.fit(data).transform(data)
+-----------------------------------------------------------------+---------------------+
|result |annotatorType |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val chunkAssembler = new Doc2ChunkInternal()
.setInputCols(Array("document", "token"))
.setChunkCol("target")
.setOutputCol("chunk")
.setIsArray(true)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
chunkAssembler
))
val data = Seq(("Spark NLP is an open-source text processing library for advanced natural language processing.",
"Spark NLP", "text processing library", "natural language processing")).toDF("text", "target")
val result = pipeline.fit(data).transform(data)
+-----------------------------------------------------------------+---------------------+
|result |annotatorType |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val chunkAssembler = new Doc2ChunkInternal()
.setInputCols(Array("document", "token"))
.setChunkCol("target")
.setOutputCol("chunk")
.setIsArray(true)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
chunkAssembler
))
val data = Seq(("Spark NLP is an open-source text processing library for advanced natural language processing.",
"Spark NLP", "text processing library", "natural language processing")).toDF("text", "target")
val result = pipeline.fit(data).transform(data)
+-----------------------------------------------------------------+---------------------+
|result |annotatorType |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+
DocMapper
DocMapper
uses the text representation of document annotations to map clinical codes to other codes or relevant information.
Parametres:
-
setRels
(List[str]): Relations that we are going to use to map the document -
setLowerCase
(Boolean): Set if we want to map the documents in lower case or not (Default: True) -
setAllowMultiTokenChunk
(Boolean): Whether to skip relations with multitokens (Default: True) -
setMultivaluesRelations
(Boolean): Whether to decide to return all values in a relation together or separately (Default: False) -
setDoExceptionHandling
: If it is set as True, the annotator tries to process as usual and ff exception-causing data (e.g. corrupted record/ document) is passed to the annotator, an exception warning is emitted which has the exception message.
Input Annotator Types: DOCUMENT
Output Annotator Type: LABEL_DEPENDENCY
Python API: DocMapperModel | Scala API: DocMapperModel | Notebook: DocMapperModelNotebook |
Show Example
from johnsnowlabs import nlp, medical
#ChunkMapper Pipeline
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
#drug_action_treatment_mapper
docMapper= medical.DocMapperModel().pretrained("drug_action_treatment_mapper", "en", "clinical/models")\
.setInputCols(["document"])\
.setOutputCol("mappings")\
.setRels(["action", "treatment"])
mapperPipeline = nlp.Pipeline().setStages([
document_assembler,
docMapper])
test_data = spark.createDataFrame([["Dermovate"], ["Aspagin"]]).toDF("text")
res = mapperPipeline.fit(test_data).transform(test_data)
# Show results
res.select(F.explode(F.arrays_zip(res.mappings.result,
res.mappings.metadata)).alias("col"))\
.select(F.expr("col['1']['entity']").alias("ner_chunk"),
F.expr("col['0']").alias("mapping_result"),
F.expr("col['1']['relation']").alias("relation"),
F.expr("col['1']['all_relations']").alias("all_mappings")).show(truncate=False)
+---------+----------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ner_chunk|mapping_result |relation |all_mappings |
+---------+----------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Dermovate|anti-inflammatory |action |corticosteroids::: dermatological preparations:::very strong |
|Dermovate|lupus |treatment|discoid lupus erythematosus:::empeines:::psoriasis:::eczema |
|Aspagin |analgesic |action |anti-inflammatory:::antipyretic |
|Aspagin |ankylosing spondylitis|treatment|arthralgia:::pain:::bursitis:::headache:::migraine:::myositis:::neuralgia:::osteoarthritis:::gout:::rheumatoid arthritis:::spondylitis:::spondyloarthritis:::tendinitis:::tenosynovitis:::crush injury:::golfer's elbow|
+---------+----------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
// ChunkMapper Pipeline
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
// drug_action_treatment_mapper
val docMapper= DocMapperModel().pretrained("drug_action_treatment_mapper", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("mappings")
.setRels(Array("action", "treatment"))
val mapperPipeline = new Pipeline().setStages(Array(
document_assembler,
docMapper))
val test_data = Seq(("Dermovate", "Aspagin")).toDF("text")
val res = mapperPipeline.fit(test_data).transform(test_data)
// Show results
+---------+----------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ner_chunk|mapping_result |relation |all_mappings |
+---------+----------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Dermovate|anti-inflammatory |action |corticosteroids::: dermatological preparations:::very strong |
|Dermovate|lupus |treatment|discoid lupus erythematosus:::empeines:::psoriasis:::eczema |
|Aspagin |analgesic |action |anti-inflammatory:::antipyretic |
|Aspagin |ankylosing spondylitis|treatment|arthralgia:::pain:::bursitis:::headache:::migraine:::myositis:::neuralgia:::osteoarthritis:::gout:::rheumatoid arthritis:::spondylitis:::spondyloarthritis:::tendinitis:::tenosynovitis:::crush injury:::golfer's elbow|
+---------+----------------------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
DocMapper
that can be used to map short strings via DocumentAssembler without using any other annotator between to convert strings to Chunk type that ChunkMapperModel expects.
Parameters:
-
setDictionary
(Str): Dictionary path where is the JsonDictionary that contains the mappings columns -
setRels
(Boolean): Relations that we are going to use to map the document -
setLowerCase
(Boolean): Set if we want to map the documents in lower case or not (Default: True) -
setAllowMultiTokenChunk
(Boolean): Whether to skip relations with multitokens (Default: True) -
setMultivaluesRelations
(Boolean): Whether to decide to return all values in a relation together or separately (Default: False) -
setDoExceptionHandling
: If it is set as True, the annotator tries to process as usual and ff exception-causing data (e.g. corrupted record/ document) is passed to the annotator, an exception warning is emitted which has the exception message.
Input Annotator Types: DOCUMENT
Output Annotator Type: LABEL_DEPENDENCY
Python API: DocMapperApproach | Scala API: DocMapperApproach | Notebook: DocMapperApproachNotebook |
Show Example
from johnsnowlabs import nlp, medical
data_set= {
"mappings": [
{
"key": "metformin",
"relations": [
{
"key": "action",
"values" : ["hypoglycemic", "Drugs Used In Diabetes"]
},
{
"key": "treatment",
"values" : ["diabetes", "t2dm"]
}
]
}
]
}
import json
with open('sample_drug.json', 'w', encoding='utf-8') as f:
json.dump(data_set, f, ensure_ascii=False, indent=4)
document_assembler = nlp.DocumentAssembler()\
.setInputCol('text')\
.setOutputCol('document')
chunkerMapper = medical.DocMapperApproach()\
.setInputCols(["document"])\
.setOutputCol("mappings")\
.setDictionary("./sample_drug.json")\
.setRels(["action"])
pipeline = nlp.Pipeline().setStages([document_assembler,
chunkerMapper])
test_data = spark.createDataFrame([["metformin"]]).toDF("text")
res = pipeline.fit(test_data).transform(test_data)
## Results
res.select(F.explode(F.arrays_zip(res.mappings.result,
res.mappings.metadata)).alias("col"))\
.select(F.expr("col['1']['entity']").alias("document"),
F.expr("col['0']").alias("mapping_result"),
F.expr("col['1']['relation']").alias("relation"),
F.expr("col['1']['all_relations']").alias("all_mappings")).show(truncate=False)
+---------+--------------+--------+----------------------+
|document |mapping_result|relation|all_mappings |
+---------+--------------+--------+----------------------+
|metformin|hypoglycemic |action |Drugs Used In Diabetes|
+---------+--------------+--------+----------------------+
import spark.implicits._
/* sample_drug.json file
{
"mappings": [
{
"key": "metformin",
"relations": [
{
"key": "action",
"values" : ["hypoglycemic", "Drugs Used In Diabetes"]
},
{
"key": "treatment",
"values" : ["diabetes", "t2dm"]
}
]
}
]
}
*/
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val chunkerMapper = new DocMapperApproach()
.setInputCols("document")
.setOutputCol("mappings")
.setDictionary("./sample_drug.json")
.setRels("action")
val pipeline = new Pipeline().setStages(Array(document_assembler, chunkerMapper))
val test_data = Seq("metformin").toDF("text")
val res = pipeline.fit(test_data).transform(test_data)
// Results
+---------+--------------+--------+----------------------+
|document |mapping_result|relation|all_mappings |
+---------+--------------+--------+----------------------+
|metformin|hypoglycemic |action |Drugs Used In Diabetes|
+---------+--------------+--------+----------------------+
DocumentFiltererByClassifier
The DocumentFiltererByClassifier function is designed to filter documents based on the outcomes generated by classifier annotators. It operates using a white list and a black list. The white list comprises classifier results that meet the criteria to pass through the filter, while the black list includes results that are prohibited from passing through. This filtering process is sensitive to cases by default. However, by setting caseSensitive to False, the filter becomes case-insensitive, allowing for a broader range of matches based on the specified criteria. This function serves as an effective tool for systematically sorting and managing documents based on specific classifier outcomes, facilitating streamlined document handling and organization.
Parameters:
-
whiteList
: (list) If defined, list of entities to process. The rest will be ignored. -
CaseSensitive
: (bool) Determines whether the definitions of the white listed entities are case sensitive.
Input Annotator Types: DOCUMENT, CATEGORY
Output Annotator Type: DOCUMENT
Notebook: DocumentFiltererByClassifierNotebook |
Show Example
example = """Medical Specialty:
Cardiovascular / Pulmonary
Sample Name: Aortic Valve Replacement
Description: Aortic valve replacement using a mechanical valve and two-vessel coronary artery bypass grafting procedure using saphenous vein graft to the first obtuse marginal artery and left radial artery graft to the left anterior descending artery.
(Medical Transcription Sample Report)
DIAGNOSIS: Aortic valve stenosis with coronary artery disease associated with congestive heart failure. The patient has diabetes and is morbidly obese.
PROCEDURES: Aortic valve replacement using a mechanical valve and two-vessel coronary artery bypass grafting procedure using saphenous vein graft to the first obtuse marginal artery and left radial artery graft to the left anterior descending artery.
ANESTHESIA: General endotracheal
INCISION: Median sternotomy
INDICATIONS: The patient presented with severe congestive heart failure associated with the patient's severe diabetes. The patient was found to have moderately stenotic aortic valve. In addition, The patient had significant coronary artery disease consisting of a chronically occluded right coronary artery but a very important large obtuse marginal artery coming off as the main circumflex system. The patient also has a left anterior descending artery which has moderate disease and this supplies quite a bit of collateral to the patient's right system. It was decided to perform a valve replacement as well as coronary artery bypass grafting procedure.
FINDINGS: The left ventricle is certainly hypertrophied· The aortic valve leaflet is calcified and a severe restrictive leaflet motion. It is a tricuspid type of valve. The coronary artery consists of a large left anterior descending artery which is associated with 60% stenosis but a large obtuse marginal artery which has a tight proximal stenosis.
The radial artery was used for the left anterior descending artery. Flow was excellent. Looking at the targets in the posterior descending artery territory, there did not appear to be any large branches. On the angiogram these vessels appeared to be quite small. Because this is a chronically occluded vessel and the patient has limited conduit due to the patient's massive obesity, attempt to bypass to this area was not undertaken. The patient was brought to the operating room
PROCEDURE: The patient was brought to the operating room and placed in supine position. A median sternotomy incision was carried out and conduits were taken from the left arm as well as the right thigh. The patient weighs nearly three hundred pounds. There was concern as to taking down the left internal mammary artery. Because the radial artery appeared to be a good conduit The patient would have arterial graft to the left anterior descending artery territory. The patient was cannulated after the aorta and atrium were exposed and full heparinization.
The patient went on cardiopulmonary bypass and the aortic cross-clamp was applied Cardioplegia was delivered through the coronary sinuses in a retrograde manner. The patient was cooled to 32 degrees. Iced slush was applied to the heart. The aortic valve was then exposed through the aortic root by transverse incision. The valve leaflets were removed and the #23 St. Jude mechanical valve was secured into position by circumferential pledgeted sutures. At this point, aortotomy was closed.
The first obtuse marginal artery was a very large target and the vein graft to this target indeed produced an excellent amount of flow. Proximal anastomosis was then carried out to the foot of the aorta. The left anterior descending artery does not have severe disease but is also a very good target and the radial artery was anastomosed to this target in an end-to-side manner. The two proximal anastomoses were then carried out to the root of the aorta.
The patient came off cardiopulmonary bypass after aortic cross-clamp was released. The patient was adequately warmed. Protamine was given without adverse effect. Sternal closure was then done using wires. The subcutaneous layers were closed using Vicryl suture. The skin was approximated using staples.
"""
df = spark.createDataFrame([[example]]).toDF("text")
from johnsnowlabs import nlp, medical
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = nlp.Tokenizer()\
.setInputCols(["document"])\
.setOutputCol("token")\
document_splitter = medical.InternalDocumentSplitter() \
.setInputCols("document")\
.setOutputCol("splits")\
.setSplitMode("recursive")\
.setChunkSize(100)\
.setChunkOverlap(3)\
.setExplodeSplits(True)\
.setPatternsAreRegex(False)\
.setSplitPatterns(["\n\n", "\n"])\
.setKeepSeparators(False)\
.setTrimWhitespace(True)
#.setEnableSentenceIncrement(False)
sequenceClassifier = medical.BertForSequenceClassification\
.pretrained('bert_sequence_classifier_clinical_sections', 'en', 'clinical/models')\
.setInputCols(["splits", "token"])\
.setOutputCol("prediction")\
.setCaseSensitive(False)
document_filterer = medical.DocumentFiltererByClassifier()\
.setInputCols(["splits", "prediction"])\
.setOutputCol("filteredDocuments")\
.setWhiteList(["Diagnostic and Laboratory Data"])\
.setCaseSensitive(False)\
pipeline = nlp.Pipeline().setStages([
document_assembler,
tokenizer,
document_splitter,
sequenceClassifier,
#document_filterer
])
result = pipeline.fit(df).transform(df)
# before filterer result
result.selectExpr("splits.result[0] as splits",
"prediction.result[0] as classes"
).show(truncate=80)
+--------------------------------------------------------------------------------+------------------------------+
| splits| classes|
+--------------------------------------------------------------------------------+------------------------------+
|Medical Specialty:\nCardiovascular / Pulmonary\n\nSample Name: Aortic Valve R...| History|
|Description: Aortic valve replacement using a mechanical valve and two-vessel...|Complications and Risk Factors|
| (Medical Transcription Sample Report)|Complications and Risk Factors|
|DIAGNOSIS: Aortic valve stenosis with coronary artery disease associated with...|Diagnostic and Laboratory Data|
|PROCEDURES: Aortic valve replacement using a mechanical valve and two-vessel ...| Procedures|
| ANESTHESIA: General endotracheal\n\nINCISION: Median sternotomy| Procedures|
|INDICATIONS: The patient presented with severe congestive heart failure assoc...| Consultation and Referral|
|FINDINGS: The left ventricle is certainly hypertrophied· The aortic valve lea...|Diagnostic and Laboratory Data|
|The radial artery was used for the left anterior descending artery. Flow was ...|Diagnostic and Laboratory Data|
|PROCEDURE: The patient was brought to the operating room and placed in supine...| Procedures|
|The patient went on cardiopulmonary bypass and the aortic cross-clamp was app...| Procedures|
|The first obtuse marginal artery was a very large target and the vein graft t...|Diagnostic and Laboratory Data|
|The patient came off cardiopulmonary bypass after aortic cross-clamp was rele...| Procedures|
+--------------------------------------------------------------------------------+------------------------------+
# after filterer result
pipeline = nlp.Pipeline().setStages([
document_assembler,
tokenizer,
document_splitter,
sequenceClassifier,
document_filterer
])
result = pipeline.fit(df).transform(df)
from pyspark.sql.functions import col
result.selectExpr("filteredDocuments.result[0] as splits",
"filteredDocuments.metadata[0].class_label as classes")\
.filter(col("classes").isNotNull()).show(truncate=80)
+--------------------------------------------------------------------------------+------------------------------+
| splits| classes|
+--------------------------------------------------------------------------------+------------------------------+
|DIAGNOSIS: Aortic valve stenosis with coronary artery disease associated with...|Diagnostic and Laboratory Data|
|FINDINGS: The left ventricle is certainly hypertrophied· The aortic valve lea...|Diagnostic and Laboratory Data|
|The radial artery was used for the left anterior descending artery. Flow was ...|Diagnostic and Laboratory Data|
|The first obtuse marginal artery was a very large target and the vein graft t...|Diagnostic and Laboratory Data|
+--------------------------------------------------------------------------------+------------------------------+
import spark.implicits._
val example = "Medical Specialty:
Cardiovascular / Pulmonary
Sample Name: Aortic Valve Replacement
Description: Aortic valve replacement using a mechanical valve and two-vessel coronary artery bypass grafting procedure using saphenous vein graft to the first obtuse marginal artery and left radial artery graft to the left anterior descending artery.
(Medical Transcription Sample Report)
DIAGNOSIS: Aortic valve stenosis with coronary artery disease associated with congestive heart failure. The patient has diabetes and is morbidly obese.
PROCEDURES: Aortic valve replacement using a mechanical valve and two-vessel coronary artery bypass grafting procedure using saphenous vein graft to the first obtuse marginal artery and left radial artery graft to the left anterior descending artery.
ANESTHESIA: General endotracheal
INCISION: Median sternotomy
INDICATIONS: The patient presented with severe congestive heart failure associated with the patient's severe diabetes. The patient was found to have moderately stenotic aortic valve. In addition, The patient had significant coronary artery disease consisting of a chronically occluded right coronary artery but a very important large obtuse marginal artery coming off as the main circumflex system. The patient also has a left anterior descending artery which has moderate disease and this supplies quite a bit of collateral to the patient's right system. It was decided to perform a valve replacement as well as coronary artery bypass grafting procedure.
FINDINGS: The left ventricle is certainly hypertrophied· The aortic valve leaflet is calcified and a severe restrictive leaflet motion. It is a tricuspid type of valve. The coronary artery consists of a large left anterior descending artery which is associated with 60% stenosis but a large obtuse marginal artery which has a tight proximal stenosis.
The radial artery was used for the left anterior descending artery. Flow was excellent. Looking at the targets in the posterior descending artery territory, there did not appear to be any large branches. On the angiogram these vessels appeared to be quite small. Because this is a chronically occluded vessel and the patient has limited conduit due to the patient's massive obesity, attempt to bypass to this area was not undertaken. The patient was brought to the operating room
PROCEDURE: The patient was brought to the operating room and placed in supine position. A median sternotomy incision was carried out and conduits were taken from the left arm as well as the right thigh. The patient weighs nearly three hundred pounds. There was concern as to taking down the left internal mammary artery. Because the radial artery appeared to be a good conduit The patient would have arterial graft to the left anterior descending artery territory. The patient was cannulated after the aorta and atrium were exposed and full heparinization.
The patient went on cardiopulmonary bypass and the aortic cross-clamp was applied Cardioplegia was delivered through the coronary sinuses in a retrograde manner. The patient was cooled to 32 degrees. Iced slush was applied to the heart. The aortic valve was then exposed through the aortic root by transverse incision. The valve leaflets were removed and the #23 St. Jude mechanical valve was secured into position by circumferential pledgeted sutures. At this point, aortotomy was closed.
The first obtuse marginal artery was a very large target and the vein graft to this target indeed produced an excellent amount of flow. Proximal anastomosis was then carried out to the foot of the aorta. The left anterior descending artery does not have severe disease but is also a very good target and the radial artery was anastomosed to this target in an end-to-side manner. The two proximal anastomoses were then carried out to the root of the aorta.
The patient came off cardiopulmonary bypass after aortic cross-clamp was released. The patient was adequately warmed. Protamine was given without adverse effect. Sternal closure was then done using wires. The subcutaneous layers were closed using Vicryl suture. The skin was approximated using staples.
"
val df = Seq(example).toDF("text")
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val document_splitter = new InternalDocumentSplitter()
.setInputCols("document")
.setOutputCol("splits")
.setSplitMode("recursive")
.setChunkSize(100)
.setChunkOverlap(3)
.setExplodeSplits(true)
.setPatternsAreRegex(false)
.setSplitPatterns(Array(" "," "))
.setKeepSeparators(false)
.setTrimWhitespace(true)
//.setEnableSentenceIncrement(false)
val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_clinical_sections","en","clinical/models")
.setInputCols(Array("splits","token"))
.setOutputCol("prediction")
.setCaseSensitive(false)
val document_filterer = new DocumentFiltererByClassifier()
.setInputCols(Array("splits","prediction"))
.setOutputCol("filteredDocuments")
.setWhiteList(Array("Diagnostic and Laboratory Data"))
.setCaseSensitive(false)
val pipeline = new Pipeline().setStages(Array(
document_assembler,
tokenizer,
document_splitter,
sequenceClassifier,
//document_filterer ))
val result = pipeline.fit(df).transform(df)
// before filterer result
+--------------------------------------------------------------------------------+------------------------------+
| splits| classes|
+--------------------------------------------------------------------------------+------------------------------+
|Medical Specialty:\nCardiovascular / Pulmonary\n\nSample Name: Aortic Valve R...| History|
|Description: Aortic valve replacement using a mechanical valve and two-vessel...|Complications and Risk Factors|
| (Medical Transcription Sample Report)|Complications and Risk Factors|
|DIAGNOSIS: Aortic valve stenosis with coronary artery disease associated with...|Diagnostic and Laboratory Data|
|PROCEDURES: Aortic valve replacement using a mechanical valve and two-vessel ...| Procedures|
| ANESTHESIA: General endotracheal\n\nINCISION: Median sternotomy| Procedures|
|INDICATIONS: The patient presented with severe congestive heart failure assoc...| Consultation and Referral|
|FINDINGS: The left ventricle is certainly hypertrophied· The aortic valve lea...|Diagnostic and Laboratory Data|
|The radial artery was used for the left anterior descending artery. Flow was ...|Diagnostic and Laboratory Data|
|PROCEDURE: The patient was brought to the operating room and placed in supine...| Procedures|
|The patient went on cardiopulmonary bypass and the aortic cross-clamp was app...| Procedures|
|The first obtuse marginal artery was a very large target and the vein graft t...|Diagnostic and Laboratory Data|
|The patient came off cardiopulmonary bypass after aortic cross-clamp was rele...| Procedures|
+--------------------------------------------------------------------------------+------------------------------+
// after filterer result
val pipeline = new Pipeline().setStages(Array(
document_assembler,
tokenizer,
document_splitter,
sequenceClassifier,
document_filterer ))
val result = pipeline.fit(df) .transform(df)
+--------------------------------------------------------------------------------+------------------------------+
| splits| classes|
+--------------------------------------------------------------------------------+------------------------------+
|DIAGNOSIS: Aortic valve stenosis with coronary artery disease associated with...|Diagnostic and Laboratory Data|
|FINDINGS: The left ventricle is certainly hypertrophied· The aortic valve lea...|Diagnostic and Laboratory Data|
|The radial artery was used for the left anterior descending artery. Flow was ...|Diagnostic and Laboratory Data|
|The first obtuse marginal artery was a very large target and the vein graft t...|Diagnostic and Laboratory Data|
+--------------------------------------------------------------------------------+------------------------------+
DocumentFiltererByNER
The DocumentFiltererByNER
annotator returns sentences containing the entity chunks you have filtered, allowing you to see only the sentences with the entities you want.It is particularly useful for extracting and organizing the results obtained from Spark NLP Pipelines.
Parameters:
blackList
: If defined, list of entities to ignore. The rest will be processed.whiteList
: If defined, list of entities to process. The rest will be ignored.caseSensitive
: Determines whether the definitions of the white listed and black listed entities are case sensitive or not.outputAsDocument
: Whether to return all sentences joined into a single document.(default :False
).joinString
: This parameter specifies the string that will be inserted between results of documents when combining them into a single result if outputAsDocument is set toTrue
(default is : “ “).
Input Annotator Types: DOCUMENT, CHUNK
Output Annotator Type: DOCUMENT
Python API: DocumentFiltererByNER | Scala API: DocumentFiltererByNER | Notebook: DocumentFiltererByNER |
Show Example
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
.setInputCols("document")\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_jsl = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = medical.NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
filterer = medical.DocumentFiltererByNER() \
.setInputCols(["sentence", "ner_chunk"]) \
.setOutputCol("filterer") \
.setWhiteList(["Disease_Syndrome_Disorder"])
pipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
ner_jsl,
ner_converter,
filterer])
df = spark.createDataFrame([
["Coronavirus disease (COVID-19) is an infectious disease caused by the SARS-CoV-2 virus."],
["Most people infected with the virus will experience mild to moderate respiratory illness and recover without requiring special treatment."],
["However, some will become seriously ill and require medical attention. "],
["Older people and those with underlying medical conditions like cardiovascular disease, diabetes, chronic respiratory disease, or cancer are more likely to develop serious illness."],
["Anyone can get sick with COVID-19 and become seriously ill or die at any age."],
["The best way to prevent and slow down transmission is to be well informed about the disease and how the virus spreads."],
["Protect yourself and others from infection by staying at least 1 metre apart from others, wearing a properly fitted mask, and washing your hands or using an alcohol-based rub frequently."],
["Get vaccinated when it’s your turn and follow local guidance."],
["Stay home if you feel unwell."],
["If you have a fever, cough and difficulty breathing, seek medical attention."],
["The virus can spread from an infected person’s mouth or nose in small liquid particles when they cough, sneeze, speak, sing or breathe. "],
["These particles range from larger respiratory droplets to smaller aerosols. It is important to practice respiratory etiquette, for example by coughing into a flexed elbow, and to stay home and self-isolate until you recover if you feel unwell."]
]).toDF("text")
from pyspark.sql.window import Window as W
from pyspark.sql import functions as F
spark_df = df.coalesce(1).withColumn("idx", F.monotonically_increasing_id())
res = pipeline.fit(spark_df).transform(spark_df)
# Result
res.selectExpr("idx as doc_id","explode(filterer) as filter").show(truncate=80)
+------+--------------------------------------------------------------------------------+
|doc_id| filter|
+------+--------------------------------------------------------------------------------+
| 0|{document, 0, 86, Coronavirus disease (COVID-19) is an infectious DISAESE cau...|
| 1|{document, 0, 136, Most people infected with the virus will experience mild t...|
| 3|{document, 0, 178, Older people and those with underlying medical conditions ...|
| 6|{document, 0, 185, Protect yourself and others from infection by staying at l...|
| 10|{document, 0, 134, The virus can spread from an infected person’s mouth or no...|
+------+--------------------------------------------------------------------------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel
.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val wordEmbeddings = WordEmbeddingsModel
.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner_jsl = NerModel
.pretrained("ner_jsl", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val nerConverter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val filterer = new DocumentFiltererByNER()
.setInputCols(Array("sentence", "ner_chunk"))
.setOutputCol("filterer")
.setWhiteList(Array("Disease_Syndrome_Disorder"))
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
wordEmbeddings,
ner_jsl,
nerConverter,
filterer
))
val data = Seq(
"Coronavirus disease (COVID-19) is an infectious disease caused by the SARS-CoV-2 virus.",
"Most people infected with the virus will experience mild to moderate respiratory illness and recover without requiring special treatment.",
"However, some will become seriously ill and require medical attention.",
"Older people and those with underlying medical conditions like cardiovascular disease, diabetes, chronic respiratory disease, or cancer are more likely to develop serious illness.",
"Anyone can get sick with COVID-19 and become seriously ill or die at any age.",
"The best way to prevent and slow down transmission is to be well informed about the disease and how the virus spreads.",
"Protect yourself and others from infection by staying at least 1 metre apart from others, wearing a properly fitted mask, and washing your hands or using an alcohol-based rub frequently.",
"Get vaccinated when it’s your turn and follow local guidance.",
"Stay home if you feel unwell.",
"If you have a fever, cough and difficulty breathing, seek medical attention.",
"The virus can spread from an infected person’s mouth or nose in small liquid particles when they cough, sneeze, speak, sing or breathe.",
"These particles range from larger respiratory droplets to smaller aerosols. It is important to practice respiratory etiquette, for example by coughing into a flexed elbow, and to stay home and self-isolate until you recover if you feel unwell."
).toDF("text")
val dfWithIdx = data.coalesce(1).withColumn("idx", monotonically_increasing_id())
val model = pipeline.fit(dfWithIdx)
val result = model.transform(dfWithIdx)
result.show(false)
// result
+------+--------------------------------------------------------------------------------+
|doc_id| filter|
+------+--------------------------------------------------------------------------------+
| 0|{document, 0, 86, Coronavirus disease (COVID-19) is an infectious DISAESE cau...|
| 1|{document, 0, 136, Most people infected with the virus will experience mild t...|
| 3|{document, 0, 178, Older people and those with underlying medical conditions ...|
| 6|{document, 0, 185, Protect yourself and others from infection by staying at l...|
| 10|{document, 0, 134, The virus can spread from an infected person’s mouth or no...|
+------+--------------------------------------------------------------------------------+
DocumentHashCoder
This annotator can replace dates in a column of DOCUMENT
type according with the hash code of any other column. It uses the hash of the specified column and creates a new document column containing the day shift information. In sequence, the DeIdentification
annotator deidentifies the document with the shifted date information.
If the specified column contains strings that can be parsed to integers, use those numbers to make the shift in the data accordingly.
Parametres:
-
PatientIdColumn
(String): Name of the column containing patient ID. -
setDateShiftColumn
(String): Sets column to be used for hash or predefined shift. -
setNewDateShift
(String): Sets column that has a reference of where chunk begins. -
setRangeDays
(int): Sets the range of dates to be sampled from. -
setSeed
(int): Sets the seed for random number generator.
Input Annotator Types: DOCUMENT
Output Annotator Type: DOCUMENT
Python API: DocumentHashCoder | Scala API: DocumentHashCoder | Notebook: DocumentHashCoderNotebook |
Show Example
from johnsnowlabs import nlp, medical
import pandas as pd
data = pd.DataFrame(
{'patientID' : ['A001', 'A001',
'A003', 'A003'],
'text' : ['Chris Brown was discharged on 10/02/2022',
'Mark White was discharged on 10/04/2022',
'John was discharged on 15/03/2022',
'John Moore was discharged on 15/12/2022'
],
'dateshift' : ['10', '10',
'30', '30']
}
)
my_input_df = spark.createDataFrame(data)
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
documentHasher = medical.DocumentHashCoder()\
.setInputCols("document")\
.setOutputCol("document2")\
.setPatientIdColumn("patientID")\
.setNewDateShift("shift_days")
tokenizer = nlp.Tokenizer()\
.setInputCols(["document2"])\
.setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["document2", "token"])\
.setOutputCol("word_embeddings")
clinical_ner = medical.NerModel\
.pretrained("ner_deid_subentity_augmented", "en", "clinical/models")\
.setInputCols(["document2","token", "word_embeddings"])\
.setOutputCol("ner")
ner_converter = medical.NerConverterInternal()\
.setInputCols(["document2", "token", "ner"])\
.setOutputCol("ner_chunk")
de_identification = medical.DeIdentification() \
.setInputCols(["ner_chunk", "token", "document2"]) \
.setOutputCol("deid_text") \
.setMode("obfuscate") \
.setObfuscateDate(True) \
.setDateTag("DATE") \
.setLanguage("en") \
.setObfuscateRefSource('faker') \
.setUseShifDays(True)\
.setRegion('us')
pipeline = nlp.Pipeline().setStages([
documentAssembler,
documentHasher,
tokenizer,
embeddings,
clinical_ner,
ner_converter,
de_identification
])
empty_data = spark.createDataFrame([["", ""]]).toDF("text", "patientID")
pipeline_model = pipeline.fit(empty_data)
output = pipeline_model.transform(my_input_df)
output.select('patientID','text', 'deid_text.result').show(truncate = False)
+---------+----------------------------------------+---------------------------------------------+
|patientID|text |result |
+---------+----------------------------------------+---------------------------------------------+
|A001 |Chris Brown was discharged on 10/02/2022|[Aldona Bar was discharged on 05/18/2022] |
|A001 |Mark White was discharged on 02/28/2020 |[Leta Speller was discharged on 10/14/2019] |
|A002 |John was discharged on 03/15/2022 |[Lonia Blood was discharged on 01/19/2022] |
|A002 |John Moore was discharged on 12/31/2022 |[Murriel Hopper was discharged on 11/06/2022]|
+---------+----------------------------------------+---------------------------------------------+
from johnsnowlabs import nlp, finance
import pandas as pd
data = pd.DataFrame(
{'patientID' : ['A001', 'A001',
'A003', 'A003'],
'text' : ['Chris Brown was discharged on 10/02/2022',
'Mark White was discharged on 10/04/2022',
'John was discharged on 15/03/2022',
'John Moore was discharged on 15/12/2022'
],
'dateshift' : ['10', '10',
'30', '30']
}
)
my_input_df = spark.createDataFrame(data)
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
documentHasher = finance.DocumentHashCoder()\
.setInputCols("document")\
.setOutputCol("document2")\
.setPatientIdColumn("patientID")\
.setNewDateShift("shift_days")
tokenizer = nlp.Tokenizer()\
.setInputCols(["document2"])\
.setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["document2", "token"])\
.setOutputCol("word_embeddings")
clinical_ner = finance.NerModel\
.pretrained("ner_deid_subentity_augmented", "en", "clinical/models")\
.setInputCols(["document2","token", "word_embeddings"])\
.setOutputCol("ner")
ner_converter = finance.NerConverterInternal()\
.setInputCols(["document2", "token", "ner"])\
.setOutputCol("ner_chunk")
de_identification = finance.DeIdentification() \
.setInputCols(["ner_chunk", "token", "document2"]) \
.setOutputCol("deid_text") \
.setMode("obfuscate") \
.setObfuscateDate(True) \
.setDateTag("DATE") \
.setLanguage("en") \
.setObfuscateRefSource('faker') \
.setUseShifDays(True)\
.setRegion('us')
pipeline = nlp.Pipeline().setStages([
documentAssembler,
documentHasher,
tokenizer,
embeddings,
clinical_ner,
ner_converter,
de_identification
])
empty_data = spark.createDataFrame([["", ""]]).toDF("text", "patientID")
pipeline_model = pipeline.fit(empty_data)
output = pipeline_model.transform(my_input_df)
output.select('patientID','text', 'deid_text.result').show(truncate = False)
+---------+----------------------------------------+----------------------------------------------+
|patientID|text |result |
+---------+----------------------------------------+----------------------------------------------+
|A001 |Chris Brown was discharged on 10/02/2022|[Andreas Newport was discharged on 04/09/2022]|
|A001 |Mark White was discharged on 02/28/2020 |[Kara Dies was discharged on 09/05/2019] |
|A002 |John was discharged on 03/15/2022 |[Lane Hacker was discharged on 02/17/2022] |
|A002 |John Moore was discharged on 12/31/2022 |[Orlena Sheldon was discharged on 12/05/2022] |
+---------+----------------------------------------+----------------------------------------------+
from johnsnowlabs import nlp, legal
import pandas as pd
data = pd.DataFrame(
{'patientID' : ['A001', 'A001',
'A003', 'A003'],
'text' : ['Chris Brown was discharged on 10/02/2022',
'Mark White was discharged on 10/04/2022',
'John was discharged on 15/03/2022',
'John Moore was discharged on 15/12/2022'
],
'dateshift' : ['10', '10',
'30', '30']
}
)
my_input_df = spark.createDataFrame(data)
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
documentHasher = legal.DocumentHashCoder()\
.setInputCols("document")\
.setOutputCol("document2")\
.setPatientIdColumn("patientID")\
.setNewDateShift("shift_days")
tokenizer = nlp.Tokenizer()\
.setInputCols(["document2"])\
.setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["document2", "token"])\
.setOutputCol("word_embeddings")
clinical_ner = legal.NerModel\
.pretrained("ner_deid_subentity_augmented", "en", "clinical/models")\
.setInputCols(["document2","token", "word_embeddings"])\
.setOutputCol("ner")
ner_converter = legal.NerConverterInternal()\
.setInputCols(["document2", "token", "ner"])\
.setOutputCol("ner_chunk")
de_identification = legal.DeIdentification() \
.setInputCols(["ner_chunk", "token", "document2"]) \
.setOutputCol("deid_text") \
.setMode("obfuscate") \
.setObfuscateDate(True) \
.setDateTag("DATE") \
.setLanguage("en") \
.setObfuscateRefSource('faker') \
.setUseShifDays(True)\
.setRegion('us')
pipeline = nlp.Pipeline().setStages([
documentAssembler,
documentHasher,
tokenizer,
embeddings,
clinical_ner,
ner_converter,
de_identification
])
empty_data = spark.createDataFrame([["", ""]]).toDF("text", "patientID")
pipeline_model = pipeline.fit(empty_data)
output = pipeline_model.transform(my_input_df)
output.select('patientID','text', 'deid_text.result').show(truncate = False)
+---------+----------------------------------------+----------------------------------------------+
|patientID|text |result |
+---------+----------------------------------------+----------------------------------------------+
|A001 |Chris Brown was discharged on 10/02/2022|[Andreas Newport was discharged on 04/09/2022]|
|A001 |Mark White was discharged on 02/28/2020 |[Kara Dies was discharged on 09/05/2019] |
|A002 |John was discharged on 03/15/2022 |[Lane Hacker was discharged on 02/17/2022] |
|A002 |John Moore was discharged on 12/31/2022 |[Orlena Sheldon was discharged on 12/05/2022] |
+---------+----------------------------------------+----------------------------------------------+
import spark.implicits._
val data = Seq(
("A001", "Chris Brown was discharged on 10/02/2022"),
("A001", "Mark White was discharged on 02/28/2020"),
("A002", "John was discharged on 03/15/2022"),
("A002", "John Moore was discharged on 12/31/2022")
)
val columns = Seq("patientID", "text")
val myInputDF: DataFrame = spark.createDataFrame(data).toDF(columns: _*)
val my_input_df = spark.createDataFrame(data)
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val documentHasher = new DocumentHashCoder()
.setInputCols("document")
.setOutputCol("document2")
.setPatientIdColumn("patientID")
.setNewDateShift("shift_days")
val tokenizer = new Tokenizer()
.setInputCols("document2")
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
.setInputCols(Array("document2","token"))
.setOutputCol("word_embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity_augmented","en","clinical/models")
.setInputCols(Array("document2","token","word_embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("document2","token","ner"))
.setOutputCol("ner_chunk")
val de_identification = new DeIdentification()
.setInputCols(Array("ner_chunk","token","document2"))
.setOutputCol("deid_text")
.setMode("obfuscate")
.setObfuscateDate(true)
.setDateTag("DATE")
.setLanguage("en")
.setObfuscateRefSource("faker")
.setUseShifDays(true)
.setRegion("us")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
documentHasher,
tokenizer,
embeddings,
clinicalNer,
nerConverter,
deIdentification
))
val emptyData = Seq(("", "")).toDF("text", "patientID")
val pipelineModel = pipeline.fit(emptyData)
val result = pipelineModel.transform(myInputDF)
+---------+----------------------------------------+----------------------------------------------+
|patientID|text |result |
+---------+----------------------------------------+----------------------------------------------+
|A001 |Chris Brown was discharged on 10/02/2022|[Andreas Newport was discharged on 04/09/2022]|
|A001 |Mark White was discharged on 02/28/2020 |[Kara Dies was discharged on 09/05/2019] |
|A002 |John was discharged on 03/15/2022 |[Lane Hacker was discharged on 02/17/2022] |
|A002 |John Moore was discharged on 12/31/2022 |[Orlena Sheldon was discharged on 12/05/2022] |
+---------+----------------------------------------+----------------------------------------------+
import spark.implicits._
val data = Seq(
("A001", "Chris Brown was discharged on 10/02/2022"),
("A001", "Mark White was discharged on 02/28/2020"),
("A002", "John was discharged on 03/15/2022"),
("A002", "John Moore was discharged on 12/31/2022")
)
val columns = Seq("patientID", "text")
val myInputDF: DataFrame = spark.createDataFrame(data).toDF(columns: _*)
val my_input_df = spark.createDataFrame(data)
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val documentHasher = new DocumentHashCoder()
.setInputCols("document")
.setOutputCol("document2")
.setPatientIdColumn("patientID")
.setNewDateShift("shift_days")
val tokenizer = new Tokenizer()
.setInputCols("document2")
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
.setInputCols(Array("document2","token"))
.setOutputCol("word_embeddings")
val clinical_ner = FinanceNerModel.pretrained("ner_deid_subentity_augmented","en","clinical/models")
.setInputCols(Array("document2","token","word_embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("document2","token","ner"))
.setOutputCol("ner_chunk")
val de_identification = new DeIdentification()
.setInputCols(Array("ner_chunk","token","document2"))
.setOutputCol("deid_text")
.setMode("obfuscate")
.setObfuscateDate(true)
.setDateTag("DATE")
.setLanguage("en")
.setObfuscateRefSource("faker")
.setUseShifDays(true)
.setRegion("us")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
documentHasher,
tokenizer,
embeddings,
clinicalNer,
nerConverter,
deIdentification
))
val emptyData = Seq(("", "")).toDF("text", "patientID")
val pipelineModel = pipeline.fit(emptyData)
val result = pipelineModel.transform(myInputDF)
+---------+----------------------------------------+----------------------------------------------+
|patientID|text |result |
+---------+----------------------------------------+----------------------------------------------+
|A001 |Chris Brown was discharged on 10/02/2022|[Andreas Newport was discharged on 04/09/2022]|
|A001 |Mark White was discharged on 02/28/2020 |[Kara Dies was discharged on 09/05/2019] |
|A002 |John was discharged on 03/15/2022 |[Lane Hacker was discharged on 02/17/2022] |
|A002 |John Moore was discharged on 12/31/2022 |[Orlena Sheldon was discharged on 12/05/2022] |
+---------+----------------------------------------+----------------------------------------------+
import spark.implicits._
val data = Seq(
("A001", "Chris Brown was discharged on 10/02/2022"),
("A001", "Mark White was discharged on 02/28/2020"),
("A002", "John was discharged on 03/15/2022"),
("A002", "John Moore was discharged on 12/31/2022")
)
val columns = Seq("patientID", "text")
val myInputDF: DataFrame = spark.createDataFrame(data).toDF(columns: _*)
val my_input_df = spark.createDataFrame(data)
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val documentHasher = new DocumentHashCoder()
.setInputCols("document")
.setOutputCol("document2")
.setPatientIdColumn("patientID")
.setNewDateShift("shift_days")
val tokenizer = new Tokenizer()
.setInputCols("document2")
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
.setInputCols(Array("document2","token"))
.setOutputCol("word_embeddings")
val clinical_ner = LegalNerModel.pretrained("ner_deid_subentity_augmented","en","clinical/models")
.setInputCols(Array("document2","token","word_embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("document2","token","ner"))
.setOutputCol("ner_chunk")
val de_identification = new DeIdentification()
.setInputCols(Array("ner_chunk","token","document2"))
.setOutputCol("deid_text")
.setMode("obfuscate")
.setObfuscateDate(true)
.setDateTag("DATE")
.setLanguage("en")
.setObfuscateRefSource("faker")
.setUseShifDays(true)
.setRegion("us")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
documentHasher,
tokenizer,
embeddings,
clinicalNer,
nerConverter,
deIdentification
))
val emptyData = Seq(("", "")).toDF("text", "patientID")
val pipelineModel = pipeline.fit(emptyData)
val result = pipelineModel.transform(myInputDF)
+---------+----------------------------------------+----------------------------------------------+
|patientID|text |result |
+---------+----------------------------------------+----------------------------------------------+
|A001 |Chris Brown was discharged on 10/02/2022|[Andreas Newport was discharged on 04/09/2022]|
|A001 |Mark White was discharged on 02/28/2020 |[Kara Dies was discharged on 09/05/2019] |
|A002 |John was discharged on 03/15/2022 |[Lane Hacker was discharged on 02/17/2022] |
|A002 |John Moore was discharged on 12/31/2022 |[Orlena Sheldon was discharged on 12/05/2022] |
+---------+----------------------------------------+----------------------------------------------+
DocumentLogRegClassifier
Classifies documents with a Logarithmic Regression algorithm. Currently there are no pretrained models available. Please see DocumentLogRegClassifierApproach to train your own model.
Parameters:
-
setMergeChunks
: Sets whether to merge all chunks in a document or not (Default: false). -
setLabels
: Sets array to output the label in the original form. -
setVectorizationModel
: Sets a path to the classification model if it has been already trained. -
setClassificationModel
: Sets a path to the the classification model if it has been already trained.
Please check out the Models Hub for available models in the future.
Input Annotator Types: TOKEN
Output Annotator Type: CATEGORY
Python API: DocumentLogRegClassifierModel | Scala API: DocumentLogRegClassifierModel |
Trains a model to classify documents with a Logarithmic Regression algorithm. Training data requires columns for text and their label. The result is a trained DocumentLogRegClassifierModel.
Parameters:
-
maxIter
: Maximum number of iterations. -
tol
: Convergence tolerance after each iteration. -
setLabels
: Sets array to output the label in the original form. -
setVectorizationModel
: Sets a path to the classification model if it has been already trained. -
setClassificationModel
: Sets a path to the the classification model if it has been already trained.
Input Annotator Types: TOKEN
Output Annotator Type: CATEGORY
Python API: DocumentLogRegClassifierApproach | Scala API: DocumentLogRegClassifierApproach |
Show Example
from johnsnowlabs import nlp, medical
# Define pipeline stages to prepare the data
document_assembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = nlp.Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
normalizer = nlp.Normalizer() \
.setInputCols(["token"]) \
.setOutputCol("normalized")
stopwords_cleaner = nlp.StopWordsCleaner() \
.setInputCols(["normalized"]) \
.setOutputCol("cleanTokens") \
.setCaseSensitive(False)
stemmer = nlp.Stemmer() \
.setInputCols(["cleanTokens"]) \
.setOutputCol("stem")
# Define the document classifier and fit training data to it
logreg = medical.DocumentLogRegClassifierApproach() \
.setInputCols(["stem"]) \
.setLabelCol("category") \
.setOutputCol("prediction")
pipeline = nlp.Pipeline(stages=[
document_assembler,
tokenizer,
normalizer,
stopwords_cleaner,
stemmer,
logreg
])
model = pipeline.fit(trainingData)
from johnsnowlabs import nlp, finance
# Define pipeline stages to prepare the data
document_assembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = nlp.Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
normalizer = nlp.Normalizer() \
.setInputCols(["token"]) \
.setOutputCol("normalized")
stopwords_cleaner = nlp.StopWordsCleaner() \
.setInputCols(["normalized"]) \
.setOutputCol("cleanTokens") \
.setCaseSensitive(False)
stemmer = nlp.Stemmer() \
.setInputCols(["cleanTokens"]) \
.setOutputCol("stem")
# Define the document classifier and fit training data to it
logreg = finance.DocumentLogRegClassifierApproach() \
.setInputCols(["stem"]) \
.setLabelCol("category") \
.setOutputCol("prediction")
pipeline = nlp.Pipeline(stages=[
document_assembler,
tokenizer,
normalizer,
stopwords_cleaner,
stemmer,
logreg
])
model = pipeline.fit(trainingData)
from johnsnowlabs import nlp, legal
# Define pipeline stages to prepare the data
document_assembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = nlp.Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
normalizer = nlp.Normalizer() \
.setInputCols(["token"]) \
.setOutputCol("normalized")
stopwords_cleaner = nlp.StopWordsCleaner() \
.setInputCols(["normalized"]) \
.setOutputCol("cleanTokens") \
.setCaseSensitive(False)
stemmer = nlp.Stemmer() \
.setInputCols(["cleanTokens"]) \
.setOutputCol("stem")
# Define the document classifier and fit training data to it
logreg = legal.DocumentLogRegClassifierApproach() \
.setInputCols(["stem"]) \
.setLabelCol("category") \
.setOutputCol("prediction")
pipeline = nlp.Pipeline(stages=[
document_assembler,
tokenizer,
normalizer,
stopwords_cleaner,
stemmer,
logreg
])
model = pipeline.fit(trainingData)
import spark.implicits._
// Define pipeline stages to prepare the data
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val normalizer = new Normalizer()
.setInputCols("token")
.setOutputCol("normalized")
val stopwords_cleaner = new StopWordsCleaner()
.setInputCols("normalized")
.setOutputCol("cleanTokens")
.setCaseSensitive(false)
val stemmer = new Stemmer()
.setInputCols("cleanTokens")
.setOutputCol("stem")
// Define the document classifier and fit training data to it
val logreg = new DocumentLogRegClassifierApproach()
.setInputCols("stem")
.setLabelCol("category")
.setOutputCol("prediction")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
tokenizer,
normalizer,
stopwords_cleaner,
stemmer,
logreg
))
val model = pipeline.fit(trainingData)
import spark.implicits._
// Define pipeline stages to prepare the data
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val normalizer = new Normalizer()
.setInputCols("token")
.setOutputCol("normalized")
val stopwords_cleaner = new StopWordsCleaner()
.setInputCols("normalized")
.setOutputCol("cleanTokens")
.setCaseSensitive(false)
val stemmer = new Stemmer()
.setInputCols("cleanTokens")
.setOutputCol("stem")
// Define the document classifier and fit training data to it
val logreg = new DocumentLogRegClassifierApproach()
.setInputCols("stem")
.setLabelCol("category")
.setOutputCol("prediction")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
tokenizer,
normalizer,
stopwords_cleaner,
stemmer,
logreg
))
val model = pipeline.fit(trainingData)
import spark.implicits._
// Define pipeline stages to prepare the data
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val normalizer = new Normalizer()
.setInputCols("token")
.setOutputCol("normalized")
val stopwords_cleaner = new StopWordsCleaner()
.setInputCols("normalized")
.setOutputCol("cleanTokens")
.setCaseSensitive(false)
val stemmer = new Stemmer()
.setInputCols("cleanTokens")
.setOutputCol("stem")
// Define the document classifier and fit training data to it
val logreg = new DocumentLogRegClassifierApproach()
.setInputCols("stem")
.setLabelCol("category")
.setOutputCol("prediction")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
tokenizer,
normalizer,
stopwords_cleaner,
stemmer,
logreg
))
val model = pipeline.fit(trainingData)
DocumentMLClassifier
DocumentMLClassifier
classifies documents with a Logarithmic Regression algorithm.
Input Annotator Types: TOKEN
Output Annotator Type: CATEGORY
Python API: DocumentMLClassifierModel | Scala API: DocumentMLClassifierModel | Notebook: DocumentMLClassifierModelNotebook |
Show Example
from johnsnowlabs import nlp, medical
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = nlp.Tokenizer()\
.setInputCols("document")\
.setOutputCol("token")
classifier_ml = medical.DocumentMLClassifierModel.pretrained("classifierml_ade", "en", "clinical/models")\
.setInputCols("token")\
.setOutputCol("prediction")
clf_Pipeline = nlp.Pipeline(stages=[
document_assembler,
tokenizer,
classifier_ml])
data = spark.createDataFrame([["""I feel great after taking tylenol."""], ["""Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient."""]]).toDF("text")
result = clf_Pipeline.fit(data).transform(data)
# Show results
result.select('text','prediction.result').show(truncate=False)
+----------------------------------------------------------------------------------------+-------+
|text |result |
+----------------------------------------------------------------------------------------+-------+
|Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient.|[False]|
|I feel great after taking tylenol. |[False]|
+----------------------------------------------------------------------------------------+-------+
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val classifier_ml = DocumentMLClassifierModel.pretrained("classifierml_ade", "en", "clinical/models")
.setInputCols("token")
.setOutputCol("prediction")
val clf_Pipeline = new Pipeline().setStages(Array(
document_assembler,
tokenizer,
classifier_ml))
val data = Seq(
"I feel great after taking tylenol.",
"Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient.").toDF("text")
val result = clf_Pipeline.fit(data).transform(data)
// Show results
+----------------------------------------------------------------------------------------+-------+
|text |result |
+----------------------------------------------------------------------------------------+-------+
|Detection of activated eosinophils in nasal polyps of an aspirin-induced asthma patient.|[False]|
|I feel great after taking tylenol. |[False]|
+----------------------------------------------------------------------------------------+-------+
Trains a model to classify documents with a Logarithmic Regression algorithm. Training data requires columns for text and their label. The result is a trained DocumentMLClassifierModel.
Parametres:
labelCol
: (str) Sets column with the value result we are trying to predict.maxIter
: (Int) Sets maximum number of iterations.tol
: (float) Sets convergence tolerance after each iteration.fitIntercept
: (str) Sets whether to fit an intercept term, default is true.vectorizationModelPath
: (str) Sets a path to the classification model if it has been already trained.classificationModelPath
: (str) Sets a path to the classification model if it has been already trained.classificationModelClass
: (str) Sets a the classification model class from SparkML to use; possible values are: logreg, svm.minTokenNgram
: (int) Sets minimum number of tokens for Ngrams.maxTokenNgram
: (int) Sets maximum number of tokens for Ngrams.mergeChunks
: (boolean) whether to merge all chunks in a document or not (Default: false)
Input Annotator Types: TOKEN
Output Annotator Type: CATEGORY
Python API: DocumentMLClassifierApproach | Scala API: DocumentMLClassifierApproach | Notebook: DocumentMLClassifierApproachNotebook |
Show Example
from johnsnowlabs import nlp, medical
document = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
token = nlp.Tokenizer()\
.setInputCols("document")\
.setOutputCol("token")
classifier_logreg = medical.DocumentMLClassifierApproach() \
.setInputCols("token") \
.setLabelCol("category") \
.setOutputCol("prediction") \
.setClassificationModelClass("logreg")\
.setFitIntercept(True)
pipeline = nlp.Pipeline(stages=[
document,
token,
classifier_logreg])
result_logreg = pipeline.fit(train_data).transform(test_data).cache()
import spark.implicits._
val document = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val token = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val classifier_logreg = new DocumentMLClassifierApproach()
.setInputCols("token")
.setLabelCol("category")
.setOutputCol("prediction")
.setClassificationModelClass("logreg")
.setFitIntercept(true)
val pipeline = new Pipeline().setStages(Array(
document,
token,
classifier_logreg))
val result_logreg = pipeline.fit(train_data).transform(test_data).cache()
DrugNormalizer
Annotator which normalizes raw text from clinical documents, e.g. scraped web pages or xml documents, from document type columns into Sentence. Removes all dirty characters from text following one or more input regex patterns. Can apply non wanted character removal which a specific policy. Can apply lower case normalization.
Parametres:
-
lowercase
: (boolean) whether to convert strings to lowercase. Default is False. -
policy
: (str) rule to remove patterns from text. Valid policy values are:- “all”,
- “abbreviations”,
- “dosages”
See Spark NLP Workshop for more examples of usage.
Input Annotator Types: DOCUMENT
Output Annotator Type: DOCUMENT
Python API: DrugNormalizer | Scala API: DrugNormalizer | Notebook: DrugNormalizerNotebook |
Show Example
from johnsnowlabs import nlp, medical
# Sample data
data_to_normalize = spark.createDataFrame([
("A", "Sodium Chloride/Potassium Chloride 13bag", "Sodium Chloride / Potassium Chloride 13 bag"),
("B", "interferon alfa-2b 10 million unit ( 1 ml ) injec", "interferon alfa - 2b 10000000 unt ( 1 ml ) injection"),
("C", "aspirin 10 meq/ 5 ml oral sol", "aspirin 2 meq/ml oral solution")
]).toDF("cuid", "text", "target_normalized_text")
# Annotator that transforms a text column from dataframe into normalized text (with all policy)
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
drug_normalizer = medical.DrugNormalizer() \
.setInputCols("document") \
.setOutputCol("document_normalized") \
.setPolicy("all")
drug_normalizer_pipeline = nlp.Pipeline(stages=[
document_assembler,
drug_normalizer
])
ds = drug_normalizer_pipeline.fit(data_to_normalize).transform(data_to_normalize)
ds = ds.selectExpr("document", "target_normalized_text", "explode(document_normalized.result) as all_normalized_text")
ds.show(truncate = False)
+-------------------------------------------------------------------------------------------+----------------------------------------------------+----------------------------------------------------+
|document |target_normalized_text |all_normalized_text |
+-------------------------------------------------------------------------------------------+----------------------------------------------------+----------------------------------------------------+
|[{document, 0, 39, Sodium Chloride/Potassium Chloride 13bag, {sentence -> 0}, []}] |Sodium Chloride / Potassium Chloride 13 bag |Sodium Chloride / Potassium Chloride 13 bag |
|[{document, 0, 48, interferon alfa-2b 10 million unit ( 1 ml ) injec, {sentence -> 0}, []}]|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|
|[{document, 0, 28, aspirin 10 meq/ 5 ml oral sol, {sentence -> 0}, []}] |aspirin 2 meq/ml oral solution |aspirin 2 meq/ml oral solution |
+-------------------------------------------------------------------------------------------+----------------------------------------------------+----------------------------------------------------+
import spark.implicits._
// Sample data
val data_to_normalize = Seq(Array( ("A","Sodium Chloride/Potassium Chloride 13bag","Sodium Chloride / Potassium Chloride 13 bag") , ("B","interferon alfa-2b 10 million unit ( 1 ml ) injec","interferon alfa - 2b 10000000 unt ( 1 ml ) injection") , ("C","aspirin 10 meq/ 5 ml oral sol","aspirin 2 meq/ml oral solution") )) .toDF("cuid","text","target_normalized_text")
// Annotator that transforms a text column from dataframe into normalized text (with all policy)
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val drug_normalizer = new DrugNormalizer()
.setInputCols("document")
.setOutputCol("document_normalized")
.setPolicy("all")
val drug_normalizer_pipeline = new Pipeline().setStages(Array(
document_assembler,
drug_normalizer))
val ds = drug_normalizer_pipeline.fit(data_to_normalize).transform(data_to_normalize)
+-------------------------------------------------------------------------------------------+----------------------------------------------------+----------------------------------------------------+
|document |target_normalized_text |all_normalized_text |
+-------------------------------------------------------------------------------------------+----------------------------------------------------+----------------------------------------------------+
|[{document, 0, 39, Sodium Chloride/Potassium Chloride 13bag, {sentence -> 0}, []}] |Sodium Chloride / Potassium Chloride 13 bag |Sodium Chloride / Potassium Chloride 13 bag |
|[{document, 0, 48, interferon alfa-2b 10 million unit ( 1 ml ) injec, {sentence -> 0}, []}]|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|
|[{document, 0, 28, aspirin 10 meq/ 5 ml oral sol, {sentence -> 0}, []}] |aspirin 2 meq/ml oral solution |aspirin 2 meq/ml oral solution |
+-------------------------------------------------------------------------------------------+----------------------------------------------------+----------------------------------------------------+
EntityChunkEmbeddings
Weighted average embeddings of multiple named entities chunk annotations.
Entity Chunk Embeddings uses BERT Sentence embeddings to compute a weighted average vector represention of related entity chunks. The input the model consists of chunks of recognized named entities. One or more entities are selected as target entities and for each of them a list of related entities is specified (if empty, all other entities are assumed to be related).
The model looks for chunks of the target entities and then tries to pair each target entity (e.g. DRUG) with other related entities (e.g. DOSAGE, STRENGTH, FORM, etc). The criterion for pairing a target entity with another related entity is that they appear in the same sentence and the maximal syntactic distance is below a predefined threshold.
The relationship between target and related entities is one-to-many, meaning that if there multiple instances of the same target entity (e.g.) within a sentence, the model will map a related entity (e.g. DOSAGE) to at most one of the instances of the target entity. For example, if there is a sentence “The patient was given 125 mg of paracetamol and metformin”, the model will pair “125 mg” to “paracetamol”, but not to “metformin”.
The output of the model is an average embeddings of the chunks of each of the target entities and their related entities. It is possible to specify a particular weight for each entity type.
An entity can be defined both as target a entity and as a related entity for some other target entity. For example, we may want to compute the embeddings of SYMPTOMs and their related entities, as well as the embeddings of DRUGs and their related entities, one of each is also SYMPTOM. In such cases, it is possible to use the TARGET_ENTITY:RELATED_ENTITY notation to specify the weight of an related entity (e.g. “DRUG:SYMPTOM” to set the weight of SYMPTOM when it appears as an related entity to target entity DRUG). The relative weights of entities for particular entity chunk embeddings are available in the annotations metadata.
This model is a subclass of BertSentenceEmbeddings
and shares all parameters
with it. It can load any pretrained BertSentenceEmbeddings
model.
Parametres:
-
targetEntities
: (dict) The target entities mapped to lists of their related entities. A target entity with an empty list of related entities means all other entities are assumed to be related to it. Entity names are case insensitive. Mandatory to set at least one entity -
entityWeights
: (dict) The relative weights of drug related entities. If not set, all entities have equal weights. If the list is non-empty and some entity is not in it, then its weight is set to 0. The notation TARGET_ENTITY:RELATED_ENTITY can be used to specify the weight of a entity which is related to specific target entity (e.g. “DRUG:SYMPTOM” -> 0.3f). Entity names are case insensitive. -
maxSyntacticDistance
: (Int) Maximal syntactic distance between the drug entity and the other drug related entities. Default value is 2.
The default model is "sbiobert_base_cased_mli"
from clinical/models
.
Other available models can be found at Models Hub.
Input Annotator Types: DEPENDENCY, CHUNK
Output Annotator Type: SENTENCE_EMBEDDINGS
Python API: EntityChunkEmbeddingsModel | Scala API: EntityChunkEmbeddingsModel | Notebook: EntityChunkEmbeddingsModelNotebook |
Show Example
from johnsnowlabs import nlp, medical
documenter = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetector()\
.setInputCols("document")\
.setOutputCol("sentence")\
tokenizer = nlp.Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
posology_ner_model = medical.NerModel().pretrained("ner_posology_large", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = medical.NerConverterInternal()\
.setInputCols("sentence", "token", "ner")\
.setOutputCol("ner_chunk")
pos_tager = nlp.PerceptronModel().pretrained("pos_clinical", "en", "clinical/models")\
.setInputCols("sentence", "token")\
.setOutputCol("pos_tag")
dependency_parser = nlp.DependencyParserModel().pretrained("dependency_conllu", "en")\
.setInputCols(["sentence", "pos_tag", "token"])\
.setOutputCol("dependencies")
entity_chunk_embeddings = medical.EntityChunkEmbeddings().pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
.setInputCols(["ner_chunk", "dependencies"])\
.setOutputCol("drug_chunk_embeddings")
entity_chunk_embeddings.setTargetEntities({"DRUG": ["STRENGTH", "ROUTE", "FORM"]})
rxnorm_re = medical.SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented_re", "en", "clinical/models")\
.setInputCols(["drug_chunk_embeddings"])\
.setOutputCol("rxnorm_code")\
.setDistanceFunction("EUCLIDEAN")
rxnorm_pipeline_re = nlp.Pipeline(
stages=[
documenter,
sentence_detector,
tokenizer,
embeddings,
posology_ner_model,
ner_converter,
pos_tager,
dependency_parser,
entity_chunk_embeddings,
rxnorm_re,
]
)
rxnorm_model = rxnorm_pipeline_re.fit(spark.createDataFrame([[""]]).toDF("text"))
data_df = spark.createDataFrame(
[
[
"The patient was given metformin 500 mg tablet, 2.5 mg of coumadin and then ibuprofen."
],
[
"The patient was given metformin 400 mg, coumadin 5 mg, coumadin, amlodipine 10 MG tablet"
],
]
).toDF("text")
results = rxnorm_model.transform(data_df)
results.select("drug_chunk_embeddings.result", "drug_chunk_embeddings.embeddings").show(truncate=200)
+--------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| result| embeddings|
+--------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| [metformin 500 mg tablet, 2.5 mg coumadin, ibuprofen]|[[0.13060866, 0.26946265, -0.50702775, 0.7724293, 0.7356907, 0.0962475, -0.5546377, 0.0534295, -0.55345106, 0.48484787, -0.35735086, 0.49109104, 0.84404886, 0.30384326, -0.9923568, -0.24454081, 0.3...|
|[metformin 400 mg, coumadin 5 mg, coumadin, amlodipine 10 MG tablet]|[[-0.177948, 0.25489503, -0.5724586, 0.8031439, 0.9211674, 0.3558219, -0.37258363, -0.194855, -0.7407244, 0.48175216, 0.040639203, 0.6822441, 0.5768623, -0.19830275, -1.1513872, -0.32279214, 0.6181...|
+--------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
val posology_ner_model = MedicalNerModel.pretrained("ner_posology_large","en","clinical/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
val pos_tager = PerceptronModel.pretrained("pos_clinical","en","clinical/models")
.setInputCols(Array("sentence","token"))
.setOutputCol("pos_tag")
val dependency_parser = DependencyParserModel.pretrained("dependency_conllu","en")
.setInputCols(Array("sentence","pos_tag","token"))
.setOutputCol("dependencies")
val entity_chunk_embeddings = EntityChunkEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols(Array("ner_chunk","dependencies"))
.setOutputCol("drug_chunk_embeddings")
val entity_chunk_embeddings.setTargetEntities(Map("DRUG" -> "Array("STRENGTH","ROUTE","FORM")"))
val rxnorm_re = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented_re","en","clinical/models")
.setInputCols("drug_chunk_embeddings")
.setOutputCol("rxnorm_code")
.setDistanceFunction("EUCLIDEAN")
val rxnorm_pipeline_re = new Pipeline().setStages(Array(
documenter,
sentence_detector,
tokenizer,
embeddings,
posology_ner_model,
ner_converter,
pos_tager,
dependency_parser,
entity_chunk_embeddings,
rxnorm_re))
val rxnorm_model = Seq(( "The patient was given metformin 500 mg tablet,2.5 mg of coumadin and then ibuprofen." ), ( "The patient was given metformin 400 mg,coumadin 5 mg,coumadin,amlodipine 10 MG tablet" )).toDF("text")
val results = rxnorm_model.fit(rxnorm_model).transform(rxnorm_model)
+--------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| result| embeddings|
+--------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| [metformin 500 mg tablet, 2.5 mg coumadin, ibuprofen]|[[0.13060866, 0.26946265, -0.50702775, 0.7724293, 0.7356907, 0.0962475, -0.5546377, 0.0534295, -0.55345106, 0.48484787, -0.35735086, 0.49109104, 0.84404886, 0.30384326, -0.9923568, -0.24454081, 0.3...|
|[metformin 400 mg, coumadin 5 mg, coumadin, amlodipine 10 MG tablet]|[[-0.177948, 0.25489503, -0.5724586, 0.8031439, 0.9211674, 0.3558219, -0.37258363, -0.194855, -0.7407244, 0.48175216, 0.040639203, 0.6822441, 0.5768623, -0.19830275, -1.1513872, -0.32279214, 0.6181...|
+--------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
EntityRulerInternal
This annotator match exact strings or regex patterns provided in a file against a Document and assigns them an named entity. The definitions can contain any number of named entities.
Parametres:
-
setPatternsResource
(str): Sets Resource in JSON or CSV format to map entities to patterns. path : str Path to the resource read_as : str, optional How to interpret the resource, by default ReadAs.TEXT options : dict, optional Options for parsing the resource, by default {“format”: “JSON”} -
setSentenceMatch
(Boolean): Whether to find match at sentence level. True: sentence level. False: token level. -
setAlphabetResource
(str): Alphabet Resource (a simple plain text with all language characters) -
setUseStorage
(Boolean): Sets whether to use RocksDB storage to serialize patterns.
See Spark NLP Workshop for more examples of usage.
Input Annotator Types: DOCUMENT, CHUNK
Output Annotator Type: CHUNK
Python API: EntityRulerInternal | Scala API: EntityRulerInternal |
Show Example
from johnsnowlabs import nlp, medical
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = nlp.Tokenizer()\
.setInputCols(["document"])\
.setOutputCol("token")
entityRuler = medical.EntityRulerInternalApproach()\
.setInputCols(["document", "token"])\
.setOutputCol("entities")\
.setPatternsResource("entities.json")\
.setCaseSensitive(False)\
pipeline = nlp.Pipeline().setStages([
documentAssembler,
tokenizer,
entityRuler
])
data = spark.createDataFrame([['''John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01.''']]).toDF("text")
model = pipeline.fit(data)
result = model.transform(data)
# Result
+---------------+-----+---+-------+
| chunk|begin|end| label|
+---------------+-----+---+-------+
| aspirin| 25| 31| Drug|
|heart condition| 41| 55|Disease|
| paracetamol| 69| 79| Drug|
| fever| 89| 93|Symptom|
| headache| 99|106|Symptom|
| tonsilitis| 129|138|Disease|
| ibuprofen| 141|149| Drug|
| lansoprazol| 177|187| Drug|
| GORD| 198|201|Disease|
+---------------+-----+---+-------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val entityRuler = new EntityRulerInternalApproach()
.setInputCols(Array("document", "token"))
.setOutputCol("entities")
.setPatternsResource("entities.json")
.setCaseSensitive(false)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
entityRuler
))
val data = Seq(
("""John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsillitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01.""")
).toDF("text")
val model = pipeline.fit(data)
# Result
+---------------+-----+---+-------+
| chunk|begin|end| label|
+---------------+-----+---+-------+
| aspirin| 25| 31| Drug|
|heart condition| 41| 55|Disease|
| paracetamol| 69| 79| Drug|
| fever| 89| 93|Symptom|
| headache| 99|106|Symptom|
| tonsilitis| 129|138|Disease|
| ibuprofen| 141|149| Drug|
| lansoprazol| 177|187| Drug|
| GORD| 198|201|Disease|
+---------------+-----+---+-------+
EntityRulerInternal will handle the chunks output based on the patterns defined, as shown in the example below. We can define an id field to identify entities.
Parameters:
-
setPatternsResource
(str): Sets Resource in JSON or CSV format to map entities to patterns. path : str Path to the resource read_as : str, optional How to interpret the resource, by default ReadAs.TEXT options : dict, optional Options for parsing the resource, by default {“format”: “JSON”} -
setSentenceMatch
(Boolean): Whether to find match at sentence level. True: sentence level. False: token level. -
setAlphabetResource
(str): Alphabet Resource (a simple plain text with all language characters) -
setUseStorage
(Boolean): Sets whether to use RocksDB storage to serialize patterns.
Input Annotator Types: DOCUMENT, TOKEN
Output Annotator Type: CHUNK
Python API: FewShotClassifierApproach | Scala API: EntityRulerInternal |
FeaturesAssembler
The FeaturesAssembler is used to collect features from different columns. It can collect features from single value
columns (anything which can be cast to a float, if casts fails then the value is set to 0), array columns or
SparkNLP annotations (if the annotation is an embedding, it takes the embedding, otherwise tries to cast the
result
field). The output of the transformer is a FEATURE_VECTOR
annotation (the numeric vector is in the
embeddings
field).
Parameters:
-
inputCols
: The name of the columns containing the input annotations. It can read either a String column name or an Array of strings (column names). -
outputCol
: The name of the column in Document type that is generated. We can specify only one column here.
All the parameters can be set using the corresponding set method in the camel case. For example, .setInputcols()
.
Input Annotator Types: NONE
Output Annotator Type: FEATURE_VECTOR
Python API: FeaturesAssembler | Scala API: FeaturesAssembler | Notebook: FeatureAssemblerNotebook |
Show Example
from johnsnowlabs import medical, nlp
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = nlp.Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_healthcare_100d","en","clinical/models")\
.setInputCols(["document","token"])\
.setOutputCol("word_embeddings")
sentence_embeddings = nlp.SentenceEmbeddings() \
.setInputCols(["document", "word_embeddings"]) \
.setOutputCol("sentence_embeddings") \
.setPoolingStrategy("AVERAGE")
features_asm = medical.FeaturesAssembler()\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("features")
embeddings_pipeline = nlp.Pipeline(
stages = [
document_assembler,
tokenizer,
word_embeddings,
sentence_embeddings,
features_asm
])
data_df = spark.createDataFrame(
[
[
"PROCEDURES PERFORMED: Colonoscopy. INDICATIONS: Renewed symptoms likely consistent with active flare of Inflammatory Bowel Disease, not responsive to conventional therapy including sulfasalazine, cortisone, local therapy. PROCEDURE: Informed consent was obtained prior to the procedure with special attention to benefits, risks, alternatives. Risks explained as bleeding, infection, bowel perforation, aspiration pneumonia, or reaction to the medications. Vital signs were monitored by blood pressure, heart rate, and oxygen saturation. Supplemental O2 given. Specifics discussed. Preprocedure physical exam performed. Stable vital signs. Lungs clear. Cardiac exam showed regular rhythm. Abdomen soft. Her past history, her past workup, her past visitation with me for Inflammatory Bowel Disease, well responsive to sulfasalazine reviewed. She currently has a flare and is not responding, therefore, likely may require steroid taper. At the same token, her symptoms are mild. She has rectal bleeding, essentially only some rusty stools. There is not significant diarrhea, just some lower stools. No significant pain. Therefore, it is possible that we are just dealing with a hemorrhoidal bleed, therefore, colonoscopy now needed. Past history reviewed. Specifics of workup, need for followup, and similar discussed. All questions answered. A normal digital rectal examination was performed. The PCF-160 AL was inserted into the anus and advanced to the cecum without difficulty, as identified by the ileocecal valve, cecal stump, and appendical orifice. All mucosal aspects thoroughly inspected, including a retroflexed examination. Withdrawal time was greater than six minutes. Unfortunately, the terminal ileum could not be intubated despite multiple attempts. Findings were those of a normal cecum, right colon, transverse colon, descending colon. A small cecal polyp was noted, this was biopsy-removed, placed in bottle #1. Random biopsies from the cecum obtained, bottle #2; random biopsies from the transverse colon obtained, as well as descending colon obtained, bottle #3. There was an area of inflammation in the proximal sigmoid colon, which was biopsied, placed in bottle #4. There was an area of relative sparing, with normal sigmoid lining, placed in bottle #5, randomly biopsied, and then inflammation again in the distal sigmoid colon and rectum biopsied, bottle #6, suggesting that we may be dealing with Crohn disease, given the relative sparing of the sigmoid colon and junk lesion. Retroflexed showed hemorrhoidal disease. Scope was then withdrawn, patient left in good condition. IMPRESSION: Active flare of Inflammatory Bowel Disease, question of Crohn disease. PLAN: I will have the patient follow up with me, will follow up on histology, follow up on the polyps. She will be put on a steroid taper and make an appointment and hopefully steroids alone will do the job. If not, she may be started on immune suppressive medication, such as azathioprine, or similar. All of this has been reviewed with the patient. All questions answered."
],
]
).toDF("text")
result = embeddings_pipeline.fit(data_df).transform(data_df)
result.select("features").show(truncate=False)
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|features |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{feature_vector, 0, 0, , {sentence -> 0}, [-0.00896873, 0.011731416, 0.12154201, 0.1149235, -0.14689414, 0.0103584975, 0.053073216, -0.056412186, -0.05143186, -0.0118978135, -0.12175384, -0.035894137, 0.11812756, 0.094671555, 0.15838866, 0.15260744, -0.004094441, -0.13675772, -0.07472433, -0.035856977, -0.026730005, -0.21840473, 0.029632289, -0.011515695, -0.20407394, -0.07848257, 0.040990185, 0.23028605, 0.077140555, 0.066990435, 0.015219222, -0.10295644, 0.038072545, 0.10786369, 0.121525764, -0.09569349, -0.06309264, 0.2778952, 0.06462455, -0.10851931, -0.14370486, -0.1466352, 0.08354363, -0.078758985, -0.08377953, 0.12384644, -0.23281692, -0.25607574, 0.16399069, -0.07780675, -0.18302177, -0.18325584, -0.12128636, -0.0010129504, 0.0070792097, 0.20506753, 0.034964647, 0.058425985, 0.19572404, -0.103953235, -0.20159312, -0.099047214, -0.07337802, -0.03713124, -0.055443633, 0.11107734, 0.048563413, -0.038048305, -0.020617828, 0.17082842, 0.069010496, 0.08457101, -0.038229663, 0.073144384, -0.092326105, -0.10054428, -4.3286112E-4, -0.046703782, -0.080231875, 0.02524295, 0.01368699, -0.19783853, -0.03501917, 0.13324805, 0.09053264, -0.0958231, -0.0032442473, 0.19218525, -0.027179888, 0.030672349, 0.12848215, -0.014700146, -0.089054875, 0.13839856, -0.15778734, 0.07103226, -0.060303356, 0.20854644, -0.008389737, -0.1473986]}]|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
from johnsnowlabs import nlp, finance
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = nlp.Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained()\
.setInputCols(["document","token"])\
.setOutputCol("word_embeddings")
sentence_embeddings = nlp.SentenceEmbeddings() \
.setInputCols(["document", "word_embeddings"]) \
.setOutputCol("sentence_embeddings") \
.setPoolingStrategy("AVERAGE")
features_asm =finance.FeaturesAssembler()\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("features")
embeddings_pipeline = nlp.Pipeline(
stages = [
document_assembler,
tokenizer,
word_embeddings,
sentence_embeddings,
features_asm
])
data_df = spark.createDataFrame(
[
[
"Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc."
],
]
).toDF("text")
result = embeddings_pipeline.fit(data_df).transform(data_df)
result.select("features").show(truncate=False)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|features |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{feature_vector, 0, 0, , {sentence -> 0}, [-0.05989722, 0.10907035, 0.25595385, -0.21656203, 0.20777024, -0.17276664, -0.045803867, -0.14506632, -0.16928527, -0.10008922, 0.18800992, -0.36529806, 0.22592439, -0.118487455, 0.006129823, -0.2674002, 0.37149927, 0.12375746, -0.30488327, 0.2507765, -0.060471725, -0.22705032, 0.39436466, 0.40368417, 0.15569581, 0.083455965, 0.11193783, -0.2783573, -0.23566169, -0.12444999, 0.22503565, 0.43343276, -0.3165808, -0.057086047, 0.050554093, 0.3512633, 0.17572127, 0.19258633, -0.09170296, -0.25344467, 0.018219033, -0.117947415, -0.03234701, -0.1549039, -0.0147800855, 0.076972865, 0.08612865, -0.14120182, -0.18348631, -0.4500436, 0.038739346, 0.12991442, -0.032128494, 0.7483725, -0.09843177, -1.6700389, 0.0060545397, -0.1044135, 1.2469376, 0.32064447, -0.17263599, 0.31999183, 0.0077194544, 0.15370668, 0.59472036, -0.16953614, 0.3042488, 0.25355336, 0.60402286, 0.07441569, -0.12468894, 0.03140718, -0.2630037, -0.37703836, 0.034783553, -0.058904923, 0.022686867, 0.07962498, -0.7945683, -0.21051218, 0.6615892, -0.18747853, -0.25412843, 0.26003888, -1.0803214, -0.026889319, -0.11805089, -0.14200646, -0.019682527, -0.2372327, 0.0090960255, -0.071929, -0.115089305, 0.21781716, -0.3569975, 0.07799677, -0.096894525, -0.34368798, 0.66465, 0.14913023]}]|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
from johnsnowlabs import nlp, legal
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = nlp.Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained()\
.setInputCols(["document","token"])\
.setOutputCol("word_embeddings")
sentence_embeddings = nlp.SentenceEmbeddings() \
.setInputCols(["document", "word_embeddings"]) \
.setOutputCol("sentence_embeddings") \
.setPoolingStrategy("AVERAGE")
features_asm =legal.FeaturesAssembler()\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("features")
embeddings_pipeline = nlp.Pipeline(
stages = [
document_assembler,
tokenizer,
word_embeddings,
sentence_embeddings,
features_asm
])
data_df = spark.createDataFrame(
[
[
"This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc."
],
]
).toDF("text")
result = embeddings_pipeline.fit(data_df).transform(data_df)
result.select("features").show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|features |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{feature_vector, 0, 0, , {sentence -> 0}, [0.02474357, -0.08310143, 0.4801927, -0.070223466, 0.33147717, -0.18737249, -0.048361354, -0.052325998, 0.053252153, -0.0067390013, 0.2836935, -0.25569317, 0.3415577, -0.19251995, 0.051623292, -0.25131556, 0.3472208, -0.036604006, -0.35653928, 0.13225944, 0.18795085, -0.09561886, 0.4695179, 0.22093144, 0.32058474, 0.057281215, 0.082858086, -0.3714214, -0.19219379, -0.26751986, -0.148075, 0.6410107, -0.07821157, -0.06398429, 6.32831E-5, 0.21222909, 0.33145514, 0.2575328, 0.009346781, -0.21482512, -0.22197871, -0.14005142, 0.04592571, -0.2919176, 0.011854073, -0.14047821, 0.22201888, -0.13500921, -0.101019345, -0.31175214, -0.0031539474, 0.07841865, 0.23760447, 0.8622971, -0.21095662, -1.9944092, -0.090888076, -0.45743433, 1.5815442, 0.4848822, -0.12528154, 0.33802572, -0.16203907, -0.09874586, 0.63106954, -0.21860953, 0.39005432, 0.25023165, 0.66769457, -0.13867687, 0.02832079, -0.17432508, -0.05764636, -0.44529453, 0.032839067, -0.2266792, -0.002856281, 0.007823931, -1.0165309, 0.08553613, 0.38090998, 0.011592574, -0.18031952, 0.37968582, -0.77948713, -0.068393, -0.029594865, -0.2165647, 0.1665183, -0.23963346, -0.017649503, -0.24768801, -0.2725593, 0.14533372, -0.36786577, 0.23388086, -0.20129707, -0.33582142, 0.5970527, 0.12596472]}]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_healthcare_100d","en","clinical/models")
.setInputCols(Array("document","token"))
.setOutputCol("word_embeddings")
val sentence_embeddings = new SentenceEmbeddings()
.setInputCols(Array("document", "word_embeddings"))
.setOutputCol("sentence_embeddings")
.setPoolingStrategy("AVERAGE")
val features_asm = new FeaturesAssembler()
.setInputCols("sentence_embeddings")
.setOutputCol("features")
val nlpPipeline = new Pipeline().setStages(Array(
document_assembler,
tokenizer,
word_embeddings,
sentence_embeddings,
features_asm
))
val data = Seq(
"PROCEDURES PERFORMED: Colonoscopy. INDICATIONS: Renewed symptoms likely consistent with active flare of Inflammatory Bowel Disease, not responsive to conventional therapy including sulfasalazine, cortisone, local therapy. PROCEDURE: Informed consent was obtained prior to the procedure with special attention to benefits, risks, alternatives. Risks explained as bleeding, infection, bowel perforation, aspiration pneumonia, or reaction to the medications. Vital signs were monitored by blood pressure, heart rate, and oxygen saturation. Supplemental O2 given. Specifics discussed. Preprocedure physical exam performed. Stable vital signs. Lungs clear. Cardiac exam showed regular rhythm. Abdomen soft. Her past history, her past workup, her past visitation with me for Inflammatory Bowel Disease, well responsive to sulfasalazine reviewed. She currently has a flare and is not responding, therefore, likely may require steroid taper. At the same token, her symptoms are mild. She has rectal bleeding, essentially only some rusty stools. There is not significant diarrhea, just some lower stools. No significant pain. Therefore, it is possible that we are just dealing with a hemorrhoidal bleed, therefore, colonoscopy now needed. Past history reviewed. Specifics of workup, need for followup, and similar discussed. All questions answered. A normal digital rectal examination was performed. The PCF-160 AL was inserted into the anus and advanced to the cecum without difficulty, as identified by the ileocecal valve, cecal stump, and appendical orifice. All mucosal aspects thoroughly inspected, including a retroflexed examination. Withdrawal time was greater than six minutes. Unfortunately, the terminal ileum could not be intubated despite multiple attempts. Findings were those of a normal cecum, right colon, transverse colon, descending colon. A small cecal polyp was noted, this was biopsy-removed, placed in bottle #1. Random biopsies from the cecum obtained, bottle #2; random biopsies from the transverse colon obtained, as well as descending colon obtained, bottle #3. There was an area of inflammation in the proximal sigmoid colon, which was biopsied, placed in bottle #4. There was an area of relative sparing, with normal sigmoid lining, placed in bottle #5, randomly biopsied, and then inflammation again in the distal sigmoid colon and rectum biopsied, bottle #6, suggesting that we may be dealing with Crohn disease, given the relative sparing of the sigmoid colon and junk lesion. Retroflexed showed hemorrhoidal disease. Scope was then withdrawn, patient left in good condition. IMPRESSION: Active flare of Inflammatory Bowel Disease, question of Crohn disease. PLAN: I will have the patient follow up with me, will follow up on histology, follow up on the polyps. She will be put on a steroid taper and make an appointment and hopefully steroids alone will do the job. If not, she may be started on immune suppressive medication, such as azathioprine, or similar. All of this has been reviewed with the patient. All questions answered."
).toDF("text")
val result = nlpPipeline.fit(data_df).transform(data_df)
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|features |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{feature_vector, 0, 0, , {sentence -> 0}, [-0.00896873, 0.011731416, 0.12154201, 0.1149235, -0.14689414, 0.0103584975, 0.053073216, -0.056412186, -0.05143186, -0.0118978135, -0.12175384, -0.035894137, 0.11812756, 0.094671555, 0.15838866, 0.15260744, -0.004094441, -0.13675772, -0.07472433, -0.035856977, -0.026730005, -0.21840473, 0.029632289, -0.011515695, -0.20407394, -0.07848257, 0.040990185, 0.23028605, 0.077140555, 0.066990435, 0.015219222, -0.10295644, 0.038072545, 0.10786369, 0.121525764, -0.09569349, -0.06309264, 0.2778952, 0.06462455, -0.10851931, -0.14370486, -0.1466352, 0.08354363, -0.078758985, -0.08377953, 0.12384644, -0.23281692, -0.25607574, 0.16399069, -0.07780675, -0.18302177, -0.18325584, -0.12128636, -0.0010129504, 0.0070792097, 0.20506753, 0.034964647, 0.058425985, 0.19572404, -0.103953235, -0.20159312, -0.099047214, -0.07337802, -0.03713124, -0.055443633, 0.11107734, 0.048563413, -0.038048305, -0.020617828, 0.17082842, 0.069010496, 0.08457101, -0.038229663, 0.073144384, -0.092326105, -0.10054428, -4.3286112E-4, -0.046703782, -0.080231875, 0.02524295, 0.01368699, -0.19783853, -0.03501917, 0.13324805, 0.09053264, -0.0958231, -0.0032442473, 0.19218525, -0.027179888, 0.030672349, 0.12848215, -0.014700146, -0.089054875, 0.13839856, -0.15778734, 0.07103226, -0.060303356, 0.20854644, -0.008389737, -0.1473986]}]|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained()
.setInputCols(Array("document","token"))
.setOutputCol("word_embeddings")
val sentence_embeddings = new SentenceEmbeddings()
.setInputCols(Array("document", "word_embeddings"))
.setOutputCol("sentence_embeddings")
.setPoolingStrategy("AVERAGE")
val features_asm = new FeaturesAssembler()
.setInputCols("sentence_embeddings")
.setOutputCol("features")
val nlpPipeline = new Pipeline().setStages(Array(
document_assembler,
tokenizer,
word_embeddings,
sentence_embeddings,
features_asm
))
val data = Seq(
"Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc."
).toDF("text")
val result = nlpPipeline.fit(data_df).transform(data_df)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|features |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{feature_vector, 0, 0, , {sentence -> 0}, [-0.05989722, 0.10907035, 0.25595385, -0.21656203, 0.20777024, -0.17276664, -0.045803867, -0.14506632, -0.16928527, -0.10008922, 0.18800992, -0.36529806, 0.22592439, -0.118487455, 0.006129823, -0.2674002, 0.37149927, 0.12375746, -0.30488327, 0.2507765, -0.060471725, -0.22705032, 0.39436466, 0.40368417, 0.15569581, 0.083455965, 0.11193783, -0.2783573, -0.23566169, -0.12444999, 0.22503565, 0.43343276, -0.3165808, -0.057086047, 0.050554093, 0.3512633, 0.17572127, 0.19258633, -0.09170296, -0.25344467, 0.018219033, -0.117947415, -0.03234701, -0.1549039, -0.0147800855, 0.076972865, 0.08612865, -0.14120182, -0.18348631, -0.4500436, 0.038739346, 0.12991442, -0.032128494, 0.7483725, -0.09843177, -1.6700389, 0.0060545397, -0.1044135, 1.2469376, 0.32064447, -0.17263599, 0.31999183, 0.0077194544, 0.15370668, 0.59472036, -0.16953614, 0.3042488, 0.25355336, 0.60402286, 0.07441569, -0.12468894, 0.03140718, -0.2630037, -0.37703836, 0.034783553, -0.058904923, 0.022686867, 0.07962498, -0.7945683, -0.21051218, 0.6615892, -0.18747853, -0.25412843, 0.26003888, -1.0803214, -0.026889319, -0.11805089, -0.14200646, -0.019682527, -0.2372327, 0.0090960255, -0.071929, -0.115089305, 0.21781716, -0.3569975, 0.07799677, -0.096894525, -0.34368798, 0.66465, 0.14913023]}]|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained()
.setInputCols(Array("document","token"))
.setOutputCol("word_embeddings")
val sentence_embeddings = new SentenceEmbeddings()
.setInputCols(Array("document", "word_embeddings"))
.setOutputCol("sentence_embeddings")
.setPoolingStrategy("AVERAGE")
val features_asm = new FeaturesAssembler()
.setInputCols("sentence_embeddings")
.setOutputCol("features")
val nlpPipeline = new Pipeline().setStages(Array(
document_assembler,
tokenizer,
word_embeddings,
sentence_embeddings,
features_asm
))
val data = Seq(
"This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc."
).toDF("text")
val result = nlpPipeline.fit(data_df).transform(data_df)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|features |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{feature_vector, 0, 0, , {sentence -> 0}, [0.02474357, -0.08310143, 0.4801927, -0.070223466, 0.33147717, -0.18737249, -0.048361354, -0.052325998, 0.053252153, -0.0067390013, 0.2836935, -0.25569317, 0.3415577, -0.19251995, 0.051623292, -0.25131556, 0.3472208, -0.036604006, -0.35653928, 0.13225944, 0.18795085, -0.09561886, 0.4695179, 0.22093144, 0.32058474, 0.057281215, 0.082858086, -0.3714214, -0.19219379, -0.26751986, -0.148075, 0.6410107, -0.07821157, -0.06398429, 6.32831E-5, 0.21222909, 0.33145514, 0.2575328, 0.009346781, -0.21482512, -0.22197871, -0.14005142, 0.04592571, -0.2919176, 0.011854073, -0.14047821, 0.22201888, -0.13500921, -0.101019345, -0.31175214, -0.0031539474, 0.07841865, 0.23760447, 0.8622971, -0.21095662, -1.9944092, -0.090888076, -0.45743433, 1.5815442, 0.4848822, -0.12528154, 0.33802572, -0.16203907, -0.09874586, 0.63106954, -0.21860953, 0.39005432, 0.25023165, 0.66769457, -0.13867687, 0.02832079, -0.17432508, -0.05764636, -0.44529453, 0.032839067, -0.2266792, -0.002856281, 0.007823931, -1.0165309, 0.08553613, 0.38090998, 0.011592574, -0.18031952, 0.37968582, -0.77948713, -0.068393, -0.029594865, -0.2165647, 0.1665183, -0.23963346, -0.017649503, -0.24768801, -0.2725593, 0.14533372, -0.36786577, 0.23388086, -0.20129707, -0.33582142, 0.5970527, 0.12596472]}]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
FewShotAssertionClassifierModel
FewShotAssertionClassifierModel does assertion classification using can run large (LLMS based) few shot classifiers based on the SetFit approach.
Parameters:
batchSize
(Int): Batch sizecaseSensitive
(Bool): Whether the classifier is sensitive to text casingmaxSentenceLength
(Int): The maximum length of the input text
Input Annotator Types: DOCUMENT, CHUNK
Output Annotator Type: ASSERTION
Python API: FewShotAssertionClassifierModel | Scala API: FewShotAssertionClassifierModel | Notebook: FewShotAssertionClassifierModel |
Show Example
from johnsnowlabs import nlp, medical
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")\
.setSplitChars(["-", "\/"])
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
.setInputCols(["sentence","token"])\
.setOutputCol("embeddings")
# ner_oncology
ner_oncology = medical.NerModel.pretrained("ner_oncology","en","clinical/models")\
.setInputCols(["sentence","token","embeddings"])\
.setOutputCol("ner_oncology")
ner_oncology_converter = medical.NerConverterInternal()\
.setInputCols(["sentence","token","ner_oncology"])\
.setOutputCol("ner_chunk")
few_shot_assertion_converter = medical.FewShotAssertionSentenceConverter()\
.setInputCols(["sentence", "token", "ner_chunk"])\
.setOutputCol("assertion_sentence")
e5_embeddings = nlp.E5Embeddings.pretrained("e5_base_v2_embeddings_medical_assertion_oncology", "en", "clinical/models")\
.setInputCols(["assertion_sentence"])\
.setOutputCol("assertion_embedding")
few_shot_assertion_classifier = medical.FewShotAssertionClassifierModel()\
.pretrained("fewhot_assertion_oncology_e5_base_v2_oncology", "en", "clinical/models")\
.setInputCols(["assertion_embedding"])\
.setOutputCol("assertion_fewshot")
assertion_pipeline = nlp.Pipeline(
stages=[
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner_oncology,
ner_oncology_converter,
few_shot_assertion_converter,
e5_embeddings,
few_shot_assertion_classifier
])
sample_text= [
"""The patient is suspected to have colorectal cancer. Her family history is positive for other cancers. The result of the biopsy was positive. A CT scan was ordered to rule out metastases."""
]
data = spark.createDataFrame([sample_text]).toDF("text")
result = assertion_pipeline.fit(data).transform(data)
## Result
+-----------------+-----+---+----------------+---------+----------+
|ner_chunk |begin|end|ner_label |assertion|confidence|
+-----------------+-----+---+----------------+---------+----------+
|colorectal cancer|33 |49 |Cancer_Dx |Possible |0.5812815 |
|Her |52 |54 |Gender |Present |0.9562998 |
|cancers |93 |99 |Cancer_Dx |Family |0.23465642|
|biopsy |120 |125|Pathology_Test |Past |0.95732147|
|positive |131 |138|Pathology_Result|Present |0.9564386 |
|CT scan |143 |149|Imaging_Test |Past |0.9571699 |
|metastases |175 |184|Metastasis |Possible |0.54986554|
+-----------------+-----+---+----------------+---------+----------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel
.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
.setSplitChars(Array("-", "/"))
val wordEmbeddings = WordEmbeddingsModel
.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val nerOncology = MedicalNerModel
.pretrained("ner_oncology", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner_oncology")
val nerOncologyConverter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner_oncology"))
.setOutputCol("ner_chunk")
val fewShotAssertionConverter = new FewShotAssertionSentenceConverter()
.setInputCols(Array("sentence", "token", "ner_chunk"))
.setOutputCol("assertion_sentence")
val e5Embeddings = E5Embeddings
.pretrained("e5_base_v2_embeddings_medical_assertion_oncology", "en", "clinical/models")
.setInputCols(Array("assertion_sentence"))
.setOutputCol("assertion_embedding")
val fewShotAssertionClassifier = FewShotAssertionClassifierModel
.pretrained("fewhot_assertion_oncology_e5_base_v2_oncology", "en", "clinical/models")
.setInputCols(Array("assertion_embedding"))
.setOutputCol("assertion_fewshot")
val pipeline = new Pipeline()
.setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
wordEmbeddings,
nerOncology,
nerOncologyConverter,
fewShotAssertionConverter,
e5Embeddings,
fewShotAssertionClassifier
))
val sampleText = Seq("The patient is suspected to have colorectal cancer. Her family history is positive for other cancers.
The result of the biopsy was positive. A CT scan was ordered to rule out metastases.")
val data = spark.createDataFrame(sampleText).toDF("text")
val result = pipeline.fit(data).transform(data)
result.show(false)
// Result
+-----------------+-----+---+----------------+---------+----------+
|ner_chunk |begin|end|ner_label |assertion|confidence|
+-----------------+-----+---+----------------+---------+----------+
|colorectal cancer|33 |49 |Cancer_Dx |Possible |0.5812815 |
|Her |52 |54 |Gender |Present |0.9562998 |
|cancers |93 |99 |Cancer_Dx |Family |0.23465642|
|biopsy |120 |125|Pathology_Test |Past |0.95732147|
|positive |131 |138|Pathology_Result|Present |0.9564386 |
|CT scan |143 |149|Imaging_Test |Past |0.9571699 |
|metastases |175 |184|Metastasis |Possible |0.54986554|
+-----------------+-----+---+----------------+---------+----------+
Trains a TensorFlow model for few shot assertion classifier.
To train a custom few shot assertion classifier model, you need to first create a Tensorflow graph using either the TfGraphBuilder
annotator or the tf_graph
module. Then, set the path to the Tensorflow graph using the method .setModelFile("path/to/tensorflow_graph.pb")
.
Parameters:
-
batchSize
: (int) Batch size -
dropout
: (float) Dropout coefficient -
epochsN
: (int) Maximum number of epochs to train -
featureScaling
: (str) Feature scaling method. Possible values are ‘zscore’, ‘minmax’ or empty (no scaling) -
fixImbalance
: (boolean) Fix the imbalance in the training set by replicating examples of under represented categories -
labelColumn
: (str) Column with label per each document -
learningRate
: (float) Learning Rate -
modelFile
: (str) Location of file of the model used for classification -
multiClass
: (boolean) If multiClass is set, the model will return all the labels with corresponding scores. By default, multiClass is false. -
outputLogsPath
: (str) Folder path to save training logs. If no path is specified, the logs won’t be stored in disk. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3). -
validationSplit
: (float) The proportion of training dataset to be used as validation set.The model will be validated against this dataset on each Epoch and will not be used for training. The value should be between 0.0 and 1.0. -
datasetInfo
(Str): Descriptive information about the dataset being used.
Input Annotator Types: SENTENCE_EMBEDDINGS
Output Annotator Type: ASSERTION
Python API: FewShotAssertionClassifierApproach | Scala API: FewShotAssertionClassifierApproach | Notebook: FewShotAssertionClassifierNotebook |
Show Example
from johnsnowlabs import nlp, medical
from sparknlp_jsl.annotator import TFGraphBuilder
graph_folder = "./tf_graphs"
graph_name = "assertion_graph.pb"
assertion_graph_builder = TFGraphBuilder()\
.setModelName("fewshot_assertion")\
.setInputCols(["assertion_embedding"]) \
.setLabelColumn("label")\
.setGraphFolder(graph_folder)\
.setGraphFile(graph_name)\
.setHiddenUnitsNumber(100)
fewshot_assertion_approach = FewShotAssertionClassifierApproach()\
.setInputCols("assertion_embedding")\
.setOutputCol("assertion")\
.setLabelCol("label")\
.setBatchSize(32)\
.setDropout(0.1)\
.setLearningRate(0.001)\
.setEpochsNumber(40)\
.setValidationSplit(0.2)\
.setModelFile(f"{graph_folder}/{graph_name}")
clinical_assertion_pipeline = Pipeline(
stages = [
assertion_graph_builder,
fewshot_assertion_approach
])
assertion_model = clinical_assertion_pipeline.fit(assertion_train_data)
import spark.implicits._
// Defining pipeline stages to extract entities first
val documentAssembler = new MultiDocumentAssembler()
.setInputCols(Array("text", "span"))
.setOutputCols(Array("document", "span_document"))
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val chunker = new Doc2Chunk()
.setInputCols(Array("span_document"))
.setOutputCol("span_chunk")
val assertionConverter = new FewShotAssertionSentenceConverter()
.setInputCols(Array("document", "span_chunk", "token"))
.setOutputCol("assertion_sentence")
val sentenceEmbeddings = MPNetEmbeddings
.pretrained()
.setInputCols(Array("assertion_sentence"))
.setOutputCol("assertion_embedding")
val fewShotAssertionApproach = FewShotAssertionClassifierApproach
.setInputCols(Array("assertion_embedding"))
.setOutputCol("prediction")
.setLabelColumn("label")
.setEpochsNumber(10)
.setBatchSize(1)
.setMultiClass(false)
.setlearningRate(0.01f)
val pipeline = new Pipeline().setStages(
Array(
documentAssembler,
tokenizer,
chunker,
assertionConverter,
sentenceEmbeddings,
fewShotAssertionApproach))
val model = pipeline.fit(trainData)
FewShotAssertionSentenceConverter
It is a util annotator that is used in some datasets to train a new FewShotAssertionClassifierModel.
Parameters:
scopeWindow
: The scope window of the assertion expression
Input Annotator Types: TOKEN
Output Annotator Type: TOKEN
Python API: FewShotAssertionSentenceConverter | Scala API: FewShotAssertionSentenceConverter |
Show Example
from johnsnowlabs import nlp, medical
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")\
.setSplitChars(["-", "\/"])
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
.setInputCols(["sentence","token"])\
.setOutputCol("embeddings")
# ner_oncology
ner_oncology = medical.NerModel.pretrained("ner_oncology","en","clinical/models")\
.setInputCols(["sentence","token","embeddings"])\
.setOutputCol("ner_oncology")
ner_oncology_converter = medical.NerConverterInternal()\
.setInputCols(["sentence","token","ner_oncology"])\
.setOutputCol("ner_chunk")
few_shot_assertion_converter = medical.FewShotAssertionSentenceConverter()\
.setInputCols(["sentence", "token", "ner_chunk"])\
.setOutputCol("assertion_sentence")
e5_embeddings = nlp.E5Embeddings.pretrained("e5_base_v2_embeddings_medical_assertion_oncology", "en", "clinical/models")\
.setInputCols(["assertion_sentence"])\
.setOutputCol("assertion_embedding")
few_shot_assertion_classifier = medical.FewShotAssertionClassifierModel()\
.pretrained("fewhot_assertion_oncology_e5_base_v2_oncology", "en", "clinical/models")\
.setInputCols(["assertion_embedding"])\
.setOutputCol("assertion_fewshot")
assertion_pipeline = nlp.Pipeline(
stages=[
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner_oncology,
ner_oncology_converter,
few_shot_assertion_converter,
e5_embeddings,
few_shot_assertion_classifier
])
sample_text= [
"""The patient is suspected to have colorectal cancer. Her family history is positive for other cancers.
The result of the biopsy was positive. A CT scan was ordered to rule out metastases."""
]
data = spark.createDataFrame([sample_text]).toDF("text")
result = assertion_pipeline.fit(data).transform(data)
result.select(F.explode(F.arrays_zip(result.assertion_fewshot.metadata,
result.assertion_fewshot.begin,
result.assertion_fewshot.end,
result.assertion_fewshot.result,)).alias("cols")) \
.select(F.expr("cols['0']['ner_chunk']").alias("ner_chunk"),
F.expr("cols['1']").alias("begin"),
F.expr("cols['2']").alias("end"),
F.expr("cols['0']['ner_label']").alias("ner_label"),
F.expr("cols['3']").alias("assertion"),
F.expr("cols['0']['confidence']").alias("confidence") ).show(truncate=False)
## Result
+-----------------+-----+---+----------------+---------+----------+
|ner_chunk |begin|end|ner_label |assertion|confidence|
+-----------------+-----+---+----------------+---------+----------+
|colorectal cancer|33 |49 |Cancer_Dx |Possible |0.5812815 |
|Her |52 |54 |Gender |Present |0.9562998 |
|cancers |93 |99 |Cancer_Dx |Family |0.23465642|
|biopsy |120 |125|Pathology_Test |Past |0.95732147|
|positive |131 |138|Pathology_Result|Present |0.9564386 |
|CT scan |143 |149|Imaging_Test |Past |0.9571699 |
|metastases |175 |184|Metastasis |Possible |0.54986554|
+-----------------+-----+---+----------------+---------+----------+
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel
.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
.setSplitChars(Array("-", "/"))
val wordEmbeddings = WordEmbeddingsModel
.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val nerOncology = MedicalNerModel
.pretrained("ner_oncology", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner_oncology")
val nerOncologyConverter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner_oncology"))
.setOutputCol("ner_chunk")
val fewShotAssertionConverter = new FewShotAssertionSentenceConverter()
.setInputCols(Array("sentence", "token", "ner_chunk"))
.setOutputCol("assertion_sentence")
val e5Embeddings = E5Embeddings
.pretrained("e5_base_v2_embeddings_medical_assertion_oncology", "en", "clinical/models")
.setInputCols(Array("assertion_sentence"))
.setOutputCol("assertion_embedding")
val fewShotAssertionClassifier = FewShotAssertionClassifierModel
.pretrained("fewhot_assertion_oncology_e5_base_v2_oncology", "en", "clinical/models")
.setInputCols(Array("assertion_embedding"))
.setOutputCol("assertion_fewshot")
val pipeline = new Pipeline()
.setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
wordEmbeddings,
nerOncology,
nerOncologyConverter,
fewShotAssertionConverter,
e5Embeddings,
fewShotAssertionClassifier
))
val sampleText = Seq("The patient is suspected to have colorectal cancer. Her family history is positive for other cancers.
The result of the biopsy was positive. A CT scan was ordered to rule out metastases.")
val data = spark.createDataFrame(sampleText).toDF("text")
val result = pipeline.fit(data).transform(data)
result.show(false)
## Result
+-----------------+-----+---+----------------+---------+----------+
|ner_chunk |begin|end|ner_label |assertion|confidence|
+-----------------+-----+---+----------------+---------+----------+
|colorectal cancer|33 |49 |Cancer_Dx |Possible |0.5812815 |
|Her |52 |54 |Gender |Present |0.9562998 |
|cancers |93 |99 |Cancer_Dx |Family |0.23465642|
|biopsy |120 |125|Pathology_Test |Past |0.95732147|
|positive |131 |138|Pathology_Result|Present |0.9564386 |
|CT scan |143 |149|Imaging_Test |Past |0.9571699 |
|metastases |175 |184|Metastasis |Possible |0.54986554|
+-----------------+-----+---+----------------+---------+----------+
FewShotClassifier
FewShotClassifier
annotators specifically target few-shot classification tasks, which involve training a model to make accurate predictions with limited labeled data.
These annotators provide a valuable capability for handling scenarios where labeled data is scarce or expensive to obtain. By effectively utilizing limited labeled examples, the few-shot classification approach enables the creation of models that can generalize and classify new instances accurately, even with minimal training data.
The FewShotClassifier is designed to process sentence embeddings as input. It generates category annotations, providing labels along with confidence scores that range from 0 to 1.
Input Annotator Types: SENTENCE EMBEDDINGS
Output Annotator Type: CATEGORY
Python API: FewShotClassifierModel | Scala API: FewShotClassifierModel |
Show Example
from johnsnowlabs import nlp, medical
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
bert_sent = nlp.BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence_embeddings")
few_shot_classifier = medical.FewShotClassifierModel.pretrained("few_shot_classifier_age_group_sbiobert_cased_mli", "en", "clinical/models")\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("prediction")
clf_Pipeline = nlp.Pipeline(stages=[
document_assembler,
bert_sent,
few_shot_classifier
])
data = spark.createDataFrame([
["""A patient presented with complaints of chest pain and shortness of breath. The medical history revealed the patient had a smoking habit for over 30 years, and was diagnosed with hypertension two years ago. After a detailed physical examination, the doctor found a noticeable wheeze on lung auscultation and prescribed a spirometry test, which showed irreversible airway obstruction. The patient was diagnosed with Chronic obstructive pulmonary disease (COPD) caused by smoking."""],
["""Hi, wondering if anyone has had a similar situation. My 1 year old daughter has the following; loose stools/ pale stools, elevated liver enzymes, low iron. 5 months and still no answers from drs. """],
["""Hi have chronic gastritis from 4 month(confirmed by endoscopy).I do not have acid reflux.Only dull ache above abdomen and left side of chest.I am on reberprozole and librax.My question is whether chronic gastritis is curable or is it a lifetime condition?I am loosing hope because this dull ache is not going away.Please please reply"""]
]).toDF("text")
result = clf_Pipeline.fit(data).transform(data)
# Show results
result.select('prediction.result','text').show(truncate=150)
+---------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| result| text|
+---------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| [Adult]|A patient presented with complaints of chest pain and shortness of breath. The medical history revealed the patient had a smoking habit for over 30...|
| [Child]|Hi, wondering if anyone has had a similar situation. My 1 year old daughter has the following; loose stools/ pale stools, elevated liver enzymes, l...|
|[Unknown]|Hi have chronic gastritis from 4 month(confirmed by endoscopy).I do not have acid reflux.Only dull ache above abdomen and left side of chest.I am o...|
+---------+------------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val bert_sent = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence_embeddings")
val few_shot_classifier = FewShotClassifierModel.pretrained("few_shot_classifier_age_group_sbiobert_cased_mli", "en", "clinical/models")
.setInputCols("sentence_embeddings")
.setOutputCol("prediction")
val clf_Pipeline = new Pipeline().setStages(Array(
document_assembler,
bert_sent,
few_shot_classifier))
val data = Seq(
("""A patient presented with complaints of chest pain and shortness of breath. The medical history revealed the patient had a smoking habit for over 30 years, and was diagnosed with hypertension two years ago. After a detailed physical examination, the doctor found a noticeable wheeze on lung auscultation and prescribed a spirometry test, which showed irreversible airway obstruction. The patient was diagnosed with Chronic obstructive pulmonary disease (COPD) caused by smoking."""),
("""Hi, wondering if anyone has had a similar situation. My 1 year old daughter has the following; loose stools/ pale stools, elevated liver enzymes, low iron. 5 months and still no answers from drs. """),
("""Hi have chronic gastritis from 4 month(confirmed by endoscopy).I do not have acid reflux.Only dull ache above abdomen and left side of chest.I am on reberprozole and librax.My question is whether chronic gastritis is curable or is it a lifetime condition?I am loosing hope because this dull ache is not going away.Please please reply""")).toDF("text")
val result = clf_Pipeline.fit(data).transform(data)
// Show results
+---------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| result| text|
+---------+------------------------------------------------------------------------------------------------------------------------------------------------------+
| [Adult]|A patient presented with complaints of chest pain and shortness of breath. The medical history revealed the patient had a smoking habit for over 30...|
| [Child]|Hi, wondering if anyone has had a similar situation. My 1 year old daughter has the following; loose stools/ pale stools, elevated liver enzymes, l...|
|[Unknown]|Hi have chronic gastritis from 4 month(confirmed by endoscopy).I do not have acid reflux.Only dull ache above abdomen and left side of chest.I am o...|
+---------+------------------------------------------------------------------------------------------------------------------------------------------------------+
FewShotClassifier
annotators specifically target few-shot classification tasks, which involve training a model to make accurate predictions with limited labeled data.
These annotators provide a valuable capability for handling scenarios where labeled data is scarce or expensive to obtain. By effectively utilizing limited labeled examples, the few-shot classification approach enables the creation of models that can generalize and classify new instances accurately, even with minimal training data.
The FewShotClassifier is designed to process sentence embeddings as input. It generates category annotations, providing labels along with confidence scores that range from 0 to 1.
Input Annotator Types: SENTENCE EMBEDDINGS
Output Annotator Type: CATEGORY
Python API: FewShotClassifierApproach | Scala API: FewShotClassifierApproach |
Show Example
from johnsnowlabs import nlp, medical
document_asm = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("sentence")
sentence_embeddings = nlp.BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["sentence"])\
.setOutputCol("sentence_embeddings")
graph_builder = medical.TFGraphBuilder()\
.setModelName("fewshot_classifier")\
.setInputCols(["sentence_embeddings"]) \
.setLabelColumn("label")\
.setGraphFolder("/tmp")\
.setGraphFile("log_reg_graph.pb")\
few_shot_approach = medical.FewShotClassifierApproach()\
.setLabelColumn("label")\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("prediction")\
.setModelFile(f"/tmp/log_reg_graph.pb")\
.setEpochsNumber(10)\
.setBatchSize(1)\
.setLearningRate(0.001)
pipeline = nlp.Pipeline(
stages=[
document_asm,
sentence_embeddings,
graph_builder,
few_shot_approach
])
model = pipeline.fit(train_data)
import spark.implicits._
val document_asm = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("sentence")
val sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols("sentence")
.setOutputCol("sentence_embeddings")
val few_shot_approach = new FewShotClassifierApproach()
.setLabelColumn("label")
.setInputCols("sentence_embeddings")
.setOutputCol("prediction")
.setModelFile("tmp/log_reg_graph.pb")
.setEpochsNumber(10)
.setBatchSize(1)
.setLearningRate(0.001)
val pipeline = new Pipeline().setStages(Array(
document_asm,
sentence_embeddings,
few_shot_approach ))
val result = pipeline.fit(train_data).transform(test_data).cache()
Flattener
The Flattener
converts annotation results into a format that easier to use. This annotator produces a DataFrame with flattened and exploded columns containing annotation results, making it easier to interpret and analyze the information.
It is particularly useful for extracting and organizing the results obtained from Spark NLP Pipelines.
Parameters:
inputCols
: Input annotations.cleanAnnotations
: Whether to remove annotation columns, by defaultTrue
.explodeSelectedFields
: Dict of input columns to their corresponding selected fields.flattenExplodedColumns
: Whether to flatten exploded columns(default :True
).orderByColumn
: Specify the column by which the DataFrame should be ordered..orderDescending
: Specifying whether to order the DataFrame in descending order.(default :True
).keepOriginalColumns
: Array of column names that should be kept in the DataFrame after the flattening process.
See Spark NLP Workshop for more examples of usage.
Input Annotator Types: ANY
Output Annotator Type: NONE
Python API: Flattener | Scala API: Flattener | Notebook: FlattenerNotebook |
Show Example
from johnsnowlabs import nlp, medical
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
clinical_ner = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner") \
.setLabelCasing("upper")
ner_converter = medical.NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk") \
.setWhiteList(["SYMPTOM","VS_FINDING","DISEASE_SYNDROME_DISORDER","ADMISSION_DISCHARGE","PROCEDURE"])
flattener = medical.Flattener()\
.setInputCols("ner_chunk") \
.setExplodeSelectedFields({"ner_chunk": ["result as ner_chunks",
"begin as begins",
"end as ends",
"metadata.entity as entities"]})
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
flattener
])
text = """
GENERAL: He is an elderly gentleman in no acute distress. He is sitting up in bed eating his breakfast. He is alert and oriented and answering questions appropriately.
HEENT: Sclerae showed mild arcus senilis in the right. Left was clear. Pupils are equally round and reactive to light. Extraocular movements are intact. Oropharynx is clear.
NECK: Supple. Trachea is midline. No jugular venous pressure distention is noted. No adenopathy in the cervical, supraclavicular, or axillary areas.
ABDOMEN: Soft and not tender. There may be some fullness in the left upper quadrant, although I do not appreciate a true spleen with inspiration.
EXTREMITIES: There is some edema, but no cyanosis and clubbing .
"""
data = spark.createDataFrame([[text]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
result.show(truncate=False)
# result
+----------------------------------+------+----+-------------------------+
|ner_chunks |begins|ends|entities |
+----------------------------------+------+----+-------------------------+
|distress |49 |56 |SYMPTOM |
|arcus senilis |196 |208 |DISEASE_SYNDROME_DISORDER|
|jugular venous pressure distention|380 |413 |SYMPTOM |
|adenopathy |428 |437 |SYMPTOM |
|tender |514 |519 |SYMPTOM |
|fullness |540 |547 |SYMPTOM |
|edema |665 |669 |SYMPTOM |
|cyanosis |679 |686 |VS_FINDING |
|clubbing |692 |699 |SYMPTOM |
+----------------------------------+------+----+-------------------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val wordEmbeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinicalNer = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
.setLabelCasing("upper")
val nerConverter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
.setWhiteList(Array("SYMPTOM", "VS_FINDING", "DISEASE_SYNDROME_DISORDER", "ADMISSION_DISCHARGE", "PROCEDURE"))
val flattener = new Flattener()
.setInputCols(Array("ner_chunk"))
.setExplodeSelectedFields(Map("ner_chunk" -> Array("result", "begin", "end", "metadata.entity")))
val nlpPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
wordEmbeddings,
clinicalNer,
nerConverter,
flattener
))
val text = """
GENERAL: He is an elderly gentleman in no acute distress. He is sitting up in bed eating his breakfast. He is alert and oriented and answering questions appropriately.
HEENT: Sclerae showed mild arcus senilis in the right. Left was clear. Pupils are equally round and reactive to light. Extraocular movements are intact. Oropharynx is clear.
NECK: Supple. Trachea is midline. No jugular venous pressure distention is noted. No adenopathy in the cervical, supraclavicular, or axillary areas.
ABDOMEN: Soft and not tender. There may be some fullness in the left upper quadrant, although I do not appreciate a true spleen with inspiration.
EXTREMITIES: There is some edema, but no cyanosis and clubbing .
"""
val data = Seq(text).toDF("text")
val result = nlpPipeline.fit(data).transform(data)
# result
+----------------------------------+------+----+-------------------------+
|ner_chunks |begins|ends|entities |
+----------------------------------+------+----+-------------------------+
|distress |49 |56 |SYMPTOM |
|arcus senilis |196 |208 |DISEASE_SYNDROME_DISORDER|
|jugular venous pressure distention|380 |413 |SYMPTOM |
|adenopathy |428 |437 |SYMPTOM |
|tender |514 |519 |SYMPTOM |
|fullness |540 |547 |SYMPTOM |
|edema |665 |669 |SYMPTOM |
|cyanosis |679 |686 |VS_FINDING |
|clubbing |692 |699 |SYMPTOM |
+----------------------------------+------+----+-------------------------+
GenericClassifier
Creates a generic single-label classifier which uses pre-generated Tensorflow graphs. The model operates on FEATURE_VECTOR annotations which can be produced using FeatureAssembler. Requires the FeaturesAssembler to create the input.
Parameter:
-
multiClass
(Boolean): Whether to return all clases or only the one with highest score (Default: False) -
datasetInfo
(Str): Descriptive information about the dataset being used.
Input Annotator Types: FEATURE_VECTOR
Output Annotator Type: CATEGORY
Python API: GenericClassifierModel | Scala API: GenericClassifierModel | Notebook: GenericClassifierModelNotebook |
Show Example
from johnsnowlabs import nlp, medical
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_embeddings = nlp.BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", 'en','clinical/models')\
.setInputCols(["document"])\
.setOutputCol("sentence_embeddings")
features_asm = medical.FeaturesAssembler()\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("features")
generic_classifier = medical.GenericClassifierModel.pretrained("genericclassifier_sdoh_economics_binary_sbiobert_cased_mli", 'en', 'clinical/models')\
.setInputCols(["features"])\
.setOutputCol("classes")
pipeline = nlp.Pipeline(
stages=[
document_assembler,
sentence_embeddings,
features_asm,
generic_classifier
])
text = """Patient works as a building inspector and remodeler. Married with 2 children. He is a current smoker, 1PPD for 25years. He drinks to beers/night, but has not had any alcohol in past 4 days. No IVDU."""
df = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(df).transform(df)
result.select("text", "classes.result").show(truncate=False)
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|text |result|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|Patient works as a building inspector and remodeler. Married with 2 children. He is a current smoker, 1PPD for 25years. He drinks to beers/night, but has not had any alcohol in past 4 days. No IVDU.|[True]|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence_embeddings")
val features_asm = new FeaturesAssembler()
.setInputCols("sentence_embeddings")
.setOutputCol("features")
val generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_economics_binary_sbiobert_cased_mli","en","clinical/models")
.setInputCols(Array("features"))
.setOutputCol("classes")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_embeddings,
features_asm,
generic_classifier ))
val text = "Patient works as a building inspector and remodeler. Married with 2 children. He is a current smoker,1PPD for 25years. He drinks to beers/night,but has not had any alcohol in past 4 days. No IVDU."
val df = Seq(text) .toDF("text")
val result = pipeline.fit(df).transform(df)
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|text |result|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
|Patient works as a building inspector and remodeler. Married with 2 children. He is a current smoker, 1PPD for 25years. He drinks to beers/night, but has not had any alcohol in past 4 days. No IVDU.|[True]|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+
Trains a TensorFlow model for generic classification of feature vectors. It takes FEATURE_VECTOR annotations from
FeaturesAssembler
as input, classifies them and outputs CATEGORY annotations.
Please see the Parameters section for required training parameters.
Parametres:
-
batchSize
: (int) Batch size -
dropout
: (float) Dropout coefficient -
epochsN
: (int) Maximum number of epochs to train -
featureScaling
: (str) Feature scaling method. Possible values are ‘zscore’, ‘minmax’ or empty (no scaling) -
fixImbalance
: (boolean) Fix the imbalance in the training set by replicating examples of under represented categories -
labelColumn
: (str) Column with label per each document -
learningRate
: (float) Learning Rate -
modelFile
: (str) Location of file of the model used for classification -
multiClass
: (boolean) If multiClass is set, the model will return all the labels with corresponding scores. By default, multiClass is false. -
outputLogsPath
: (str) Folder path to save training logs. If no path is specified, the logs won’t be stored in disk. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3). -
validationSplit
: (float) The proportion of training dataset to be used as validation set.The model will be validated against this dataset on each Epoch and will not be used for training. The value should be between 0.0 and 1.0. -
datasetInfo
(Str): Descriptive information about the dataset being used.
For a more extensive example please see the Spark NLP Workshop.
Input Annotator Types: FEATURE_VECTOR
Output Annotator Type: CATEGORY
Python API: GenericClassifierApproach | Scala API: GenericClassifierApproach | Notebook: GenericClassifierApproachNotebook |
Show Example
from johnsnowlabs import nlp, medical
features_asm = medical.FeaturesAssembler() \
.setInputCols(["feature_1", "feature_2", "...", "feature_n"]) \
.setOutputCol("features")
gen_clf = medical.GenericClassifierApproach() \
.setLabelColumn("target") \
.setInputCols(["features"]) \
.setOutputCol("prediction") \
.setModelFile("/path/to/graph_file.pb") \
.setEpochsNumber(50) \
.setBatchSize(100) \
.setFeatureScaling("zscore") \
.setlearningRate(0.001) \
.setFixImbalance(True) \
.setOutputLogsPath("logs") \
.setValidationSplit(0.2) # keep 20% of the data for validation purposes
pipeline = nlp.Pipeline().setStages([
features_asm,
gen_clf
])
clf_model = pipeline.fit(data)
from johnsnowlabs import nlp, finance
features_asm = finance.FeaturesAssembler() \
.setInputCols(["feature_1", "feature_2", "...", "feature_n"]) \
.setOutputCol("features")
gen_clf = finance.GenericClassifierApproach() \
.setLabelColumn("target") \
.setInputCols(["features"]) \
.setOutputCol("prediction") \
.setModelFile("/path/to/graph_file.pb") \
.setEpochsNumber(50) \
.setBatchSize(100) \
.setFeatureScaling("zscore") \
.setlearningRate(0.001) \
.setFixImbalance(True) \
.setOutputLogsPath("logs") \
.setValidationSplit(0.2) # keep 20% of the data for validation purposes
pipeline = nlp.Pipeline().setStages([
features_asm,
gen_clf
])
clf_model = pipeline.fit(data)
from johnsnowlabs import nlp, legal
features_asm = legal.FeaturesAssembler() \
.setInputCols(["feature_1", "feature_2", "...", "feature_n"]) \
.setOutputCol("features")
gen_clf = legal.GenericClassifierApproach() \
.setLabelColumn("target") \
.setInputCols(["features"]) \
.setOutputCol("prediction") \
.setModelFile("/path/to/graph_file.pb") \
.setEpochsNumber(50) \
.setBatchSize(100) \
.setFeatureScaling("zscore") \
.setlearningRate(0.001) \
.setFixImbalance(True) \
.setOutputLogsPath("logs") \
.setValidationSplit(0.2) # keep 20% of the data for validation purposes
pipeline = nlp.Pipeline().setStages([
features_asm,
gen_clf
])
clf_model = pipeline.fit(data)
import spark.implicits._
val features_asm = new FeaturesAssembler()
.setInputCols(Array("feature_1", "feature_2", "...", "feature_n"))
.setOutputCol("features")
val gen_clf = new GenericClassifierApproach()
.setLabelColumn("target")
.setInputCols("features")
.setOutputCol("prediction")
.setModelFile("/path/to/graph_file.pb")
.setEpochsNumber(50)
.setBatchSize(100)
.setFeatureScaling("zscore")
.setlearningRate(0.001f)
.setFixImbalance(true)
.setOutputLogsPath("logs")
.setValidationSplit(0.2f) // keep 20% of the data for validation purposes
val pipeline = new Pipeline().setStages(Array(
features_asm,
gen_clf
))
val clf_model = pipeline.fit(data)
import spark.implicits._
val features_asm = new FeaturesAssembler()
.setInputCols(Array("feature_1", "feature_2", "...", "feature_n"))
.setOutputCol("features")
val gen_clf = new GenericClassifierApproach()
.setLabelColumn("target")
.setInputCols("features")
.setOutputCol("prediction")
.setModelFile("/path/to/graph_file.pb")
.setEpochsNumber(50)
.setBatchSize(100)
.setFeatureScaling("zscore")
.setlearningRate(0.001f)
.setFixImbalance(true)
.setOutputLogsPath("logs")
.setValidationSplit(0.2f) // keep 20% of the data for validation purposes
val pipeline = new Pipeline().setStages(Array(
features_asm,
gen_clf
))
val clf_model = pipeline.fit(data)
import spark.implicits._
val features_asm = new FeaturesAssembler()
.setInputCols(Array("feature_1", "feature_2", "...", "feature_n"))
.setOutputCol("features")
val gen_clf = new GenericClassifierApproach()
.setLabelColumn("target")
.setInputCols("features")
.setOutputCol("prediction")
.setModelFile("/path/to/graph_file.pb")
.setEpochsNumber(50)
.setBatchSize(100)
.setFeatureScaling("zscore")
.setlearningRate(0.001f)
.setFixImbalance(true)
.setOutputLogsPath("logs")
.setValidationSplit(0.2f) // keep 20% of the data for validation purposes
val pipeline = new Pipeline().setStages(Array(
features_asm,
gen_clf
))
val clf_model = pipeline.fit(data)
GenericLogRegClassifier
GenericLogRegClassifier
is a derivative of GenericClassifier which implements a multinomial logistic regression. This is a single layer neural network with the logistic function at the output. The input to the model is FeatureVector and the output is category annotations with labels and corresponding confidence scores varying between 0 and 1.
Parameters:
-
LabelColumn
: This parameter sets the name of the column in your input data that contains the labels (categories) for the classification task. The classifier will use this column to learn from the data and make predictions. -
ModelFile
: This parameter specifies the path to the pre-trained model file for the logistic regression classifier. It should be a protobuf file containing the model graph and trained weights. -
EpochsNumber
: This parameter sets the number of epochs (iterations) the classifier will go through during the training process. An epoch represents one complete pass through the entire training dataset. -
BatchSize
: This parameter sets the batch size used during training. The training data is divided into batches, and the model’s weights are updated after processing each batch. A larger batch size may speed up training, but it requires more memory. -
LearningRate
: This parameter sets the learning rate for the optimization algorithm used during training. The learning rate determines how much the model’s weights are updated based on the computed gradients. A higher learning rate may lead to faster convergence but risks overshooting the optimal solution. -
OutputLogsPath
: This parameter specifies the path where the logs related to the training process will be stored. These logs can include information such as training loss, accuracy, and other metrics. -
Dropout
: Dropout is a regularization technique used to prevent overfitting in neural networks. This parameter sets the dropout rate, which determines the probability that each neuron’s output will be temporarily ignored during training. -
FixImbalance
: Imbalance refers to the situation when some classes have significantly more training examples than others. Setting this parameter to True indicates that the classifier will handle class imbalance during training to help ensure that the model doesn’t become biased towards the majority class. -
ValidationSplit
: This line seems to be commented out, but it’s worth mentioning its purpose. If uncommented and set to a value between 0 and 1, it would specify the fraction of the training data to be used for validation during the training process. The remaining data would be used for actual training.
Input Annotator Types: FEATURE_VECTOR
Output Annotator Type: CATEGORY
Python API: GenericLogRegClassifierApproach | Scala API: GenericLogRegClassifierApproach | Notebook: GenericLogRegClassifierApproachNotebook |
Show Example
from johnsnowlabs import nlp, medical
features_asm = medical.FeaturesAssembler()\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("feature_vector")
graph_folder = "gc_graph"
gc_logreg_graph_builder = medical.TFGraphBuilder()\
.setModelName("logreg_classifier")\
.setInputCols(["feature_vector"]) \
.setLabelColumn("category")\
.setGraphFolder(graph_folder)\
.setGraphFile("log_reg_graph.pb")
gen_clf = medical.GenericLogRegClassifierApproach()\
.setLabelColumn("category")\
.setInputCols("feature_vector")\
.setOutputCol("prediction")\
.setModelFile(f"{graph_folder}/log_reg_graph.pb")\
.setEpochsNumber(20)\
.setBatchSize(128)\
.setLearningRate(0.01)\
.setOutputLogsPath(log_folder)\
.setDropout(0.1)\
.setFixImbalance(True)\
# .setValidationSplit(0.1)
clf_Pipeline = nlp.Pipeline(stages=[
features_asm,
gc_logreg_graph_builder,
gen_clf])
import spark.implicits._
val features_asm = new FeaturesAssembler()
.setInputCols("sentence_embeddings")
.setOutputCol("feature_vector")
val gc_logreg_graph_builder = new TFGraphBuilder()
.setModelName("logreg_classifier")
.setInputCols("feature_vector")
.setLabelColumn("category")
.setGraphFolder("gc_graph")
.setGraphFile("log_reg_graph.pb")
val gen_clf = new GenericLogRegClassifierApproach()
.setLabelColumn("category")
.setInputCols("feature_vector")
.setOutputCol("prediction")
.setModelFile("gc_graph/log_reg_graph.pb")
.setEpochsNumber(20)
.setBatchSize(128)
.setLearningRate(0.01)
.setOutputLogsPath(log_folder)
.setDropout(0.1)
.setFixImbalance(true) // .setValidationSplit(0.1)
val clf_Pipeline = new Pipeline().setStages(Array(features_asm, gc_logreg_graph_builder, gen_clf))
GenericREModel
Instantiated RelationExtractionModel for extracting relationships between any entities. This class is not intended to be directly used, please use the RelationExtractionModel instead. Pairs of entities should be specified using setRelationPairs.
Input Annotator Types: WORD_EMBEDDINGS, POS, CHUNK, DEPENDENCY
Output Annotator Type: CATEGORY
Python API: RelationExtractionModel | Scala API: RelationExtractionModel | Notebook: RelationExtractionModelNotebook |
Show Example
from johnsnowlabs import nlp, medical
documenter = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencer = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentences")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentences"])\
.setOutputCol("tokens")
words_embedder = nlp.WordEmbeddingsModel()\
.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentences", "tokens"])\
.setOutputCol("embeddings")
pos_tagger = nlp.PerceptronModel()\
.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"])\
.setOutputCol("pos_tags")
ner_tagger = medical.NerModel()\
.pretrained("ner_posology", "en", "clinical/models")\
.setInputCols("sentences", "tokens", "embeddings")\
.setOutputCol("ner_tags")
ner_chunker = medical.NerConverterInternal()\
.setInputCols(["sentences", "tokens", "ner_tags"])\
.setOutputCol("ner_chunks")
dependency_parser = nlp.DependencyParserModel()\
.pretrained("dependency_conllu", "en")\
.setInputCols(["sentences", "pos_tags", "tokens"])\
.setOutputCol("dependencies")
reModel = medical.RelationExtractionModel()\
.pretrained("generic_re")\
.setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
.setOutputCol("relations")\
.setRelationPairs(["problem-test",
"problem-treatment"])\
.setMaxSyntacticDistance(4)
pipeline = nlp.Pipeline(stages=[
documenter,
sentencer,
tokenizer,
words_embedder,
pos_tagger,
ner_tagger,
ner_chunker,
dependency_parser,
reModel
])
text = """
A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to
presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis
three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index
( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and
vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a
respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin
and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical
examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was
benign with no tenderness , guarding , or rigidity .
"""
df = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(df).transform(df)
# Show results
result.select(F.explode(F.arrays_zip(
result.relations.result,
result.relations.metadata)).alias("cols"))\
.select(
F.expr("cols['1']['chunk1']").alias("chunk1"),
F.expr("cols['1']['chunk2']").alias("chunk2"),
F.expr("cols['1']['entity1']").alias("entity1"),
F.expr("cols['1']['entity2']").alias("entity2"),
F.expr("cols['0']").alias("relations"),
F.expr("cols['1']['confidence']").alias("confidence")).show(5, truncate=False)
+-----------------+-------------+-------+-------+------------+----------+
|chunk1 |chunk2 |entity1|entity2|relations |confidence|
+-----------------+-------------+-------+-------+------------+----------+
|obesity |BMI |PROBLEM|TEST |PROBLEM-TEST|1.0 |
|a body mass index|BMI |PROBLEM|TEST |PROBLEM-TEST|1.0 |
|BMI |polyuria |TEST |PROBLEM|TEST-PROBLEM|1.0 |
|BMI |polydipsia |TEST |PROBLEM|TEST-PROBLEM|1.0 |
|BMI |poor appetite|TEST |PROBLEM|TEST-PROBLEM|1.0 |
+-----------------+-------------+-------+-------+------------+----------+
import spark.implicits._
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentencer = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentences")
val tokenizer = new Tokenizer()
.setInputCols("sentences")
.setOutputCol("tokens")
val words_embedder = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
.setInputCols(Array("sentences","tokens"))
.setOutputCol("embeddings")
val pos_tagger = PerceptronModel.pretrained("pos_clinical","en","clinical/models")
.setInputCols(Array("sentences","tokens"))
.setOutputCol("pos_tags")
val ner_tagger = MedicalNerModel.pretrained("ner_posology","en","clinical/models")
.setInputCols("sentences","tokens","embeddings")
.setOutputCol("ner_tags")
val ner_chunker = new NerConverterInternal()
.setInputCols(Array("sentences","tokens","ner_tags"))
.setOutputCol("ner_chunks")
val dependency_parser = DependencyParserModel.pretrained("dependency_conllu","en")
.setInputCols(Array("sentences","pos_tags","tokens"))
.setOutputCol("dependencies")
val reModel = RelationExtractionModel.pretrained("generic_re")
.setInputCols(Array("embeddings","pos_tags","ner_chunks","dependencies"))
.setOutputCol("relations")
.setRelationPairs(Array("problem-test","problem-treatment"))
.setMaxSyntacticDistance(4)
val pipeline = new Pipeline().setStages(Array(
documenter,
sentencer,
tokenizer,
words_embedder,
pos_tagger,
ner_tagger,
ner_chunker,
dependency_parser,
reModel ))
val text = "A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to " +
"presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis " +
"three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index " +
"( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and " +
"vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a " +
"respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin " +
"and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation . Physical " +
"examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was " +
"benign with no tenderness , guarding , or rigidity."
val df = Seq(text) .toDF("text")
val result = pipeline.fit(df) .transform(df)
// Show results
+-----------------+-------------+-------+-------+------------+----------+
|chunk1 |chunk2 |entity1|entity2|relations |confidence|
+-----------------+-------------+-------+-------+------------+----------+
|obesity |BMI |PROBLEM|TEST |PROBLEM-TEST|1.0 |
|a body mass index|BMI |PROBLEM|TEST |PROBLEM-TEST|1.0 |
|BMI |polyuria |TEST |PROBLEM|TEST-PROBLEM|1.0 |
|BMI |polydipsia |TEST |PROBLEM|TEST-PROBLEM|1.0 |
|BMI |poor appetite|TEST |PROBLEM|TEST-PROBLEM|1.0 |
+-----------------+-------------+-------+-------+------------+----------+
GenericSVMClassifier
Creates a generic single-label classifier which uses pre-generated Tensorflow graphs. The model operates on FEATURE_VECTOR annotations which can be produced using FeatureAssembler. Requires the FeaturesAssembler to create the input.
Parameters:
featureScaling
: Feature scaling method. Possible values are ‘zscore’, ‘minmax’ or empty (no scaling) (default:’’)
multiClass
: Whether to return only the label with the highest confidence score or all labels (default: False)
inputCols
: previous annotations columns, if renamed (default: [‘features’])
outputCol
: output annotation column. can be left default. (default: class)
Input Annotator Types: FEATURE_VECTOR
Output Annotator Type: CATEGORY
Python API: GenericSVMClassifier | Scala API: GenericSVMClassifier | Notebook: GenericSVMClassifierNotebook |
Show Example
from johnsnowlabs import nlp, medical
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = nlp.Tokenizer()\
.setInputCols("document")\
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
.setInputCols(["document","token"])\
.setOutputCol("word_embeddings")
sentence_embeddings = nlp.SentenceEmbeddings() \
.setInputCols(["document", "word_embeddings"]) \
.setOutputCol("sentence_embeddings") \
.setPoolingStrategy("AVERAGE")
features_asm = medical.FeaturesAssembler()\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("features")
generic_classifier = medical.GenericSVMClassifierModel.pretrained("generic_svm_classifier_ade", "en", "clinical/models")\
.setInputCols(["features"])\
.setOutputCol("class")
clf_Pipeline = nlp.Pipeline(stages=[
document_assembler,
tokenizer,
word_embeddings,
sentence_embeddings,
features_asm,
generic_classifier])
data = spark.createDataFrame([["""None of the patients required treatment for the overdose."""],
["""I feel a bit drowsy & have a little blurred vision after taking an insulin"""]]).toDF("text")
result = clf_Pipeline.fit(data).transform(df) # sample df
+----------------------------------------------------------------------------------------------------+-------+
| text| result|
+----------------------------------------------------------------------------------------------------+-------+
| Multicentric canine lymphoma in a 12-year-old keeshond: chemotherapy options.|[False]|
| Pyomyositis is a rare disease, encountered mainly in tropical climates.|[False]|
| Both patients subsequently developed markedly elevated EBV-DNA titers in association with monocl...|[False]|
|Bortezomib-induced paralytic ileus is a potential gastrointestinal side effect of this first-in-c...|[False]|
|However, given the clinically significant result to the interaction between tolazoline and cimeti...| [True]|
| How much do novel antipsychotics benefit the patients?|[False]|
| We hypothesize that during interferon therapy, melanocytes may produce more melanin pigment in t...|[False]|
|They seemed to involve multiple aetiological factors, such as autoimmune thyroid disease, the tox...|[False]|
| Two days after completing this regimen, the patient developed a rash with blistering.| [True]|
|A diagnosis of masked theophylline poisoning should be considered in similar situations involving...|[False]|
| The overall response rate of these 24 refractory lymphomas to gemcitabine-containing regimens wa...|[False]|
|Development of sarcoidosis during interferon alpha 2b and ribavirin combination therapy for chron...| [True]|
|A patient with coccidioidal meningitis was treated with intrathecally administered amphotericin B...|[False]|
| Renal failure associated with the use of dextran-40.|[False]|
| However, with increased experience in applying BCG, the side effects now appear to be less promi...|[False]|
| Hepatotoxicity after high-dose methylprednisolone for demyelinating disease.| [True]|
| Histopathological findings included signs of orthokeratotic hyperkeratosis, moderate follicular ...| [True]|
| Acute spontaneous TLS is rare, and it has been described in leukemia and lymphoma and in some pa...|[False]|
|We present a fatal case of subacute methanol toxicity with associated diffuse brain involvement, ...| [True]|
| The reaction was thought to be triggered by the combination of radiation and epidermal growth fa...|[False]|
+----------------------------------------------------------------------------------------------------+-------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val wordEmbeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("document", "token"))
.setOutputCol("word_embeddings")
val sentenceEmbeddings = new SentenceEmbeddings()
.setInputCols(Array("document", "word_embeddings"))
.setOutputCol("sentence_embeddings")
.setPoolingStrategy("AVERAGE")
val featuresAssembler = new FeaturesAssembler()
.setInputCols(Array("sentence_embeddings"))
.setOutputCol("features")
val genericClassifier = PretrainedPipeline("generic_svm_classifier_ade", lang = "en", remoteLoc = "clinical/models")
.setInputCols("features")
.setOutputCol("class")
val pipeline = new Pipeline()
.setStages(Array(
documentAssembler,
tokenizer,
wordEmbeddings,
sentenceEmbeddings,
featuresAssembler,
genericClassifier))
val data = Seq(
("""None of the patients required treatment for the overdose."""),
("""I feel a bit drowsy & have a little blurred vision after taking an insulin""")
)
val df = data.toDF("text")
val result = pipeline.fit(df).transform(df)
+----------------------------------------------------------------------------------------------------+-------+
| text| result|
+----------------------------------------------------------------------------------------------------+-------+
| Multicentric canine lymphoma in a 12-year-old keeshond: chemotherapy options.|[False]|
| Pyomyositis is a rare disease, encountered mainly in tropical climates.|[False]|
| Both patients subsequently developed markedly elevated EBV-DNA titers in association with monocl...|[False]|
|Bortezomib-induced paralytic ileus is a potential gastrointestinal side effect of this first-in-c...|[False]|
|However, given the clinically significant result to the interaction between tolazoline and cimeti...| [True]|
| How much do novel antipsychotics benefit the patients?|[False]|
| We hypothesize that during interferon therapy, melanocytes may produce more melanin pigment in t...|[False]|
|They seemed to involve multiple aetiological factors, such as autoimmune thyroid disease, the tox...|[False]|
| Two days after completing this regimen, the patient developed a rash with blistering.| [True]|
|A diagnosis of masked theophylline poisoning should be considered in similar situations involving...|[False]|
| The overall response rate of these 24 refractory lymphomas to gemcitabine-containing regimens wa...|[False]|
|Development of sarcoidosis during interferon alpha 2b and ribavirin combination therapy for chron...| [True]|
|A patient with coccidioidal meningitis was treated with intrathecally administered amphotericin B...|[False]|
| Renal failure associated with the use of dextran-40.|[False]|
| However, with increased experience in applying BCG, the side effects now appear to be less promi...|[False]|
| Hepatotoxicity after high-dose methylprednisolone for demyelinating disease.| [True]|
| Histopathological findings included signs of orthokeratotic hyperkeratosis, moderate follicular ...| [True]|
| Acute spontaneous TLS is rare, and it has been described in leukemia and lymphoma and in some pa...|[False]|
|We present a fatal case of subacute methanol toxicity with associated diffuse brain involvement, ...| [True]|
| The reaction was thought to be triggered by the combination of radiation and epidermal growth fa...|[False]|
+----------------------------------------------------------------------------------------------------+-------+
GenericSVMClassifier
is a derivative of GenericClassifier which implements SVM (Support Vector Machine) classification. The input to the model is FeatureVector and the output is category annotations with labels and corresponding confidence scores. The scores are standardized using the logistic function so that they vary between 0 and 1.
Parameters:
-
batchSize
: (int) Batch size -
dropout
: (float) Dropout coefficient -
epochsNumber
: (int) Maximum number of epochs to train -
featureScaling
: (str) Feature scaling method. Possible values are ‘zscore’, ‘minmax’ or empty (no scaling) -
fixImbalance
: (boolean) Fix the imbalance in the training set by replicating examples of under represented categories -
labelColumn
: (str) Column with label per each document -
learningRate
: (float) Learning Rate -
modelFile
: (str) Location of file of the model used for classification -
multiClass
: (boolean) If multiClass is set, the model will return all the labels with corresponding scores. By default, multiClass is false. -
outputLogsPath
: (str) Folder path to save training logs. If no path is specified, the logs won’t be stored in disk. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3). -
validationSplit
: (float) The proportion of training dataset to be used as validation set.The model will be validated against this dataset on each Epoch and will not be used for training. The value should be between 0.0 and 1.0.
Input Annotator Types: FEATURE_VECTOR
Output Annotator Type: CATEGORY
Python API: GenericSVMClassifier | Scala API: GenericSVMClassifier | Notebook: GenericSVMClassifierNotebook |
Show Example
from jojnsnowlabs import nlp, medical
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = nlp.Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_healthcare_100d","en","clinical/models")\
.setInputCols(["document","token"])\
.setOutputCol("word_embeddings")
sentence_embeddings = nlp.SentenceEmbeddings() \
.setInputCols(["document", "word_embeddings"]) \
.setOutputCol("sentence_embeddings") \
.setPoolingStrategy("AVERAGE")
embeddings_pipeline = nlp.Pipeline(stages = [
document_assembler,
tokenizer,
word_embeddings,
sentence_embeddings,
])
trainingData_with_embeddings = embeddings_pipeline.fit(trainingData).transform(trainingData)
trainingData_with_embeddings = trainingData_with_embeddings.select("text","category","sentence_embeddings")
graph_folder = "graph_folder"
gc_svm_graph_builder = medical.TFGraphBuilder()\
.setModelName("svm_classifier")\
.setInputCols(["feature_vector"]) \
.setLabelColumn("category")\
.setGraphFolder(graph_folder)\
.setGraphFile("svm_graph.pb")
features_asm = medical.FeaturesAssembler()\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("feature_vector")
gen_clf = medical.GenericSVMClassifierApproach()\
.setLabelColumn("category")\
.setInputCols("feature_vector")\
.setOutputCol("prediction")\
.setModelFile(f"{graph_folder}/svm_graph.pb")\
.setEpochsNumber(2)\
.setBatchSize(128)\
.setLearningRate(0.015)\
.setOutputLogsPath(log_folder)\
.setDropout(0.1)\
.setFixImbalance(True)\
# .setValidationSplit(0.1)
clf_Pipeline = nlp.Pipeline(stages=[
features_asm,
gc_svm_graph_builder,
gen_clf])
model = clf_Pipeline.fit(trainingData_with_embeddings)
model.stages[-1].write().overwrite().save('/model_path/model_name')
#sample training data
text category
0 Clioquinol intoxication occurring in the trea... neg
1 "Retinoic acid syndrome" was prevented with s... neg
2 BACKGROUND: External beam radiation therapy o... neg
3 Although the enuresis ceased, she developed t... neg
4 A 42-year-old woman had uneventful bilateral ... neg
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceEmbeddings = BertSentenceEmbeddings
.pretrained()
.setInputCols(Array("document"))
.setOutputCol("sentence_embedding")
val featuresAssembler = new FeaturesAssembler()
.setInputCols(Array("sentence_embedding"))
.setOutputCol("feature_vector")
val svmClassifier = new GenericSVMClassifierApproach()
.setInputCols("feature_vector")
.setOutputCol("prediction")
.setLabelColumn("label")
.setModelFile("src/test/resources/classification/svm_graph.pb")
.setEpochsNumber(10)
.setBatchSize(1)
.setMultiClass(false)
.setlearningRate(0.01f)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceEmbeddings,
featuresAssembler,
svmClassifier,
))
val model = pipeline.fit(trainingData)
IOBTagger
The IOBTagger chunk tag (Chunk based) outputs, namely NerConverter and ChunkMerger, serve the purpose of converting token tags into Named Entity Recognition (NER) tags (token-based). These tags help to identify and categorize specific entities within a given text, enabling valuable information and context to be extracted from tokens. For example output columns as inputs from NerConverter and Tokenizer can be used to merge.
Input Annotator Types: TOKEN, CHUNK
Output Annotator Type: NAMED_ENTITY
Python API: IOBTagger | Scala API: IOBTagger | Notebook: IOBTaggerNotebook |
Show Example
from johnsnowlabs import nlp, medical
# Pipeline stages are defined where NER is done. NER is converted to chunks.
docAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embs")
nerModel = medical.NerModel.pretrained("ner_posology_greedy", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embs"])\
.setOutputCol("ner")
nerConverter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
# Define the IOB tagger, which needs tokens and chunks as input. Show results.
iobTagger = medical.IOBTagger()\
.setInputCols(["token", "ner_chunk"])\
.setOutputCol("ner_label")
pipeline = nlp.Pipeline(stages=[docAssembler,
sentenceDetector,
tokenizer,
embeddings,
nerModel,
nerConverter,
iobTagger])
text = "The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO."
df = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(df).transform(df)
# chunk level result
result.selectExpr("explode(ner_chunk) as a") \
.selectExpr("a.begin",
"a.end",
"a.result as ner_chunk",
"a.metadata.entity as ner_label").show(50, False)
+-----+---+-------------------------------------------+---------+
|begin|end|ner_chunk |ner_label|
+-----+---+-------------------------------------------+---------+
|27 |50 |1 capsule of Advil 10 mg |DRUG |
|52 |61 |for 5 days |DURATION |
|67 |109|magnesium hydroxide 100mg/1ml suspension PO|DRUG |
+-----+---+-------------------------------------------+---------+
# token level result
result.selectExpr("explode(ner_label) as a") \
.selectExpr("a.begin",
"a.end",
"a.metadata.word as word",
"a.result as chunk").show(50, False)
+-----+---+----------+----------+
|begin|end|word |chunk |
+-----+---+----------+----------+
|0 |2 |The |0 |
|4 |10 |patient |0 |
|12 |14 |was |0 |
|16 |25 |prescribed|0 |
|27 |27 |1 |B-DRUG |
|29 |35 |capsule |I-DRUG |
|37 |38 |of |I-DRUG |
|40 |44 |Advil |I-DRUG |
|46 |47 |10 |I-DRUG |
|49 |50 |mg |I-DRUG |
|52 |54 |for |B-DURATION|
|56 |56 |5 |I-DURATION|
|58 |61 |days |I-DURATION|
|63 |65 |and |0 |
|67 |75 |magnesium |B-DRUG |
|77 |85 |hydroxide |I-DRUG |
|87 |95 |100mg/1ml |I-DRUG |
|97 |106|suspension|I-DRUG |
|108 |109|PO |I-DRUG |
|110 |110|. |0 |
+-----+---+----------+----------+
from johnsnowlabs import nlp, finance
# Pipeline stages are defined where NER is done. NER is converted to chunks.
docAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"])\
.setOutputCol("embs")
nerModel = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
.setInputCols(["sentence", "token", "embs"])\
.setOutputCol("ner")
nerConverter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
# Define the IOB tagger, which needs tokens and chunks as input. Show results.
iobTagger = finance.IOBTagger()\
.setInputCols(["token", "ner_chunk"])\
.setOutputCol("ner_label")
pipeline = nlp.Pipeline(stages=[docAssembler,
sentenceDetector,
tokenizer,
embeddings,
nerModel,
nerConverter,
iobTagger])
text = """In 2020, we acquired certain assets of Spell Security Private Limited (also known as "Spell Security"). More specifically, their Compliance product - Policy Compliance (PC)")."""
df = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(df).transform(df)
# chunk level result
result.selectExpr("explode(ner_chunk) as a") \
.selectExpr("a.begin",
"a.end",
"a.result as ner_chunk",
"a.metadata.entity as ner_label").show(50, False)
+-----+---+------------------------------+---------+
|begin|end|ner_chunk |ner_label|
+-----+---+------------------------------+---------+
|39 |68 |Spell Security Private Limited|ORG |
|86 |99 |Spell Security |ALIAS |
|129 |138|Compliance |PRODUCT |
|150 |166|Policy Compliance |PRODUCT |
|169 |170|PC |ALIAS |
+-----+---+------------------------------+---------+
# token level result
result.selectExpr("explode(ner_label) as a") \
.selectExpr("a.begin",
"a.end",
"a.metadata.word as word",
"a.result as chunk").show(50, False)
+-----+---+------------+---------+
|begin|end|word |chunk |
+-----+---+------------+---------+
|0 |1 |In |0 |
|3 |6 |2020 |0 |
|7 |7 |, |0 |
|9 |10 |we |0 |
|12 |19 |acquired |0 |
|21 |27 |certain |0 |
|29 |34 |assets |0 |
|36 |37 |of |0 |
|39 |43 |Spell |B-ORG |
|45 |52 |Security |I-ORG |
|54 |60 |Private |I-ORG |
|62 |68 |Limited |I-ORG |
|70 |70 |( |0 |
|71 |74 |also |0 |
|76 |80 |known |0 |
|82 |83 |as |0 |
|85 |85 |" |0 |
|86 |90 |Spell |B-ALIAS |
|92 |99 |Security |I-ALIAS |
|100 |102|"). |0 |
|104 |107|More |0 |
|109 |120|specifically|0 |
|121 |121|, |0 |
|123 |127|their |0 |
|129 |138|Compliance |B-PRODUCT|
|140 |146|product |0 |
|148 |148|- |0 |
|150 |155|Policy |B-PRODUCT|
|157 |166|Compliance |I-PRODUCT|
|168 |168|( |0 |
|169 |170|PC |B-ALIAS |
|171 |174|)"). |0 |
+-----+---+------------+---------+
from johnsnowlabs import nlp, legal
# Pipeline stages are defined where NER is done. NER is converted to chunks.
docAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embs")
ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models")\
.setInputCols(["sentence", "token", "embs"])\
.setOutputCol("ner")
nerConverter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
# Define the IOB tagger, which needs tokens and chunks as input. Show results.
iobTagger = legal.IOBTagger()\
.setInputCols(["token", "ner_chunk"])\
.setOutputCol("ner_label")
pipeline = nlp.Pipeline(stages=[docAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
nerConverter,
iobTagger])
text = """This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties").
"""
df = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(df).transform(df)
# chunk level result
result.selectExpr("explode(ner_chunk) as a") \
.selectExpr("a.begin",
"a.end",
"a.result as ner_chunk",
"a.metadata.entity as ner_label").show(50, False)
+-----+---+-----------------------------------+---------+
|begin|end|ner_chunk |ner_label|
+-----+---+-----------------------------------+---------+
|141 |165|Armstrong Flooring, Inc., |ORG |
|192 |197|Seller |ALIAS |
|205 |221|AFI Licensing LLC |ORG |
|263 |271|Licensing |ALIAS |
|292 |297|Seller |ALIAS |
|301 |307|Arizona |ALIAS |
|315 |330|AHF Holding, Inc |ORG |
|399 |403|Buyer |ALIAS |
|411 |445|Armstrong Hardwood Flooring Company|ORG |
|478 |484|Company |ALIAS |
|505 |509|Buyer |ALIAS |
|516 |529|Buyer Entities |ALIAS |
|542 |548|Arizona |ALIAS |
|574 |587|Buyer Entities |ALIAS |
|611 |615|Party |ALIAS |
|641 |647|Parties |ALIAS |
+-----+---+-----------------------------------+---------+
# token level result
result.selectExpr("explode(ner_label) as a") \
.selectExpr("a.begin",
"a.end",
"a.metadata.word as word",
"a.result as chunk").show(50, False)
+-----+---+------------+-------+
|begin|end|word |chunk |
+-----+---+------------+-------+
|0 |3 |This |0 |
|5 |16 |INTELLECTUAL|0 |
|18 |25 |PROPERTY |0 |
|27 |35 |AGREEMENT |0 |
|37 |37 |( |0 |
|38 |41 |this |0 |
|43 |43 |" |0 |
|44 |52 |Agreement |0 |
|53 |55 |"), |0 |
|57 |61 |dated |0 |
|63 |64 |as |0 |
|66 |67 |of |0 |
|69 |76 |December |0 |
|78 |79 |31 |0 |
|80 |80 |, |0 |
|82 |85 |2018 |0 |
|87 |87 |( |0 |
|88 |90 |the |0 |
|92 |92 |" |0 |
|93 |101|Effective |0 |
|103 |106|Date |0 |
|107 |108|") |0 |
|110 |111|is |0 |
|113 |119|entered |0 |
|121 |124|into |0 |
|126 |127|by |0 |
|129 |131|and |0 |
|133 |139|between |0 |
|141 |149|Armstrong |B-ORG |
|151 |158|Flooring |I-ORG |
|159 |159|, |I-ORG |
|161 |163|Inc |I-ORG |
|164 |165|., |I-ORG |
|167 |167|a |0 |
|169 |176|Delaware |0 |
|178 |188|corporation |0 |
|190 |191|(" |0 |
|192 |197|Seller |B-ALIAS|
|198 |199|") |0 |
|201 |203|and |0 |
|205 |207|AFI |B-ORG |
|209 |217|Licensing |I-ORG |
|219 |221|LLC |I-ORG |
|222 |222|, |0 |
|224 |224|a |0 |
|226 |233|Delaware |0 |
|235 |241|limited |0 |
|243 |251|liability |0 |
|253 |259|company |0 |
|261 |262|(" |0 |
+-----+---+------------+-------+
only showing top 50 rows
import spark.implicits._
// Pipeline stages are defined where NER is done. NER is converted to chunks.
val docAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
.setInputCols(Array("sentence","token"))
.setOutputCol("embs")
val nerModel = MedicalNerModel.pretrained("ner_posology_greedy","en","clinical/models")
.setInputCols(Array("sentence","token","embs"))
.setOutputCol("ner")
val nerConverter = new NerConverter()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
// Define the IOB tagger,which needs tokens and chunks as input. Show results.
val iobTagger = new IOBTagger()
.setInputCols(Array("token","ner_chunk"))
.setOutputCol("ner_label")
val pipeline = new Pipeline().setStages(Array(
docAssembler,
sentenceDetector,
tokenizer,
embeddings,
nerModel,
nerConverter,
iobTagger))
val text = "The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO."
val df = Seq(text) .toDF("text")
val result = pipeline.fit(df) .transform(df)
// chunk level result
+-----+---+-------------------------------------------+---------+
|begin|end|ner_chunk |ner_label|
+-----+---+-------------------------------------------+---------+
|27 |50 |1 capsule of Advil 10 mg |DRUG |
|52 |61 |for 5 days |DURATION |
|67 |109|magnesium hydroxide 100mg/1ml suspension PO|DRUG |
+-----+---+-------------------------------------------+---------+
// token level result
+-----+---+----------+----------+
|begin|end|word |chunk |
+-----+---+----------+----------+
|0 |2 |The |0 |
|4 |10 |patient |0 |
|12 |14 |was |0 |
|16 |25 |prescribed|0 |
|27 |27 |1 |B-DRUG |
|29 |35 |capsule |I-DRUG |
|37 |38 |of |I-DRUG |
|40 |44 |Advil |I-DRUG |
|46 |47 |10 |I-DRUG |
|49 |50 |mg |I-DRUG |
|52 |54 |for |B-DURATION|
|56 |56 |5 |I-DURATION|
|58 |61 |days |I-DURATION|
|63 |65 |and |0 |
|67 |75 |magnesium |B-DRUG |
|77 |85 |hydroxide |I-DRUG |
|87 |95 |100mg/1ml |I-DRUG |
|97 |106|suspension|I-DRUG |
|108 |109|PO |I-DRUG |
|110 |110|. |0 |
+-----+---+----------+----------+
import spark.implicits._
// Pipeline stages are defined where NER is done. NER is converted to chunks.
val docAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
.setInputCols(Array("sentence","token"))
.setOutputCol("embs")
val nerModel = FinanceNerModel.pretrained("finner_orgs_prods_alias","en","finance/models")
.setInputCols(Array("sentence","token","embs"))
.setOutputCol("ner")
val nerConverter = new NerConverter()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
// Define the IOB tagger,which needs tokens and chunks as input. Show results.
val iobTagger = new IOBTagger()
.setInputCols(Array("token","ner_chunk"))
.setOutputCol("ner_label")
val pipeline = new Pipeline().setStages(Array(
docAssembler,
sentenceDetector,
tokenizer,
embeddings,
nerModel,
nerConverter,
iobTagger))
val text = """In 2020, we acquired certain assets of Spell Security Private Limited (also known as "Spell Security") . More specifically,their Compliance product - Policy Compliance (PC))."""
val df = Seq(text) .toDF("text")
val result = pipeline.fit(df) .transform(df)
// chunk level result
+-----+---+------------------------------+---------+
|begin|end|ner_chunk |ner_label|
+-----+---+------------------------------+---------+
|39 |68 |Spell Security Private Limited|ORG |
|86 |99 |Spell Security |ALIAS |
|129 |138|Compliance |PRODUCT |
|150 |166|Policy Compliance |PRODUCT |
|169 |170|PC |ALIAS |
+-----+---+------------------------------+---------+
// token level result
+-----+---+------------+---------+
|begin|end|word |chunk |
+-----+---+------------+---------+
|0 |1 |In |0 |
|3 |6 |2020 |0 |
|7 |7 |, |0 |
|9 |10 |we |0 |
|12 |19 |acquired |0 |
|21 |27 |certain |0 |
|29 |34 |assets |0 |
|36 |37 |of |0 |
|39 |43 |Spell |B-ORG |
|45 |52 |Security |I-ORG |
|54 |60 |Private |I-ORG |
|62 |68 |Limited |I-ORG |
|70 |70 |( |0 |
|71 |74 |also |0 |
|76 |80 |known |0 |
|82 |83 |as |0 |
|85 |85 |" |0 |
|86 |90 |Spell |B-ALIAS |
|92 |99 |Security |I-ALIAS |
|100 |102|"). |0 |
|104 |107|More |0 |
|109 |120|specifically|0 |
|121 |121|, |0 |
|123 |127|their |0 |
|129 |138|Compliance |B-PRODUCT|
|140 |146|product |0 |
|148 |148|- |0 |
|150 |155|Policy |B-PRODUCT|
|157 |166|Compliance |I-PRODUCT|
|168 |168|( |0 |
|169 |170|PC |B-ALIAS |
|171 |174|)"). |0 |
+-----+---+------------+---------+
import spark.implicits._
// Pipeline stages are defined where NER is done. NER is converted to chunks.
val docAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
.setInputCols(Array("sentence","token"))
.setOutputCol("embs")
val ner_model = LegalNerModel.pretrained("legner_orgs_prods_alias","en","legal/models")
.setInputCols(Array("sentence","token","embs"))
.setOutputCol("ner")
val nerConverter = new NerConverter()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
// Define the IOB tagger,which needs tokens and chunks as input. Show results.
val iobTagger = new IOBTagger()
.setInputCols(Array("token","ner_chunk"))
.setOutputCol("ner_label")
val pipeline = new Pipeline().setStages(Array(
docAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
nerConverter,
iobTagger))
val text = """This
INTELLECTUAL PROPERTY AGREEMENT (this "Agreement") ,dated as of December 31,2018 (the "Effective Date") is entered into by and between Armstrong Flooring,Inc.,a Delaware corporation ("Seller") and AFI Licensing LLC,a Delaware limited liability company ("Licensing" and together with Seller,"Arizona") and AHF Holding,Inc. (formerly known as Tarzan HoldCo,Inc.) ,a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company,a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand,a "Party" and collectively,the "Parties") ."""
val df = Seq(text) .toDF("text")
val result = pipeline.fit(df) .transform(df)
// chunk level result
+-----+---+-----------------------------------+---------+
|begin|end|ner_chunk |ner_label|
+-----+---+-----------------------------------+---------+
|141 |165|Armstrong Flooring, Inc., |ORG |
|192 |197|Seller |ALIAS |
|205 |221|AFI Licensing LLC |ORG |
|263 |271|Licensing |ALIAS |
|292 |297|Seller |ALIAS |
|301 |307|Arizona |ALIAS |
|315 |330|AHF Holding, Inc |ORG |
|399 |403|Buyer |ALIAS |
|411 |445|Armstrong Hardwood Flooring Company|ORG |
|478 |484|Company |ALIAS |
|505 |509|Buyer |ALIAS |
|516 |529|Buyer Entities |ALIAS |
|542 |548|Arizona |ALIAS |
|574 |587|Buyer Entities |ALIAS |
|611 |615|Party |ALIAS |
|641 |647|Parties |ALIAS |
+-----+---+-----------------------------------+---------+
// token level result
+-----+---+------------+-------+
|begin|end|word |chunk |
+-----+---+------------+-------+
|0 |3 |This |0 |
|5 |16 |INTELLECTUAL|0 |
|18 |25 |PROPERTY |0 |
|27 |35 |AGREEMENT |0 |
|37 |37 |( |0 |
|38 |41 |this |0 |
|43 |43 |" |0 |
|44 |52 |Agreement |0 |
|53 |55 |"), |0 |
|57 |61 |dated |0 |
|63 |64 |as |0 |
|66 |67 |of |0 |
|69 |76 |December |0 |
|78 |79 |31 |0 |
|80 |80 |, |0 |
|82 |85 |2018 |0 |
|87 |87 |( |0 |
|88 |90 |the |0 |
|92 |92 |" |0 |
|93 |101|Effective |0 |
|103 |106|Date |0 |
|107 |108|") |0 |
|110 |111|is |0 |
|113 |119|entered |0 |
|121 |124|into |0 |
|126 |127|by |0 |
|129 |131|and |0 |
|133 |139|between |0 |
|141 |149|Armstrong |B-ORG |
|151 |158|Flooring |I-ORG |
|159 |159|, |I-ORG |
|161 |163|Inc |I-ORG |
|164 |165|., |I-ORG |
|167 |167|a |0 |
|169 |176|Delaware |0 |
|178 |188|corporation |0 |
|190 |191|(" |0 |
|192 |197|Seller |B-ALIAS|
|198 |199|") |0 |
|201 |203|and |0 |
|205 |207|AFI |B-ORG |
|209 |217|Licensing |I-ORG |
|219 |221|LLC |I-ORG |
|222 |222|, |0 |
|224 |224|a |0 |
|226 |233|Delaware |0 |
|235 |241|limited |0 |
|243 |251|liability |0 |
|253 |259|company |0 |
|261 |262|(" |0 |
+-----+---+------------+-------+
only showing top 50 rows
InternalDocumentSplitter
InternalDocumentSplitter
splits large documents into small documents. InternalDocumentSplitter
has setSplitMode method to decide how to split documents.
If splitMode is recursive
, It takes the separators in order and splits subtexts if they are over the chunk length, considering optional overlap of the chunks.
Additionally, you can set
- custom patterns with setSplitPatterns
- whether patterns should be interpreted as regex with setPatternsAreRegex
- whether to keep the separators with setKeepSeparators
- whether to trim whitespaces with setTrimWhitespace
- whether to explode the splits to individual rows with setExplodeSplits
Parametres:
chunkSize
: Size of each chunk of text. This param is applicable only for “recursive” splitMode.chunkOverlap
: Length of the overlap between text chunks, by default0
. This param is applicable only forrecursive
splitMode.splitPatterns
: Patterns to split the document. patternsAreRegex. Whether to interpret the split patterns as regular expressions, by defaultTrue
.keepSeparators
: Whether to keep the separators in the final result , by defaultTrue
. This param is applicable only for “recursive” splitMode.explodeSplits
: Whether to explode split chunks to separate rows , by defaultFalse
.trimWhitespace
: Whether to trim whitespaces of extracted chunks , by defaultTrue
.splitMode
: The split mode to determine how text should be segmented. Default: ‘regex’. It should be one of the following values:- “char”: Split text based on individual characters.
- “token”: Split text based on tokens. You should supply tokens from inputCols.
- “sentence”: Split text based on sentences. You should supply sentences from inputCols.
- “recursive”: Split text recursively using a specific algorithm.
- “regex”: Split text based on a regular expression pattern.
sentenceAwareness
: Whether to split the document by sentence awareness if possible.- If true, it can stop the split process before maxLength.
- If true, you should supply sentences from inputCols. Default: False.
- This param is not applicable only for
regex
andrecursive
splitMode.
maxLength
: The maximum length allowed for spitting. The mode in which the maximum length is specified:- “char”: Maximum length is measured in characters. Default:
512
- “token”: Maximum length is measured in tokens. Default:
128
- “sentence”: Maximum length is measured in sentences. Default:
8
- “char”: Maximum length is measured in characters. Default:
customBoundsStrategy
: The custom bounds strategy for text splitting using regular expressions. This param is applicable only forregex
splitMode.caseSensitive
: Whether to use case sensitive when matching regex, by defaultFalse
. This param is applicable only forregex
splitMode.metaDataFields
: Metadata fields to add specified data in columns to the metadata of the split documents. You should set column names to read columns.
Input Annotator Types: DOCUMENT
Output Annotator Type: DOCUMENT
Notebook: InternalDocumentSplitterNotebook |
Show Example
from johnsnowlabs import nlp, medical
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
document_splitter = medical.InternalDocumentSplitter()\
.setInputCols("document")\
.setOutputCol("splits")\
.setSplitMode("recursive")\
.setChunkSize(100)\
.setChunkOverlap(3)\
.setExplodeSplits(True)\
.setPatternsAreRegex(False)\
.setSplitPatterns(["\n\n", "\n", " "])\
.setKeepSeparators(False)\
.setTrimWhitespace(True)
pipeline = nlp.Pipeline().setStages([
document_assembler,
document_splitter
])
df = spark.createDataFrame([[(
"The patient is a 28-year-old, who is status post gastric bypass surgery"
" nearly one year ago. \nHe has lost about 200 pounds and was otherwise doing well"
" until yesterday evening around 7:00-8:00 when he developed nausea and right upper quadrant pain,"
" which apparently wrapped around toward his right side and back. He feels like he was on it"
" but has not done so. He has overall malaise and a low-grade temperature of 100.3."
" \n\nHe denies any prior similar or lesser symptoms. His last normal bowel movement was yesterday."
" He denies any outright chills or blood per rectum."
)]]).toDF("text")
pipeline_df = pipeline.fit(df).transform(df).select("splits").show(truncate=False)
## Result
+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
|splits |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 92, The patient is a 28-year-old, who is status post gastric bypass surgery nearly one year ago., {sentence -> 0, document -> 0}, []}] |
|[{document, 94, 192, He has lost about 200 pounds and was otherwise doing well until yesterday evening around 7:00-8:00, {sentence -> 0, document -> 1}, []}] |
|[{document, 193, 291, when he developed nausea and right upper quadrant pain, which apparently wrapped around toward his, {sentence -> 0, document -> 2}, []}] |
|[{document, 288, 387, his right side and back. He feels like he was on it but has not done so. He has overall malaise and, {sentence -> 0, document -> 3}, []}]|
|[{document, 384, 421, and a low-grade temperature of 100.3., {sentence -> 0, document -> 4}, []}] |
|[{document, 424, 520, He denies any prior similar or lesser symptoms. His last normal bowel movement was yesterday. He, {sentence -> 0, document -> 5}, []}] |
|[{document, 518, 568, He denies any outright chills or blood per rectum., {sentence -> 0, document -> 6}, []}] |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val document_splitter = new InternalDocumentSplitter()
.setInputCols("document")
.setOutputCol("splits")
.setSplitMode("recursive")
.setChunkSize(100)
.setChunkOverlap(3)
.setExplodeSplits(true)
.setPatternsAreRegex(false)
.setSplitPatterns(Array("\n\n", "\n", " "))
.setKeepSeparators(false)
.setTrimWhitespace(true)
val pipeline = new Pipeline().setStages(Array(
document_assembler,
document_splitter ))
val test_data = Seq( "The patient is a 28-year-old, who is status post gastric bypass surgery"
" nearly one year ago. \nHe has lost about 200 pounds and was otherwise doing well"
" until yesterday evening around 7:00-8:00 when he developed nausea and right upper quadrant pain,"
" which apparently wrapped around toward his right side and back. He feels like he was on it"
" but has not done so. He has overall malaise and a low-grade temperature of 100.3."
" \n\nHe denies any prior similar or lesser symptoms. His last normal bowel movement was yesterday."
" He denies any outright chills or blood per rectum.").toDF("text")
val res = mapperPipeline.fit(test_data).transform(test_data)
// Show results
+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
|splits |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 92, The patient is a 28-year-old, who is status post gastric bypass surgery nearly one year ago., {sentence -> 0, document -> 0}, []}] |
|[{document, 94, 192, He has lost about 200 pounds and was otherwise doing well until yesterday evening around 7:00-8:00, {sentence -> 0, document -> 1}, []}] |
|[{document, 193, 291, when he developed nausea and right upper quadrant pain, which apparently wrapped around toward his, {sentence -> 0, document -> 2}, []}] |
|[{document, 288, 387, his right side and back. He feels like he was on it but has not done so. He has overall malaise and, {sentence -> 0, document -> 3}, []}]|
|[{document, 384, 421, and a low-grade temperature of 100.3., {sentence -> 0, document -> 4}, []}] |
|[{document, 424, 520, He denies any prior similar or lesser symptoms. His last normal bowel movement was yesterday. He, {sentence -> 0, document -> 5}, []}] |
|[{document, 518, 568, He denies any outright chills or blood per rectum., {sentence -> 0, document -> 6}, []}] |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
LLMLoader
LLMLoader is designed to interact with a LLMs that are converted into gguf format. This module allows using John Snow Labs’ licensed LLMs at various sizes that are finetuned on medical context for certain tasks. It provides various methods for setting parameters, loading models, generating text, and retrieving metadata. The LLMLoader includes methods for setting various parameters such as input prefix, suffix, cache prompt, number of tokens to predict, sampling techniques, temperature, penalties, and more. Overall, the LLMLoader provides a flexible and extensible framework for interacting with language models in a Python and Scala environment using PySpark and Java.
Parameters:
inputPrefix
: Prefix for infilling (default: empty)inputSuffix
: Suffix for infilling (default: empty)cachePrompt
: Whether to remember the prompt to avoid reprocessing itnPredict
: Number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled)topK
: Top-k sampling (default: 40, 0 = disabled)topP
: Top-p sampling (default: 0.9, 1.0 = disabled)minP
: Min-p sampling (default: 0.1, 0.0 = disabled)tfsZ
: Tail free sampling, parameter z (default: 1.0, 1.0 = disabled)typicalP
: Locally typical sampling, parameter p (default: 1.0, 1.0 = disabled)temperature
: The temperature (default: 0.8)dynatempRange
: Dynamic temperature range (default: 0.0, 0.0 = disabled)dynatempExponent
: Dynamic temperature exponent (default: 1.0)repeatLastN
: Last n tokens to consider for penalties (default: 64, 0 = disabled, -1 = ctx_size)repeatPenalty
: Penalty of repeated sequences of tokens (default: 1.0, 1.0 = disabled)frequencyPenalty
: Repetition alpha frequency penalty (default: 0.0, 0.0 = disabled)presencePenalty
: Repetition alpha presence penalty (default: 0.0, 0.0 = disabled)mirostatTau
: MiroStat target entropy, parameter tau (default: 5.0)mirostatEta
: MiroStat learning rate, parameter eta (default: 0.1)penalizeNl
: Whether to penalize newline tokensnKeep
: Number of tokens to keep from the initial prompt (default: 0, -1 = all)seed
: RNG seed (default: -1, use random seed for < 0)nProbs
: Amount top tokens probabilities to output if greater than 0.minKeep
: Amount of tokens the samplers should return at least (0 = disabled)grammar
: BNF-like grammar to constrain generations (see samples in grammars/ dir)penaltyPrompt
: Override which part of the prompt is penalized for repetition. E.g. if original prompt is “Alice: Hello!” and penaltyPrompt is “Hello!”, only the latter will be penalized if repeated. See pull request 3727 for more details.penaltyPromptTokens
: PenaltyPromptTokensignoreEos
: Whether to ignore end of stream token and continue generating (implies –logit-bias 2-inf)stopStrings
: Strings upon seeing which token generation is stoppeduseChatTemplate
: Whether or not generate should apply a chat template (default: false)
Input Annotator Types: DOCUMENT
Output Annotator Type: DOCUMENT
Python API: LLMLoader | Scala API: LLMLoader |
Show Example
from sparknlp_jsl.llm import LLMLoader
llm_loader_pretrained = medical.LLMLoader(spark).pretrained("JSL_MedS_q16_v1", "en", "clinical/models")
llm_loader_pretrained.generate("What is the indication for the drug Methadone?")
## Result
'Methadone is used to treat opioid addiction. It is a long-acting opioid agonist that is used to help individuals who are addicted to short-acting opioids such as heroin or other illicit opioids. It is also used to treat chronic pain in patients who have developed tolerance to other opioids.'
val llmLoader = new LLMLoader()
.setSparkSession(spark)
.pretrained("JSL_MedS_q16_v1", remoteLoc="clinical/models", lang = "en")
.setTemperature(0.0f)
.setNPredict(20)
val prompt = "What is the indication for the drug Methadone?"
val output = llmLoader.setUseChatTemplate(false).setStopStrings(Array.empty[String]).generate(prompt)
println(output)
## Result
'Methadone is used to treat opioid addiction. It is a long-acting opioid agonist that is used to help individuals who are addicted to short-acting opioids such as heroin or other illicit opioids. It is also used to treat chronic pain in patients who have developed tolerance to other opioids.'
LargeFewShotClassifier
The LargeFewShotClassifierModel annotator is designed to work effectively with minimal labeled data, offering flexibility and adaptability to new, unseen classes. Key parameters include batch size, case sensitivity, and maximum sentence length.Large Few-Shot Classifier Model can achieve impressive performance even with minimal labeled data.
Parameters:
-
inputCols
: The name of the columns containing the input annotations. It can read either a String column or an Array. -
outputCol
: The name of the column in Document type that is generated. We can specify only one column here.
Input Annotator Types: DOCUMENT
Output Annotator Type: CATEGORY
Python API: LargeFewShotClassifier | Scala API: LargeFewShotClassifier | Notebook: Mapper2Chunk |
Show Example
from johnsnowlabs import nlp, medical
document_assembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
large_few_shot_classifier = medical.LargeFewShotClassifierModel()\
.pretrained("large_fewshot_classifier_ade", "en", "clinical/models")\
.setInputCols("document")\
.setOutputCol("prediction")
pipeline = nlp.Pipeline().setStages([
document_assembler,
large_few_shot_classifier
])
data = spark.createDataFrame(["The patient developed severe liver toxicity after taking the medication for three weeks",
"He experienced no complications during the treatment and reported feeling much better.",
"She experienced a sudden drop in blood pressure after the administration of the new drug.",
"The doctor recommended a daily dosage of the vitamin supplement to improve her health."], StringType()).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("text", col("prediction.result").getItem(0).alias("result")).show(truncate=False)
## Result
+-----------------------------------------------------------------------------------------+------+
|text |result|
+-----------------------------------------------------------------------------------------+------+
|The patient developed severe liver toxicity after taking the medication for three weeks |ADE |
|He experienced no complications during the treatment and reported feeling much better. |noADE |
|She experienced a sudden drop in blood pressure after the administration of the new drug.|ADE |
|The doctor recommended a daily dosage of the vitamin supplement to improve her health. |noADE |
+-----------------------------------------------------------------------------------------+------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val largeFewShotClassifier = LargeFewShotClassifierModel()
.pretrained("large_fewshot_classifier_ade")
.setInputCols("document")
.setOutputCol("prediction")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
largeFewShotClassifier
))
val textList = Seq(
("The patient developed severe liver toxicity after taking the medication for three weeks"),
("He experienced no complications during the treatment and reported feeling much better."),
("She experienced a sudden drop in blood pressure after the administration of the new drug."),
("The doctor recommended a daily dosage of the vitamin supplement to improve her health.")
)
val data = spark.createDataFrame(textList).toDF("text")
val result = pipeline.fit(data).transform(data)
result.select(col("text"), col("prediction.result").getItem(0).alias("result")).show(truncate = false)
// Result
+-----------------------------------------------------------------------------------------+------+
|text |result|
+-----------------------------------------------------------------------------------------+------+
|The patient developed severe liver toxicity after taking the medication for three weeks |ADE |
|He experienced no complications during the treatment and reported feeling much better. |noADE |
|She experienced a sudden drop in blood pressure after the administration of the new drug.|ADE |
|The doctor recommended a daily dosage of the vitamin supplement to improve her health. |noADE |
+-----------------------------------------------------------------------------------------+------+
LightDeIdentification
Light DeIdentification is a light version of DeIdentification. It replaces sensitive information in a text with obfuscated or masked fakers. It is designed to work with healthcare data, and it can be used to de-identify patient names, dates, and other sensitive information. It can also be used to obfuscate or mask any other type of sensitive information, such as doctor names, hospital names, and other types of sensitive information. Additionally, it supports millions of embedded fakers and If desired, custom external fakers can be set with setCustomFakers function. It also supports multiple languages such as English, Spanish, French, German, and Arabic. And it supports multi-mode de-Identification with setSelectiveObfuscationModes function at the same time.
Parameters:
-
mode
(str): Mode for Anonimizer [‘mask’,’obfuscate’] -
dateEntities
(list[str]): List of date entities. Default: [‘DATE’, ‘DOB’, ‘DOD’] -
obfuscateDate
(Bool): When mode==’obfuscate’ whether to obfuscate dates or not. This param helps in consistency to make dateFormats more visible. When setting toTrue
, make sure dateFormats param fits the needs. If the value is True and obfuscation is failed, then unnormalizedDateMode param will be activated. When setting to ‘False’, then the date will be masked to. Default: False -
unnormalizedDateMode
(str): The mode to use if the date is not formatted. Options: [mask, obfuscate, skip]. Default: obfuscate. -
days
(IntParam): Number of days to obfuscate the dates by displacement.If not provided a random integer between 1 and 60 will be used. -
useShiftDays
(Bool): Whether to use the random shift day when the document has this in its metadata. Default: False -
dateFormats
(list[str]): List of date formats to automatically displace if parsed. -
region
(str): The region to use for date parsing. This property is especially used when obfuscating dates. You can decide whether the first part of 11/11/2023 is a day or the second part is a day when obfuscating dates. Options: ‘eu’ for European Union, ‘us’ for the USA, Default: ‘eu’ -
obfuscateRefSource
(str): The source of obfuscation of to obfuscate the entities. For dates entities, This property is invalid. The values ar the following: custom: Takes the entities from the setCustomFakers function. faker: Takes the entities from the Faker module both : Takes the entities from the setCustomFakers function and the faker module randomly -
language
(str): The language used to select the regex file and some faker entities. The values are the following: ‘en’(English), ‘de’(German), ‘es’(Spanish), ‘fr’(French), ‘ar’(Arabic) or ‘ro’(Romanian). Default:’en’. -
seed
(Int): It is the seed to select the entities on obfuscate mode. With the seed, you can reply to an execution several times with the same output. -
maskingPolicy
(str): Select the masking policy: same_length_chars: Replace the obfuscated entity with a masking sequence composed of asterisks and surrounding squared brackets, being the total length of the masking sequence of the same length as the original sequence. Example, Smith -> [***]. If the entity is less than 3 chars (like Jo, or 5), asterisks without brackets will be returned. entity_labels: Replace the values with the corresponding entity labels. fixed_length_chars: Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisk. -
fixedMaskLength
(Int): The length of the masking sequence in case of fixed_length_chars masking policy. -
sameLengthFormattedEntities
(list[str]): List of formatted entities to generate the same length outputs as original ones during obfuscation. The supported and default formatted entities are: PHONE, FAX, ID, IDNUM, BIOID, MEDICALRECORD, ZIP, VIN, SSN, DLN, LICENSE, PLATE. -
genderAwareness
(Bool): Whether to use gender-aware names or not during obfuscation. This param effects only names. If the value is true, it might decrease performance. Default: False -
ageRanges
(list[str]): list of integer specifying limits of the age groups to preserve during obfuscation. -
selectiveObfuscationModes
(dict[str, dict[str]]): The dictionary of modes to enable multi-mode deIdentification. ‘obfuscate’: Replace the values with random values. ‘mask_same_length_chars’: Replace the name with the asterisks with same length minus two plus brackets on both end. ‘mask_entity_labels’: Replace the values with the entity value. ‘mask_fixed_length_chars’: Replace the name with the asterisks with fixed length. You can also invoke “setFixedMaskLength()” ‘skip’: Skip the values (intact) The entities which have not been given in dictionary will deidentify according to :param:mode
-
customFakers
(dict[str, dict[str]]): The dictionary of custom fakers to specify the obfuscation terms for the entities. You can specify the entity and the terms to be used for obfuscation. -
keepYear
: (BooleanParam) Whether to keep the year intact when obfuscating date entities. If True, the year will remain unchanged during the obfuscation process. If False, the year will be modified along with the month and day. Default: False. -
keepMonth
: Whether to keep the month intact when obfuscating date entities. If True, the month will remain unchanged during the obfuscation process. If False, the month will be modified along with the year and day. Default: False.
Input Annotator Types: DOCUMENT, CHUNK
Output Annotator Type: DOCUMENT
Python API: LightDeIdentification | Scala API: LightDeIdentification | Notebook: LightDeIdentification |
Show Example
from johnsnowlabs import nlp, medical
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
# Sentence Detector annotator, processes various sentences per line
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
# NER model trained on n2c2 (de-identification and Heart Disease Risk Factors Challenge) datasets)
ner_subentity = medical.NerModel.pretrained("ner_deid_subentity_augmented", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner_subentity")
ner_converter = medical.NerConverterInternal()\
.setInputCols(["sentence", "token", "ner_subentity"])\
.setOutputCol("ner_chunk")
light_deidentification = medical.LightDeIdentification() \
.setInputCols(["ner_chunk", "sentence"]) \
.setOutputCol("obfuscated") \
.setMode("obfuscate") \
.setObfuscateDate(True)\
.setDateFormats(["MM/dd/yyyy","yyyy-MM-dd", "MM/dd/yy"]) \
.setDays(7) \
.setObfuscateRefSource('custom') \
.setCustomFakers({"Doctor": ["John", "Joe"],
"Patient": ["James", "Michael"],
"Hospital": ["Medical Center"],
"Street" : ["Main Street"],
"Age":["1","10", "20", "40","80"],
"PHONE":["555-555-0000"]}) \
.setAgeRanges([1, 4, 12, 20, 40, 60, 80])\
.setLanguage("en") \
.setSeed(42) \
.setDateEntities(["DATE", "DOB", "DOD"]) \
flattener = medical.Flattener()\
.setInputCols("obfuscated","sentence")\
.setExplodeSelectedFields({"obfuscated": ["result"], "sentence": ["result"]})
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
ner_subentity,
ner_converter,
light_deidentification,
flattener
])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
text ='''
Record date : 2093-01-13 , David Hale , M.D . ,
Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 .
PCP : Oliveira , 95 years-old , Record date : 2079-11-09 .
Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555.
'''
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))
result.show(truncate=False)
## Result
+----------------------------------------------------------------------+-----------------------------------------------------+
|sentence_result |obfuscated_result |
+----------------------------------------------------------------------+-----------------------------------------------------+
|Record date : 2093-01-13 , David Hale , M.D . |Record date : 2093-01-20 , John , M.D . |
|,\nName : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . |,\nName : Michael , MR # 1478295 Date : 01/20/93 . |
|PCP : Oliveira , 95 years-old , Record date : 2079-11-09 . |PCP : Joe , 95 years-old , Record date : 2079-11-16 .|
|Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555.|Medical Center , Main Street , Phone 62-130-8657. |
+----------------------------------------------------------------------+-----------------------------------------------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val wordEmbeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val nerSubEntity = MedicalNerModel.pretrained("ner_deid_subentity_augmented", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner_subentity")
val nerConverter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner_subentity"))
.setOutputCol("ner_chunk")
val lightDeidentification = new LightDeIdentification()
.setInputCols(Array("ner_chunk", "sentence"))
.setOutputCol("obfuscated")
.setMode("obfuscate")
.setObfuscateDate(true)
.setDateFormats(Array("MM/dd/yyyy", "yyyy-MM-dd", "MM/dd/yy"))
.setDays(7)
.setObfuscateRefSource("custom")
.setCustomFakers(Map("Doctor" -> Array("John", "Joe"),
"Patient" -> Array("James", "Michael"),
"Hospital" -> Array("Medical Center"),
"Street" -> Array("Main Street"),
"Age" -> Array("1", "10", "20", "40", "80"),
"PHONE" -> Array("555-555-0000")))
.setAgeRanges(Array(1, 4, 12, 20, 40, 60, 80))
.setLanguage("en")
.setSeed(42)
.setDateEntities(Array("DATE", "DOB", "DOD"))
val flattener = new Flattener()
.setInputCols(Array("obfuscated", "sentence"))
.setExplodeSelectedFields(Map("obfuscated" -> Array("result"), "sentence" -> Array("result")))
val nlpPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
wordEmbeddings,
nerSubEntity,
nerConverter,
lightDeidentification,
flattener
))
val emptyData =Seq(("")).toDF("text")
val model = nlpPipeline.fit(emptyData)
// Result
+----------------------------------------------------------------------+-----------------------------------------------------+
|sentence_result |obfuscated_result |
+----------------------------------------------------------------------+-----------------------------------------------------+
|Record date : 2093-01-13 , David Hale , M.D . |Record date : 2093-01-20 , John , M.D . |
|,\nName : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . |,\nName : Michael , MR # 1478295 Date : 01/20/93 . |
|PCP : Oliveira , 95 years-old , Record date : 2079-11-09 . |PCP : Joe , 95 years-old , Record date : 2079-11-16 .|
|Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555.|Medical Center , Main Street , Phone 62-130-8657. |
+----------------------------------------------------------------------+-----------------------------------------------------+
Mapper2Chunk
This annotator converts ‘LABELED_DEPENDENCY’ type annotations coming from [[ChunkMapper]] into ‘CHUNK’ type to create new chunk-type column, compatible with annotators that use chunk type as input.
Parameters:
filterNoneValues
: (Bool) Filter ‘NONE’ values
Input Annotator Types: LABELED_DEPENDENCY
Output Annotator Type: CHUNK
Python API: Mapper2Chunk | Scala API: Mapper2Chunk | Notebook: Mapper2Chunk |
Show Example
from johnsnowlabs import nlp, medical
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("sentence")\
# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")\
# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
# NER model trained on n2c2 (de-identification and Heart Disease Risk Factors Challenge) datasets
clinical_ner = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter_name = medical.NerConverterInternal()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
chunkMapper = medical.ChunkMapperModel.pretrained("drug_action_treatment_mapper", "en", "clinical/models") \
.setInputCols(["ner_chunk"]) \
.setOutputCol("relations") \
.setRels(["action"])
mapper2chunk = medical.Mapper2Chunk() \
.setInputCols(["relations"]) \
.setOutputCol("chunk") \
.setFilterNoneValues(True)
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter_name,
chunkMapper,
mapper2chunk
])
sample_text = "Patient resting in bed. Patient given azithromycin without any difficulty. Patient denies nausea at this time. zofran declined. Patient is also having intermittent sweating"
data = spark.createDataFrame([[sample_text]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
## Result
+--------------------------+--------------+
|result |annotatorType |
+--------------------------+--------------+
|[bactericidal, antiemetic]|[chunk, chunk]|
+--------------------------+--------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
val ner_converter_name = new NerConverterInternal()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
val chunkMapper = ChunkMapperModel.pretrained("drug_action_treatment_mapper", "en", "clinical/models")
.setInputCols("ner_chunk")
.setOutputCol("relations")
.setRels("action")
val mapper2chunk = new Mapper2Chunk()
.setInputCols("relations")
.setOutputCol("chunk")
.setFilterNoneValues(True)
val nlpPipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter_name,
chunkMapper,
mapper2chunk))
val test_data = Seq("""Patient resting in bed. Patient given azithromycin without any difficulty. Patient denies nausea at this time. zofran declined. Patient is also having intermittent sweating""").toDF("text")
val res = mapperPipeline.fit(test_data).transform(test_data)
// Show results
+--------------------------+--------------+
|result |annotatorType |
+--------------------------+--------------+
|[bactericidal, antiemetic]|[chunk, chunk]|
+--------------------------+--------------+
MedicalLLM
MedicalLLM was designed to load and run large language models (LLMs) in GGUF format with scalable performance. Ideal for clinical and healthcare applications, MedicalLLM supports tasks like medical entity extraction, summarization, Q&A, Retrieval Augmented Generation (RAG), and conversational AI. With simple integration into Spark NLP pipelines, it allows for customizable batch sizes, prediction settings, and chat templates. GPU optimization is also available, enhancing its capabilities for high-performance environments. MedicalLLM empowers users to link medical entities and perform complex NLP tasks with efficiency and precision.
To use GPU inference with this annotator, make sure to use the Spark NLP GPU package and set the number of GPU layers with the setNGpuLayers method. When using larger models, we recommend adjusting GPU usage with setNCtx and setNGpuLayers according to your hardware to avoid out-of-memory errors.
Parameters:
inputPrefix
: Prefix for infilling (default: empty)inputSuffix
: Suffix for infilling (default: empty)cachePrompt
: Whether to remember the prompt to avoid reprocessing itnPredict
: Number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled)topK
: Top-k sampling (default: 40, 0 = disabled)topP
: Top-p sampling (default: 0.9, 1.0 = disabled)minP
: Min-p sampling (default: 0.1, 0.0 = disabled)tfsZ
: Tail free sampling, parameter z (default: 1.0, 1.0 = disabled)typicalP
: Locally typical sampling, parameter p (default: 1.0, 1.0 = disabled)temperature
: The temperature (default: 0.8)dynatempRange
: Dynamic temperature range (default: 0.0, 0.0 = disabled)dynatempExponent
: Dynamic temperature exponent (default: 1.0)repeatLastN
: Last n tokens to consider for penalties (default: 64, 0 = disabled, -1 = ctx_size)repeatPenalty
: Penalty of repeated sequences of tokens (default: 1.0, 1.0 = disabled)frequencyPenalty
: Repetition alpha frequency penalty (default: 0.0, 0.0 = disabled)presencePenalty
: Repetition alpha presence penalty (default: 0.0, 0.0 = disabled)mirostatTau
: MiroStat target entropy, parameter tau (default: 5.0)mirostatEta
: MiroStat learning rate, parameter eta (default: 0.1)penalizeNl
: Whether to penalize newline tokensnKeep
: Number of tokens to keep from the initial prompt (default: 0, -1 = all)seed
: RNG seed (default: -1, use random seed for < 0)nProbs
: Amount top tokens probabilities to output if greater than 0.minKeep
: Amount of tokens the samplers should return at least (0 = disabled)grammar
: BNF-like grammar to constrain generations (see samples in grammars/ dir)penaltyPrompt
: Override which part of the prompt is penalized for repetition. E.g. if original prompt is “Alice: Hello!” and penaltyPrompt is “Hello!”, only the latter will be penalized if repeated. See pull request 3727 for more details.penaltyPromptTokens
: PenaltyPromptTokensignoreEos
: Whether to ignore end of stream token and continue generating (implies –logit-bias 2-inf)stopStrings
: Strings upon seeing which token generation is stoppeduseChatTemplate
: Whether or not generate should apply a chat template (default: false)
Input Annotator Types: DOCUMENT
Output Annotator Type: DOCUMENT
Python API: MedicalLLM | Scala API: MedicalLLM | Notebook: MedicalLLMNotebook |
Show Example
from johnsnowlabs import nlp, medical
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
medical_llm = medical.AutoGGUFModel.pretrained("jsl_meds_ner_q4_v2", "en", "clinical/models")\
.setInputCols("document")\
.setOutputCol("completions")\
.setBatchSize(1)\
.setNPredict(100)\
.setUseChatTemplate(True)\
.setTemperature(0)\
#.setNGpuLayers(100) # if you have GPU
pipeline = nlp.Pipeline(
stages = [
document_assembler,
medical_llm
])
med_ner_prompt = """
### Template:
{
"drugs": [
{
"name": "",
"reactions": []
}
]
}
### Text:
I feel a bit drowsy & have a little blurred vision , and some gastric problems .
I 've been on Arthrotec 50 for over 10 years on and off , only taking it when I needed it .
Due to my arthritis getting progressively worse , to the point where I am in tears with the agony.
Gp 's started me on 75 twice a day and I have to take it every day for the next month to see how I get on , here goes .
So far its been very good , pains almost gone , but I feel a bit weird , did n't have that when on 50.
"""
data = spark.createDataFrame([[med_ner_prompt]]).toDF("text")
data.show(truncate=100)
## Result
{
"drugs": [
{
"name": "Arthrotec",
"reactions": [
"drowsy",
"blurred vision",
"gastric problems"
]
}
]
}
</s> #### Template:
{"drugs": [{"name": "", "reaction": []}]}
#### Text:
The patient is a 65-year
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.seq2seq.MedicalLLM
import org.apache.spark.ml.Pipeline
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val medicalLLM = MedicalLLM.pretrained("jsl_meds_ner_q4_v2", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("completions")
.setBatchSize(1)
.setNPredict(100)
.setUseChatTemplate(true)
.setTemperature(0)
// .setNGpuLayers(100) if you have GPU
val pipeline = new Pipeline().setStages(
Array(
documentAssembler,
medicalLLM
))
val medPrompt =
"""
|### Template:
|{
|"drugs": [
|{
|"name": "",
|"reactions": []
|}
|]
|}
|### Text:
|I feel a bit drowsy & have a little blurred vision , and some gastric problems .
|I 've been on Arthrotec 50 for over 10 years on and off , only taking it when I needed it .
|Due to my arthritis getting progressively worse , to the point where I am in tears with the agony.
|Gp 's started me on 75 twice a day and I have to take it every day for the next month to see how I get on , here goes .
|So far its been very good , pains almost gone , but I feel a bit weird , did n't have that when on 50.
|""".stripMargin
val data = Seq(medPrompt).toDF("text")
data.select("completions.result").show(false)
## Result
{
"drugs": [
{
"name": "Arthrotec",
"reactions": [
"drowsy",
"blurred vision",
"gastric problems"
]
}
]
}
</s> #### Template:
{"drugs": [{"name": "", "reaction": []}]}
#### Text:
The patient is a 65-year
MultiChunk2Doc
MultiChunk2Doc annotator merges a given chunks to create a document. During document creation, a specific whitelist and blacklist filter can be applied, and case sensitivity can be adjusted. Additionally, specified prefix and suffix texts can be placed before and after the merged chunks in the resulting document. And a separator can be placed between the chunks.
Parameters:
-
separator
(str): Separator to add between the chunks -
prefix
(str): Prefix to add to the result -
suffix
(str): Suffix to add to the result -
blackList
(list[str]): If defined, list of entities to ignore. The rest will be processed. Do not include IOB prefix on labels -
whiteList
(list[str]): If defined, list of entities to process. The rest will be ignored. Do not include IOB prefix on labels -
caseSensitive
(Bool): Determines whether the definitions of the white listed entities are case sensitive or not.
Input Annotator Types: CHUNK
Output Annotator Type: DOCUMENT
Python API: MultiChunk2Doc | Scala API: MultiChunk2Doc |
Show Example
from johnsnowlabs import nlp, medical
document_assembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
clinical_ner = medical.NerModel.pretrained("ner_clinical_large_langtest", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = medical.NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
multi_chunk2_doc = medical.MultiChunk2Doc() \
.setInputCols(["ner_chunk"]) \
.setOutputCol("new_document") \
.setWhiteList(["test"]) \
.setCaseSensitive(False) \
.setPrefix("<") \
.setSeparator("><") \
.setSuffix(">") \
nlpPipeline = nlp.Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
multi_chunk2_doc])
model = nlpPipeline.fit(self.spark.createDataFrame([[""]]).toDF("text"))
data = spark.createDataFrame([
["""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation
and subsequent type two diabetes mellitus (T2DM),
one prior episode of HTG-induced pancreatitis three years prior to presentation,
and associated with an acute hepatitis, presented with a one-week history of polyuria, poor appetite, and vomiting.
She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG.
She had been on dapagliflozin for six months at the time of presentation.
Physical examination on presentation was significant for dry oral mucosa ;
significantly , her abdominal examination was benign with no tenderness, guarding, or rigidity."""]
]).toDF("text")
result = model.transform(data)
## Result
+-------------------------------------------------------------------------------------------------------------------------------+-------------+-----------+-------------------------+----------------------------+
|new_doc_result |new_doc_begin|new_doc_end|new_doc_metadata_document|new_doc_metadata_chunk_count|
+-------------------------------------------------------------------------------------------------------------------------------+-------------+-----------+-------------------------+----------------------------+
|<Physical examination> <her abdominal examination> <serum glucose> <creatinine> <triglycerides> <total cholesterol> <venous pH>|0 |126 |0 |7 |
+-------------------------------------------------------------------------------------------------------------------------------+-------------+-----------+-------------------------+----------------------------+
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner = MedicalNerModel.pretrained("ner_clinical_large_langtest", "en", "clinical/models")
.setInputCols("sentence", "token", "embeddings")
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val multi_chunk2_doc = new MultiChunk2Doc()
.setInputCols("ner_chunk").setOutputCol("new_doc")
.setWhiteList(Array("test"))
.setCaseSensitive(false)
.setPrefix("<")
.setSuffix(">")
.setSeparator("> <")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner,
ner_converter,
multi_chunk2_doc))
val data = Seq("""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, and associated with an acute hepatitis, presented with a one-week history of polyuria, poor appetite, and vomiting. She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation. Physical examination on presentation was significant for dry oral mucosa; significantly, her abdominal examination was benign with no tenderness, guarding, or rigidity. Pertinent laboratory findings on admission were: serum glucose 111 mg/dl, creatinine 0.4 mg/dL, triglycerides 508 mg/dL, total cholesterol 122 mg/dL, and venous pH 7.27.""").toDS().toDF("text")
val result = pipeline.fit(emptyDF).transform(data)
// Result
+-------------------------------------------------------------------------------------------------------------------------------+-------------+-----------+-------------------------+----------------------------+
|new_doc_result |new_doc_begin|new_doc_end|new_doc_metadata_document|new_doc_metadata_chunk_count|
+-------------------------------------------------------------------------------------------------------------------------------+-------------+-----------+-------------------------+----------------------------+
|<Physical examination> <her abdominal examination> <serum glucose> <creatinine> <triglycerides> <total cholesterol> <venous pH>|0 |126 |0 |7 |
+-------------------------------------------------------------------------------------------------------------------------------+-------------+-----------+-------------------------+----------------------------+
NameChunkObfuscator
NameChunkObfuscator
annotator allows to transform a dataset with an Input Annotation of type CHUNK, into its obfuscated version of by obfuscating the given CHUNKS. This module can replace name entities with consistent fakers, remain others same.
Obfuscation, refers to the process of de-identifying or removing sensitive patient information from clinical notes or other healthcare documents. The purpose of PHI obfuscation is to protect patient privacy and comply with regulations such as the Health Insurance Portability and Accountability Act (HIPAA).
It is important to note that the obfuscation should be done carefully to ensure that the de-identified data cannot be re-identified. Organizations must follow best practices and adhere to applicable regulations to protect patient privacy and maintain data security.
Parameters:
-
seed
: The seed to select the names on obfuscation. With the seed, you can reply an execution several times with the same output.. -
obfuscateRefSource
: Sets mode for select obfuscate source [‘both’, ’faker’, ‘file’] Default: ‘both’. -
language
: The language used to select some faker names. The values are the following: ‘en’(english),’de’(german), ‘es’(Spanish), ‘fr’(french) or ‘ro’(romanian) Default:’en’. -
sameLength
: The sameLength used to select the same length names as original ones during obfuscation. Example: ‘John’ –> ‘Mike’. Default: true. -
nameEntities
: The nameEntities used to select entities during obfuscation. The supported name entities are NAME, PATIENT, and DOCTOR. Default: ‘NAME’ -
genderAwareness
: Whether to use gender-aware names or not during obfuscation. This param effects only names. Default: False
Input Annotator Types: CHUNK
Output Annotator Type: CHUNK
Python API: NameChunkObfuscator | Scala API: NameChunkObfuscator | Notebook: NameChunkObfuscatorNotebook |
Show Example
from johnsnowlabs import medical, nlp
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = medical.NerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = medical.NerConverterInternal()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nameChunkObfuscator = medical.NameChunkObfuscator()\
.setInputCols("ner_chunk")\
.setOutputCol("replacement")\
.setObfuscateRefSource("faker")\
.setNameEntities(["DOCTOR", "PATIENT"])\
.setGenderAwareness(True)
replacer_name = medical.Replacer()\
.setInputCols("replacement","sentence")\
.setOutputCol("obfuscated_sentence_name")\
.setUseReplacement(True)
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
nameChunkObfuscator,
replacer_name])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
## sample data
text ='''
Record date : 2093-01-13 , David Hale , M.D . , Patient name : Michael , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555. Analyzed by Dr. Jennifer .
'''
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))
result.select(F.explode(F.arrays_zip(result.sentence.result,
result.obfuscated_sentence_name.result)).alias("cols")) \
.select(F.expr("cols['0']").alias("sentence"),
F.expr("cols['1']").alias("obfuscated_sentence_name"))
| sentence | obfuscated_sentence_name |
| ------------------------------------------------- | --------------------------------------------------------- |
| Record date : 2093-01-13 , David Hale , M.D . | Record date : 2093-01-13 , Richardson , M.D . |
| , Patient name : Michael , MR # 7194334 Date ... | , Patient name : Thaxter , MR # 7194334 Date ... |
| PCP : Oliveira , 25 years-old , Record date : ... | PCP : Adelaida , 25 years-old , Record date : ... |
| Cocke County Baptist Hospital , 0295 Keats Str... | Cocke County Baptist Hospital , 0295 Keats Str... |
| Analyzed by Dr. Jennifer . | Analyzed by Dr. Morganne . |
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val nameChunkObfuscator = new NameChunkObfuscator()
.setInputCols("ner_chunk")
.setOutputCol("replacement")
.setObfuscateRefSource("faker")
.setNameEntities(Array("DOCTOR", "PATIENT"))
.setGenderAwareness(true)
val replacer_name = new Replacer()
.setInputCols(Array("replacement","sentence"))
.setOutputCol("obfuscated_sentence_name")
.setUseReplacement(true)
val nlpPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
nameChunkObfuscator,
replacer_name))
val data = Seq("Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .").toDF("text")
val result = nlpPipeline.fit(data).transfrom(data)
| sentence | obfuscated_sentence_name |
| ------------------------------------------------- | --------------------------------------------------------- |
| Record date : 2093-01-13 , David Hale , M.D . | Record date : 2093-01-13 , Richardson , M.D . |
| , Patient name : Michael , MR # 7194334 Date ... | , Patient name : Thaxter , MR # 7194334 Date ... |
| PCP : Oliveira , 25 years-old , Record date : ... | PCP : Adelaida , 25 years-old , Record date : ... |
| Cocke County Baptist Hospital , 0295 Keats Str... | Cocke County Baptist Hospital , 0295 Keats Str... |
| Analyzed by Dr. Jennifer . | Analyzed by Dr. Morganne . |
NameChunkObfuscator
annotator that can be used in deidentification tasks for replacing doctor and patient names with fake names using a reference document.
Input Annotator Types: CHUNK
Output Annotator Type: CHUNK
Python API: NameChunkObfuscatorApproach | Scala API: NameChunkObfuscatorApproach | Notebook: NameChunkObfuscatorApproachNotebook |
Show Example
from johnsnowlabs import medical, nlp
names = """Mitchell-NAME
Clifford-NAME
Jeremiah-NAME
Lawrence-NAME
Brittany-NAME
Patricia-NAME
Jennifer-NAME
Jackson-NAME
Leonard-NAME
Randall-NAME
Camacho-NAME
Ferrell-NAME
Mueller-NAME
Bowman-NAME
Hansen-NAME
"""
with open('names_test2.txt', 'w') as file:
file.write(names)
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = medical.NerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = medical.NerConverterInternal()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nameChunkObfuscator = medical.NameChunkObfuscatorApproach()\
.setInputCols("ner_chunk")\
.setOutputCol("replacement")\
.setObfuscateRefFile("names_test2.txt")\
.setObfuscateRefSource("file")\
.setRefFileFormat("csv")\
.setRefSep("-")
replacer_name = medical.Replacer()\
.setInputCols("replacement","sentence")\
.setOutputCol("obfuscated_sentence_name")\
.setUseReplacement(True)
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
nameChunkObfuscator,
replacer_name])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
## Results
text ='''
M.D . , Patient name : Michael , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555. Analyzed by Dr. Jennifer .
'''
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))
result.select(F.explode(F.arrays_zip(result.sentence.result,
result.obfuscated_sentence_name.result)).alias("cols")) \
.select(F.expr("cols['0']").alias("sentence"), F.expr("cols['1']").alias("obfuscated_sentence_name"))
| sentence | obfuscated_sentence_name |
| ------------------------------------------------- | ------------------------------------------------- |
| M.D . | M.D . |
| , Patient name : Michael , MR # 7194334 Date ... | , Patient name : Ferrell , MR # 7194334 Date ... |
| PCP : Oliveira , 25 years-old , Record date : ... | PCP : Clifford , 25 years-old , Record date : ... |
| Cocke County Baptist Hospital , 0295 Keats Str... | Cocke County Baptist Hospital , 0295 Keats Str... |
| Analyzed by Dr. Jennifer . | Analyzed by Dr. Jennifer . |
val names = """Mitchell-NAME
Clifford-NAME
Jeremiah-NAME
Lawrence-NAME
Brittany-NAME
Patricia-NAME
Jennifer-NAME
Jackson-NAME
Leonard-NAME
Randall-NAME
Camacho-NAME
Ferrell-NAME
Mueller-NAME
Bowman-NAME
Hansen-NAME
"""
/*
with open("names_test2.txt", 'w') as file:
file.write(names)
*/
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val nameChunkObfuscator = new NameChunkObfuscatorApproach()
.setInputCols("ner_chunk")
.setOutputCol("replacement")
.setObfuscateRefFile("names_test2.txt")\
.setObfuscateRefSource("file")
.setRefFileFormat("csv")
.setRefSep("-")
val replacer_name = new Replacer()
.setInputCols(Array("replacement","sentence"))
.setOutputCol("obfuscated_sentence_name")
.setUseReplacement(true)
val nlpPipeline = new Pipeline().setStages((
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
nameChunkObfuscator,
replacer_nam))
val data = Seq("M.D . , Patient name : Michael , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555. Analyzed by Dr. Jennifer .").toDF("text")
val res = nlpPipeline.fit(data).transform(data)
| sentence | obfuscated_sentence_name |
| ------------------------------------------------- | ------------------------------------------------- |
| M.D . | M.D . |
| , Patient name : Michael , MR # 7194334 Date ... | , Patient name : Ferrell , MR # 7194334 Date ... |
| PCP : Oliveira , 25 years-old , Record date : ... | PCP : Clifford , 25 years-old , Record date : ... |
| Cocke County Baptist Hospital , 0295 Keats Str... | Cocke County Baptist Hospital , 0295 Keats Str... |
| Analyzed by Dr. Jennifer . | Analyzed by Dr. Jennifer . |
NerChunker
Extracts phrases that fits into a known pattern using the NER tags. Useful for entity groups with neighboring tokens when there is no pretrained NER model to address certain issues. A Regex needs to be provided to extract the tokens between entities.
Parameter:
setRegexParsers
: Array of grammar based chunk parsers.
Input Annotator Types: DOCUMENT, NAMED_ENTITY
Output Annotator Type: CHUNK
Python API: NerChunker | Scala API: NerChunker | Notebook: NerChunkerNotebook |
Show Example
from johnsnowlabs import nlp, medical
# Defining pipeline stages for NER
documentAssembler= nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector= nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence") \
.setUseAbbreviations(False)
tokenizer= nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["sentence","token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(False)
ner = medical.NerModel.pretrained("ner_radiology", "en", "clinical/models") \
.setInputCols(["sentence","token","embeddings"]) \
.setOutputCol("ner") \
.setIncludeConfidence(True)
# Define the NerChunker to combine to chunks
chunker = medical.NerChunker() \
.setInputCols(["sentence","ner"]) \
.setOutputCol("ner_chunk") \
.setRegexParsers(["<ImagingFindings>.*<BodyPart>"])
pipeline= nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner,
chunker
])
data= spark.createDataFrame([["She has cystic cyst on her kidney."]]).toDF("text")
result = pipeline.fit(data).transform(data)
# Show results:
result.selectExpr("explode(arrays_zip(ner.metadata , ner.result))")\
.selectExpr("col['0'].word as word" , "col['1'] as ner").show(truncate=False)
+------+-----------------+
|word |ner |
+------+-----------------+
|She |O |
|has |O |
|cystic|B-ImagingFindings|
|cyst |I-ImagingFindings|
|on |O |
|her |O |
|kidney|B-BodyPart |
|. |O |
+------+-----------------+
result.select("ner_chunk.result").show(truncate=False)
+---------------------------+
|result |
+---------------------------+
|[cystic cyst on her kidney]|
+---------------------------+
from johnsnowlabs import nlp, finance
# Defining pipeline stages for NER
documentAssembler= nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector= nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer= nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")\
.setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '"', "'", '%', '&'])
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)\
.setCaseSensitive(True)
ner_model = finance.NerModel.pretrained("finner_responsibility_reports_md", "en", "finance/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
# Define the NerChunker to combine to chunks
chunker = finance.NerChunker() \
.setInputCols(["sentence","ner"]) \
.setOutputCol("ner_chunk") \
.setRegexParsers(["<ENVIRONMENTAL_KPI>.*<AMOUNT>"])
pipeline= nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
chunker
])
data= spark.createDataFrame([["""The company has reduced its direct GHG emissions from 12,135 million tonnes of CO2e in 2017 to 4 million tonnes of CO2e in 2021. The indirect GHG emissions (scope 2) are mainly from imported energy, including electricity, heat, steam, and cooling, and the company has reduced its scope 2 emissions from 3 million tonnes of CO2e in 2017-2018 to 4 million tonnes of CO2e in 2020-2021. The scope 3 emissions are mainly from the use of sold products, and the emissions have increased from 377 million tonnes of CO2e in 2017 to 408 million tonnes of CO2e in 2021."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
# Show results:
result.selectExpr("explode(arrays_zip(ner.metadata , ner.result))")\
.selectExpr("col['0'].word as word" , "col['1'] as ner").show(truncate=False)
+---------+--------------------+
|word |ner |
+---------+--------------------+
|The |O |
|company |O |
|has |O |
|reduced |O |
|its |O |
|direct |B-ENVIRONMENTAL_KPI |
|GHG |I-ENVIRONMENTAL_KPI |
|emissions|I-ENVIRONMENTAL_KPI |
|from |O |
|12,135 |B-AMOUNT |
|million |I-AMOUNT |
|tonnes |B-ENVIRONMENTAL_UNIT|
|of |I-ENVIRONMENTAL_UNIT|
|CO2e |I-ENVIRONMENTAL_UNIT|
|in |O |
|2017 |B-DATE_PERIOD |
|to |O |
|4 |B-AMOUNT |
|million |I-AMOUNT |
|tonnes |B-ENVIRONMENTAL_UNIT|
+---------+--------------------+
result.select("ner_chunk.result").show(truncate=False)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[direct GHG emissions from 12,135 million tonnes of CO2e in 2017 to 4 million, indirect GHG emissions (scope 2) are mainly from imported energy, including electricity, heat, steam, and cooling, and the company has reduced its scope 2 emissions from 3 million tonnes of CO2e in 2017-2018 to 4 million, scope 3 emissions are mainly from the use of sold products, and the emissions have increased from 377 million tonnes of CO2e in 2017 to 408 million]|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
from johnsnowlabs import nlp, legal
# Defining pipeline stages for NER
documentAssembler= nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector= nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer= nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner_model = legal.NerModel.pretrained("legner_org_per_role_date", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
# Define the NerChunker to combine to chunks
chunker = legal.NerChunker() \
.setInputCols(["sentence","ner"]) \
.setOutputCol("ner_chunk") \
.setRegexParsers(["<PERSON>.*<ROLE>"])
pipeline= nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
chunker
])
data= spark.createDataFrame([["""Jeffrey Preston Bezos is an American entrepreneur, founder and CEO of Amazon"""]]).toDF("text")
result = pipeline.fit(data).transform(data)
# Show results:
result.selectExpr("explode(arrays_zip(ner.metadata , ner.result))")\
.selectExpr("col['0'].word as word" , "col['1'] as ner").show(truncate=False)
+------------+--------+
|word |ner |
+------------+--------+
|Jeffrey |B-PERSON|
|Preston |I-PERSON|
|Bezos |I-PERSON|
|is |O |
|an |O |
|American |O |
|entrepreneur|O |
|, |O |
|founder |B-ROLE |
|and |O |
|CEO |B-ROLE |
|of |O |
|Amazon |B-ORG |
+------------+--------+
result.select("ner_chunk.result").show(truncate=False)
+--------------------------------------------------------------------+
|result |
+--------------------------------------------------------------------+
|[Jeffrey Preston Bezos is an American entrepreneur, founder and CEO]|
+--------------------------------------------------------------------+
import spark.implicits._
// Defining pipeline stages for NER
val data= Seq("She has cystic cyst on her kidney.").toDF("text")
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
.setUseAbbreviations(false)
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
.setCaseSensitive(False)
val ner = MedicalNerModel.pretrained("ner_radiology", "en", "clinical/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
.setIncludeConfidence(True)
// Define the NerChunker to combine to chunks
val chunker = new NerChunker()
.setInputCols(Array("sentence","ner"))
.setOutputCol("ner_chunk")
.setRegexParsers(Array("<ImagingFindings>.<BodyPart>"))
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner,
chunker
))
val data = Seq(
"""She has cystic cyst on her kidney."""
).toDF("text")
val result = pipeline.fit(data).transform(data)
// Show results:
+------+-----------------+
|word |ner |
+------+-----------------+
|She |O |
|has |O |
|cystic|B-ImagingFindings|
|cyst |I-ImagingFindings|
|on |O |
|her |O |
|kidney|B-BodyPart |
|. |O |
+------+-----------------+
+---------------------------+
|result |
+---------------------------+
|[cystic cyst on her kidney]|
+---------------------------+
import spark.implicits._
// Defining pipeline stages for NER
val documentAssembler= new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer= new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner_model = FinanceNerModel.pretrained("finner_responsibility_reports_md", "en", "finance/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
// Define the NerChunker to combine to chunks
val chunker = new NerChunker()
.setInputCols(Array("sentence","ner"))
.setOutputCol("ner_chunk")
.setRegexParsers(Array("<ENVIRONMENTAL_KPI>.*<AMOUNT>"))
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner,
chunker
))
val data = Seq(
"""The company has reduced its direct GHG emissions from 12,135 million tonnes of CO2e in 2017 to 4 million tonnes of CO2e in 2021. The indirect GHG emissions (scope 2) are mainly from imported energy, including electricity, heat, steam, and cooling, and the company has reduced its scope 2 emissions from 3 million tonnes of CO2e in 2017-2018 to 4 million tonnes of CO2e in 2020-2021. The scope 3 emissions are mainly from the use of sold products, and the emissions have increased from 377 million tonnes of CO2e in 2017 to 408 million tonnes of CO2e in 2021."""
).toDF("text")
val result = pipeline.fit(data).transform(data)
// Show results:
+---------+--------------------+
|word |ner |
+---------+--------------------+
|The |O |
|company |O |
|has |O |
|reduced |O |
|its |O |
|direct |B-ENVIRONMENTAL_KPI |
|GHG |I-ENVIRONMENTAL_KPI |
|emissions|I-ENVIRONMENTAL_KPI |
|from |O |
|12,135 |B-AMOUNT |
|million |I-AMOUNT |
|tonnes |B-ENVIRONMENTAL_UNIT|
|of |I-ENVIRONMENTAL_UNIT|
|CO2e |I-ENVIRONMENTAL_UNIT|
|in |O |
|2017 |B-DATE_PERIOD |
|to |O |
|4 |B-AMOUNT |
|million |I-AMOUNT |
|tonnes |B-ENVIRONMENTAL_UNIT|
+---------+--------------------+
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[direct GHG emissions from 12,135 million tonnes of CO2e in 2017 to 4 million, indirect GHG emissions (scope 2) are mainly from imported energy, including electricity, heat, steam, and cooling, and the company has reduced its scope 2 emissions from 3 million tonnes of CO2e in 2017-2018 to 4 million, scope 3 emissions are mainly from the use of sold products, and the emissions have increased from 377 million tonnes of CO2e in 2017 to 408 million]|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
// Defining pipeline stages for NER
val documentAssembler= new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer= new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner_model = LegalNerModel.pretrained("legner_org_per_role_date", "en", "legal/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
// Define the NerChunker to combine to chunks
val chunker = new NerChunker()
.setInputCols(Array("sentence","ner"))
.setOutputCol("ner_chunk")
.setRegexParsers(Array("<PERSON>.*<ROLE>"))
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner,
chunker
))
val data = Seq(
"""Jeffrey Preston Bezos is an American entrepreneur, founder and CEO of Amazon"""
).toDF("text")
val result = pipeline.fit(data).transform(data)
// Show results:
+------------+--------+
|word |ner |
+------------+--------+
|Jeffrey |B-PERSON|
|Preston |I-PERSON|
|Bezos |I-PERSON|
|is |O |
|an |O |
|American |O |
|entrepreneur|O |
|, |O |
|founder |B-ROLE |
|and |O |
|CEO |B-ROLE |
|of |O |
|Amazon |B-ORG |
+------------+--------+
+--------------------------------------------------------------------+
|result |
+--------------------------------------------------------------------+
|[Jeffrey Preston Bezos is an American entrepreneur, founder and CEO]|
+--------------------------------------------------------------------+
NerConverterInternal
Converts a IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their label. Chunks with no associated entity (tagged “O”) are filtered out.
Parameters;
-
setThreshold
: Confidence threshold. -
setWhiteList
: If defined, list of entities to process. -
setBlackList
: If defined, list of entities to ignore. -
setReplaceLabels
: If defined, contains a dictionary for entity replacement. -
setPreservePosition
: Whether to preserve the original position of the tokens in the original document or use the modified tokens. -
setReplaceDictResource
: If defined, path to the file containing a dictionary for entity replacement. -
setIgnoreStopWords
: If defined, list of stop words to ignore. -
setGreedyMode
: (Boolean) Whether to ignore B tags for contiguous tokens of same entity same . -
resetSentenceIndices
: Whether to reset sentence indices to treat the entire output as if it originates from a single document. Default: False.
This licensed annotator adds extra functionality to the open-source version by adding the following parameters: blackList
, greedyMode
, threshold
, and ignoreStopWords
that are not available in the NerConverter annotator.
See also Inside–outside–beginning (tagging) for more information.
Input Annotator Types: DOCUMENT, TOKEN, NAMED_ENTITY
Output Annotator Type: CHUNK
Python API: NerConverterInternal | Scala API: NerConverterInternal | Notebook: Notebook |
Show Example
from johnsnowlabs import *
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
document_assembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
# Clinical word embeddings trained on PubMED dataset
embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
# NER model
nerModel = medical.NerModel.pretrained("ner_posology", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
# NER Converter
nerConverter = medical.NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages = [document_assembler,
sentence_detector,
tokenizer,
embeddings,
nerModel,
nerConverter
])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
sample_text = """The patient was prescribed 1 capsule of Advil for 5 days.
He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, metformin 1000 mg two times a day.
"""
data = spark.createDataFrame([[sample_text]]).toDF("text")
result = model.transform(data)
result.select('text', 'ner.result', 'ner_chunk.result').show(truncate = 50)
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
| text| result| result|
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|The patient was prescribed 1 capsule of Advil f...|[O, O, O, O, B-DOSAGE, B-FORM, O, B-DRUG, B-DUR...|[1, capsule, Advil, for 5 days, 40 units, insul...|
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
from johnsnowlabs import *
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
document_assembler = nlp.DocumentAssembler() \
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")\
.setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '"', "'", '%', '&'])
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
# NER model
nerModel = finance.NerModel.pretrained("finner_responsibility_reports_md", "en", "finance/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
# NER Converter
nerConverter = finance.NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages = [document_assembler,
sentence_detector,
tokenizer,
embeddings,
nerModel,
nerConverter
])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
sample_text = """The company has reduced its direct GHG emissions from 12,135 million tonnes of CO2e in 2017 to 4 million tonnes of CO2e in 2021. The indirect GHG emissions (scope 2) are mainly from imported energy, including electricity, heat, steam, and cooling, and the company has reduced its scope 2 emissions from 3 million tonnes of CO2e in 2017-2018 to 4 million tonnes of CO2e in 2020-2021. The scope 3 emissions are mainly from the use of sold products, and the emissions have increased from 377 million tonnes of CO2e in 2017 to 408 million tonnes of CO2e in 2021.
"""
data = spark.createDataFrame([[sample_text]]).toDF("text")
result = model.transform(data)
result.select('text', 'ner.result', 'ner_chunk.result').show(truncate = 50)
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
| text| result| result|
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|The company has reduced its direct GHG emission...|[O, O, O, O, O, B-ENVIRONMENTAL_KPI, I-ENVIRONM...|[direct GHG emissions, 12,135 million, tonnes o...|
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
from johnsnowlabs import *
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
document_assembler = nlp.DocumentAssembler() \
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
# NER model
nerModel = legal.NerModel.pretrained("legner_org_per_role_date", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
# NER Converter
nerConverter = legal.NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
nlpPipeline = nlp.Pipeline(stages = [document_assembler,
sentence_detector,
tokenizer,
embeddings,
nerModel,
nerConverter
])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = nlpPipeline.fit(empty_data)
sample_text = """Jeffrey Preston Bezos is an American entrepreneur, founder and CEO of Amazon
"""
data = spark.createDataFrame([[sample_text]]).toDF("text")
result = model.transform(data)
result.select('text', 'ner.result', 'ner_chunk.result').show(truncate = 50)
+--------------------------------------------------+--------------------------------------------------+---------------------------------------------+
| text| result| result|
+--------------------------------------------------+--------------------------------------------------+---------------------------------------------+
|Jeffrey Preston Bezos is an American entreprene...|[B-PERSON, I-PERSON, I-PERSON, O, O, O, O, O, B...|[Jeffrey Preston Bezos, founder, CEO, Amazon]|
+--------------------------------------------------+--------------------------------------------------+---------------------------------------------+
import spark.implicits._
// Annotator that transforms a text column from dataframe into an Annotation ready for NLP
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
// Tokenizer splits words in a relevant format for NLP
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
// Clinical word embeddings trained on PubMED dataset
val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
// NER model
val nerModel = MedicalNerModel.pretrained("ner_posology","en","clinical/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
// NER Converter
val nerConverter = new NerConverterInternal()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
val nlpPipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
embeddings,
nerModel,
nerConverter ))
val empty_data = Seq("") .toDF("text")
val model = nlpPipeline.fit(empty_data)
val sample_text = "The patient was prescribed 1 capsule of Advil for 5 days.He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night,12 units of insulin lispro with meals,metformin 1000 mg two times a day."
val data = Seq(sample_text) .toDF("text")
val result = model.transform(data) result.select("text","ner.result","ner_chunk.result")
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
| text| result| result|
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|The patient was prescribed 1 capsule of Advil f...|[O, O, O, O, B-DOSAGE, B-FORM, O, B-DRUG, B-DUR...|[1, capsule, Advil, for 5 days, 40 units, insul...|
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
import spark.implicits._
// Annotator that transforms a text column from dataframe into an Annotation ready for NLP
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
// Tokenizer splits words in a relevant format for NLP
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
.setContextChars(Array(".",",",";",":","!","?","*","-","(",") ",""",""","%","&"))
val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
// NER model
val nerModel = FinanceNerModel.pretrained("finner_responsibility_reports_md","en","finance/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
// NER Converter
val nerConverter = new NerConverterInternal()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
val nlpPipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
embeddings,
nerModel,
nerConverter ))
val empty_data = Seq("") .toDF("text")
val model = nlpPipeline.fit(empty_data)
val sample_text = "The company has reduced its direct GHG emissions from 12,135 million tonnes of CO2e in 2017 to 4 million tonnes of CO2e in 2021. The indirect GHG emissions (scope 2) are mainly from imported energy,including electricity,heat,steam,and cooling,and the company has reduced its scope 2 emissions from 3 million tonnes of CO2e in 2017-2018 to 4 million tonnes of CO2e in 2020-2021. The scope 3 emissions are mainly from the use of sold products,and the emissions have increased from 377 million tonnes of CO2e in 2017 to 408 million tonnes of CO2e in 2021."
val data = Seq(sample_text) .toDF("text")
val result = model.transform(data) result.select("text","ner.result","ner_chunk.result")
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
| text| result| result|
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|The company has reduced its direct GHG emission...|[O, O, O, O, O, B-ENVIRONMENTAL_KPI, I-ENVIRONM...|[direct GHG emissions, 12,135 million, tonnes o...|
+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
import spark.implicits._
// Annotator that transforms a text column from dataframe into an Annotation ready for NLP
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
// Tokenizer splits words in a relevant format for NLP
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
// NER model
val nerModel = LegalNerModel.pretrained("legner_org_per_role_date","en","legal/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
// NER Converter
val nerConverter = new NerConverterInternal()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
val nlpPipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
embeddings,
nerModel,
nerConverter ))
val empty_data = Seq("") .toDF("text")
val model = nlpPipeline.fit(empty_data)
val sample_text = "Jeffrey Preston Bezos is an American entrepreneur,founder and CEO of Amazon"
val data = Seq(sample_text) .toDF("text")
val result = model.transform(data) result.select("text","ner.result","ner_chunk.result")
+--------------------------------------------------+--------------------------------------------------+---------------------------------------------+
| text| result| result|
+--------------------------------------------------+--------------------------------------------------+---------------------------------------------+
|Jeffrey Preston Bezos is an American entreprene...|[B-PERSON, I-PERSON, I-PERSON, O, O, O, O, O, B...|[Jeffrey Preston Bezos, founder, CEO, Amazon]|
+--------------------------------------------------+--------------------------------------------------+---------------------------------------------+
NerDisambiguator
Links words of interest, such as names of persons, locations and companies, from an input text document to a corresponding unique entity in a target Knowledge Base (KB). Words of interest are called Named Entities (NEs), mentions, or surface forms. Instantiated / pretrained model of the NerDisambiguator. Links words of interest, such as names of persons, locations and companies, from an input text document to a corresponding unique entity in a target Knowledge Base (KB). Words of interest are called Named Entities (NEs), mentions, or surface forms.
Parameters:
-
embeddingTypeParam
: (String) ‘bow’ for word embeddings or ‘sentence’ for sentences. -
numFirstChars
: (Int) number of characters to be considered for initial prefix search in the knowledge base. -
tokenSearch
: (BooleanParam) mechanism of search - by token or by - chunk in knowledge base (token is recommended ==> Default value: True). -
narrowWithApproximateMatching
: (BooleanParam) narrow down the prefix search results with Levenshtein distance based matching (True is recommended). -
levenshteinDistanceThresholdParam
: (Float) value of the Levenshtein distance threshold to narrow results from prefix search (default value: 0.1). -
nearMatchingGapParam
: (Int) allows to define a limit on the string length (by trimming the candidate chunks) during Levenshtein distance-based narrowing, {len(candidate) - len(entity chunk) > nearMatchingGap} (default value: 4). -
predictionsLimit
: (BooleanParam) allows to limit the number of predictions N for top N predictions. -
s3KnowledgeBaseName
: (String) the name of the Knowledge Base name in S3.
Input Annotator Types: CHUNK, SENTENCE_EMBEDDINGS
Output Annotator Type: DISAMBIGUATION
Python API: NerDisambiguatorModel | Scala API: NerDisambiguatorModel | Notebook: NerDisambiguatorModelNotebook |
Show Example
from johnsnowlabs import nlp, medical
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
sentence_embeddings = nlp.SentenceEmbeddings() \
.setInputCols(["sentence","embeddings"]) \
.setOutputCol("sentence_embeddings")
ner_model = medical.NerModel.pretrained() \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = nlp.NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk") \
.setWhiteList(["PER"])
disambiguator = medical.NerDisambiguator() \
.setS3KnowledgeBaseName("i-per") \
.setInputCols(["ner_chunk", "sentence_embeddings"]) \
.setOutputCol("disambiguation") \
.setTokenSearch(False)
pipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
sentence_embeddings,
ner_model,
ner_converter,
disambiguator])
text = """The show also had a contestant named Donald Trump who later defeated Christina Aguilera ..."""
df = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(df).transform(df)
# Result
result.selectExpr("explode(disambiguation)") \
.selectExpr("col.metadata.chunk as chunk", "col.result as result").show(5, truncate=False)
+------------------+------------------------------------------------------------------------------------------------------------------------+
|chunk |result |
+------------------+------------------------------------------------------------------------------------------------------------------------+
|Donald Trump |http://en.wikipedia.org/?curid=55907961, http://en.wikipedia.org/?curid=31698421, http://en.wikipedia.org/?curid=4848272|
|Christina Aguilera|http://en.wikipedia.org/?curid=6636454, http://en.wikipedia.org/?curid=144171 |
+------------------+------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained()
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
val sentence_embeddings = new SentenceEmbeddings()
.setInputCols(Array("sentence","embeddings"))
.setOutputCol("sentence_embeddings")
val ner_model = MedicalNerModel.pretrained()
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
.setWhiteList(Array("PER"))
val disambiguator = new NerDisambiguator()
.setS3KnowledgeBaseName("i-per")
.setInputCols(Array("ner_chunk","sentence_embeddings"))
.setOutputCol("disambiguation")
.setTokenSearch(false)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
sentence_embeddings,
ner_model,
ner_converter,
disambiguator))
val text = "The show also had a contestant named Donald Trump who later defeated Christina Aguilera ..."
val df = Seq(text) .toDF("text")
val result = pipeline.fit(df) .transform(df)
// Result
+------------------+------------------------------------------------------------------------------------------------------------------------+
|chunk |result |
+------------------+------------------------------------------------------------------------------------------------------------------------+
|Donald Trump |http://en.wikipedia.org/?curid=55907961, http://en.wikipedia.org/?curid=31698421, http://en.wikipedia.org/?curid=4848272|
|Christina Aguilera|http://en.wikipedia.org/?curid=6636454, http://en.wikipedia.org/?curid=144171 |
+------------------+------------------------------------------------------------------------------------------------------------------------+
Links words of interest, such as names of persons, locations and companies, from an input text document to a corresponding unique entity in a target Knowledge Base (KB). Words of interest are called Named Entities (NEs), mentions, or surface forms. The model needs extracted CHUNKS and SENTENCE_EMBEDDINGS type input from e.g. SentenceEmbeddings and NerConverter.
Input Annotator Types: CHUNK, SENTENCE_EMBEDDINGS
Output Annotator Type: DISAMBIGUATION
Python API: NerDisambiguator | Scala API: NerDisambiguator |
NerModel
NerModel
is the Named Entity Recognition (NER) annotator that allows to train generic NER model based on Neural Networks. The architecture of the neural network is a Char CNNs - BiLSTM - CRF that achieves state-of-the-art in most datasets.
Note that some pre-trained models require specific types of embeddings, depending on which they were trained.
Parameters:
-
setBatchSize
: (int) number of samples used in one iteration of training (Default:32
). -
setIncludeConfidence
: (Boolean) whether to include confidence scores in annotation metadata (Default
: False). -
setConfigProtoBytes
: (int) ConfigProto from tensorflow, serialized into byte array. -
setIncludeAllConfidenceScores
: (Boolean) whether to include confidence scores for all tags rather than just for the predicted one. -
setMinProbability
(Float) define the minimum probability value.
For available pretrained models please see the Models Hub. Additionally, pretrained pipelines are available for this module, see the Pipelines. For extended examples of usage, see the Spark NLP Workshop
Input Annotator Types: DOCUMENT, TOKEN, WORD_EMBEDDINGS
Output Annotator Type: NAMED_ENTITY
Python API: MedicalNerModel | Scala API: MedicalNerModel | Notebook: MedicalNerModelNotebook |
Show Example
from johnsnowlabs import nlp, medical
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
jsl_ner = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("jsl_ner")
jsl_ner_converter = medical.NerConverterInternal() \
.setInputCols(["sentence", "token", "jsl_ner"]) \
.setOutputCol("ner_chunk")
jsl_ner_pipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
jsl_ner,
jsl_ner_converter])
text = '''
A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, and associated with an acute hepatitis, presented with a one-week history of polyuria, poor appetite, and vomiting.
She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation.
Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness, guarding, or rigidity. Pertinent laboratory findings on admission were: serum glucose 111 mg/dl, creatinine 0.4 mg/dL, triglycerides 508 mg/dL, total cholesterol 122 mg/dL, and venous pH 7.27.
'''
data = spark.createDataFrame([[text]]).toDF("text")
result = jsl_ner_pipeline.fit(data).transform(data)
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias("cols"))\
.select(F.expr("cols['0']").alias("chunk"),
F.expr("cols['1']['entity']").alias("ner_label")).show(100, truncate=False)
+-----------------------------+----------------------------+
|chunk |ner_label |
+-----------------------------+----------------------------+
|28-year-old |Age |
|female |Gender |
|gestational diabetes mellitus|Diabetes |
|eight years prior |RelativeDate |
|type two diabetes mellitus |Diabetes |
|T2DM |Diabetes |
|HTG-induced pancreatitis |Disease_Syndrome_Disorder |
|three years prior |RelativeDate |
|acute |Modifier |
|hepatitis |Disease_Syndrome_Disorder |
|one-week |Duration |
|polyuria |Symptom |
|poor appetite |Symptom |
|vomiting |Symptom |
|She |Gender |
|metformin |Drug_Ingredient |
|glipizide |Drug_Ingredient |
|dapagliflozin |Drug_Ingredient |
|T2DM |Diabetes |
|atorvastatin |Drug_Ingredient |
|gemfibrozil |Drug_Ingredient |
|HTG |Hyperlipidemia |
|She |Gender |
|dapagliflozin |Drug_Ingredient |
|for six months |Duration |
|dry oral mucosa |Symptom |
|her |Gender |
|abdominal |External_body_part_or_region|
|tenderness |Symptom |
|guarding |Symptom |
|rigidity |Symptom |
|admission |Admission_Discharge |
|serum glucose |Test |
|111 mg/dl |Test_Result |
|creatinine |Test |
|0.4 mg/dL |Test_Result |
|triglycerides |Triglycerides |
|508 mg/dL |Test_Result |
|total cholesterol 122 mg/dL |Total_Cholesterol |
|venous pH |Test |
|7.27 |Test_Result |
+-----------------------------+----------------------------+
from johnsnowlabs import nlp, finance
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_legal_bert_base_uncased","en")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model = finance.NerModel.pretrained("finner_sec_conll", "en", "finance/models") \
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = finance.NerConverterInternal()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
pipeline = nlp.Pipeline(stages=[
documentAssembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter])
text = '''December 2007 SUBORDINATED LOAN AGREEMENT. THIS LOAN AGREEMENT is made on 7th December, 2007 BETWEEN: (1) SILICIUM DE PROVENCE S.A.S., a private company with limited liability, incorporated under the laws of France, whose registered office is situated at Usine de Saint Auban, France, represented by Mr.Frank Wouters, hereinafter referred to as the "Borrower", and ( 2 ) EVERGREEN SOLAR INC., a company incorporated in Delaware, U.S.A., with registered number 2426798, whose registered office is situated at Bartlett Street, Marlboro, Massachusetts, U.S.A. represented by Richard Chleboski, hereinafter referred to as "Lender" '''
data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias("cols"))\
.select(F.expr("cols['0']").alias("chunk"),
F.expr("cols['1']['entity']").alias("ner_label")).show(100, truncate=False)
+--------------------------+---------+
|chunk |ner_label|
+--------------------------+---------+
|SILICIUM DE PROVENCE S.A.S|ORG |
|France |LOC |
|Usine de Saint Auban |LOC |
|France |LOC |
|Mr.Frank Wouters |PER |
|Borrower |PER |
|EVERGREEN SOLAR INC |ORG |
|Delaware |LOC |
|U.S.A |LOC |
|Bartlett Street |LOC |
|Marlboro |LOC |
|Massachusetts |LOC |
|U.S.A |LOC |
|Richard Chleboski |PER |
|Lender |PER |
+--------------------------+---------+
from johnsnowlabs import nlp, legal
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")\
ner_model = legal.NerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
pipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
ner_converter])
text = """EXCLUSIVE DISTRIBUTOR AGREEMENT (" Agreement ") dated as April 15, 1994 by and between IMRS OPERATIONS INC., a Delaware corporation with its principal place of business at 777 Long Ridge Road, Stamford, Connecticut 06902, U.S.A. (hereinafter referred to as " Developer ") and Delteq Pte Ltd, a Singapore company (and a subsidiary of Wuthelam Industries (S) Pte LTD ) with its principal place of business at 215 Henderson Road , #101-03 Henderson Industrial Park , Singapore 0315 ( hereinafter referred to as " Distributor ")."""
data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias("cols"))\
.select(F.expr("cols['0']").alias("chunk"),
F.expr("cols['1']['entity']").alias("ner_label")).show(100, truncate=False)
+-------------------------------+---------+
|chunk |ner_label|
+-------------------------------+---------+
|EXCLUSIVE DISTRIBUTOR AGREEMENT|DOC |
|April 15, 1994 |EFFDATE |
|IMRS OPERATIONS INC |PARTY |
|Developer |ALIAS |
|Delteq Pte Ltd |PARTY |
|Distributor |ALIAS |
+-------------------------------+---------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetector.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val wordEmbeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val jslNer = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("jsl_ner")
val jslNerConverter = new NerConverter()
.setInputCols(Array("sentence", "token", "jsl_ner"))
.setOutputCol("ner_chunk")
val jslNerPipeline = new Pipeline()
.setStages(Array(documentAssembler,
sentenceDetector,
tokenizer,
wordEmbeddings,
jslNer,
jslNerConverter))
val text = "A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, and associated with an acute hepatitis, presented with a one-week history of polyuria, poor appetite, and vomiting.
She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation.
Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness, guarding, or rigidity. Pertinent laboratory findings on admission were: serum glucose 111 mg/dl, creatinine 0.4 mg/dL, triglycerides 508 mg/dL, total cholesterol 122 mg/dL, and venous pH 7.27."
val data = Seq(text).toDF("text")
val result = jslNerPipeline.fit(data).transform(data)
+-----------------------------+----------------------------+
|chunk |ner_label |
+-----------------------------+----------------------------+
|28-year-old |Age |
|female |Gender |
|gestational diabetes mellitus|Diabetes |
|eight years prior |RelativeDate |
|type two diabetes mellitus |Diabetes |
|T2DM |Diabetes |
|HTG-induced pancreatitis |Disease_Syndrome_Disorder |
|three years prior |RelativeDate |
|acute |Modifier |
|hepatitis |Disease_Syndrome_Disorder |
|one-week |Duration |
|polyuria |Symptom |
|poor appetite |Symptom |
|vomiting |Symptom |
|She |Gender |
|metformin |Drug_Ingredient |
|glipizide |Drug_Ingredient |
|dapagliflozin |Drug_Ingredient |
|T2DM |Diabetes |
|atorvastatin |Drug_Ingredient |
|gemfibrozil |Drug_Ingredient |
|HTG |Hyperlipidemia |
|She |Gender |
|dapagliflozin |Drug_Ingredient |
|for six months |Duration |
|dry oral mucosa |Symptom |
|her |Gender |
|abdominal |External_body_part_or_region|
|tenderness |Symptom |
|guarding |Symptom |
|rigidity |Symptom |
|admission |Admission_Discharge |
|serum glucose |Test |
|111 mg/dl |Test_Result |
|creatinine |Test |
|0.4 mg/dL |Test_Result |
|triglycerides |Triglycerides |
|508 mg/dL |Test_Result |
|total cholesterol 122 mg/dL |Total_Cholesterol |
|venous pH |Test |
|7.27 |Test_Result |
+-----------------------------+----------------------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_legal_bert_base_uncased", "en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val nerModel = FinanceNerModel.pretrained("finner_sec_conll", "en", "finance/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val nerConverter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
nerModel,
nerConverter))
val text = '''December 2007 SUBORDINATED LOAN AGREEMENT. THIS LOAN AGREEMENT is made on 7th December, 2007 BETWEEN: (1) SILICIUM DE PROVENCE S.A.S., a private company with limited liability, incorporated under the laws of France, whose registered office is situated at Usine de Saint Auban, France, represented by Mr.Frank Wouters, hereinafter referred to as the "Borrower", and ( 2 ) EVERGREEN SOLAR INC., a company incorporated in Delaware, U.S.A., with registered number 2426798, whose registered office is situated at Bartlett Street, Marlboro, Massachusetts, U.S.A. represented by Richard Chleboski, hereinafter referred to as "Lender" '''
val data = Seq((text)).toDF("text")
val result = pipeline.fit(data).transform(data)
+--------------------------+---------+
|chunk |ner_label|
+--------------------------+---------+
|SILICIUM DE PROVENCE S.A.S|ORG |
|France |LOC |
|Usine de Saint Auban |LOC |
|France |LOC |
|Mr.Frank Wouters |PER |
|Borrower |PER |
|EVERGREEN SOLAR INC |ORG |
|Delaware |LOC |
|U.S.A |LOC |
|Bartlett Street |LOC |
|Marlboro |LOC |
|Massachusetts |LOC |
|U.S.A |LOC |
|Richard Chleboski |PER |
|Lender |PER |
+--------------------------+---------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val nerModel = LegalNerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val nerConverter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
nerModel,
nerConverter))
val text = """EXCLUSIVE DISTRIBUTOR AGREEMENT ("Agreement") dated as April 15, 1994 by and between IMRS OPERATIONS INC., a Delaware corporation with its principal place of business at 777 Long Ridge Road, Stamford, Connecticut 06902, U.S.A. (hereinafter referred to as "Developer") and Delteq Pte Ltd, a Singapore company (and a subsidiary of Wuthelam Industries (S) Pte LTD) with its principal place of business at 215 Henderson Road, #101-03 Henderson Industrial Park, Singapore 0315 (hereinafter referred to as "Distributor")."""
val data = Seq(text).toDF("text")
val result = pipeline.fit(data).transform(data)
+-------------------------------+---------+
|chunk |ner_label|
+-------------------------------+---------+
|EXCLUSIVE DISTRIBUTOR AGREEMENT|DOC |
|April 15, 1994 |EFFDATE |
|IMRS OPERATIONS INC |PARTY |
|Developer |ALIAS |
|Delteq Pte Ltd |PARTY |
|Distributor |ALIAS |
+-------------------------------+---------+
This Named Entity recognition annotator allows to train generic NER model based on Neural Networks.
The architecture of the neural network is a Char CNNs - BiLSTM - CRF that achieves state-of-the-art in most datasets.
For instantiated/pretrained models, see NerDLModel.
The training data should be a labeled Spark Dataset, in the format of CoNLL
2003 IOB with Annotation
type columns. The data should have columns of type DOCUMENT, TOKEN, WORD_EMBEDDINGS
and an
additional label column of annotator type NAMED_ENTITY
.
Excluding the label, this can be done with for example
- a SentenceDetector,
- a Tokenizer and
- a WordEmbeddingsModel with clinical embeddings (any clinical word embeddings can be chosen).
For extended examples of usage, see the Spark NLP Workshop
(sections starting with Training a Clinical NER
)
Input Annotator Types: DOCUMENT, TOKEN, WORD_EMBEDDINGS
Output Annotator Type: NAMED_ENTITY
Python API: MedicalNerApproach | Scala API: MedicalNerApproach | Notebook: MedicalNerApproachNotebook |
Show Example
from johnsnowlabs import nlp, medical
# First extract the prerequisites for the NerDLApproach
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
clinical_embeddings = nlp.WordEmbeddingsModel.pretrained('embeddings_clinical', "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
# Then the training can start
nerTagger = medical.NerApproach()\
.setInputCols(["sentence", "token", "embeddings"])\
.setLabelColumn("label")\
.setOutputCol("ner")\
.setMaxEpochs(2)\
.setBatchSize(64)\
.setRandomSeed(0)\
.setVerbose(1)\
.setValidationSplit(0.2)\
.setEvaluationLogExtended(True) \
.setEnableOutputLogs(True)\
.setIncludeConfidence(True)\
.setOutputLogsPath('ner_logs')\
.setGraphFolder('medical_ner_graphs')\
.setEnableMemoryOptimizer(True) #>> if you have a limited memory and a large conll file, you can set this True to train batch by batch
pipeline = nlp.Pipeline().setStages([
documentAssembler,
sentence,
tokenizer,
clinical_embeddings,
nerTagger
])
# We use the text and labels from the CoNLL dataset
conll = CoNLL()
trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")
pipelineModel = pipeline.fit(trainingData)
from johnsnowlabs import nlp, finance
# First extract the prerequisites for the NerDLApproach
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
clinical_embeddings = nlp.WordEmbeddingsModel.pretrained('embeddings_clinical', "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
# Then the training can start
nerTagger = finance.NerApproach()\
.setInputCols(["sentence", "token", "embeddings"])\
.setLabelColumn("label")\
.setOutputCol("ner")\
.setMaxEpochs(2)\
.setBatchSize(64)\
.setRandomSeed(0)\
.setVerbose(1)\
.setValidationSplit(0.2)\
.setEvaluationLogExtended(True) \
.setEnableOutputLogs(True)\
.setIncludeConfidence(True)\
.setOutputLogsPath('ner_logs')\
.setGraphFolder('medical_ner_graphs')\
.setEnableMemoryOptimizer(True) #>> if you have a limited memory and a large conll file, you can set this True to train batch by batch
pipeline = nlp.Pipeline().setStages([
documentAssembler,
sentence,
tokenizer,
clinical_embeddings,
nerTagger
])
from johnsnowlabs import nlp, legal
# First extract the prerequisites for the NerDLApproach
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
clinical_embeddings = nlp.WordEmbeddingsModel.pretrained('embeddings_clinical', "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
# Then the training can start
nerTagger = legal.NerApproach()\
.setInputCols(["sentence", "token", "embeddings"])\
.setLabelColumn("label")\
.setOutputCol("ner")\
.setMaxEpochs(2)\
.setBatchSize(64)\
.setRandomSeed(0)\
.setVerbose(1)\
.setValidationSplit(0.2)\
.setEvaluationLogExtended(True) \
.setEnableOutputLogs(True)\
.setIncludeConfidence(True)\
.setOutputLogsPath('ner_logs')\
.setGraphFolder('medical_ner_graphs')\
.setEnableMemoryOptimizer(True) #>> if you have a limited memory and a large conll file, you can set this True to train batch by batch
pipeline = nlp.Pipeline().setStages([
documentAssembler,
sentence,
tokenizer,
clinical_embeddings,
nerTagger
])
import spark.implicits._
// First extract the prerequisites for the NerDLApproach
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = WordEmbeddingsModel
.pretrained('embeddings_clinical', "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
// Then the training can start
val nerTagger =new MedicalNerApproach()
.setInputCols(Array("sentence", "token", "embeddings"))
.setLabelColumn("label")
.setOutputCol("ner")
.setMaxEpochs(5)
.setLr(0.003f)
.setBatchSize(8)
.setRandomSeed(0)
.setVerbose(1)
.setEvaluationLogExtended(false)
.setEnableOutputLogs(false)
.setIncludeConfidence(true)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentence,
tokenizer,
embeddings,
nerTagger
))
// We use the text and labels from the CoNLL dataset
val conll = CoNLL()
val trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")
val pipelineModel = pipeline.fit(trainingData)
import spark.implicits._
// First extract the prerequisites for the NerDLApproach
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = WordEmbeddingsModel
.pretrained('embeddings_clinical', "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
// Then the training can start
val nerTagger =new FinanceNerApproach()
.setInputCols(Array("sentence", "token", "embeddings"))
.setLabelColumn("label")
.setOutputCol("ner")
.setMaxEpochs(5)
.setLr(0.003f)
.setBatchSize(8)
.setRandomSeed(0)
.setVerbose(1)
.setEvaluationLogExtended(false)
.setEnableOutputLogs(false)
.setIncludeConfidence(true)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentence,
tokenizer,
embeddings,
nerTagger
))
import spark.implicits._
// First extract the prerequisites for the NerDLApproach
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = WordEmbeddingsModel
.pretrained('embeddings_clinical', "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
// Then the training can start
val nerTagger =new LegalNerApproach()
.setInputCols(Array("sentence", "token", "embeddings"))
.setLabelColumn("label")
.setOutputCol("ner")
.setMaxEpochs(5)
.setLr(0.003f)
.setBatchSize(8)
.setRandomSeed(0)
.setVerbose(1)
.setEvaluationLogExtended(false)
.setEnableOutputLogs(false)
.setIncludeConfidence(true)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentence,
tokenizer,
embeddings,
nerTagger
))
NerQuestionGenerator
NerQuestionGenerator
takes an NER chunk (obtained by, e.g., NerConverterInternal
) and generates a questions based on two entity types, a pronoun and a strategy.
The question is generated in the form of [QUESTIONPRONOUN] [ENTITY1] [ENTITY2] [QUESTIONMARK]
. The generated question can be used by QuestionAnswerer
or ZeroShotNer
annotators to answer the question or find NER entities.
Parametres:
-
questionPronoun
: Pronoun to be used in the question. E.g., ‘When’, ‘Where’, ‘Why’, ‘How’, ‘Who’, ‘What’. -
strategyType
: Strategy for the proccess, eitherPaired
(default) orCombined
. -
questionMark
: Whether to add a question mark at the end of the question. -
entities1
: List with the entity types of entities that appear first in the question. -
entities2
: List with the entity types of entities that appear second in the question.
All the parameters can be set using the corresponding set method in camel case. For example, .setQuestionPronoun(True)
.
Input Annotator Types: CHUNK
Output Annotator Type: DOCUMENT
Python API: NerQuestionGenerator | Scala API: NerQuestionGenerator | Notebook: NerQuestionGeneratorNotebook |
Show Example
from johnsnowlabs import nlp, medical
import json
entities = [
{
"label": "Person",
"patterns": ["Jon", "John", "John's"]
},
{
"label": "Organization",
"patterns": ["St. Mary's Hospital", "St. Mary's"]
},
{
"label": "Condition",
"patterns": ["vital signs", "heartbeat", "oxygen saturation levels"]
}
]
with open('./entities.json', 'w') as jsonfile:
json.dump(entities, jsonfile)
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
entity_ruler = nlp.EntityRulerApproach() \
.setInputCols(["document"]) \
.setOutputCol("entity") \
.setPatternsResource("./entities.json")\
.setCaseSensitive(False)
qagenerator = medical.NerQuestionGenerator()\
.setInputCols(["entity"])\
.setOutputCol("question")\
.setQuestionPronoun("How is")\
.setEntities1(["Person"])\
.setEntities2(["Condition"])\
.setStrategyType("Paired")\
.setQuestionMark(True)
prep_pipeline = nlp.Pipeline(stages=[
document_assembler,
entity_ruler,
qagenerator
])
example_text = """At St. Mary's Hospital, the healthcare team closely monitored John's vital signs with unwavering attention. They recorded his heartbeat and oxygen saturation levels, promptly addressing any deviations from normal. Their dedication and expertise at St. Mary's played a vital role in ensuring John's stability and fostering a swift recovery."""
df = spark.createDataFrame([[example_text]]).toDF("text")
result = prep_pipeline.fit(df).transform(df)
result.select("question").show(truncate=False)
## Result
+--------------------------------------------------------------------------------------------------------------------------------------------+
|question |
+--------------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 62, 79, How is John's vital signs ?, {sentence -> 0}, []}, {document, 291, 134, How is John's heartbeat ?, {sentence -> 0}, []}]|
+--------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
/* entities.json file
entities = [
{
"label": "Person",
"patterns": ["Jon", "John", "John's"]
},
{
"label": "Organization",
"patterns": ["St. Mary's Hospital", "St. Mary's"]
},
{
"label": "Condition",
"patterns": ["vital signs", "heartbeat", "oxygen saturation levels"]
}
]
*/
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val entity_ruler = new EntityRulerApproach()
.setInputCols("document")
.setOutputCol("entity")
.setPatternsResource("./entities.json")
.setCaseSensitive(false)
val qagenerator = new NerQuestionGenerator()
.setInputCols("entity")
.setOutputCol("question")
.setQuestionPronoun("How is")
.setEntities1("Person")
.setEntities2("Condition")
.setStrategyType("Paired")
.setQuestionMark(true)
val prep_pipeline = new Pipeline().setStages(Array(
document_assembler,
entity_ruler,
qagenerator ))
val test_data = Seq("""At St. Mary's Hospital, the healthcare team closely monitored John's vital signs with unwavering attention. They recorded his heartbeat and oxygen saturation levels, promptly addressing any deviations from normal. Their dedication and expertise at St. Mary's played a vital role in ensuring John's stability and fostering a swift recovery.""").toDF("text")
val res = mapperPipeline.fit(test_data).transform(test_data)
// Show results
+--------------------------------------------------------------------------------------------------------------------------------------------+
|question |
+--------------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 62, 79, How is John's vital signs ?, {sentence -> 0}, []}, {document, 291, 134, How is John's heartbeat ?, {sentence -> 0}, []}]|
+--------------------------------------------------------------------------------------------------------------------------------------------+
PosologyREModel
Instantiated RelationExtractionModel for extracting relationships between different recognized drug entitites. This class is not intended to be directly used, please use the RelationExtractionModel instead. Possible values are “DRUG-DOSAGE”, “DRUG-ADE”, “DRUG-FORM”, “DRUG-FREQUENCY”, “DRUG-ROUTE”, “DRUG-REASON”, “DRUG-STRENGTH”, “DRUG-DURATION”.
Input Annotator Types: WORD_EMBEDDINGS, POS, CHUNK, DEPENDENCY
Output Annotator Type: CATEGORY
Python API: RelationExtractionModel | Scala API: RelationExtractionModel | Notebook: RelationExtractionModelNotebook |
Show Example
from johnsnowlabs import nlp, medical
documenter = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencer = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentences")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentences"])\
.setOutputCol("tokens")
words_embedder = nlp.WordEmbeddingsModel()\
.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentences", "tokens"])\
.setOutputCol("embeddings")
pos_tagger = nlp.PerceptronModel()\
.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"])\
.setOutputCol("pos_tags")
ner_tagger = medical.NerModel()\
.pretrained("ner_posology", "en", "clinical/models")\
.setInputCols("sentences", "tokens", "embeddings")\
.setOutputCol("ner_tags")
ner_chunker = medical.NerConverterInternal()\
.setInputCols(["sentences", "tokens", "ner_tags"])\
.setOutputCol("ner_chunks")
dependency_parser = nlp.DependencyParserModel()\
.pretrained("dependency_conllu", "en")\
.setInputCols(["sentences", "pos_tags", "tokens"])\
.setOutputCol("dependencies")
reModel = medical.RelationExtractionModel()\
.pretrained("posology_re")\
.setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
.setOutputCol("relations")\
.setMaxSyntacticDistance(4)
pipeline = nlp.Pipeline(stages=[
documenter,
sentencer,
tokenizer,
words_embedder,
pos_tagger,
ner_tagger,
ner_chunker,
dependency_parser,
reModel
])
text = """
The patient was prescribed 1 unit of Advil for 5 days after meals. The patient was also
given 1 unit of Metformin daily.
He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night ,
12 units of insulin lispro with meals , and metformin 1000 mg two times a day.
"""
df = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(df).transform(df)
# Show results
result.select(F.explode(F.arrays_zip(
result.relations.result,
result.relations.metadata)).alias("cols"))\
.select(
F.expr("cols['1']['chunk1']").alias("chunk1"),
F.expr("cols['1']['chunk2']").alias("chunk2"),
F.expr("cols['1']['entity1']").alias("entity1"),
F.expr("cols['1']['entity2']").alias("entity2"),
F.expr("cols['0']").alias("relations"),
F.expr("cols['1']['confidence']").alias("confidence")).show(5, truncate=False)
+---------+----------------+-------+---------+--------------+----------+
|chunk1 |chunk2 |entity1|entity2 |relations |confidence|
+---------+----------------+-------+---------+--------------+----------+
|1 unit |Advil |DOSAGE |DRUG |DOSAGE-DRUG |1.0 |
|Advil |for 5 days |DRUG |DURATION |DRUG-DURATION |1.0 |
|1 unit |Metformin |DOSAGE |DRUG |DOSAGE-DRUG |1.0 |
|Metformin|daily |DRUG |FREQUENCY|DRUG-FREQUENCY|1.0 |
|40 units |insulin glargine|DOSAGE |DRUG |DOSAGE-DRUG |1.0 |
+---------+----------------+-------+---------+--------------+----------+
import spark.implicits._
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentencer = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentences")
val tokenizer = new Tokenizer()
.setInputCols("sentences")
.setOutputCol("tokens")
val words_embedder = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
.setInputCols(Array("sentences","tokens"))
.setOutputCol("embeddings")
val pos_tagger = PerceptronModel.pretrained("pos_clinical","en","clinical/models")
.setInputCols(Array("sentences","tokens"))
.setOutputCol("pos_tags")
val ner_tagger = MedicalNerModel.pretrained("ner_posology","en","clinical/models")
.setInputCols("sentences","tokens","embeddings")
.setOutputCol("ner_tags")
val ner_chunker = new NerConverterInternal()
.setInputCols(Array("sentences","tokens","ner_tags"))
.setOutputCol("ner_chunks")
val dependency_parser = DependencyParserModel.pretrained("dependency_conllu","en")
.setInputCols(Array("sentences","pos_tags","tokens"))
.setOutputCol("dependencies")
val reModel = RelationExtractionModel.pretrained("posology_re")
.setInputCols(Array("embeddings","pos_tags","ner_chunks","dependencies"))
.setOutputCol("relations")
.setMaxSyntacticDistance(4)
val pipeline = new Pipeline().setStages(Array(
documenter,
sentencer,
tokenizer,
words_embedder,
pos_tagger,
ner_tagger,
ner_chunker,
dependency_parser,
reModel ))
val text = " The patient was prescribed 1 unit of Advil for 5 days after meals. The patient was also given 1 unit of Metformin daily. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals ,and metformin 1000 mg two times a day. "
val df = Seq(text) .toDF("text")
val result = pipeline.fit(df) .transform(df)
// Show results
+---------+----------------+-------+---------+--------------+----------+
|chunk1 |chunk2 |entity1|entity2 |relations |confidence|
+---------+----------------+-------+---------+--------------+----------+
|1 unit |Advil |DOSAGE |DRUG |DOSAGE-DRUG |1.0 |
|Advil |for 5 days |DRUG |DURATION |DRUG-DURATION |1.0 |
|1 unit |Metformin |DOSAGE |DRUG |DOSAGE-DRUG |1.0 |
|Metformin|daily |DRUG |FREQUENCY|DRUG-FREQUENCY|1.0 |
|40 units |insulin glargine|DOSAGE |DRUG |DOSAGE-DRUG |1.0 |
+---------+----------------+-------+---------+--------------+----------+
PretrainedZeroShotNER
Pretrained Zero-shot Named Entity Recognition (NER)
makes it easy to identify specific entities in text without needing
pre-labeled datasets. It uses advanced pre-trained language models to recognize entities in different fields and languages,
saving time and effort.
This method is flexible, letting you define your own entity labels instead of relying on a fixed set of examples.
For the best results, it’s helpful to choose labels similar to the provided examples, as they guide the model’s understanding.
Parameters:
labels
: A list of labels descriving the entities. For example: [“person”, “location”]predictionThreshold
: Minimal confidence score to encode an entity (Default: 0.01f)
Input Annotator Types: DOCUMENT, TOKEN
Output Annotator Type: NAMED_ENTITY
Python API: PretrainedZeroShotNER | Scala API: PretrainedZeroShotNER | Notebook: PretrainedZeroShotNER |
Show Example
from johnsnowlabs import nlp, medical
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
labels = ['DOCTOR', 'PATIENT', 'AGE', 'DATE', 'HOSPITAL', 'CITY', 'STREET', 'STATE', 'COUNTRY', 'PHONE', 'IDNUM', 'EMAIL','ZIP', 'ORGANIZATION', 'PROFESSION', 'USERNAME']
pretrained_zero_shot_ner = medical.PretrainedZeroShotNER().pretrained("zeroshot_ner_deid_subentity_merged_medium", "en", "clinical/models")\
.setInputCols("sentence", "token")\
.setOutputCol("ner")\
.setPredictionThreshold(0.5)\
.setLabels(labels)
ner_converter = medical.NerConverterInternal()\
.setInputCols("sentence", "token", "ner")\
.setOutputCol("ner_chunk")
pipeline = Pipeline().setStages([
document_assembler,
sentence_detector,
tokenizer,
pretrained_zero_shot_ner,
ner_converter
])
text = """Dr. John Lee, from Royal Medical Clinic in Chicago, attended to the patient on 11/05/2024.
The patient’s medical record number is 56467890. The patient, Emma Wilson, is 50 years old, her Contact number: 444-456-7890 .
Dr. John Taylor, ID: 982345, a cardiologist at St. Mary's Hospital in Boston, was contacted on 05/10/2023 regarding a 45-year-old.
"""
data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)
# result
+--------------------+-----+---+----------+
|chunk |begin|end|ner_label |
+--------------------+-----+---+----------+
|John Lee |4 |11 |DOCTOR |
|Royal Medical Clinic|19 |38 |HOSPITAL |
|Chicago |43 |49 |CITY |
|11/05/2024 |80 |89 |DATE |
|56467890 |131 |138|IDNUM |
|Emma Wilson |154 |164|PATIENT |
|50 |170 |171|AGE |
|444-456-7890 |205 |216|PHONE |
|John Taylor |224 |234|DOCTOR |
|982345 |241 |246|IDNUM |
|cardiologist |251 |262|PROFESSION|
|St. Mary's Hospital |267 |285|HOSPITAL |
|Boston |290 |295|CITY |
|05/10/2023 |315 |324|DATE |
|45-year-old |338 |348|AGE |
+--------------------+-----+---+----------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val labels = Array(
"DOCTOR", "PATIENT", "AGE", "DATE", "HOSPITAL", "CITY", "STREET",
"STATE", "COUNTRY", "PHONE", "IDNUM", "EMAIL", "ZIP",
"ORGANIZATION", "PROFESSION", "USERNAME"
)
val pretrainedZeroShotNer = PretrainedZeroShotNER
.pretrained("zeroshot_ner_deid_subentity_merged_medium", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
.setPredictionThreshold(0.5.toFloat)
.setLabels(labels)
val nerConverter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline()
.setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
pretrainedZeroShotNer,
nerConverter
))
val text = """Dr. John Lee, from Royal Medical Clinic in Chicago, attended to the patient on 11/05/2024.
The patient’s medical record number is 56467890. The patient, Emma Wilson, is 50 years old, her Contact number: 444-456-7890.
Dr. John Taylor, ID: 982345, a cardiologist at St. Mary's Hospital in Boston, was contacted on 05/10/2023 regarding a 45-year-old.
"""
val data = spark.createDataFrame(Seq((text))).toDF("text")
val pipelineModel = pipeline.fit(data)
val result = pipelineModel.transform(data)
# result
+--------------------+-----+---+----------+
|chunk |begin|end|ner_label |
+--------------------+-----+---+----------+
|John Lee |4 |11 |DOCTOR |
|Royal Medical Clinic|19 |38 |HOSPITAL |
|Chicago |43 |49 |CITY |
|11/05/2024 |80 |89 |DATE |
|56467890 |131 |138|IDNUM |
|Emma Wilson |154 |164|PATIENT |
|50 |170 |171|AGE |
|444-456-7890 |205 |216|PHONE |
|John Taylor |224 |234|DOCTOR |
|982345 |241 |246|IDNUM |
|cardiologist |251 |262|PROFESSION|
|St. Mary's Hospital |267 |285|HOSPITAL |
|Boston |290 |295|CITY |
|05/10/2023 |315 |324|DATE |
|45-year-old |338 |348|AGE |
+--------------------+-----+---+----------+
QuestionAnswering
QuestionAnswering is a GPT-based model for answering questions given a context. Unlike span-based models, it generates the answers to the questions, rather than selecting phrases from the given context. The model is capable of answering various types of questions, including yes-no or full-text ones. Types of questions are supported: "short"
(producing yes/no/maybe) answers and "long"
(full answers).
Parameters:
-
questionType
: Question type, e.g. “short” or “long”. The question types depend on the model. -
maxNewTokens
: Maximum number of of new tokens to generate, by default 30 -
maxContextLength
: Maximum length of context text -
configProtoBytes
: ConfigProto from tensorflow, serialized into byte array. -
doSample
: Whether or not to use sampling; use greedy decoding otherwise, by default False -
topK
: The number of highest probability vocabulary tokens to consider, by default 1 -
noRepeatNgramSize
: The number of tokens that can’t be repeated in the same order. Useful for preventing loops. The default is 0. -
ignoreTokenIds
: A list of token ids which are ignored in the decoder’s output, by default [] -
randomSeed
: Set to positive integer to get reproducible results, by default None. -
customPrompt
: Custom prompt template. Available variables {QUESTION} and {CONTEXT}
Available models can be found at the Models Hub
For more extended examples on the document, pre-processing see the Spark NLP Workshop
Input Annotator Types: DOCUMENT, DOCUMENT
Output Annotator Type: CHUNK
Python API: MedicalQuestionAnswering | Scala API: MedicalQuestionAnswering |
Show Example
from johnsnowlabs import nlp, medical
document_assembler = nlp.MultiDocumentAssembler()\
.setInputCols("question", "context")\
.setOutputCols("document_question", "document_context")
med_qa = medical.MedicalQuestionAnswering.pretrained("medical_qa_biogpt","en","clinical/models")\
.setInputCols(["document_question", "document_context"])\
.setOutputCol("answer")\
.setMaxNewTokens(30)\
.setTopK(1)\
.setQuestionType("long") # "short"
pipeline = nlp.Pipeline(stages=[document_assembler, med_qa])
paper_abstract = [
"In patients with Los Angeles (LA) grade C or D oesophagitis, a positive relationship has been established between the duration of intragastric acid suppression and healing.AIM: To determine whether there is an apparent optimal time of intragastric acid suppression for maximal healing of reflux oesophagitis. Post hoc analysis of data from a proof-of-concept, double-blind, randomized study of 134 adult patients treated with esomeprazole (10 or 40 mg od for 4 weeks) for LA grade C or D oesophagitis. A curve was fitted to pooled 24-h intragastric pH (day 5) and endoscopically assessed healing (4 weeks) data using piecewise quadratic logistic regression. Maximal reflux oesophagitis healing rates were achieved when intragastric pH>4 was achieved for approximately 50-70% (12-17 h) of the 24-h period. Acid suppression above this threshold did not yield further increases in healing rates."
]
question = ["Is there an optimal time of acid suppression for maximal healing?"]
data = spark.createDataFrame([ [paper_abstract[0], question[0]] ]).toDF("context","question")
data.show(truncate = 60)
+------------------------------------------------------------+------------------------------------------------------------+
| context| question|
+------------------------------------------------------------+------------------------------------------------------------+
|In patients with Los Angeles (LA) grade C or D oesophagit...|Is there an optimal time of acid suppression for maximal ...|
+------------------------------------------------------------+------------------------------------------------------------+
result = pipeline.fit(data).transform(data)
result.selectExpr("document_question.result as Question", "answer.result as Long_Answer").show(truncate=False)
+-------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
|Question |Long_Answer |
+-------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
|[Is there an optimal time of acid suppression for maximal healing?]|[in patients with reflux oesophagitis, maximal healing rates are obtained when intragastric pH is achieved for approximately 50 - 70 % ( 12 - 17 h )]|
+-------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
from johnsnowlabs import nlp, finance
document_assembler = nlp.MultiDocumentAssembler()\
.setInputCols("question", "context")\
.setOutputCols("document_question", "document_context")
fin_qa = finance.QuestionAnswering.pretrained("finqa_flant5_finetuned","en","finance/models")\
.setInputCols(["document_question", "document_context"])\
.setCustomPrompt("question: {QUESTION} context: {CONTEXT}")\
.setMaxNewTokens(100)\
.setOutputCol("answer")
pipeline = nlp.Pipeline(stages=[document_assembler, fin_qa])
context = """EXHIBIT 99.2 Page 1 of 3 DISTRIBUTOR AGREEMENT Agreement made this 19th day of March, 2020 Between: Co-Diagnostics, Inc. (herein referred to as "Principal") And PreCheck Health Services, Inc. (herein referred to as "Distributor"). In consideration of the mutual terms, conditions and covenants hereinafter set forth, Principal and Distributor acknowledge and agree to the following descriptions and conditions: DESCRIPTION OF PRINCIPAL The Principal is a company located in Utah, United States and is in the business of research and development of reagents. The Principal markets and sells it products globally through direct sales and distributors. DESCRIPTION OF DISTRIBUTOR The Distributor is a company operating or planning to operate in the United States of America, Latin America, Europe and Russia. The Distributor represents that the Distributor or a subsidiary of the Distributor is or will be fully licensed and registered in the Territory and will provide professional distribution services for the products of the Principal. CONDITIONS: 1. The Principal appoints the Distributor as a non-exclusive distributor, to sell Principal's qPCR infectious disease kits, Logix Smart COVID-19 PCR diagnostic test and Co-Dx Box™ instrument (the "Products"). The Products are described on Exhibit A to this Agreement. 2. The Principal grants Distributor non- exclusive rights to sell these products within the countries of Romania (the "Territory"), which may be amended by mutual written agreement."""
questions = ["""Which company is referred to as 'Principal' in the Distributor Agreement?""",
"""What is the date of the distributor agreement between Co-Diagnostics, Inc. and PreCheck Health Services, Inc.?""",
"""What is the Territory in which the Distributor has non-exclusive rights to sell Principal's products according to the Agreement?"""]
data = spark.createDataFrame(
[
[context[0], questions[0]],
[context[0], questions[1]],
[context[0], questions[2]],
]
).toDF("context","question")
data.show(truncate = 80)
+------------------------------------------------------------------+--------------------------------------------------------------------------------+
| question| context|
+------------------------------------------------------------------+--------------------------------------------------------------------------------+
| What are the key components of the business strategy described?|Our business strategy has been to develop data processing and product technol...|
|What is the immediate strategy for scaling the IntentKey platform?|Our business strategy has been to develop data processing and product technol...|
|How does the company aim to provide differentiation in the market?|Our business strategy has been to develop data processing and product technol...|
+------------------------------------------------------------------+--------------------------------------------------------------------------------+
result = pipeline.fit(data).transform(data)
result.select('question', 'answer.result').show(truncate=False)
+------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|question |result |
+------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|What are the key components of the business strategy described? |[The key components of the business strategy described are proprietary demand (media spend) and supply side (media inventory) technologies, targeting technologies, on-page or in-app ad-unit technologies, proprietary data and data management technologies, and advertising fraud detection technologies. . . ]|
|What is the immediate strategy for scaling the IntentKey platform?|[The immediate strategy for scaling the IntentKey platform is to scale through the hiring of additional sales professionals, growing existing accounts and expanding the market size by concurrently selling the SaaS version of the IntentKey beginning in 2021. ] |
|How does the company aim to provide differentiation in the market?|[The company aims to provide differentiation through the AI analytics and data products they own and protect through patents. ] |
+------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
from johnsnowlabs import nlp, legal
context = ["""EXHIBIT 99.2 Page 1 of 3 DISTRIBUTOR AGREEMENT Agreement made this 19th day of March, 2020 Between: Co-Diagnostics, Inc. (herein referred to as "Principal") And PreCheck Health Services, Inc. (herein referred to as "Distributor"). In consideration of the mutual terms, conditions and covenants hereinafter set forth, Principal and Distributor acknowledge and agree to the following descriptions and conditions: DESCRIPTION OF PRINCIPAL The Principal is a company located in Utah, United States and is in the business of research and development of reagents. The Principal markets and sells it products globally through direct sales and distributors. DESCRIPTION OF DISTRIBUTOR The Distributor is a company operating or planning to operate in the United States of America, Latin America, Europe and Russia. The Distributor represents that the Distributor or a subsidiary of the Distributor is or will be fully licensed and registered in the Territory and will provide professional distribution services for the products of the Principal. CONDITIONS: 1. The Principal appoints the Distributor as a non-exclusive distributor, to sell Principal's qPCR infectious disease kits, Logix Smart COVID-19 PCR diagnostic test and Co-Dx Box™ instrument (the "Products"). The Products are described on Exhibit A to this Agreement. 2. The Principal grants Distributor non- exclusive rights to sell these products within the countries of Romania (the "Territory"), which may be amended by mutual written agreement."""]
questions = ["""Which company is referred to as 'Principal' in the Distributor Agreement?""",
"""What is the date of the distributor agreement between Co-Diagnostics, Inc. and PreCheck Health Services, Inc.?""",
"""What is the Territory in which the Distributor has non-exclusive rights to sell Principal's products according to the Agreement?"""]
data = spark.createDataFrame(
[
[context[0], questions[0]],
[context[0], questions[1]],
[context[0], questions[2]],
]
).toDF("context","question")
data.show(truncate = 80)
+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+
| context| question|
+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+
|EXHIBIT 99.2 Page 1 of 3 DISTRIBUTOR AGREEMENT Agreement made this 19th day o...| Which company is referred to as 'Principal' in the Distributor Agreement?|
|EXHIBIT 99.2 Page 1 of 3 DISTRIBUTOR AGREEMENT Agreement made this 19th day o...|What is the date of the distributor agreement between Co-Diagnostics, Inc. an...|
|EXHIBIT 99.2 Page 1 of 3 DISTRIBUTOR AGREEMENT Agreement made this 19th day o...|What is the Territory in which the Distributor has non-exclusive rights to se...|
+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+
document_assembler = nlp.MultiDocumentAssembler()\
.setInputCols("question", "context")\
.setOutputCols("document_question", "document_context")
leg_qa = legal.QuestionAnswering.pretrained("legqa_flant5_finetuned","en","legal/models")\
.setInputCols(["document_question", "document_context"])\
.setCustomPrompt("question: {QUESTION} context: {CONTEXT}")\
.setMaxNewTokens(40)\
.setTopK(3)\
.setOutputCol("answer")
pipeline = nlp.Pipeline(stages=[document_assembler, leg_qa])
result = pipeline.fit(data).transform(data)
result.selectExpr("document_question.result as Question", "answer.result as Answer").show(truncate=False)
+----------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|Question |Answer |
+----------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|[Which company is referred to as 'Principal' in the Distributor Agreement?] |[Co-Diagnostics, Inc. is referred to as 'Principal' in the Distributor Agreement. ] |
|[What is the date of the distributor agreement between Co-Diagnostics, Inc. and PreCheck Health Services, Inc.?] |[The date of the distributor agreement between Co-Diagnostics, Inc. and PreCheck Health Services, Inc. is the 19th day of March, 2020. ]|
|[What is the Territory in which the Distributor has non-exclusive rights to sell Principal's products according to the Agreement?]|[The Territory in which the Distributor has non-exclusive rights to sell Principal's products according to the Agreement is Romania. ] |
+----------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val medQA = MedicalQuestionAnswering.pretrained("medical_qa_biogpt", "en", "clinical/models")
.setInputCols(Array("document_question", "document_context"))
.setOutputCol("answer")
.setMaxNewTokens(30)
.setTopK(1)
.setQuestionType("long") // "short"
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
medQA))
val paperAbstract = "In patients with Los Angeles (LA) grade C or D oesophagitis, a positive relationship has been established between the duration of intragastric acid suppression and healing.AIM: To determine whether there is an apparent optimal time of intragastric acid suppression for maximal healing of reflux oesophagitis. Post hoc analysis of data from a proof-of-concept, double-blind, randomized study of 134 adult patients treated with esomeprazole (10 or 40 mg od for 4 weeks) for LA grade C or D oesophagitis. A curve was fitted to pooled 24-h intragastric pH (day 5) and endoscopically assessed healing (4 weeks) data using piecewise quadratic logistic regression. Maximal reflux oesophagitis healing rates were achieved when intragastric pH>4 was achieved for approximately 50-70% (12-17 h) of the 24-h period. Acid suppression above this threshold did not yield further increases in healing rates."
val question = "Is there an optimal time of acid suppression for maximal healing?"
val data = Seq(paperAbstract, question).toDF("context", "question")
+------------------------------------------------------------+------------------------------------------------------------+
| context| question|
+------------------------------------------------------------+------------------------------------------------------------+
|In patients with Los Angeles (LA) grade C or D oesophagit...|Is there an optimal time of acid suppression for maximal ...|
+------------------------------------------------------------+------------------------------------------------------------+
val result = pipeline.fit(data).transform(data)
+-------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
|Question |Long_Answer |
+-------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
|[Is there an optimal time of acid suppression for maximal healing?]|[in patients with reflux oesophagitis, maximal healing rates are obtained when intragastric pH is achieved for approximately 50 - 70 % ( 12 - 17 h )]|
+-------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val finQa = new FinanceQuestionAnswering()
.pretrained("finqa_flant5_finetuned", "en", "finance/models")
.setInputCols(Array("document_question", "document_context"))
.setCustomPrompt("question: {QUESTION} context: {CONTEXT}")
.setMaxNewTokens(100)
.setOutputCol("answer")
val pipeline = new Pipeline().setStages(Array(documentAssembler, finQa))
val context = "EXHIBIT 99.2 Page 1 of 3 DISTRIBUTOR AGREEMENT Agreement made this 19th day of March, 2020 Between: Co-Diagnostics, Inc. (herein referred to as "Principal") And PreCheck Health Services, Inc. (herein referred to as "Distributor"). In consideration of the mutual terms, conditions and covenants hereinafter set forth, Principal and Distributor acknowledge and agree to the following descriptions and conditions: DESCRIPTION OF PRINCIPAL The Principal is a company located in Utah, United States and is in the business of research and development of reagents. The Principal markets and sells it products globally through direct sales and distributors. DESCRIPTION OF DISTRIBUTOR The Distributor is a company operating or planning to operate in the United States of America, Latin America, Europe and Russia. The Distributor represents that the Distributor or a subsidiary of the Distributor is or will be fully licensed and registered in the Territory and will provide professional distribution services for the products of the Principal. CONDITIONS: 1. The Principal appoints the Distributor as a non-exclusive distributor, to sell Principal's qPCR infectious disease kits, Logix Smart COVID-19 PCR diagnostic test and Co-Dx Box™ instrument (the "Products"). The Products are described on Exhibit A to this Agreement. 2. The Principal grants Distributor non- exclusive rights to sell these products within the countries of Romania (the "Territory"), which may be amended by mutual written agreement."
val questions = Seq(
"Which company is referred to as 'Principal' in the Distributor Agreement?",
"What is the date of the distributor agreement between Co-Diagnostics, Inc. and PreCheck Health Services, Inc.?",
"What is the Territory in which the Distributor has non-exclusive rights to sell Principal's products according to the Agreement?"
)
val data = questions.map(q => (context, q)).toDF("context", "question")
+------------------------------------------------------------------+--------------------------------------------------------------------------------+
| question| context|
+------------------------------------------------------------------+--------------------------------------------------------------------------------+
| What are the key components of the business strategy described?|Our business strategy has been to develop data processing and product technol...|
|What is the immediate strategy for scaling the IntentKey platform?|Our business strategy has been to develop data processing and product technol...|
|How does the company aim to provide differentiation in the market?|Our business strategy has been to develop data processing and product technol...|
+------------------------------------------------------------------+--------------------------------------------------------------------------------+
val result = pipeline.fit(data).transform(data)
+------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|question |result |
+------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|What are the key components of the business strategy described? |[The key components of the business strategy described are proprietary demand (media spend) and supply side (media inventory) technologies, targeting technologies, on-page or in-app ad-unit technologies, proprietary data and data management technologies, and advertising fraud detection technologies. . . ]|
|What is the immediate strategy for scaling the IntentKey platform?|[The immediate strategy for scaling the IntentKey platform is to scale through the hiring of additional sales professionals, growing existing accounts and expanding the market size by concurrently selling the SaaS version of the IntentKey beginning in 2021. ] |
|How does the company aim to provide differentiation in the market?|[The company aims to provide differentiation through the AI analytics and data products they own and protect through patents. ] |
+------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
val context = Seq("""EXHIBIT 99.2 Page 1 of 3 DISTRIBUTOR AGREEMENT Agreement made this 19th day of March, 2020 Between: Co-Diagnostics, Inc. (herein referred to as "Principal") And PreCheck Health Services, Inc. (herein referred to as "Distributor"). In consideration of the mutual terms, conditions and covenants hereinafter set forth, Principal and Distributor acknowledge and agree to the following descriptions and conditions: DESCRIPTION OF PRINCIPAL The Principal is a company located in Utah, United States and is in the business of research and development of reagents. The Principal markets and sells it products globally through direct sales and distributors. DESCRIPTION OF DISTRIBUTOR The Distributor is a company operating or planning to operate in the United States of America, Latin America, Europe and Russia. The Distributor represents that the Distributor or a subsidiary of the Distributor is or will be fully licensed and registered in the Territory and will provide professional distribution services for the products of the Principal. CONDITIONS: 1. The Principal appoints the Distributor as a non-exclusive distributor, to sell Principal's qPCR infectious disease kits, Logix Smart COVID-19 PCR diagnostic test and Co-Dx Box™ instrument (the "Products"). The Products are described on Exhibit A to this Agreement. 2. The Principal grants Distributor non- exclusive rights to sell these products within the countries of Romania (the "Territory"), which may be amended by mutual written agreement."""
)
val questions = Seq(
"""Which company is referred to as 'Principal' in the Distributor Agreement?""",
"""What is the date of the distributor agreement between Co-Diagnostics, Inc. and PreCheck Health Services, Inc.?""",
"""What is the Territory in which the Distributor has non-exclusive rights to sell Principal's products according to the Agreement?"""
)
// val data = context.flatMap(c => questions.map(q => (c, q))).toDF("context", "question")
+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+
| context| question|
+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+
|EXHIBIT 99.2 Page 1 of 3 DISTRIBUTOR AGREEMENT Agreement made this 19th day o...| Which company is referred to as 'Principal' in the Distributor Agreement?|
|EXHIBIT 99.2 Page 1 of 3 DISTRIBUTOR AGREEMENT Agreement made this 19th day o...|What is the date of the distributor agreement between Co-Diagnostics, Inc. an...|
|EXHIBIT 99.2 Page 1 of 3 DISTRIBUTOR AGREEMENT Agreement made this 19th day o...|What is the Territory in which the Distributor has non-exclusive rights to se...|
+--------------------------------------------------------------------------------+--------------------------------------------------------------------------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCols(Array("question", "context"))
.setOutputCols(Array("document_question", "document_context"))
val legQA = LegalQuestionAnswering.pretrained("legqa_flant5_finetuned", "en", "clinical/models")
.setInputCols(Array("document_question", "document_context"))
.setCustomPrompt("question: {QUESTION} context: {CONTEXT}")
.setMaxNewTokens(40)
.setTopK(3)
.setOutputCol("answer")
val pipeline = new Pipeline().setStages(Array(documentAssembler, legQA))
val result = pipeline.fit(data).transform(data)
+----------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|Question |Answer |
+----------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|[Which company is referred to as 'Principal' in the Distributor Agreement?] |[Co-Diagnostics, Inc. is referred to as 'Principal' in the Distributor Agreement. ] |
|[What is the date of the distributor agreement between Co-Diagnostics, Inc. and PreCheck Health Services, Inc.?] |[The date of the distributor agreement between Co-Diagnostics, Inc. and PreCheck Health Services, Inc. is the 19th day of March, 2020. ]|
|[What is the Territory in which the Distributor has non-exclusive rights to sell Principal's products according to the Agreement?]|[The Territory in which the Distributor has non-exclusive rights to sell Principal's products according to the Agreement is Romania. ] |
+----------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
REChunkMerger
REChunkMerger
annotator merges relation chunks to create a new chunk.
Parameters:
separator
: Separator to add between the chunks. Default: “ “.
Input Annotator Types: CATEGORY
Output Annotator Type: CHUNK
Python API: REChunkMerger | Scala API: REChunkMerger | Notebook: REChunkMergerNotebook |
Show Example
from johnsnowlabs import nlp, medical
documenter = nlp.DocumentAssembler() \
.setInputCol("sentence") \
.setOutputCol("document")
tokenizer = nlp.Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("tokens") \
words_embedder = nlp.WordEmbeddingsModel() \
.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["document", "tokens"]) \
.setOutputCol("embeddings")
pos_tagger = nlp.PerceptronModel() \
.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["document", "tokens"]) \
.setOutputCol("pos_tags")
ner_tagger = medical.NerModel() \
.pretrained("ner_clinical", "en", "clinical/models") \
.setInputCols(["document", "tokens", "embeddings"]) \
.setOutputCol("ner_tags")
ner_converter = medical.NerConverter() \
.setInputCols(["document", "tokens", "ner_tags"]) \
.setOutputCol("ner_chunks")
depency_parser = nlp.DependencyParserModel() \
.pretrained("dependency_conllu", "en") \
.setInputCols(["document", "pos_tags", "tokens"]) \
.setOutputCol("dependencies")
re_model = medical.RelationExtractionModel \
.pretrained("re_clinical", "en", "clinical/models") \
.setCustomLabels({"TeRP": "CustomLabel_TeRP", "TrWP": "CustomLabel_TeWP"}) \
.setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"]) \
.setOutputCol("re_chunk")
re_chunk_merger = medical.REChunkMerger() \
.setInputCols(["re_chunk"]) \
.setOutputCol("relation_chunks") \
.setSeparator(" && ")
nlpPipeline = nlp.Pipeline(
stages=[
documenter,
tokenizer,
words_embedder,
pos_tagger,
ner_tagger,
ner_converter,
depency_parser,
re_model,
re_chunk_merger
])
empty_data = spark.createDataFrame([[""]]).toDF("sentence")
model = nlpPipeline.fit(empty_data)
text =''' 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to " +
"presentation and subsequent type two diabetes mellitus ( T2DM ). '''
result = model.transform(spark.createDataFrame([[text]]).toDF("sentence"))
# result
+----------------------------------------------------------------------+
|result |
+----------------------------------------------------------------------+
|gestational diabetes mellitus && subsequent type two diabetes mellitus|
|gestational diabetes mellitus && T2DM |
|subsequent type two diabetes mellitus && T2DM |
+----------------------------------------------------------------------+
import spark.implicits._
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("tokens")
val words_embedder = WordEmbeddingsModel
.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols("document", "tokens")
.setOutputCol("embeddings")
val pos_tagger = PerceptronModel
.pretrained("pos_clinical", "en", "clinical/models")
.setInputCols("document", "tokens")
.setOutputCol("pos_tags")
val ner_tagger = MedicalNerModel
.pretrained("ner_clinical", "en", "clinical/models")
.setInputCols("document", "tokens", "embeddings")
.setOutputCol("ner_tags")
val ner_converter = new NerConverter()
.setInputCols("document", "tokens", "ner_tags")
.setOutputCol("ner_chunks")
val depency_parser = DependencyParserModel
.pretrained("dependency_conllu", "en")
.setInputCols("document", "pos_tags", "tokens")
.setOutputCol("dependencies")
val re_model = RelationExtractionModel
.pretrained("re_clinical", "en", "clinical/models")
.setInputCols("embeddings", "pos_tags", "ner_chunks", "dependencies")
.setOutputCol("re_chunk")
val re_chunk_merger = new REChunkMerger()
.setInputCols("re_chunk")
.setOutputCol("relation_chunks")
.setSeparator(" && ")
val pipeline = new Pipeline()
.setStages(Array(
documenter,
tokenizer,
words_embedder,
pos_tagger,
ner_tagger,
ner_converter,
depency_parser,
re_model,
re_chunk_merger
))
val text = "28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to " +
"presentation and subsequent type two diabetes mellitus ( T2DM ). "
val empty_data = Seq("").toDF("text")
val model = pipeline.fit(empty_data).transform(Seq(text).toDF("text"))
# result
+----------------------------------------------------------------------+
|result |
+----------------------------------------------------------------------+
|gestational diabetes mellitus && subsequent type two diabetes mellitus|
|gestational diabetes mellitus && T2DM |
|subsequent type two diabetes mellitus && T2DM |
+----------------------------------------------------------------------+
RENerChunksFilter
The RENerChunksFilter
annotator filters desired relation pairs (defined by the parameter realtionPairs), and store those on the output column. Filtering the possible relations can be useful to perform additional analysis for a specific use case (e.g., checking adverse drug reactions and drug realations), which can be the input for further analysis using a pretrained RelationExtractionDLModel
.
Parameters:
-
maxSyntacticDistance
(Int): Maximum syntactic distance between a pair of named entities to consider them as a relation. Increasing this value will increase recall, but also increase the number of false positives. -
relationPairs
(List[Str]): List of dash-separated pairs of named entities. For example, [“Biomarker-RelativeDay”] will process all relations between entities of type “Biomarker” and “RelativeDay”. -
relationPairsCaseSensitive
(Boolean): Determines whether relation pairs are case sensitive. -
directionSensitive
(Boolean): Specify direction sensitivity in processing relation pairs Default is ‘False’. -
filterByTokenDistance
(Int): Filtering criterion based on number of token between entities (Default: 0)
For example, the ner_clinical NER model can identify PROBLEM
, TEST
, and TREATMENT
entities. By using the RENerChunksFilter
, one can filter only the relations between PROBLEM
and TREATMENT
entities only, removing any relation between the other entities, to further analyze the associations between clinical problems and treatments.
Input Annotator Types: CHUNK, DEPENDENCY
Output Annotator Type: CHUNK
Python API: RENerChunksFilter | Scala API: RENerChunksFilter | Notebook: RENerChunksFilter |
Show Example
from johnsnowlabs import nlp, medical
documenter = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencer = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
words_embedder = nlp.WordEmbeddingsModel()\
.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
pos_tagger = nlp.PerceptronModel()\
.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token"])\
.setOutputCol("pos_tags")
ner_tagger = medical.NerModel.pretrained("ner_ade_clinical", "en", "clinical/models")\
.setInputCols("sentence", "token", "embeddings")\
.setOutputCol("ner_tags")
ner_chunker = medical.NerConverterInternal()\
.setInputCols(["sentence", "token", "ner_tags"])\
.setOutputCol("ner_chunks")
dependency_parser = nlp.DependencyParserModel()\
.pretrained("dependency_conllu", "en")\
.setInputCols(["sentence", "pos_tags", "token"])\
.setOutputCol("dependencies")
ade_re_ner_chunk_filter = medical.RENerChunksFilter() \
.setInputCols(["ner_chunks", "dependencies"])\
.setOutputCol("re_ner_chunks")\
.setMaxSyntacticDistance(10)\
.setRelationPairs(["drug-ade, ade-drug"])
ade_re_model = medical.RelationExtractionDLModel()\
.pretrained('redl_ade_biobert', 'en', "clinical/models") \
.setInputCols(["re_ner_chunks", "sentences"]) \
.setPredictionThreshold(0.5)\
.setOutputCol("relations")
pipeline = nlp.Pipeline(stages=[
documenter,
sentencer,
tokenizer,
words_embedder,
pos_tagger,
ner_tagger,
ner_chunker,
dependency_parser,
ade_re_ner_chunk_filter,
ade_re_model
])
text = """A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands."""
data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)
from pyspark.sql import functions as F
results.select(
F.explode(F.arrays_zip(results.relations.metadata, results.relations.result)).alias("cols")).select(
F.expr("cols['0']['sentence']").alias("sentence"),
F.expr("cols['0']['entity1_begin']").alias("entity1_begin"),
F.expr("cols['0']['entity1_end']").alias("entity1_end"),
F.expr("cols['0']['chunk1']").alias("chunk1"),
F.expr("cols['0']['entity1']").alias("entity1"),
F.expr("cols['0']['entity2_begin']").alias("entity2_begin"),
F.expr("cols['0']['entity2_end']").alias("entity2_end"),
F.expr("cols['0']['chunk2']").alias("chunk2"),
F.expr("cols['0']['entity2']").alias("entity2"),
F.expr("cols['1']").alias("relation"),
F.expr("cols['0']['confidence']").alias("confidence"),
).show(truncate=70)
+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
|sentence|entity1_begin|entity1_end| chunk1|entity1|entity2_begin|entity2_end| chunk2|entity2|relation|confidence|
+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
| 0| 25| 32| naproxen| DRUG| 137| 148| tense bullae| ADE| 1| 0.9989047|
| 0| 25| 32| naproxen| DRUG| 154| 210|cutaneous fragility on the face and the back of the hands| ADE| 1| 0.9989704|
| 0| 87| 95|oxaprozin| DRUG| 137| 148| tense bullae| ADE| 1|0.99895453|
| 0| 87| 95|oxaprozin| DRUG| 154| 210|cutaneous fragility on the face and the back of the hands| ADE| 1|0.99900633|
+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
from johnsnowlabs import nlp, finance
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
text_splitter = finance.TextSplitter()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model_date = finance.NerModel.pretrained("finner_sec_dates", "en", "finance/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner_dates")
ner_converter_date = finance.NerConverterInternal()\
.setInputCols(["sentence","token","ner_dates"])\
.setOutputCol("ner_chunk_date")
ner_model_org= finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner_orgs")
ner_converter_org = finance.NerConverterInternal()\
.setInputCols(["sentence","token","ner_orgs"])\
.setOutputCol("ner_chunk_org")\
chunk_merger = finance.ChunkMergeApproach()\
.setInputCols('ner_chunk_org', "ner_chunk_date")\
.setOutputCol('ner_chunk')
pos = nlp.PerceptronModel.pretrained()\
.setInputCols(["sentence", "token"])\
.setOutputCol("pos")
dependency_parser = nlp.DependencyParserModel().pretrained("dependency_conllu", "en")\
.setInputCols(["sentence", "pos", "token"])\
.setOutputCol("dependencies")
re_filter = finance.RENerChunksFilter()\
.setInputCols(["ner_chunk", "dependencies"])\
.setOutputCol("re_ner_chunk")\
.setRelationPairs(["ORG-ORG", "ORG-DATE"])\
.setMaxSyntacticDistance(10)
reDL = finance.RelationExtractionDLModel().pretrained('finre_acquisitions_subsidiaries_md', 'en', 'finance/models')\
.setInputCols(["re_ner_chunk", "sentence"])\
.setOutputCol("relation")\
.setPredictionThreshold(0.1)
pipeline = nlp.Pipeline(stages=[
document_assembler,
text_splitter,
tokenizer,
embeddings,
ner_model_date,
ner_converter_date,
ner_model_org,
ner_converter_org,
chunk_merger,
pos,
dependency_parser,
re_filter,
reDL])
text = """In fiscal 2020, Cadence acquired all of the outstanding equity of AWR Corporation (“AWR”) and Integrand Software, Inc. (“Integrand”)."""
data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)
from pyspark.sql import functions as F
result.select(
F.explode(F.arrays_zip(result.relation.metadata, result.relation.result)).alias("cols")).select(
F.expr("cols['0']['sentence']").alias("sentence"),
F.expr("cols['0']['entity1_begin']").alias("entity1_begin"),
F.expr("cols['0']['entity1_end']").alias("entity1_end"),
F.expr("cols['0']['chunk1']").alias("chunk1"),
F.expr("cols['0']['entity1']").alias("entity1"),
F.expr("cols['0']['entity2_begin']").alias("entity2_begin"),
F.expr("cols['0']['entity2_end']").alias("entity2_end"),
F.expr("cols['0']['chunk2']").alias("chunk2"),
F.expr("cols['0']['entity2']").alias("entity2"),
F.expr("cols['1']").alias("relation"),
F.expr("cols['0']['confidence']").alias("confidence"),
).filter("relation != 'no_rel'").show(truncate=70)
+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
|sentence|entity1_begin|entity1_end| chunk1|entity1|entity2_begin|entity2_end| chunk2|entity2| relation|confidence|
+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
| 0| 16| 22| Cadence| ORG| 3| 13| fiscal 2020| DATE|has_acquisition_date|0.99687237|
| 0| 66| 80| AWR Corporation| ORG| 3| 13| fiscal 2020| DATE|has_acquisition_date| 0.993112|
| 0| 94| 116|Integrand Software, Inc| ORG| 3| 13| fiscal 2020| DATE|has_acquisition_date| 0.9741451|
| 0| 66| 80| AWR Corporation| ORG| 16| 22| Cadence| ORG| was_acquired_by| 0.997124|
| 0| 94| 116|Integrand Software, Inc| ORG| 16| 22| Cadence| ORG| was_acquired_by|0.99910504|
| 0| 94| 116|Integrand Software, Inc| ORG| 66| 80|AWR Corporation| ORG| was_acquired_by|0.93245244|
+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
from johnsnowlabs import nlp, legal
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
text_splitter = legal.TextSplitter()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)
ner_model = legal.NerModel.pretrained(ner_model, "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
pos_tagger = nlp.PerceptronModel().pretrained() \
.setInputCols(["sentence", "token"])\
.setOutputCol("pos_tags")
dependency_parser = nlp.DependencyParserModel() \
.pretrained("dependency_conllu", "en") \
.setInputCols(["sentence", "pos_tags", "token"]) \
.setOutputCol("dependencies")
re_filter = legal.RENerChunksFilter()\
.setInputCols(["ner_chunk", "dependencies"])\
.setOutputCol("re_ner_chunks")\
.setMaxSyntacticDistance(10)\
.setRelationPairs(['PARTY-ALIAS', 'DOC-PARTY', 'DOC-EFFDATE'])
re_model = legal.RelationExtractionDLModel.pretrained(re_model, "en", "legal/models")\
.setPredictionThreshold(0.1)\
.setInputCols(["re_ner_chunks", "sentence"])\
.setOutputCol("relations")
pipeline = nlp.Pipeline(stages=[
document_assembler,
text_splitter,
tokenizer,
embeddings,
ner_model,
ner_converter,
pos_tagger,
dependency_parser,
re_filter,
re_model
])
text = """This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties")."""
data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)
from pyspark.sql import functions as F
result.select(
F.explode(F.arrays_zip(result.relations.metadata, result.relations.result)).alias("cols")).select(
F.expr("cols['0']['sentence']").alias("sentence"),
F.expr("cols['0']['entity1_begin']").alias("entity1_begin"),
F.expr("cols['0']['entity1_end']").alias("entity1_end"),
F.expr("cols['0']['chunk1']").alias("chunk1"),
F.expr("cols['0']['entity1']").alias("entity1"),
F.expr("cols['0']['entity2_begin']").alias("entity2_begin"),
F.expr("cols['0']['entity2_end']").alias("entity2_end"),
F.expr("cols['0']['chunk2']").alias("chunk2"),
F.expr("cols['0']['entity2']").alias("entity2"),
F.expr("cols['1']").alias("relation"),
F.expr("cols['0']['confidence']").alias("confidence"),
).filter("relation != 'no_rel'").show(truncate=70)
+--------+-------------+-----------+-------------------------------+-------+-------------+-----------+-----------------+-------+---------+----------+
|sentence|entity1_begin|entity1_end| chunk1|entity1|entity2_begin|entity2_end| chunk2|entity2| relation|confidence|
+--------+-------------+-----------+-------------------------------+-------+-------------+-----------+-----------------+-------+---------+----------+
| 0| 5| 35|INTELLECTUAL PROPERTY AGREEMENT| DOC| 69| 85|December 31, 2018|EFFDATE| dated_as| 0.9856822|
| 0| 141| 163| Armstrong Flooring, Inc| PARTY| 192| 197| Seller| ALIAS|has_alias|0.89620054|
+--------+-------------+-----------+-------------------------------+-------+-------------+-----------+-----------------+-------+---------+----------+
import spark.implicits._
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentencer = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val wordsEmbedder = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val posTagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos_tags")
val nerTagger = MedicalNerModel.pretrained("ner_ade_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner_tags")
val nerChunker = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner_tags"))
.setOutputCol("ner_chunks")
val dependencyParser = DependencyParserModel.pretrained("dependency_conllu", "en")
.setInputCols(Array("sentence", "pos_tags", "token"))
.setOutputCol("dependencies")
val adeReNerChunkFilter = new RENerChunksFilter()
.setInputCols(Array("ner_chunks", "dependencies"))
.setOutputCol("re_ner_chunks")
.setMaxSyntacticDistance(10)
.setRelationPairs(Array("drug-ade", "ade-drug"))
val adeReModel = RelationExtractionDLModel.pretrained("redl_ade_biobert", "en", "clinical/models")
.setInputCols(Array("re_ner_chunks", "sentences"))
.setPredictionThreshold(0.5)
.setOutputCol("relations")
val pipeline = new Pipeline()
.setStages(Array(
documenter,
sentencer,
tokenizer,
wordsEmbedder,
posTagger,
nerTagger,
nerChunker,
dependencyParser,
adeReNerChunkFilter,
adeReModel
))
val text = """A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands."""
val data = Seq(text).toDF("text")
val result = pipeline.fit(data).transform(data)
+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
|sentence|entity1_begin|entity1_end| chunk1|entity1|entity2_begin|entity2_end| chunk2|entity2|relation|confidence|
+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
| 0| 25| 32| naproxen| DRUG| 137| 148| tense bullae| ADE| 1| 0.9989047|
| 0| 25| 32| naproxen| DRUG| 154| 210|cutaneous fragility on the face and the back of the hands| ADE| 1| 0.9989704|
| 0| 87| 95|oxaprozin| DRUG| 137| 148| tense bullae| ADE| 1|0.99895453|
| 0| 87| 95|oxaprozin| DRUG| 154| 210|cutaneous fragility on the face and the back of the hands| ADE| 1|0.99900633|
+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val text_splitter = new TextSplitter()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en", "finance/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner_model_date = FinanceNerModel.pretrained("finner_sec_dates", "en", "finance/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner_dates")
val ner_converter_date = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner_dates"))
.setOutputCol("ner_chunk_date")
val ner_model_org = FinanceNerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner_orgs")
val ner_converter_org = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner_orgs"))
.setOutputCol("ner_chunk_org")
val chunk_merger = new ChunkMergeApproach()
.setInputCols(Array("ner_chunk_org", "ner_chunk_date"))
.setOutputCol("ner_chunk")
val pos = PerceptronModel.pretrained()
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en")
.setInputCols(Array("sentence", "pos", "token"))
.setOutputCol("dependencies")
val re_filter = new RENerChunksFilter()
.setInputCols(Array("ner_chunk", "dependencies"))
.setOutputCol("re_ner_chunk")
.setRelationPairs(Array("ORG-ORG", "ORG-DATE"))
.setMaxSyntacticDistance(10)
val reDL = RelationExtractionDLModel.pretrained("finre_acquisitions_subsidiaries_md", "en", "finance/models")
.setInputCols(Array("re_ner_chunk", "sentence"))
.setOutputCol("relation")
.setPredictionThreshold(0.1)
val pipeline = new Pipeline().setStages(Array(
document_assembler,
text_splitter,
tokenizer,
embeddings,
ner_model_date,
ner_converter_date,
ner_model_org,
ner_converter_org,
chunk_merger,
pos,
dependency_parser,
re_filter,
reDL
))
text = """In fiscal 2020, Cadence acquired all of the outstanding equity of AWR Corporation (“AWR”) and Integrand Software, Inc. (“Integrand”)."""
val data = Seq(text).toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
|sentence|entity1_begin|entity1_end| chunk1|entity1|entity2_begin|entity2_end| chunk2|entity2| relation|confidence|
+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
| 0| 16| 22| Cadence| ORG| 3| 13| fiscal 2020| DATE|has_acquisition_date|0.99687237|
| 0| 66| 80| AWR Corporation| ORG| 3| 13| fiscal 2020| DATE|has_acquisition_date| 0.993112|
| 0| 94| 116|Integrand Software, Inc| ORG| 3| 13| fiscal 2020| DATE|has_acquisition_date| 0.9741451|
| 0| 66| 80| AWR Corporation| ORG| 16| 22| Cadence| ORG| was_acquired_by| 0.997124|
| 0| 94| 116|Integrand Software, Inc| ORG| 16| 22| Cadence| ORG| was_acquired_by|0.99910504|
| 0| 94| 116|Integrand Software, Inc| ORG| 66| 80|AWR Corporation| ORG| was_acquired_by|0.93245244|
+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val text_splitter = new TextSplitter()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
.setMaxSentenceLength(512)
val ner_model = LegalNerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pos_tagger = PerceptronModel().pretrained() \
.setInputCols(["sentence", "token"])\
.setOutputCol("pos_tags")
val dependency_parser = DependencyParserModel() \
.pretrained("dependency_conllu", "en") \
.setInputCols(["sentence", "pos_tags", "token"]) \
.setOutputCol("dependencies")
val re_filter = new RENerChunksFilter()\
.setInputCols(["ner_chunk", "dependencies"])\
.setOutputCol("re_ner_chunks")\
.setMaxSyntacticDistance(10)\
.setRelationPairs(['PARTY-ALIAS', 'DOC-PARTY', 'DOC-EFFDATE'])
val re_model = RelationExtractionDLModel.pretrained("legre_contract_doc_parties", "en", "legal/models")
.setPredictionThreshold(0.1)
.setInputCols(Array("re_ner_chunks", "sentence"))
.setOutputCol("relations")
val pipeline = new Pipeline()
.setStages(Array(
document_assembler,
text_splitter,
tokenizer,
embeddings,
ner_model,
ner_converter,
pos_tagger,
dependency_parser,
re_filter,
re_model
))
text = """This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties")."""
val data = Seq(text).toDF("text")
val result = pipeline.fit(data).transform(data)
+--------+-------------+-----------+-------------------------------+-------+-------------+-----------+-----------------+-------+---------+----------+
|sentence|entity1_begin|entity1_end| chunk1|entity1|entity2_begin|entity2_end| chunk2|entity2| relation|confidence|
+--------+-------------+-----------+-------------------------------+-------+-------------+-----------+-----------------+-------+---------+----------+
| 0| 5| 35|INTELLECTUAL PROPERTY AGREEMENT| DOC| 69| 85|December 31, 2018|EFFDATE| dated_as| 0.9856822|
| 0| 141| 163| Armstrong Flooring, Inc| PARTY| 192| 197| Seller| ALIAS|has_alias|0.89620054|
+--------+-------------+-----------+-------------------------------+-------+-------------+-----------+-----------------+-------+---------+----------+
ReIdentification
This annotator can reidentifies obfuscated entities by DeIdentification. It requires the outputs from the deidentification as input. Input columns need to be the deidentified document and the deidentification mappings set with DeIdentification.setMappingsColumn
.
Input Annotator Types: DOCUMENT,CHUNK
Output Annotator Type: DOCUMENT
Python API: ReIdentification | Scala API: ReIdentification | Notebook: ReIdentificationNotebook |
Show Example
from johnsnowlabs import nlp, medical
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = medical.NerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = medical.NerConverterInternal()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
deidentification = medical.DeIdentification() \
.setInputCols(["sentence", "token", "ner_chunk"]) \
.setOutputCol("deidentified") \
.setMode("mask")\
.setReturnEntityMappings(True) # return a new column to save the mappings between the mask/obfuscated entities and original entities.
#.setMappingsColumn("MappingCol") # change the name of the column, 'aux' is default
pipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
deidentification])
text = """
Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora ,
MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 .
Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .
"""
data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select(F.explode(F.arrays_zip(result.sentence.result,
result.deidentified.result)).alias("cols"))\
.select(F.expr("cols['0']").alias("sentence"),
F.expr("cols['1']").alias("deidentified")).show(truncate = False)
+-----------------------------------------------------------------------+-------------------------------------------------------+
|sentence |deidentified |
+-----------------------------------------------------------------------+-------------------------------------------------------+
|Record date : 2093-01-13 , David Hale , M.D . |Record date : <DATE> , <NAME> , M.D . |
|, Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . |, Name : <NAME> , MR # <ID> Date : <DATE> . |
|PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . |PCP : <NAME> , <AGE> years-old , Record date : <DATE> .|
|Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .|<LOCATION> , <LOCATION> , Phone <CONTACT> . |
+-----------------------------------------------------------------------+-------------------------------------------------------+
reIdentification = medical.ReIdentification()\
.setInputCols(["aux","deidentified"])\
.setOutputCol("original")
reid_result = reIdentification.transform(result)
reid_result.select('original.result').show(truncate=False)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[Record date : 2093-01-13 , David Hale , M.D ., , Name : Hendrickson Ora ,MR # 7194334 Date : 01/13/93 ., PCP : Oliveira , 25 years-old , Record date : 2079-11-09 ., Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .] |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
from johnsnowlabs import nlp, finance
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model = finance.NerModel.pretrained('finner_sec_10k_summary', 'en', 'finance/models')\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = finance.NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
deidentification = finance.DeIdentification() \
.setInputCols(["sentence", "token", "ner_chunk"]) \
.setOutputCol("deidentified") \
.setMode("mask")\
.setReturnEntityMappings(True) # return a new column to save the mappings between the mask/obfuscated entities and original entities. REquired for "ReIdentification"
#.setMappingsColumn("MappingCol") # change the name of the column, 'aux' is default
pipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
ner_converter,
deidentification])
text= """
Commission file number 000-15867
_____________________________________
CADENCE DESIGN SYSTEMS, INC.
(Exact name of registrant as specified in its charter)
____________________________________
Delaware
00-0000000
(State or Other Jurisdiction ofIncorporation or Organization)
(I.R.S. EmployerIdentification No.)
2655 Seely Avenue, Building 5,
San Jose,
California
95134
(Address of Principal Executive Offices)
(Zip Code)
(408)
-943-1234
(Registrant’s Telephone Number, including Area Code)
Securities registered pursuant to Section 12(b) of the Act:
Title of Each Class
Trading Symbol(s)
Names of Each Exchange on which Registered
Common Stock, $0.01 par value per share
CDNS
Nasdaq Global Select Market
Securities registered pursuant to Section 12(g) of the Act:"""
data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("deidentified.result").show(truncate = False)
+-------------------------------------------------------------------------------------------------------------+
|result |
+-------------------------------------------------------------------------------------------------------------+
|[Commission file number <CFN>
_____________________________________
<ORG>., (Exact name of registrant as specified in its charter)
____________________________________
<STATE>
<IRS>
(State or Other Jurisdiction ofIncorporation or Organization)
(I.R.S., EmployerIdentification No., )
<ADDRESS>
95134
(Address of Principal Executive Offices)
(Zip Code)
<PHONE>
(Registrant’s Telephone Number, including Area Code)
Securities registered pursuant to Section 12, (b) of the Act:
Title of Each Class
Trading Symbol, (s)
Names of Each Exchange on which Registered
<TITLE_CLASS>, <TITLE_CLASS_VALUE> par value per share
<TICKER>
<STOCK_EXCHANGE>
Securities registered pursuant to Section 12, (g) of the Act:]|
+-------------------------------------------------------------------------------------------------------------+
reIdentification = finance.ReIdentification()\
.setInputCols(["aux","deidentified"])\
.setOutputCol("original")
reid_result = reIdentification.transform(result)
reid_result.select('original.result').show(truncate=False)
+---------------------------------------------------------------------------------------------------+
|result |
+---------------------------------------------------------------------------------------------------+
|[Commission file number 000-15867
_____________________________________
CADENCE DESIGN SYSTEMS, INC., (Exact name of registrant as specified in its charter)
____________________________________
Delaware
00-0000000
(State or Other Jurisdiction ofIncorporation or Organization)
(I.R.S., EmployerIdentification No., )
2655 Seely Avenue, Building 5,
San Jose,
California
95134
(Address of Principal Executive Offices)
(Zip Code)<(408)
-943-1234
(Registrant’s Telephone Number, including Area Code)
Securities registered pursuant to Section 12, (b) of the Act:
Title of Each Class
Trading Symbol, (s)
Names of Each Exchange on which Registered
Common Stock, $0.01 par value per share
CDNS
Nasdaq Global Select Market
Securities registered pursuant to Section 12, (g) of the Act:]|
+---------------------------------------------------------------------------------------------------+
from johnsnowlabs import nlp, legal
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
legal_ner = legal.NerModel.pretrained("legner_contract_doc_parties_lg", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = legal.NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")\
.setReplaceLabels({"ALIAS": "PARTY"}) # "ALIAS" are secondary names of companies, so let's extract them also as PARTY
deidentification = legal.DeIdentification() \
.setInputCols(["sentence", "token", "ner_chunk"]) \
.setOutputCol("deidentified") \
.setMode("mask")\
.setReturnEntityMappings(True) # return a new column to save the mappings between the mask/obfuscated entities and original entities. REquired for "ReIdentification"
#.setMappingsColumn("MappingCol") # change the name of the column, 'aux' is default
pipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
legal_ner,
ner_converter,
deidentification])
text = """THIS STRATEGIC ALLIANCE AGREEMENT ("Agreement") is made and entered into as of December 14, 2016 , by and between Hyatt Franchising Latin America, L.L.C. a limited liability company organized and existing under the laws of the State of Delaware"""
data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("deidentified.result").show(truncate = False)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[THIS <DOC> ("Agreement") is made and entered into as of <EFFDATE> , by and between <PARTY>. a limited liability company organized and existing under the laws of the State of Delaware]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
reIdentification = legal.ReIdentification()\
.setInputCols(["aux","deidentified"])\
.setOutputCol("original")
reid_result = reIdentification.transform(result)
reid_result.select('original.result').show(truncate=False)
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[THIS STRATEGIC ALLIANCE AGREEMENT ("Agreement") is made and entered into as of December 14, 2016 , by and between Hyatt Franchising Latin America, L.L.C. a limited liability company organized and existing under the laws of the State of Delaware]|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val wordEmbeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val clinicalNer = MedicalNerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val nerConverter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val deidentification = new DeIdentification()
.setInputCols(Array("sentence", "token", "ner_chunk"))
.setOutputCol("deidentified")
.setMode("mask")
.setReturnEntityMappings(true)
val pipeline = new Pipeline()
.setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
wordEmbeddings,
clinicalNer,
nerConverter,
deidentification
))
val text = """
Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora ,
MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 .
Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .
"""
val data = Seq((text)).toDF("text")
val result = pipeline.fit(data).transform(data)
+-----------------------------------------------------------------------+-------------------------------------------------------+
|sentence |deidentified |
+-----------------------------------------------------------------------+-------------------------------------------------------+
|Record date : 2093-01-13 , David Hale , M.D . |Record date : <DATE> , <NAME> , M.D . |
|, Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . |, Name : <NAME> , MR # <ID> Date : <DATE> . |
|PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . |PCP : <NAME> , <AGE> years-old , Record date : <DATE> .|
|Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .|<LOCATION> , <LOCATION> , Phone <CONTACT> . |
+-----------------------------------------------------------------------+-------------------------------------------------------+
val reIdentification = new ReIdentification()
.setInputCols(Array("aux", "deidentified"))
.setOutputCol("original")
val reidResult = reIdentification.transform(result)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[Record date : 2093-01-13 , David Hale , M.D ., , Name : Hendrickson Ora ,MR # 7194334 Date : 01/13/93 ., PCP : Oliveira , 25 years-old , Record date : 2079-11-09 ., Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .] |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val nerModel = FinanceNerModel.pretrained("finner_sec_10k_summary", "en", "finance/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val nerConverter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val deidentification = new DeIdentification()
.setInputCols(Array("sentence", "token", "ner_chunk"))
.setOutputCol("deidentified")
.setMode("mask")
.setReturnEntityMappings(true)
val pipeline = new Pipeline()
.setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
nerModel,
nerConverter,
deidentification
))
val text = "Commission file number 000-15867
_____________________________________
CADENCE DESIGN SYSTEMS, INC.
(Exact name of registrant as specified in its charter)
____________________________________
Delaware
00-0000000
(State or Other Jurisdiction ofIncorporation or Organization)
(I.R.S. EmployerIdentification No.)
2655 Seely Avenue, Building 5,
San Jose,
California
95134
(Address of Principal Executive Offices)
(Zip Code)
(408)
-943-1234
(Registrant’s Telephone Number, including Area Code)
Securities registered pursuant to Section 12(b) of the Act:
Title of Each Class
Trading Symbol(s)
Names of Each Exchange on which Registered
Common Stock, $0.01 par value per share
CDNS
Nasdaq Global Select Market
Securities registered pursuant to Section 12(g) of the Act:"
val data = Seq(text).toDF("text")
val result = pipeline.fit(data).transform(data)
+-------------------------------------------------------------------------------------------------------------+
|result |
+-------------------------------------------------------------------------------------------------------------+
|[Commission file number <CFN>
_____________________________________
<ORG>., (Exact name of registrant as specified in its charter)
____________________________________
<STATE>
<IRS>
(State or Other Jurisdiction ofIncorporation or Organization)
(I.R.S., EmployerIdentification No., )
<ADDRESS>
95134
(Address of Principal Executive Offices)
(Zip Code)
<PHONE>
(Registrant’s Telephone Number, including Area Code)
Securities registered pursuant to Section 12, (b) of the Act:
Title of Each Class
Trading Symbol, (s)
Names of Each Exchange on which Registered
<TITLE_CLASS>, <TITLE_CLASS_VALUE> par value per share
<TICKER>
<STOCK_EXCHANGE>
Securities registered pursuant to Section 12, (g) of the Act:]|
+-------------------------------------------------------------------------------------------------------------+
val reIdentification = new ReIdentification()
.setInputCols(Array("aux", "deidentified"))
.setOutputCol("original")
val reidResult = reIdentification.transform(result)
+---------------------------------------------------------------------------------------------------+
|result |
+---------------------------------------------------------------------------------------------------+
|[Commission file number 000-15867
_____________________________________
CADENCE DESIGN SYSTEMS, INC., (Exact name of registrant as specified in its charter)
____________________________________
Delaware
00-0000000
(State or Other Jurisdiction ofIncorporation or Organization)
(I.R.S., EmployerIdentification No., )
2655 Seely Avenue, Building 5,
San Jose,
California
95134
(Address of Principal Executive Offices)
(Zip Code)<(408)
-943-1234
(Registrant’s Telephone Number, including Area Code)
Securities registered pursuant to Section 12, (b) of the Act:
Title of Each Class
Trading Symbol, (s)
Names of Each Exchange on which Registered
Common Stock, $0.01 par value per share
CDNS
Nasdaq Global Select Market
Securities registered pursuant to Section 12, (g) of the Act:]|
+---------------------------------------------------------------------------------------------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val legalNer = LegalNerModel.pretrained("legner_contract_doc_parties_lg", "en", "legal/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val nerConverter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
.setReplaceLabels(Map("ALIAS" -> "PARTY"))
val deidentification = new DeIdentification()
.setInputCols(Array("sentence", "token", "ner_chunk"))
.setOutputCol("deidentified")
.setMode("mask")
.setReturnEntityMappings(true)
val pipeline = new Pipeline()
.setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
legalNer,
nerConverter,
deidentification
))
val text = "THIS STRATEGIC ALLIANCE AGREEMENT (\"Agreement\") is made and entered into as of December 14, 2016, by and between Hyatt Franchising Latin America, L.L.C. a limited liability company organized and existing under the laws of the State of Delaware"
val data = Seq(text).toDF("text")
val result = pipeline.fit(data).transform(data)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[THIS <DOC> ("Agreement") is made and entered into as of <EFFDATE> , by and between <PARTY>. a limited liability company organized and existing under the laws of the State of Delaware]|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
val reIdentification = new ReIdentification()
.setInputCols(Array("aux", "deidentified"))
.setOutputCol("original")
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[THIS STRATEGIC ALLIANCE AGREEMENT ("Agreement") is made and entered into as of December 14, 2016 , by and between Hyatt Franchising Latin America, L.L.C. a limited liability company organized and existing under the laws of the State of Delaware]|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
RegexMatcherInternal
The RegexMatcherInternal
class implements an internal annotator approach to match a set of regular expressions with a provided entity. This approach is utilized for associating specific patterns within text data with predetermined entities, such as dates, mentioned within the text.
The class allows users to define rules using regular expressions paired with entities, offering flexibility in customization. These rules can either be directly set using the setRules
method, with a specified delimiter, or loaded from an external file using the setExternalRules
method.
Additionally, users can specify parameters such as the matching strategy (MATCH_FIRST
, MATCH_ALL
, or MATCH_COMPLETE
) to control how matches are handled. The output annotation type is CHUNK
, with input annotation types supporting DOCUMENT
. This class provides a versatile tool for implementing entity recognition based on user-defined patterns within text data.
A rule consists of a regex pattern and an identifier, delimited by a character of choice. An example could be "\\d{4}\\/\\d\\d\\/\\d\\d,date"
which will match strings like "1970/01/01"
to the identifier "date"
.
Parametres:
strategy
: Can be eitherMATCH_FIRST
,MATCH_ALL
,MATCH_COMPLETE
, by defaultMATCH_ALL
.rules
: Regex rules to match the identifier with.delimiter
: Delimiter for rules provided with setRules.externalRules
: external resource to rules, needsdelimiter
in options.
See Spark NLP Workshop for more examples of usage.
Input Annotator Types: DOCUMENT
Output Annotator Type: CHUNK
Python API: RegexMatcherInternal | Scala API: RegexMatcherInternal |
Show Example
from johnsnowlabs import nlp, medical
text = """Name : Hendrickson, Ora, Record date: 2093-01-13, MR #719435.
Dr. John Green, ID: 1231511863, IP 203.120.223.13
He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93
Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no: A334455B.
Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com."""
data = spark.createDataFrame([[text]]).toDF("text")
rules = '''
(\d{1,3}\.){3}\d{1,3}~IPADDR
\d{4}-\d{2}-\d{2}|\d{2}/\d{2}/\d{2}|\d{2}/\d{2}/\d{2}~DATE
'''
with open('./rules/regex_rules.txt', 'w') as f:
f.write(rules)
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
regex_matcher_internal = medical.RegexMatcherInternal()\
.setInputCols('document')\
.setStrategy("MATCH_ALL")\
.setOutputCol("regex_matches")\
.setExternalRules(path='./rules/regex_rules.txt', delimiter='~')
nlpPipeline = nlp.Pipeline(
stages=[
document_assembler,
regex_matcher_internal
])
result = nlpPipeline.fit(data).transform(data)
# result
+--------------+-----+---+---------+
| regex_result|begin|end|ner_label|
+--------------+-----+---+---------+
| 2093-01-13| 38| 47| DATE|
|203.120.223.13| 97|110| IPADDR|
| 01/13/93| 188|195| DATE|
+--------------+-----+---+---------+
//rules = '''
//(\d{1,3}\.){3}\d{1,3}~IPADDR
//\d{4}-\d{2}-\d{2}|\d{2}/\d{2}/\d{2}|\d{2}/\d{2}/\d{2}~DATE
//'''
//
//with open('./rules/regex_rules.txt', 'w') as f:
// f.write(rules)
val text = """Name : Hendrickson, Ora, Record date: 2093-01-13, MR #719435.
|Dr. John Green, ID: 1231511863, IP 203.120.223.13
|He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93
|Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no: A334455B.
|Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com.""".stripMargin
import spark.implicits._
val data = Seq(text).toDF("text")
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val regexMatcher = new RegexMatcher()
.setInputCols("document")
.setStrategy("MATCH_ALL")
.setOutputCol("regex_matches")
.setExternalRulesPath("./rules/regex_rules.txt")
.setDelimiter("~")
val nlpPipeline = new Pipeline()
.setStages(Array(documentAssembler, regexMatcher))
val result = nlpPipeline.fit(data).transform(data)
# result
+--------------+-----+---+---------+
| regex_result|begin|end|ner_label|
+--------------+-----+---+---------+
| 2093-01-13| 38| 47| DATE|
|203.120.223.13| 97|110| IPADDR|
| 01/13/93| 188|195| DATE|
+--------------+-----+---+---------+
RelationExtraction
Extracts and classifies instances of relations between named entities.
Parameters:
-
predictionThreshold
(Float): Sets minimal activation of the target unit to encode a new relation instance. -
relationPairs
(List[Str]): List of dash-separated pairs of named entities. For example, [“Biomarker-RelativeDay”] will process all relations between entities of type “Biomarker” and “RelativeDay”. -
relationPairsCaseSensitive
(Bool): Determines whether relation pairs are case sensitive. -
relationTypePerPair
dict[str, list[str]]: List of entity pairs per relations which limit the entities can form a relation. For example, {“CAUSE”: [“PROBLEM”, “SYMPTOM”]} which only let a “CAUSE” relation to hold between a problem (“PROBLEM) and a symptom (“SYMTOM”). -
maxSyntacticDistance
(Int): Maximal syntactic distance, as threshold (Default: 0). Determine how far the “from entity” can be from the “to entity” in the text. Increasing this value will increase recall, but also increase the number of false positives. -
customLabels
(dict[str, str]): Custom relation labels. -
multiClass
(Bool): If multiClass is set, the model will return all the labels with corresponding scores (Default: False) -
DoExceptionHandling
: If it is set as True, the annotator tries to process as usual and ff exception-causing data (e.g. corrupted record/ document) is passed to the annotator, an exception warning is emitted which has the exception message. -
scopeWindow
(IntArrayParam) The scope window of feature generation for relations. Given scope window list must contain two integers.
For pretrained models please see the Models Hub for available models.
Input Annotator Types: WORD_EMBEDDINGS, POS, CHUNK, DEPENDENCY
Output Annotator Type: CATEGORY
Python API: RelationExtractionModel | Scala API: RelationExtractionModel | Notebook: RelationExtractionModelNotebook |
Show Example
from johnsnowlabs import nlp, medical
documenter = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencer = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentences")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentences"])\
.setOutputCol("tokens")
words_embedder = nlp.WordEmbeddingsModel()\
.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentences", "tokens"])\
.setOutputCol("embeddings")
pos_tagger = nlp.PerceptronModel()\
.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"])\
.setOutputCol("pos_tags")
ner_tagger = medical.NerModel()\
.pretrained("ner_posology", "en", "clinical/models")\
.setInputCols("sentences", "tokens", "embeddings")\
.setOutputCol("ner_tags")
ner_chunker = medical.NerConverterInternal()\
.setInputCols(["sentences", "tokens", "ner_tags"])\
.setOutputCol("ner_chunks")
dependency_parser = nlp.DependencyParserModel()\
.pretrained("dependency_conllu", "en")\
.setInputCols(["sentences", "pos_tags", "tokens"])\
.setOutputCol("dependencies")
reModel = medical.RelationExtractionModel()\
.pretrained("posology_re")\
.setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
.setOutputCol("relations")\
.setMaxSyntacticDistance(4)
pipeline = nlp.Pipeline(stages=[
documenter,
sentencer,
tokenizer,
words_embedder,
pos_tagger,
ner_tagger,
ner_chunker,
dependency_parser,
reModel
])
text = """
The patient was prescribed 1 unit of Advil for 5 days after meals. The patient was also
given 1 unit of Metformin daily.
He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night ,
12 units of insulin lispro with meals , and metformin 1000 mg two times a day.
"""
df = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(df).transform(df)
# Show results
result.select(F.explode(F.arrays_zip(
result.relations.result,
result.relations.metadata)).alias("cols"))\
.select(
F.expr("cols['1']['chunk1']").alias("chunk1"),
F.expr("cols['1']['chunk2']").alias("chunk2"),
F.expr("cols['1']['entity1']").alias("entity1"),
F.expr("cols['1']['entity2']").alias("entity2"),
F.expr("cols['0']").alias("relations"),
F.expr("cols['1']['confidence']").alias("confidence")).show(5, truncate=False)
+---------+----------------+-------+---------+--------------+----------+
|chunk1 |chunk2 |entity1|entity2 |relations |confidence|
+---------+----------------+-------+---------+--------------+----------+
|1 unit |Advil |DOSAGE |DRUG |DOSAGE-DRUG |1.0 |
|Advil |for 5 days |DRUG |DURATION |DRUG-DURATION |1.0 |
|1 unit |Metformin |DOSAGE |DRUG |DOSAGE-DRUG |1.0 |
|Metformin|daily |DRUG |FREQUENCY|DRUG-FREQUENCY|1.0 |
|40 units |insulin glargine|DOSAGE |DRUG |DOSAGE-DRUG |1.0 |
+---------+----------------+-------+---------+--------------+----------+
import spark.implicits._
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentencer = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentences")
val tokenizer = new Tokenizer()
.setInputCols("sentences")
.setOutputCol("tokens")
val words_embedder = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
.setInputCols(Array("sentences","tokens"))
.setOutputCol("embeddings")
val pos_tagger = PerceptronModel.pretrained("pos_clinical","en","clinical/models")
.setInputCols(Array("sentences","tokens"))
.setOutputCol("pos_tags")
val ner_tagger = MedicalNerModel.pretrained("ner_posology","en","clinical/models")
.setInputCols("sentences","tokens","embeddings")
.setOutputCol("ner_tags")
val ner_chunker = new NerConverterInternal()
.setInputCols(Array("sentences","tokens","ner_tags"))
.setOutputCol("ner_chunks")
val dependency_parser = DependencyParserModel.pretrained("dependency_conllu","en")
.setInputCols(Array("sentences","pos_tags","tokens"))
.setOutputCol("dependencies")
val reModel = RelationExtractionModel.pretrained("posology_re")
.setInputCols(Array("embeddings","pos_tags","ner_chunks","dependencies"))
.setOutputCol("relations")
.setMaxSyntacticDistance(4)
val pipeline = new Pipeline().setStages(Array(
documenter,
sentencer,
tokenizer,
words_embedder,
pos_tagger,
ner_tagger,
ner_chunker,
dependency_parser,
reModel ))
val text = " The patient was prescribed 1 unit of Advil for 5 days after meals. The patient was also given 1 unit of Metformin daily. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals ,and metformin 1000 mg two times a day. "
val df = Seq(text) .toDF("text")
val result = pipeline.fit(df) .transform(df)
// Show results
+---------+----------------+-------+---------+--------------+----------+
|chunk1 |chunk2 |entity1|entity2 |relations |confidence|
+---------+----------------+-------+---------+--------------+----------+
|1 unit |Advil |DOSAGE |DRUG |DOSAGE-DRUG |1.0 |
|Advil |for 5 days |DRUG |DURATION |DRUG-DURATION |1.0 |
|1 unit |Metformin |DOSAGE |DRUG |DOSAGE-DRUG |1.0 |
|Metformin|daily |DRUG |FREQUENCY|DRUG-FREQUENCY|1.0 |
|40 units |insulin glargine|DOSAGE |DRUG |DOSAGE-DRUG |1.0 |
+---------+----------------+-------+---------+--------------+----------+
Trains a TensorFlow model for relation extraction.
To train a custom relation extraction model, you need to first creat a Tensorflow graph using either the TfGraphBuilder
annotator or the tf_graph
module. Then, set the path to the Tensorflow graph using the method .setModelFile("path/to/tensorflow_graph.pb")
.
If the parameter relationDirectionCol
is set, the model will be trained using the direction information (see the parameter decription for details). Otherwise, the model won’t have direction between the relation of the entities.
After training a model (using the .fit()
method), the resulting object is of class RelationExtractionModel
.
Parameters:
-
FromEntity
: (begin_col: str, end_col: str, label_col: str) Sets from entity -
begin_col
Column that has a reference of where the chunk begins -
end_col
: Column that has a reference of where the chunk ends -
label_col
: Column that has a reference what are the type of chunk -
ToEntity
: (begin_col: str, end_col: str, label_col: str) Sets to entity -
begin_col
Column that has a reference of where the chunk begins -
end_col
: Column that has a reference of where the chunk ends -
label_col
: Column that has a reference what are the type of chunk -
CustomLabels
: (labels: dict[str, str]) Sets custom relation labels -
labels
: Dictionary which maps old to new labels -
RelationDirectionCol
: (col: str) Relation direction column (possible values are: “none”, “left” or “right”). If this parameter is not set, the model will not have direction between the relation of the entities -
col
Column contains the relation direction values -
PretrainedModelPath
(value: str) Path to an already trained model saved to disk, which is used as a starting point for training the new model -
ОverrideExistingLabels
(bool) Whether to override already learned labels when using a pretrained model to initialize the new model. Default is ‘true’ -
batchSize
: (Int) Size for each batch in the optimization process -
EpochsNumber
(Int) Maximum number of epochs to train -
Dropout
: (Float) Dropout at the output of each layer -
LearningRate
: (Float) Learning rate for the optimization process -
OutputLogsPath
: (Str) Folder path to save training logs. If no path is specified, the logs won’t be stored in disk. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3). -
ModelFile
: (Str) The path to the Tensorflow graph -
FixImbalance
(Float) Fix the imbalance in the training set by replicating examples of under represented categories -
ValidationSplit
(Float) The proportion of training dataset to be used as validation set -
OverrideExistingLabels
(Boolean) Controls whether to override already learned lebels when using a pretrained model to initialize the new model. A value of true will override existing labels -
MultiClass
(Boolean) If multiClass is set, the model will return all the labels with corresponding scores. By default, multiClass is false. -
ModelFile
(Str) Location of file of the model used for classification -
MaxSyntacticDistance
(Int) Maximal syntactic distance, as threshold (Default: 0) -
directionSensitive
(Boolean) Specify direction sensitivity in processing relation pairs -
filterByTokenDistance
(Int) Filtering criterion based on number of token between entities -
scopeWindow
(IntArrayParam) The scope window of feature generation for relations. Given scope window list must contain two integers.
Input Annotator Types: WORD_EMBEDDINGS, POS, CHUNK, DEPENDENCY
Output Annotator Type: NONE
Python API: RelationExtractionApproach | Scala API: RelationExtractionApproach | Notebook: RelationExtractionApproachNotebook |
Show Example
from johnsnowlabs import nlp, medical
# Defining pipeline stages to extract entities first
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = nlp.Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("tokens")
embedder = nlp.WordEmbeddingsModel \
.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["document", "tokens"]) \
.setOutputCol("embeddings")
posTagger = nlp.PerceptronModel \
.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["document", "tokens"]) \
.setOutputCol("posTags")
nerTagger = nlp.MedicalNerModel \
.pretrained("ner_events_clinical", "en", "clinical/models") \
.setInputCols(["document", "tokens", "embeddings"]) \
.setOutputCol("ner_tags")
nerConverter = nlp.NerConverter() \
.setInputCols(["document", "tokens", "ner_tags"]) \
.setOutputCol("nerChunks")
depencyParser = nlp.DependencyParserModel \
.pretrained("dependency_conllu", "en") \
.setInputCols(["document", "posTags", "tokens"]) \
.setOutputCol("dependencies")
# Then define `RelationExtractionApproach` and training parameters
re = medical.RelationExtractionApproach() \
.setInputCols(["embeddings", "posTags", "train_ner_chunks", "dependencies"]) \
.setOutputCol("relations_t") \
.setLabelColumn("target_rel") \
.setEpochsNumber(300) \
.setBatchSize(200) \
.setLearningRate(0.001) \
.setModelFile("path/to/graph_file.pb") \
.setFixImbalance(True) \
.setValidationSplit(0.05) \
.setFromEntity("from_begin", "from_end", "from_label") \
.setToEntity("to_begin", "to_end", "to_label")
finisher = nlp.Finisher() \
.setInputCols(["relations_t"]) \
.setOutputCols(["relations"]) \
.setCleanAnnotations(False) \
.setValueSplitSymbol(",") \
.setAnnotationSplitSymbol(",") \
.setOutputAsArray(False)
# Define complete pipeline and start training
pipeline = nlp.Pipeline(stages=[
documentAssembler,
tokenizer,
embedder,
posTagger,
nerTagger,
nerConverter,
depencyParser,
re,
finisher])
model = pipeline.fit(trainData)
import spark.implicits._
// Defining pipeline stages to extract entities first
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("tokens")
val embedder = WordEmbeddingsModel
.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("document", "tokens"))
.setOutputCol("embeddings")
val posTagger = PerceptronModel
.pretrained("pos_clinical", "en", "clinical/models")
.setInputCols(Array("document", "tokens"))
.setOutputCol("posTags")
val nerTagger = MedicalNerModel
.pretrained("ner_events_clinical", "en", "clinical/models")
.setInputCols(Array("document", "tokens", "embeddings"))
.setOutputCol("ner_tags")
val nerConverter = new NerConverter()
.setInputCols(Array("document", "tokens", "ner_tags"))
.setOutputCol("nerChunks")
val depencyParser = DependencyParserModel
.pretrained("dependency_conllu", "en")
.setInputCols(Array("document", "posTags", "tokens"))
.setOutputCol("dependencies")
// Then define `RelationExtractionApproach` and training parameters
val re = new RelationExtractionApproach()
.setInputCols(Array("embeddings", "posTags", "train_ner_chunks", "dependencies"))
.setOutputCol("relations_t")
.setLabelColumn("target_rel")
.setEpochsNumber(300)
.setBatchSize(200)
.setlearningRate(0.001f)
.setModelFile("path/to/graph_file.pb")
.setFixImbalance(true)
.setValidationSplit(0.05f)
.setFromEntity("from_begin", "from_end", "from_label")
.setToEntity("to_begin", "to_end", "to_label")
val finisher = new Finisher()
.setInputCols(Array("relations_t"))
.setOutputCols(Array("relations"))
.setCleanAnnotations(false)
.setValueSplitSymbol(",")
.setAnnotationSplitSymbol(",")
.setOutputAsArray(false)
// Define complete pipeline and start training
val pipeline = new Pipeline()
.setStages(Array(
documentAssembler,
tokenizer,
embedder,
posTagger,
nerTagger,
nerConverter,
depencyParser,
re,
finisher))
val model = pipeline.fit(trainData)
RelationExtractionDL
This Relation Extraction annotator extracts and classifies instances of relations between named entities. In contrast with RelationExtractionModel
, RelationExtractionDLModel
is based on BERT.
Parametres:
-
predictionThreshold
(Float): Sets minimal activation of the target unit to encode a new relation instance. -
customLabels
(dict[str, str]): Custom relation labels. -
DoExceptionHandling
: If it is set as True, the annotator tries to process as usual and ff exception-causing data (e.g. corrupted record/ document) is passed to the annotator, an exception warning is emitted which has the exception message.
Available models can be found at the Models Hub.
For more extended examples on document pre-processing see the Spark NLP Workshop
Input Annotator Types: CHUNK, DOCUMENT
Output Annotator Type: CATEGORY
Python API: RelationExtractionDLModel | Scala API: RelationExtractionDLModel | Notebook: RelationExtractionDLModelNotebook |
Show Example
from johnsnowlabs import nlp, medical
documenter = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencer = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
words_embedder = nlp.WordEmbeddingsModel()\
.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
pos_tagger = nlp.PerceptronModel()\
.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token"])\
.setOutputCol("pos_tags")
ner_tagger = medical.NerModel.pretrained("ner_ade_clinical", "en", "clinical/models")\
.setInputCols("sentence", "token", "embeddings")\
.setOutputCol("ner_tags")
ner_chunker = medical.NerConverterInternal()\
.setInputCols(["sentence", "token", "ner_tags"])\
.setOutputCol("ner_chunks")
dependency_parser = nlp.DependencyParserModel()\
.pretrained("dependency_conllu", "en")\
.setInputCols(["sentence", "pos_tags", "token"])\
.setOutputCol("dependencies")
ade_re_ner_chunk_filter = medical.RENerChunksFilter() \
.setInputCols(["ner_chunks", "dependencies"])\
.setOutputCol("re_ner_chunks")\
.setMaxSyntacticDistance(10)\
.setRelationPairs(["drug-ade, ade-drug"])
ade_re_model = medical.RelationExtractionDLModel()\
.pretrained('redl_ade_biobert', 'en', "clinical/models") \
.setInputCols(["re_ner_chunks", "sentences"]) \
.setPredictionThreshold(0.5)\
.setOutputCol("relations")
pipeline = nlp.Pipeline(stages=[
documenter,
sentencer,
tokenizer,
words_embedder,
pos_tagger,
ner_tagger,
ner_chunker,
dependency_parser,
ade_re_ner_chunk_filter,
ade_re_model
])
text = """A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands."""
data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)
from pyspark.sql import functions as F
results.select(
F.explode(F.arrays_zip(results.relations.metadata, results.relations.result)).alias("cols")).select(
F.expr("cols['0']['sentence']").alias("sentence"),
F.expr("cols['0']['entity1_begin']").alias("entity1_begin"),
F.expr("cols['0']['entity1_end']").alias("entity1_end"),
F.expr("cols['0']['chunk1']").alias("chunk1"),
F.expr("cols['0']['entity1']").alias("entity1"),
F.expr("cols['0']['entity2_begin']").alias("entity2_begin"),
F.expr("cols['0']['entity2_end']").alias("entity2_end"),
F.expr("cols['0']['chunk2']").alias("chunk2"),
F.expr("cols['0']['entity2']").alias("entity2"),
F.expr("cols['1']").alias("relation"),
F.expr("cols['0']['confidence']").alias("confidence"),
).show(truncate=70)
+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
|sentence|entity1_begin|entity1_end| chunk1|entity1|entity2_begin|entity2_end| chunk2|entity2|relation|confidence|
+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
| 0| 25| 32| naproxen| DRUG| 137| 148| tense bullae| ADE| 1| 0.9989047|
| 0| 25| 32| naproxen| DRUG| 154| 210|cutaneous fragility on the face and the back of the hands| ADE| 1| 0.9989704|
| 0| 87| 95|oxaprozin| DRUG| 137| 148| tense bullae| ADE| 1|0.99895453|
| 0| 87| 95|oxaprozin| DRUG| 154| 210|cutaneous fragility on the face and the back of the hands| ADE| 1|0.99900633|
+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
from johnsnowlabs import nlp, finance
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
text_splitter = finance.TextSplitter()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model_date = finance.NerModel.pretrained("finner_sec_dates", "en", "finance/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner_dates")
ner_converter_date = finance.NerConverterInternal()\
.setInputCols(["sentence","token","ner_dates"])\
.setOutputCol("ner_chunk_date")
ner_model_org= finance.NerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner_orgs")
ner_converter_org = finance.NerConverterInternal()\
.setInputCols(["sentence","token","ner_orgs"])\
.setOutputCol("ner_chunk_org")\
chunk_merger = finance.ChunkMergeApproach()\
.setInputCols('ner_chunk_org', "ner_chunk_date")\
.setOutputCol('ner_chunk')
pos = nlp.PerceptronModel.pretrained()\
.setInputCols(["sentence", "token"])\
.setOutputCol("pos")
dependency_parser = nlp.DependencyParserModel().pretrained("dependency_conllu", "en")\
.setInputCols(["sentence", "pos", "token"])\
.setOutputCol("dependencies")
re_filter = finance.RENerChunksFilter()\
.setInputCols(["ner_chunk", "dependencies"])\
.setOutputCol("re_ner_chunk")\
.setRelationPairs(["ORG-ORG", "ORG-DATE"])\
.setMaxSyntacticDistance(10)
reDL = finance.RelationExtractionDLModel().pretrained('finre_acquisitions_subsidiaries_md', 'en', 'finance/models')\
.setInputCols(["re_ner_chunk", "sentence"])\
.setOutputCol("relation")\
.setPredictionThreshold(0.1)
pipeline = nlp.Pipeline(stages=[
document_assembler,
text_splitter,
tokenizer,
embeddings,
ner_model_date,
ner_converter_date,
ner_model_org,
ner_converter_org,
chunk_merger,
pos,
dependency_parser,
re_filter,
reDL])
text = """In fiscal 2020, Cadence acquired all of the outstanding equity of AWR Corporation (“AWR”) and Integrand Software, Inc. (“Integrand”)."""
data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)
from pyspark.sql import functions as F
result.select(
F.explode(F.arrays_zip(result.relation.metadata, result.relation.result)).alias("cols")).select(
F.expr("cols['0']['sentence']").alias("sentence"),
F.expr("cols['0']['entity1_begin']").alias("entity1_begin"),
F.expr("cols['0']['entity1_end']").alias("entity1_end"),
F.expr("cols['0']['chunk1']").alias("chunk1"),
F.expr("cols['0']['entity1']").alias("entity1"),
F.expr("cols['0']['entity2_begin']").alias("entity2_begin"),
F.expr("cols['0']['entity2_end']").alias("entity2_end"),
F.expr("cols['0']['chunk2']").alias("chunk2"),
F.expr("cols['0']['entity2']").alias("entity2"),
F.expr("cols['1']").alias("relation"),
F.expr("cols['0']['confidence']").alias("confidence"),
).filter("relation != 'no_rel'").show(truncate=70)
+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
|sentence|entity1_begin|entity1_end| chunk1|entity1|entity2_begin|entity2_end| chunk2|entity2| relation|confidence|
+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
| 0| 16| 22| Cadence| ORG| 3| 13| fiscal 2020| DATE|has_acquisition_date|0.99687237|
| 0| 66| 80| AWR Corporation| ORG| 3| 13| fiscal 2020| DATE|has_acquisition_date| 0.993112|
| 0| 94| 116|Integrand Software, Inc| ORG| 3| 13| fiscal 2020| DATE|has_acquisition_date| 0.9741451|
| 0| 66| 80| AWR Corporation| ORG| 16| 22| Cadence| ORG| was_acquired_by| 0.997124|
| 0| 94| 116|Integrand Software, Inc| ORG| 16| 22| Cadence| ORG| was_acquired_by|0.99910504|
| 0| 94| 116|Integrand Software, Inc| ORG| 66| 80|AWR Corporation| ORG| was_acquired_by|0.93245244|
+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
from johnsnowlabs import nlp, legal
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
text_splitter = legal.TextSplitter()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)
ner_model = legal.NerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = legal.NerConverterInternal()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
re_model = legal.RelationExtractionDLModel.pretrained("legre_contract_doc_parties", "en", "legal/models")\
.setPredictionThreshold(0.1)\
.setInputCols(["ner_chunk", "sentence"])\
.setOutputCol("relation")
pipeline = nlp.Pipeline(stages=[
document_assembler,
text_splitter,
tokenizer,
embeddings,
ner_model,
ner_converter,
re_model
])
text = """This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties")."""
data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)
from pyspark.sql import functions as F
result.select(
F.explode(F.arrays_zip(result.relation.metadata, result.relation.result)).alias("cols")).select(
F.expr("cols['0']['sentence']").alias("sentence"),
F.expr("cols['0']['entity1_begin']").alias("entity1_begin"),
F.expr("cols['0']['entity1_end']").alias("entity1_end"),
F.expr("cols['0']['chunk1']").alias("chunk1"),
F.expr("cols['0']['entity1']").alias("entity1"),
F.expr("cols['0']['entity2_begin']").alias("entity2_begin"),
F.expr("cols['0']['entity2_end']").alias("entity2_end"),
F.expr("cols['0']['chunk2']").alias("chunk2"),
F.expr("cols['0']['entity2']").alias("entity2"),
F.expr("cols['1']").alias("relation"),
F.expr("cols['0']['confidence']").alias("confidence"),
).filter("relation != 'no_rel'").show(truncate=70)
+--------+-------------+-----------+-----------------------------------+-------+-------------+-----------+-----------------------+-------+--------------------+----------+
|sentence|entity1_begin|entity1_end| chunk1|entity1|entity2_begin|entity2_end| chunk2|entity2| relation|confidence|
+--------+-------------+-----------+-----------------------------------+-------+-------------+-----------+-----------------------+-------+--------------------+----------+
| 0| 5| 35| INTELLECTUAL PROPERTY AGREEMENT| DOC| 69| 85| December 31, 2018|EFFDATE| dated_as| 0.9856822|
| 0| 5| 35| INTELLECTUAL PROPERTY AGREEMENT| DOC| 141| 163|Armstrong Flooring, Inc| PARTY| signed_by| 0.7816506|
| 0| 5| 35| INTELLECTUAL PROPERTY AGREEMENT| DOC| 205| 221| AFI Licensing LLC| PARTY| signed_by|0.53521496|
| 0| 141| 163| Armstrong Flooring, Inc| PARTY| 192| 197| Seller| ALIAS| has_alias| 0.8962001|
| 0| 205| 221| AFI Licensing LLC| PARTY| 263| 271| Licensing| ALIAS| has_alias|0.95189077|
| 0| 292| 297| Seller| ALIAS| 301| 307| Arizona| ALIAS|has_collective_alias| 0.8934925|
| 1| 411| 445|Armstrong Hardwood Flooring Company| PARTY| 478| 484| Company| ALIAS| has_alias|0.98353034|
| 1| 505| 509| Buyer| ALIAS| 516| 529| Buyer Entities| ALIAS|has_collective_alias| 0.7217146|
| 1| 611| 615| Party| ALIAS| 641| 647| Parties| ALIAS|has_collective_alias| 0.5040909|
+--------+-------------+-----------+-----------------------------------+-------+-------------+-----------+-----------------------+-------+--------------------+----------+
import spark.implicits._
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentencer = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val wordsEmbedder = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val posTagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos_tags")
val nerTagger = MedicalNerModel.pretrained("ner_ade_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner_tags")
val nerChunker = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner_tags"))
.setOutputCol("ner_chunks")
val dependencyParser = DependencyParserModel.pretrained("dependency_conllu", "en")
.setInputCols(Array("sentence", "pos_tags", "token"))
.setOutputCol("dependencies")
val adeReNerChunkFilter = new RENerChunksFilter()
.setInputCols(Array("ner_chunks", "dependencies"))
.setOutputCol("re_ner_chunks")
.setMaxSyntacticDistance(10)
.setRelationPairs(Array("drug-ade", "ade-drug"))
val adeReModel = RelationExtractionDLModel.pretrained("redl_ade_biobert", "en", "clinical/models")
.setInputCols(Array("re_ner_chunks", "sentences"))
.setPredictionThreshold(0.5)
.setOutputCol("relations")
val pipeline = new Pipeline()
.setStages(Array(
documenter,
sentencer,
tokenizer,
wordsEmbedder,
posTagger,
nerTagger,
nerChunker,
dependencyParser,
adeReNerChunkFilter,
adeReModel
))
val text = """A 44-year-old man taking naproxen for chronic low back pain and a 20-year-old woman on oxaprozin for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands."""
val data = Seq(text).toDF("text")
val result = pipeline.fit(data).transform(data)
+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
|sentence|entity1_begin|entity1_end| chunk1|entity1|entity2_begin|entity2_end| chunk2|entity2|relation|confidence|
+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
| 0| 25| 32| naproxen| DRUG| 137| 148| tense bullae| ADE| 1| 0.9989047|
| 0| 25| 32| naproxen| DRUG| 154| 210|cutaneous fragility on the face and the back of the hands| ADE| 1| 0.9989704|
| 0| 87| 95|oxaprozin| DRUG| 137| 148| tense bullae| ADE| 1|0.99895453|
| 0| 87| 95|oxaprozin| DRUG| 154| 210|cutaneous fragility on the face and the back of the hands| ADE| 1|0.99900633|
+--------+-------------+-----------+---------+-------+-------------+-----------+---------------------------------------------------------+-------+--------+----------+
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val text_splitter = new TextSplitter()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en", "finance/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner_model_date = NerModel.pretrained("finner_sec_dates", "en", "finance/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner_dates")
val ner_converter_date = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner_dates"))
.setOutputCol("ner_chunk_date")
val ner_model_org = FinanceNerModel.pretrained("finner_orgs_prods_alias", "en", "finance/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner_orgs")
val ner_converter_org = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner_orgs"))
.setOutputCol("ner_chunk_org")
val chunk_merger = new ChunkMergeApproach()
.setInputCols(Array("ner_chunk_org", "ner_chunk_date"))
.setOutputCol("ner_chunk")
val pos = PerceptronModel.pretrained()
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val dependency_parser = DependencyParserModel.pretrained("dependency_conllu", "en")
.setInputCols(Array("sentence", "pos", "token"))
.setOutputCol("dependencies")
val re_filter = new RENerChunksFilter()
.setInputCols(Array("ner_chunk", "dependencies"))
.setOutputCol("re_ner_chunk")
.setRelationPairs(Array("ORG-ORG", "ORG-DATE"))
.setMaxSyntacticDistance(10)
val reDL = RelationExtractionDLModel.pretrained("finre_acquisitions_subsidiaries_md", "en", "finance/models")
.setInputCols(Array("re_ner_chunk", "sentence"))
.setOutputCol("relation")
.setPredictionThreshold(0.1)
val pipeline = new Pipeline().setStages(Array(
document_assembler,
text_splitter,
tokenizer,
embeddings,
ner_model_date,
ner_converter_date,
ner_model_org,
ner_converter_org,
chunk_merger,
pos,
dependency_parser,
re_filter,
reDL
))
text = """In fiscal 2020, Cadence acquired all of the outstanding equity of AWR Corporation (“AWR”) and Integrand Software, Inc. (“Integrand”)."""
val data = Seq(text).toDF("text")
val result = pipeline.fit(data).transform(data)
+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
|sentence|entity1_begin|entity1_end| chunk1|entity1|entity2_begin|entity2_end| chunk2|entity2| relation|confidence|
+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
| 0| 16| 22| Cadence| ORG| 3| 13| fiscal 2020| DATE|has_acquisition_date|0.99687237|
| 0| 66| 80| AWR Corporation| ORG| 3| 13| fiscal 2020| DATE|has_acquisition_date| 0.993112|
| 0| 94| 116|Integrand Software, Inc| ORG| 3| 13| fiscal 2020| DATE|has_acquisition_date| 0.9741451|
| 0| 66| 80| AWR Corporation| ORG| 16| 22| Cadence| ORG| was_acquired_by| 0.997124|
| 0| 94| 116|Integrand Software, Inc| ORG| 16| 22| Cadence| ORG| was_acquired_by|0.99910504|
| 0| 94| 116|Integrand Software, Inc| ORG| 66| 80|AWR Corporation| ORG| was_acquired_by|0.93245244|
+--------+-------------+-----------+-----------------------+-------+-------------+-----------+---------------+-------+--------------------+----------+
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val text_splitter = new TextSplitter()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
.setMaxSentenceLength(512)
val ner_model = LegalNerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val re_model = RelationExtractionDLModel.pretrained("legre_contract_doc_parties", "en", "legal/models")
.setPredictionThreshold(0.1)
.setInputCols(Array("ner_chunk", "sentence"))
.setOutputCol("relation")
val pipeline = new Pipeline()
.setStages(Array(
document_assembler,
text_splitter,
tokenizer,
embeddings,
ner_model,
ner_converter,
re_model
))
text = """This INTELLECTUAL PROPERTY AGREEMENT (this "Agreement"), dated as of December 31, 2018 (the "Effective Date") is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ("Seller") and AFI Licensing LLC, a Delaware limited liability company ("Licensing" and together with Seller, "Arizona") and AHF Holding, Inc. (formerly known as Tarzan HoldCo, Inc.), a Delaware corporation ("Buyer") and Armstrong Hardwood Flooring Company, a Tennessee corporation (the "Company" and together with Buyer the "Buyer Entities") (each of Arizona on the one hand and the Buyer Entities on the other hand, a "Party" and collectively, the "Parties")."""
val data = Seq(text).toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
+--------+-------------+-----------+-----------------------------------+-------+-------------+-----------+-----------------------+-------+--------------------+----------+
|sentence|entity1_begin|entity1_end| chunk1|entity1|entity2_begin|entity2_end| chunk2|entity2| relation|confidence|
+--------+-------------+-----------+-----------------------------------+-------+-------------+-----------+-----------------------+-------+--------------------+----------+
| 0| 5| 35| INTELLECTUAL PROPERTY AGREEMENT| DOC| 69| 85| December 31, 2018|EFFDATE| dated_as| 0.9856822|
| 0| 5| 35| INTELLECTUAL PROPERTY AGREEMENT| DOC| 141| 163|Armstrong Flooring, Inc| PARTY| signed_by| 0.7816506|
| 0| 5| 35| INTELLECTUAL PROPERTY AGREEMENT| DOC| 205| 221| AFI Licensing LLC| PARTY| signed_by|0.53521496|
| 0| 141| 163| Armstrong Flooring, Inc| PARTY| 192| 197| Seller| ALIAS| has_alias| 0.8962001|
| 0| 205| 221| AFI Licensing LLC| PARTY| 263| 271| Licensing| ALIAS| has_alias|0.95189077|
| 0| 292| 297| Seller| ALIAS| 301| 307| Arizona| ALIAS|has_collective_alias| 0.8934925|
| 1| 411| 445|Armstrong Hardwood Flooring Company| PARTY| 478| 484| Company| ALIAS| has_alias|0.98353034|
| 1| 505| 509| Buyer| ALIAS| 516| 529| Buyer Entities| ALIAS|has_collective_alias| 0.7217146|
| 1| 611| 615| Party| ALIAS| 641| 647| Parties| ALIAS|has_collective_alias| 0.5040909|
+--------+-------------+-----------+-----------------------------------+-------+-------------+-----------+-----------------------+-------+--------------------+----------+
Replacer
Replacer
allows to replace entities in the original text with the ones extracted by the annotators NameChunkObfuscatorApproach
or DateNormalizer
.
Replacer
is most often used in conjunction with the DateNormalizer
annotator or in deidentification pipelines.
With the dates, the Replacer
annotator is used to replace specific tokens in a text with another token or string. The DateNormalizer
annotator, on the other hand, is used to normalize dates and times to a standardized format.
Obfuscation in healthcare is the act of making healthcare data difficult to understand or use without authorization. This can be done by replacing or removing identifying information, such as names, dates of birth, and Social Security numbers. Obfuscation can also be used to hide the contents of healthcare records, such as diagnoses, medications, and treatment plans.
In the deidentification process, the Replacer
annotator is used to replace certain tokens or patterns in the text with specified values. For example, it can be used to replace all instances of a person’s name with a placeholder like “PERSON”.
The NameChunkObfuscatorApproach
annotator is used to identify and obfuscate sensitive named entities in the text, such as people’s names, addresses, dates of birth, SSNs etc.
Parameter:
setUseReplacement
: (Boolean) Select what output format should be used. By default it will use the current day.setPlaceHolderDelimiters
: (String) Sets the placeholder string to use when noneValuesTo is set to “place_holder”. This placeholder string will be wrapped by the delimiters defined in placeHolderDelimiters.setPlaceHolder
: (String) Determines the action to take when encountering a value of ‘NONE’ in the annotation. This parameter can take one of the following three string values:- “entity”: Replaces ‘NONE’ values with the entity field extracted from the annotation, if available. If the entity field is not available, it uses the string “NONE” wrapped by the specified delimiters.
- “place_holder”: Replaces ‘NONE’ values with a placeholder string wrapped by the specified delimiters.
- “skip”: Retains the original target_text from the annotation’s metadata if available. If not available, it retains the original annotation result.
mappingsColumn
: (String) Column name for mapping. This column maps the annotations to their corresponding chunks before the entities are replaced.returnEntityMappings
: (Boolean) With this property you select if you want to return mapping column.staticEntityMappingsFallback
: (String) Fallback option for static entity mappings. Allowed values: ‘entity’, ‘place_holder’, ‘skip’, ‘error’.staticEntityMappings
: (dict) Static entity mappings. A dictionary with entity types as keys and replacement values as values.
Input Annotator Types: DOCUMENT, CHUNK
Output Annotator Type: DOCUMENT
Python API: Replacer | Scala API: Replacer | Notebook: ReplacerNotebook |
Show Example
from johnsnowlabs import nlp, medical
names = """Mitchell#NAME
Clifford#NAME
Jeremiah#NAME
Lawrence#NAME
Brittany#NAME
Patricia#NAME
Samantha#NAME
Jennifer#NAME
Jackson#NAME
Leonard#NAME
Randall#NAME
Camacho#NAME
Ferrell#NAME
Mueller#NAME
Bowman#NAME
Hansen#NAME
Acosta#NAME
Gillespie#NAME
Zimmerman#NAME
Gillespie#NAME
Chandler#NAME
Bradshaw#NAME
Ferguson#NAME
Jacobson#NAME
Figueroa#NAME
Chandler#NAME
Schaefer#NAME
Matthews#NAME
Ferguson#NAME
Bradshaw#NAME
Figueroa#NAME
Delacruz#NAME
Gallegos#NAME
Villarreal#NAME
Williamson#NAME
Montgomery#NAME
Mclaughlin#NAME
Blankenship#NAME
Fitzpatrick#NAME
"""
with open('names_test.txt', 'w') as file:
file.write(names)
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("sentence")\
# Tokenizer splits words in a relevant format for NLP
tokenizer = nlp.Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")\
# Clinical word embeddings trained on PubMED dataset
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
# NER model trained on n2c2 (de-identification and Heart Disease Risk Factors Challenge) datasets)
clinical_ner = medical.NerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter_name = medical.NerConverterInternal()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
nameChunkObfuscator = medical.NameChunkObfuscatorApproach()\
.setInputCols("ner_chunk")\
.setOutputCol("replacement")\
.setRefFileFormat("csv")\
.setObfuscateRefFile("names_test.txt")\
.setRefSep("#")\
replacer_name = medical.Replacer()\
.setInputCols("replacement","sentence")\
.setOutputCol("obfuscated_document_name")\
.setUseReplacement(False) \
.setNoneValuesTo("entity") \
.setPlaceHolder("******") \
.setPlaceHolderDelimiters(["<", ">"])
nlpPipeline = nlp.Pipeline(stages=[
documentAssembler,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter_name,
nameChunkObfuscator,
replacer_name
])
sample_text = "John Davies is a 62 y.o. patient admitted. Mr. Davies was seen by attending physician Dr. Lorand and was scheduled for emergency assessment."
data = spark.createDataFrame([[sample_text]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
## Result
Original text. : John Davies is a 62 y.o. patient admitted. Mr. Davies was seen by attending physician Dr. Lorand and was scheduled for emergency assessment.
Obfuscated text : Joseeduardo is a 62 y.o. patient admitted. Mr. Teigan was seen by attending physician Dr. Mayson and was scheduled for emergency assessment.
import spark.implicits._
/* names.txt file
names = """Mitchell#NAME
Clifford#NAME
Jeremiah#NAME
Lawrence#NAME
Brittany#NAME
Patricia#NAME
Samantha#NAME
Jennifer#NAME
Jackson#NAME
Leonard#NAME
Randall#NAME
Camacho#NAME
Ferrell#NAME
Mueller#NAME
Bowman#NAME
Hansen#NAME
Acosta#NAME
Gillespie#NAME
Zimmerman#NAME
Gillespie#NAME
Chandler#NAME
Bradshaw#NAME
Ferguson#NAME
Jacobson#NAME
Figueroa#NAME
Chandler#NAME
Schaefer#NAME
Matthews#NAME
Ferguson#NAME
Bradshaw#NAME
Figueroa#NAME
Delacruz#NAME
Gallegos#NAME
Villarreal#NAME
Williamson#NAME
Montgomery#NAME
Mclaughlin#NAME
Blankenship#NAME
Fitzpatrick#NAME
"""
*/
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented","en","clinical/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
val ner_converter_name = new NerConverterInternal()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
val nameChunkObfuscator = new NameChunkObfuscatorApproach()
.setInputCols("ner_chunk")
.setOutputCol("replacement")
.setRefFileFormat("csv")
.setObfuscateRefFile("names_test.txt")
.setRefSep("//")
val replacer_name = new Replacer()
.setInputCols("replacement","sentence")
.setOutputCol("obfuscated_document_name")
.setUseReplacement(true)
.setNoneValuesTo("entity")
.setPlaceHolder("******")
.setPlaceHolderDelimiters(["<", ">"])
val nlpPipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter_name,
nameChunkObfuscator,
replacer_name))
val test_data = Seq("""John Davies is a 62 y.o. patient admitted. Mr. Davies was seen by attending physician Dr. Lorand and was scheduled for emergency assessment.""").toDF("text")
val res = mapperPipeline.fit(test_data).transform(test_data)
# Result
Original text. : John Davies is a 62 y.o. patient admitted. Mr. Davies was seen by attending physician Dr. Lorand and was scheduled for emergency assessment.
Obfuscated text : Joseeduardo is a 62 y.o. patient admitted. Mr. Teigan was seen by attending physician Dr. Mayson and was scheduled for emergency assessment.
Resolution2Chunk
This annotator is responsible for converting the annotations generated by entity resolver models (typically labeled as ENTITY) into a format compatible with subsequent stages of the pipeline, such as the ChunkMapperModel. It transforms these annotations into CHUNK annotations, allowing for seamless integration and processing of clinical terminologies and entities in the pipeline.
Input Annotator Types: Resolution
Output Annotator Type: CHUNK
Python API: Resolution2Chunk | Scala API: Resolution2Chunk | Notebook: Resolution2ChunkNotebook |
Show Example
from johnsnowlabs import medical, nlp
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("ner_chunk")
sbert_embedder = nlp.BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli", "en","clinical/models")\
.setInputCols(["ner_chunk"])\
.setOutputCol("sentence_embeddings")\
.setCaseSensitive(False)
rxnorm_resolver = medical.SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented","en", "clinical/models") \
.setInputCols(["sentence_embeddings"]) \
.setOutputCol("rxnorm_code")\
.setDistanceFunction("EUCLIDEAN")
resolver2chunk = medical.Resolution2Chunk()\
.setInputCols(["rxnorm_code"]) \
.setOutputCol("resolver2chunk")
chunkerMapper_action = medical.ChunkMapperModel.pretrained("rxnorm_action_treatment_mapper", "en", "clinical/models")\
.setInputCols(["resolver2chunk"])\
.setOutputCol("action_mapping")\
.setRels(["action"]) #for treatment
pipeline = nlp.Pipeline().setStages([document_assembler,
sbert_embedder,
rxnorm_resolver,
resolver2chunk,
chunkerMapper_action
])
data= spark.createDataFrame([['Zonalon 50 mg']]).toDF('text')
res= pipeline.fit(data).transform(data)
# Example results
res.select(F.explode(F.arrays_zip(res.ner_chunk.result,
res.rxnorm_code.result,
res.action_mapping.result)).alias("col"))\
.select(F.expr("col['0']").alias("document"),
F.expr("col['1']").alias("rxnorm_code"),
F.expr("col['2']").alias("Action Mapping")).show(truncate=False)
+-------------+-----------+--------------+
|document |rxnorm_code|Action Mapping|
+-------------+-----------+--------------+
|Zonalon 50 mg|103971 |Analgesic |
+-------------+-----------+--------------+
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("ner_chunk")
val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols("ner_chunk")
.setOutputCol("sentence_embeddings")
.setCaseSensitive(False)
val rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented","en","clinical/models")
.setInputCols("sentence_embeddings")
.setOutputCol("rxnorm_code")
.setDistanceFunction("EUCLIDEAN")
val resolver2chunk = new Resolution2Chunk()
.setInputCols("rxnorm_code")
.setOutputCol("resolver2chunk")
val chunkerMapper_action = ChunkMapperModel.pretrained("rxnorm_action_treatment_mapper","en","clinical/models")
.setInputCols("resolver2chunk")
.setOutputCol("action_mapping")
.setRels("action")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
sbert_embedder,
rxnorm_resolver,
resolver2chunk,
chunkerMapper_action ))
val data = Seq("Zonalon 50 mg").toDF("text")
val res = pipeline.fit(data).transform(data)
// Example results
+-------------+-----------+--------------+
|document |rxnorm_code|Action Mapping|
+-------------+-----------+--------------+
|Zonalon 50 mg|103971 |Analgesic |
+-------------+-----------+--------------+
ResolverMerger
ResolverMerger
provides the ability to merge sentence enitity resolver and chunk mapper model output columns.
To convert a sentence or document into a vector for tasks like semantic search or recommendation systems, a common approach is to utilize transformer models like BERT. These models provide embeddings for each token in the text. One option is to extract the embedding vector of the CLS token, which represents the overall meaning of the text. Another option is to average the embeddings of all tokens.
Alternatively, we can use fine-tuned Siamese network variants like SBERT, which are specifically designed to generate embeddings that bring similar sentences or documents closer together in the embedding space while separating dissimilar ones. These embeddings can be applied in “Sentence Entity Resolver Models” to perform entity mapping.
However, for a more straightforward approach, we can use a chunk mapper method to extract entities from the text. In addition, by combining resolver models and mapper models using the ResolverMerger
annotator, we can further enhance the performance and accuracy of the resolver system.
Parameters:
-
inputCols
: The name of the columns containing the input annotations. It can read an Array of strings. -
outputCol
: The name of the column in Document type that is generated. We can specify only one column here.
All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols()
.
Input Annotator Types: ENTITY, LABEL_DEPENDENCY
Output Annotator Type: ENTITY
Python API: ResolverMerger | Scala API: ResolverMerger | Notebook: ResolverMergerNotebook |
Show Example
from johnsnowlabs import nlp, medical
document_assembler = nlp.DocumentAssembler()\
.setInputCol('text')\
.setOutputCol('document')
sentence_detector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model = medical.NerModel.pretrained("ner_posology_greedy", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = medical.NerConverterInternal()\
.setInputCols("sentence", "token", "ner")\
.setOutputCol("chunk")
chunkerMapper = medical.ChunkMapperModel.pretrained("rxnorm_mapper", "en", "clinical/models")\
.setInputCols(["chunk"])\
.setOutputCol("RxNorm_Mapper")\
.setRel("rxnorm_code")
cfModel = medical.ChunkMapperFilterer() \
.setInputCols(["chunk", "RxNorm_Mapper"]) \
.setOutputCol("chunks_fail") \
.setReturnCriteria("fail")
chunk2doc = nlp.Chunk2Doc() \
.setInputCols("chunks_fail") \
.setOutputCol("doc_chunk")
sbert_embedder = nlp.BertSentenceEmbeddings.pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\
.setInputCols(["doc_chunk"])\
.setOutputCol("sentence_embeddings")\
.setCaseSensitive(False)
resolver = medical.SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented", "en", "clinical/models") \
.setInputCols(["sentence_embeddings"]) \
.setOutputCol("resolver_code") \
.setDistanceFunction("EUCLIDEAN")
resolverMerger = medical.ResolverMerger()\
.setInputCols(["resolver_code","RxNorm_Mapper"])\
.setOutputCol("RxNorm")
mapper_pipeline = nlp.Pipeline(
stages = [
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner_model,
ner_converter,
chunkerMapper,
chunkerMapper,
cfModel,
chunk2doc,
sbert_embedder,
resolver,
resolverMerger
])
sample_text = [
["The patient was given Adapin 10 MG, coumadn 5 mg"],
["The patient was given Avandia 4 mg, Tegretol, zitiga"],
]
data = spark.createDataFrame(sample_text).toDF("text")
result = mapper_pipeline.fit(data).transform(data)
result.selectExpr(
"chunk.result as chunk",
"RxNorm_Mapper.result as RxNorm_Mapper",
"chunks_fail.result as chunks_fail",
"resolver_code.result as resolver_code",
"RxNorm.result as RxNorm",
).show(truncate=False)
## Result
+--------------------------------+----------------------+--------------+-------------+------------------------+
|chunk |RxNorm_Mapper |chunks_fail |resolver_code|RxNorm |
+--------------------------------+----------------------+--------------+-------------+------------------------+
|[Adapin 10 MG, coumadn 5 mg] |[1000049, NONE] |[coumadn 5 mg]|[200883] |[1000049, 200883] |
|[Avandia 4 mg, Tegretol, zitiga]|[261242, 203029, NONE]|[zitiga] |[220989] |[261242, 203029, 220989]|
+--------------------------------+----------------------+--------------+-------------+------------------------+
import spark.implicits._
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
val ner_model = MedicalNerModel.pretrained("ner_posology_greedy","en","clinical/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("chunk")
val chunkerMapper = ChunkMapperModel.pretrained("rxnorm_mapper","en","clinical/models")
.setInputCols("chunk")
.setOutputCol("RxNorm_Mapper")
.setRel("rxnorm_code")
val cfModel = new ChunkMapperFilterer()
.setInputCols(Array("chunk","RxNorm_Mapper"))
.setOutputCol("chunks_fail")
.setReturnCriteria("fail")
val chunk2doc = new Chunk2Doc()
.setInputCols("chunks_fail")
.setOutputCol("doc_chunk")
val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols("doc_chunk")
.setOutputCol("sentence_embeddings")
.setCaseSensitive(false)
val resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented","en","clinical/models")
.setInputCols("sentence_embeddings")
.setOutputCol("resolver_code")
.setDistanceFunction("EUCLIDEAN")
val resolverMerger = new ResolverMerger()
.setInputCols(Array("resolver_code","RxNorm_Mapper"))
.setOutputCol("RxNorm")
val mapper_pipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner_model,
ner_converter,
chunkerMapper,
chunkerMapper,
cfModel,
chunk2doc,
sbert_embedder,
resolver,
resolverMerger))
val data = Seq(("""The patient was given Adapin 10 MG, coumadn 5 mg"""),("""The patient was given Avandia 4 mg, Tegretol, zitiga""")).toDF("text")
val res = mapperPipeline.fit(data).transform(data)
// Show results
+--------------------------------+----------------------+--------------+-------------+------------------------+
|chunk |RxNorm_Mapper |chunks_fail |resolver_code|RxNorm |
+--------------------------------+----------------------+--------------+-------------+------------------------+
|[Adapin 10 MG, coumadn 5 mg] |[1000049, NONE] |[coumadn 5 mg]|[200883] |[1000049, 200883] |
|[Avandia 4 mg, Tegretol, zitiga]|[261242, 203029, NONE]|[zitiga] |[220989] |[261242, 203029, 220989]|
+--------------------------------+----------------------+--------------+-------------+------------------------+
Router
Router
provides the ability to split an output of an annotator for a selected metadata field and the value for that field.
When we need to use multiple sentence entity resolver models in the same pipeline, we typically had to run the BertSentenceEmbeddings
annotator multiple times based on the number of resolver models. This meant that the heavy process of generating sentence embeddings using BERT was repeated multiple times.
To address this issue, Spark NLP Healthcare Library has introduced a solution using the Router
annotator. With this new approach, we can provide all the named entity recognition (NER) chunks to the BertSentenceEmbeddings
annotator at once. The annotator generates the sentence embeddings for all the chunks together. Then, the output of the sentence embeddings is routed to the specific resolver models that are required for further processing.
This solution eliminates the need to run BertSentenceEmbeddings
multiple times, reducing the computational overhead and improving the efficiency of the pipeline.
Parameters:
-
inputCols
: The name of the columns containing the input annotations. It can read an Array of strings. -
outputCol
: The name of the column in the Document type that is generated. We can specify only one column here. -
inputType
: The type of entity that you want to filter (by defaultsentence_embeddings
). Possible values;document|token|wordpiece|word_embeddings|sentence_embeddings|category|date|sentiment|pos|chunk|named_entity|regex|dependency|labeled_dependency|language|keyword
-
metadataField
: The key in the metadata dictionary that you want to filter (by defaultentity
) -
filterFieldsElements
: ThefilterfieldsElements
are the allowed values for the metadata field that is being used.
All the parameters can be set using the corresponding set method in the camel case. For example, .setInputcols()
.
Input Annotator Types: ENTITY, LABEL_DEPENDENCY
Output Annotator Type: ENTITY
Python API: Router | Scala API: Router | Notebook: RouterNotebook |
Show Example
from johnsnowlabs import nlp, medical
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols("document")\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols("sentence", "token")\
.setOutputCol("word_embeddings")
# to get PROBLEM entitis
clinical_ner = medical.NerModel().pretrained("ner_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token", "word_embeddings"]) \
.setOutputCol("clinical_ner")
clinical_ner_chunk = medical.NerConverterInternal()\
.setInputCols("sentence","token","clinical_ner")\
.setOutputCol("clinical_ner_chunk")\
.setWhiteList(["PROBLEM"])
# to get DRUG entities
posology_ner = medical.NerModel().pretrained("ner_posology", "en", "clinical/models") \
.setInputCols(["sentence", "token", "word_embeddings"]) \
.setOutputCol("posology_ner")
posology_ner_chunk = medical.NerConverterInternal()\
.setInputCols("sentence","token","posology_ner")\
.setOutputCol("posology_ner_chunk")\
.setWhiteList(["DRUG"])
# merge the chunks into a single ner_chunk
chunk_merger = medical.ChunkMergeApproach()\
.setInputCols("clinical_ner_chunk","posology_ner_chunk")\
.setOutputCol("final_ner_chunk")\
.setMergeOverlapping(False)
# convert chunks to doc to get sentence embeddings of them
chunk2doc = nlp.Chunk2Doc().setInputCols("final_ner_chunk").setOutputCol("doc_final_chunk")
sbiobert_embeddings = nlp.BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["doc_final_chunk"])\
.setOutputCol("sbert_embeddings")\
.setCaseSensitive(False)
# filter PROBLEM entity embeddings
router_sentence_icd10 = medical.Router() \
.setInputCols("sbert_embeddings") \
.setFilterFieldsElements(["PROBLEM"]) \
.setOutputCol("problem_embeddings")
# filter DRUG entity embeddings
router_sentence_rxnorm = medical.Router() \
.setInputCols("sbert_embeddings") \
.setFilterFieldsElements(["DRUG"]) \
.setOutputCol("drug_embeddings")
# use problem_embeddings only
icd_resolver = medical.SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_slim_billable_hcc","en", "clinical/models") \
.setInputCols(["problem_embeddings"]) \
.setOutputCol("icd10cm_code")\
.setDistanceFunction("EUCLIDEAN")
# use drug_embeddings only
rxnorm_resolver = medical.SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented","en", "clinical/models") \
.setInputCols(["drug_embeddings"]) \
.setOutputCol("rxnorm_code")\
.setDistanceFunction("EUCLIDEAN")
pipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
clinical_ner_chunk,
posology_ner,
posology_ner_chunk,
chunk_merger,
chunk2doc,
sbiobert_embeddings,
router_sentence_icd10,
router_sentence_rxnorm,
icd_resolver,
rxnorm_resolver
])
clinical_note = """The patient is a 41-year-old Vietnamese female with a cough that started last week.
She has had right-sided chest pain radiating to her back with fever starting yesterday.
She has a history of pericarditis in May 2006 and developed cough with right-sided chest pain.
MEDICATIONS
1. Coumadin 1 mg daily. Last INR was on Tuesday, August 14, 2007, and her INR was 2.3.
2. Amiodarone 100 mg p.o. daily.
"""
data = spark.createDataFrame([[clinical_note]]).toDF("text")
result = pipeline.fit(data).transform(data)
## Result
result.selectExpr(
"final_ner_chunk.result as chunk",
"posology_ner_chunk.result as posology_chunk",
"rxnorm_code.result as rxnorm_code",
"clinical_ner_chunk.result as clinical_chunk",
"icd10cm_code.result as icd10cm_code",
).show(truncate=False)
+-----------------------------------------------------------------------------------------------------------+----------------------+-------------+-------------------------------------------------------------------------------------+--------------------------------------+
|chunk |posology_chunk |rxnorm_code |clinical_chunk |icd10cm_code |
+-----------------------------------------------------------------------------------------------------------+----------------------+-------------+-------------------------------------------------------------------------------------+--------------------------------------+
|[a cough, right-sided chest pain, fever, pericarditis, cough, right-sided chest pain, Coumadin, Amiodarone]|[Coumadin, Amiodarone]|[202421, 703]|[a cough, right-sided chest pain, fever, pericarditis, cough, right-sided chest pain]|[R05, R10.11, A68, I30.1, R05, R10.11]|
+-----------------------------------------------------------------------------------------------------------+----------------------+-------------+-------------------------------------------------------------------------------------+--------------------------------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence","token"))
.setOutputCol("word_embeddings")
// to get PROBLEM entitis
val clinical_ner = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
.setInputCols(Array("sentence","token","word_embeddings"))
.setOutputCol("clinical_ner")
val clinical_ner_chunk = new NerConverterInternal()
.setInputCols("sentence","token","clinical_ner")
.setOutputCol("clinical_ner_chunk")
.setWhiteList("PROBLEM")
// to get DRUG entities
val posology_ner = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models")
.setInputCols(Array("sentence","token","word_embeddings"))
.setOutputCol("posology_ner")
val posology_ner_chunk = new NerConverterInternal()
.setInputCols("sentence","token","posology_ner")
.setOutputCol("posology_ner_chunk")
.setWhiteList("DRUG")
// merge the chunks into a single ner_chunk
val chunk_merger = new ChunkMergeApproach()
.setInputCols(Array("clinical_ner_chunk","posology_ner_chunk"))
.setOutputCol("final_ner_chunk")
.setMergeOverlapping(false)
// convert chunks to doc to get sentence embeddings of them
val chunk2doc = new Chunk2Doc()
.setInputCols("final_ner_chunk")
.setOutputCol("doc_final_chunk")
val sbiobert_embeddings = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols("doc_final_chunk")
.setOutputCol("sbert_embeddings")
.setCaseSensitive(false)
// filter PROBLEM entity embeddings
val router_sentence_icd10 = new Router()
.setInputCols("sbert_embeddings")
.setFilterFieldsElements("PROBLEM")
.setOutputCol("problem_embeddings")
// filter DRUG entity embeddings
val router_sentence_rxnorm = new Router()
.setInputCols("sbert_embeddings")
.setFilterFieldsElements("DRUG")
.setOutputCol("drug_embeddings")
// use problem_embeddings only
val icd_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_slim_billable_hcc", "en", "clinical/models")
.setInputCols("problem_embeddings")
.setOutputCol("icd10cm_code")
.setDistanceFunction("EUCLIDEAN")
// use drug_embeddings only
val rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented", "en", "clinical/models")
.setInputCols("drug_embeddings")
.setOutputCol("rxnorm_code")
.setDistanceFunction("EUCLIDEAN")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
clinical_ner_chunk,
posology_ner,
posology_ner_chunk,
chunk_merger,
chunk2doc,
sbiobert_embeddings,
router_sentence_icd10,
router_sentence_rxnorm,
icd_resolver,
rxnorm_resolver))
val data = Seq("""The patient is a 41-year-old Vietnamese female with a cough that started last week.
She has had right-sided chest pain radiating to her back with fever starting yesterday.
She has a history of pericarditis in May 2006 and developed cough with right-sided chest pain.
MEDICATIONS
1. Coumadin 1 mg daily. Last INR was on Tuesday, August 14, 2007, and her INR was 2.3.
2. Amiodarone 100 mg p.o. daily.""").toDF("text")
val res = mapperPipeline.fit(data).transform(data)
// Show results
+-----------------------------------------------------------------------------------------------------------+----------------------+-------------+-------------------------------------------------------------------------------------+--------------------------------------+
|chunk |posology_chunk |rxnorm_code |clinical_chunk |icd10cm_code |
+-----------------------------------------------------------------------------------------------------------+----------------------+-------------+-------------------------------------------------------------------------------------+--------------------------------------+
|[a cough, right-sided chest pain, fever, pericarditis, cough, right-sided chest pain, Coumadin, Amiodarone]|[Coumadin, Amiodarone]|[202421, 703]|[a cough, right-sided chest pain, fever, pericarditis, cough, right-sided chest pain]|[R05, R10.11, A68, I30.1, R05, R10.11]|
+-----------------------------------------------------------------------------------------------------------+----------------------+-------------+-------------------------------------------------------------------------------------+--------------------------------------+
SentenceEntityResolver
The model transforms a dataset with Input Annotation type SENTENCE_EMBEDDINGS, coming from e.g. BertSentenceEmbeddings and returns the normalized entity for a particular trained ontology / curated dataset (e.g. ICD-10, RxNorm, SNOMED etc.).
Parameters:
-
distanceFunction
: Determines how the distance between different entities will be calculated. EitherCOSINE
orEUCLIDEAN
. -
neighbours
: The number of neighbours to consider when computing the distances. -
caseSensitive
: WWhether to consider text casing or not. -
threshold
: Threshold of the distance between nodes to consider. -
DoExceptionHandling
: If it is set as True, the annotator tries to process as usual and ff exception-causing data (e.g. corrupted record/ document) is passed to the annotator, an exception warning is emitted which has the exception message.
All the parameters can be set using the corresponding set method in camel case. For example, .setInputcols()
.
For a list of pretrained models, please see the Models Hub.
Input Annotator Types: SENTENCE_EMBEDDINGS
Output Annotator Type: ENTITY
Python API: SentenceEntityResolverModel | Scala API: SentenceEntityResolverModel | Notebook: SentenceEntityResolverModelNotebook |
Show Example
from johnsnowlabs import nlp, medical
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
clinical_ner = medical.NerModel.pretrained("jsl_ner_wip_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")\
.setWhiteList(["Test","Procedure"])
c2doc = nlp.Chunk2Doc()\
.setInputCols(["ner_chunk"])\
.setOutputCol("ner_chunk_doc")
sbert_embedder = nlp.BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sbert_embeddings")
# Then the resolver is defined on the extracted entities and sentence embeddings
cpt_resolver = medical.SentenceEntityResolverModel.pretrained("sbiobertresolve_cpt_procedures_augmented","en", "clinical/models")\
.setInputCols(["sbert_embeddings"])\
.setOutputCol("cpt_code")\
.setDistanceFunction("EUCLIDEAN")
pipeline = nlp.Pipeline().setStages([
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
c2doc,
sbert_embedder,
cpt_resolver])
text = """She was admitted to the hospital with chest pain and found to have bilateral pleural effusion, the right greater than the left. CT scan of the chest also revealed a large mediastinal lymph node.
We reviewed the pathology obtained from the pericardectomy in March 2006, which was diagnostic of mesothelioma.
At this time, chest tube placement for drainage of the fluid occurred and thoracoscopy, which were performed, which revealed epithelioid malignant mesothelioma."""
df = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(df).transform(df)
# Show Results
+--------------------+---------+-----+----------+--------------------+--------------------+
| chunk| entity| code|confidence| all_k_results| all_k_resolutions|
+--------------------+---------+-----+----------+--------------------+--------------------+
|CT scan of the chest| Test|62284| 0.2028|62284:::76497:::7...|Computed tomograp...|
| pericardectomy|Procedure|33031| 0.3329|33031:::33025:::3...|Pericardectomy [P...|
|chest tube placement|Procedure|39503| 0.9343|39503:::32036:::3...|Insertion of ches...|
|drainage of the f...|Procedure|49405| 0.2476|49405:::49407:::4...|Drainage procedur...|
| thoracoscopy|Procedure|32660| 0.1422|32660:::32667:::1...|Thoracoscopy [Tho...|
+--------------------+---------+-----+----------+--------------------+--------------------+
from johnsnowlabs import nlp, finance
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("ner_chunk")
embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en") \
.setInputCols("ner_chunk") \
.setOutputCol("sentence_embeddings")
resolver = finance.SentenceEntityResolverModel.pretrained("finel_edgar_company_name", "en", "finance/models")\
.setInputCols(["ner_chunk", "sentence_embeddings"]) \
.setOutputCol("normalized")\
.setDistanceFunction("EUCLIDEAN")
pipeline = nlp.Pipeline(
stages = [
documentAssembler,
embeddings,
resolver])
text = """CONTACT GOLD"""
df = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(df).transform(df)
# Show Results
+------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|chunk |result |all_k_results |all_k_resolutions |
+------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|CONTACT GOLD|Contact Gold Corp.|Contact Gold Corp.:::ISHARES GOLD TRUST:::Minatura Gold:::Mexus Gold US:::BESRA GOLD INC.:::ALAMOS GOLD INC:::JOSHUA GOLD RESOURCES INC:::MIDEX GOLD CORP.:::Gold Mark Stephen:::Guskin Gold Corp.:::CMX GOLD & SILVER CORP.:::Permal Gold Ltd.|Contact Gold Corp.:::ISHARES GOLD TRUST:::Minatura Gold:::Mexus Gold US:::BESRA GOLD INC.:::ALAMOS GOLD INC:::JOSHUA GOLD RESOURCES INC:::MIDEX GOLD CORP.:::Gold Mark Stephen:::Guskin Gold Corp.:::CMX GOLD & SILVER CORP.:::Permal Gold Ltd.|
+------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
from johnsnowlabs import nlp, legal
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("ner_chunk")
embeddings = nlp.UniversalSentenceEncoder.pretrained("tfhub_use", "en")\
.setInputCols("ner_chunk")\
.setOutputCol("sentence_embeddings")
resolver = legal.SentenceEntityResolverModel.pretrained("legel_edgar_company_name", "en", "legal/models")\
.setInputCols(["ner_chunk", "sentence_embeddings"])\
.setOutputCol("irs_code")\
.setDistanceFunction("EUCLIDEAN")
pipeline = nlp.Pipeline(
stages = [
documentAssembler,
embeddings,
resolver])
text = """CONTACT GOLD"""
df = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(df).transform(df)
# Show Results
+------------+------------------+---------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|chunk |result |code |all_k_results |all_k_resolutions |
+------------+------------------+---------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|CONTACT GOLD|Contact Gold Corp.|981369960:::0:::208273426:::204092640:::0:::0:::270531073:::261918920:::0:::271989147:::0:::0|Contact Gold Corp.:::ISHARES GOLD TRUST:::Minatura Gold:::Mexus Gold US:::BESRA GOLD INC.:::ALAMOS GOLD INC:::JOSHUA GOLD RESOURCES INC:::MIDEX GOLD CORP.:::Gold Mark Stephen:::Guskin Gold Corp.:::CMX GOLD & SILVER CORP.:::Permal Gold Ltd.|Contact Gold Corp.:::ISHARES GOLD TRUST:::Minatura Gold:::Mexus Gold US:::BESRA GOLD INC.:::ALAMOS GOLD INC:::JOSHUA GOLD RESOURCES INC:::MIDEX GOLD CORP.:::Gold Mark Stephen:::Guskin Gold Corp.:::CMX GOLD & SILVER CORP.:::Permal Gold Ltd.|
+------------+------------------+---------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("jsl_ner_wip_clinical","en","clinical/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence","token","ner"))
.setOutputCol("ner_chunk")
.setWhiteList(Array("Test","Procedure"))
val c2doc = new Chunk2Doc()
.setInputCols(Array("ner_chunk"))
.setOutputCol("ner_chunk_doc")
val sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols(Array("ner_chunk_doc"))
.setOutputCol("sbert_embeddings")
// Then the resolver is defined on the extracted entities and sentence embeddings
val cpt_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_cpt_procedures_augmented","en","clinical/models")
.setInputCols(Array("sbert_embeddings"))
.setOutputCol("cpt_code")
.setDistanceFunction("EUCLIDEAN")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
c2doc,
sbert_embedder,
cpt_resolver))
val text = "She was admitted to the hospital with chest pain and found to have bilateral pleural effusion,the right greater than the left. CT scan of the chest also revealed a large mediastinal lymph node. We reviewed the pathology obtained from the pericardectomy in March 2006,which was diagnostic of mesothelioma. At this time,chest tube placement for drainage of the fluid occurred and thoracoscopy,which were performed,which revealed epithelioid malignant mesothelioma."
val df = Seq(text) .toDF("text")
val result = pipeline.fit(df).transform(df)
// Show Results
+--------------------+---------+-----+----------+--------------------+--------------------+
| chunk| entity| code|confidence| all_k_results| all_k_resolutions|
+--------------------+---------+-----+----------+--------------------+--------------------+
|CT scan of the chest| Test|62284| 0.2028|62284:::76497:::7...|Computed tomograp...|
| pericardectomy|Procedure|33031| 0.3329|33031:::33025:::3...|Pericardectomy [P...|
|chest tube placement|Procedure|39503| 0.9343|39503:::32036:::3...|Insertion of ches...|
|drainage of the f...|Procedure|49405| 0.2476|49405:::49407:::4...|Drainage procedur...|
| thoracoscopy|Procedure|32660| 0.1422|32660:::32667:::1...|Thoracoscopy [Tho...|
+--------------------+---------+-----+----------+--------------------+--------------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("ner_chunk")
val embeddings = UniversalSentenceEncoder.pretrained("tfhub_use","en")
.setInputCols("ner_chunk")
.setOutputCol("sentence_embeddings")
val resolver = SentenceEntityResolverModel.pretrained("finel_edgar_company_name","en","finance/models")
.setInputCols(Array("ner_chunk","sentence_embeddings"))
.setOutputCol("normalized")
.setDistanceFunction("EUCLIDEAN")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
embeddings,
resolver))
val text = "CONTACT GOLD"
val df = Seq(text) .toDF("text")
val result = pipeline.fit(df).transform(df)
// Show Results
+------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|chunk |result |all_k_results |all_k_resolutions |
+------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|CONTACT GOLD|Contact Gold Corp.|Contact Gold Corp.:::ISHARES GOLD TRUST:::Minatura Gold:::Mexus Gold US:::BESRA GOLD INC.:::ALAMOS GOLD INC:::JOSHUA GOLD RESOURCES INC:::MIDEX GOLD CORP.:::Gold Mark Stephen:::Guskin Gold Corp.:::CMX GOLD & SILVER CORP.:::Permal Gold Ltd.|Contact Gold Corp.:::ISHARES GOLD TRUST:::Minatura Gold:::Mexus Gold US:::BESRA GOLD INC.:::ALAMOS GOLD INC:::JOSHUA GOLD RESOURCES INC:::MIDEX GOLD CORP.:::Gold Mark Stephen:::Guskin Gold Corp.:::CMX GOLD & SILVER CORP.:::Permal Gold Ltd.|
+------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("ner_chunk")
val embeddings = UniversalSentenceEncoder.pretrained("tfhub_use","en")
.setInputCols("ner_chunk")
.setOutputCol("sentence_embeddings")
val resolver = SentenceEntityResolverModel.pretrained("legel_edgar_company_name","en","legal/models")
.setInputCols(Array("ner_chunk","sentence_embeddings"))
.setOutputCol("irs_code") .setDistanceFunction("EUCLIDEAN")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
embeddings,
resolver))
val text = "CONTACT GOLD"
val df = Seq(text) .toDF("text")
val result = pipeline.fit(df).transform(df)
// Show Results
+------------+------------------+---------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|chunk |result |code |all_k_results |all_k_resolutions |
+------------+------------------+---------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|CONTACT GOLD|Contact Gold Corp.|981369960:::0:::208273426:::204092640:::0:::0:::270531073:::261918920:::0:::271989147:::0:::0|Contact Gold Corp.:::ISHARES GOLD TRUST:::Minatura Gold:::Mexus Gold US:::BESRA GOLD INC.:::ALAMOS GOLD INC:::JOSHUA GOLD RESOURCES INC:::MIDEX GOLD CORP.:::Gold Mark Stephen:::Guskin Gold Corp.:::CMX GOLD & SILVER CORP.:::Permal Gold Ltd.|Contact Gold Corp.:::ISHARES GOLD TRUST:::Minatura Gold:::Mexus Gold US:::BESRA GOLD INC.:::ALAMOS GOLD INC:::JOSHUA GOLD RESOURCES INC:::MIDEX GOLD CORP.:::Gold Mark Stephen:::Guskin Gold Corp.:::CMX GOLD & SILVER CORP.:::Permal Gold Ltd.|
+------------+------------------+---------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Trains a SentenceEntityResolverModel that maps sentence embeddings to entities in a knowledge base.
Parameters:
-
labelCol
: Column name for the value we are trying to resolve. Usually this contains the entity ID in the knowledge base (e.g., the ICD-10 code). -
normalizedCol
: Column name for the original, normalized description -
aux_label_col
: Auxiliary label which maps resolved entities to additional labels -
useAuxLabel
: Whether to use the auxiliary column or not. Default value is False. -
distanceFunction
: Determines how the distance between different entities will be calculated. -
confidenceFunction
: What function to use to calculate confidence: Either `INVERSE
orSOFTMAX
. -
caseSensitive
: whether to ignore case in tokens for embeddings matching (Default:False
) -
threshold
: Threshold value for the last distance calculated (default: 5.0) -
missAsEmpty
: whether or not to return an empty annotation on unmatched chunks (default:True
)
When finetuning an existing model, there are additional parameters:
-
pretrainedModelPath
: Path to an already trained SentenceEntityResolverModel.This pretrained model will be used as a starting point for training the new one. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3). -
overrideExistingCodes
: Whether to override the existing codes with new data while continue the training from a pretrained model. Default value isFalse
(keep all the codes). -
dropCodesList
: A list of codes in a pretrained model that will be omitted when the training process begins with a pretrained model.
You can find pretrained Sentence Embeddings (using BERT or other architecgture) in the NLP Models Hub <https://nlp.johnsnowlabs.com/models?task=Embeddings>
_.
Input Annotator Types: SENTENCE_EMBEDDINGS
Output Annotator Type: ENTITY
Python API: SentenceEntityResolverApproach | Scala API: SentenceEntityResolverApproach | Notebook: SentenceEntityResolverApproachNotebook |
Show Example
from johnsnowlabs import nlp, medical
# Training a SNOMED resolution model using BERT sentence embeddings
# Define pre-processing pipeline for training data. It needs consists of columns for the normalized training data and their labels.
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("normalized_text") \
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
bertEmbeddings = nlp.BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_base_cased") \
.setInputCols(["sentence"]) \
.setOutputCol("bert_embeddings")
snomedTrainingPipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
bertEmbeddings
])
snomedTrainingModel = snomedTrainingPipeline.fit(data)
snomedData = snomedTrainingModel.transform(data).cache()
# Then the Resolver can be trained with
bertExtractor = medical.SentenceEntityResolverApproach() \
.setNeighbours(25) \
.setThreshold(1000) \
.setInputCols(["bert_embeddings"]) \
.setNormalizedCol("normalized_text") \
.setLabelCol("label") \
.setOutputCol("snomed_code") \
.setDistanceFunction("EUCLIDIAN") \
.setCaseSensitive(False)
snomedModel = bertExtractor.fit(snomedData)
from johnsnowlabs import nlp, finance
# Define pre-processing pipeline for training data. It needs consists of columns for the normalized training data and their labels.
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("normalized_text") \
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
bertEmbeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_large_cased") \
.setInputCols(["sentence"]) \
.setOutputCol("bert_embeddings")
preprocessing_pipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
bertEmbeddings
])
preprocessing_model = preprocessing_pipeline.fit(data)
processed_data = preprocessing_model.transform(data).cache()
# Then the Resolver can be trained with
bertExtractor = finance.SentenceEntityResolverApproach() \
.setNeighbours(25) \
.setThreshold(1000) \
.setInputCols(["bert_embeddings"]) \
.setNormalizedCol("normalized_text") \
.setLabelCol("label") \
.setOutputCol("snomed_code") \
.setDistanceFunction("EUCLIDIAN") \
.setCaseSensitive(False)
model = bertExtractor.fit(processed_data)
from johnsnowlabs import nlp, legal
# Define pre-processing pipeline for training data. It needs consists of columns for the normalized training data and their labels.
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("normalized_text") \
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
bertEmbeddings = nlp.BertSentenceEmbeddings.pretrained("sent_bert_base_uncased_legal") \
.setInputCols(["sentence"]) \
.setOutputCol("bert_embeddings")
preprocessing_pipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
bertEmbeddings
])
data_preprocessing_model = preprocessing_pipeline.fit(data)
processed_data = data_preprocessing_model.transform(data).cache()
# Then the Resolver can be trained with
bertExtractor = legal.SentenceEntityResolverApproach() \
.setNeighbours(25) \
.setThreshold(1000) \
.setInputCols(["bert_embeddings"]) \
.setNormalizedCol("normalized_text") \
.setLabelCol("label") \
.setOutputCol("snomed_code") \
.setDistanceFunction("EUCLIDIAN") \
.setCaseSensitive(False)
model = bertExtractor.fit(processed_data)
import spark.implicits._
// Training a SNOMED resolution model using BERT sentence embeddings
// Define pre-processing pipeline for training data. It needs consists of columns for the normalized training data and their labels.
val documentAssembler = new DocumentAssembler()
.setInputCol("normalized_text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val bertEmbeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_base_cased")
.setInputCols("sentence")
.setOutputCol("bert_embeddings")
val snomedTrainingPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
bertEmbeddings
))
val snomedTrainingModel = snomedTrainingPipeline.fit(data)
val snomedData = snomedTrainingModel.transform(data).cache()
// Then the Resolver can be trained with
val bertExtractor = new SentenceEntityResolverApproach()
.setNeighbours(25)
.setThreshold(1000)
.setInputCols("bert_embeddings")
.setNormalizedCol("normalized_text")
.setLabelCol("label")
.setOutputCol("snomed_code")
.setDistanceFunction("EUCLIDIAN")
.setCaseSensitive(false)
val snomedModel = bertExtractor.fit(snomedData)
import spark.implicits._
// Training a SNOMED resolution model using BERT sentence embeddings
// Define pre-processing pipeline for training data. It needs consists of columns for the normalized training data and their labels.
val documentAssembler = new DocumentAssembler()
.setInputCol("normalized_text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val bertEmbeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_base_cased")
.setInputCols("sentence")
.setOutputCol("bert_embeddings")
val snomedTrainingPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
bertEmbeddings
))
val snomedTrainingModel = snomedTrainingPipeline.fit(data)
val snomedData = snomedTrainingModel.transform(data).cache()
// Then the Resolver can be trained with
val bertExtractor = new SentenceEntityResolverApproach()
.setNeighbours(25)
.setThreshold(1000)
.setInputCols("bert_embeddings")
.setNormalizedCol("normalized_text")
.setLabelCol("label")
.setOutputCol("snomed_code")
.setDistanceFunction("EUCLIDIAN")
.setCaseSensitive(false)
val snomedModel = bertExtractor.fit(snomedData)
import spark.implicits._
// Training a SNOMED resolution model using BERT sentence embeddings
// Define pre-processing pipeline for training data. It needs consists of columns for the normalized training data and their labels.
val documentAssembler = new DocumentAssembler()
.setInputCol("normalized_text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val bertEmbeddings = BertSentenceEmbeddings.pretrained("sent_biobert_pubmed_base_cased")
.setInputCols("sentence")
.setOutputCol("bert_embeddings")
val snomedTrainingPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
bertEmbeddings
))
val snomedTrainingModel = snomedTrainingPipeline.fit(data)
val snomedData = snomedTrainingModel.transform(data).cache()
// Then the Resolver can be trained with
val bertExtractor = new SentenceEntityResolverApproach()
.setNeighbours(25)
.setThreshold(1000)
.setInputCols("bert_embeddings")
.setNormalizedCol("normalized_text")
.setLabelCol("label")
.setOutputCol("snomed_code")
.setDistanceFunction("EUCLIDIAN")
.setCaseSensitive(false)
val snomedModel = bertExtractor.fit(snomedData)
StructuredJsonConverter
This annotator integrates seamlessly with existing systems to process outputs from pretrained pipelines, delivering structured, easy-to-read results in a dictionary format. Optimized for API integration and user-friendly outputs, it supports streamlined data analysis workflows by converting raw annotations into a prettified, structured JSON format. With configurable schema mappings, it accommodates diverse outputs, including entities, assertions, resolutions, relations, summaries, deidentifications, and classifications. It uses column_maps to define output columns and align them with pipeline requirements. It handles diverse annotation types, including entities, assertions, resolutions, relations, summaries, deidentifications, and classifications. It produces well-structured, easy-to-read results ideal for API consumption and streamlined workflows.
Parameters:
cleanAnnotations
: Whether to remove annotation columns, by default False.returnRelationEntities
: Whether to return the entities in the relations or not, by default False.outputAsStr
: Whether to output the result as a string or as a structured json, by default True.
When set to `True`, the output column will be a string:
|-- column_name: string (nullable = true)
When set to False, the output column will be a struct with the following schema:
|-- column_name: struct (nullable = true)
|-- document_identifier: string (nullable = true)
|-- document_text: array (nullable = true)
| |-- element: string (containsNull = true)
|-- entities: array (nullable = true)
| |-- element: map (containsNull = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- assertions: array (nullable = true)
| |-- element: map (containsNull = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- resolutions: array (nullable = true)
| |-- element: map (containsNull = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- relations: array (nullable = true)
| |-- element: map (containsNull = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- summaries: array (nullable = true)
| |-- element: string (containsNull = true)
|-- deidentifications: array (nullable = true)
| |-- element: map (containsNull = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- classifications: array (nullable = true)
| |-- element: map (containsNull = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
converterSchema
andconverterSchemaAsStr
: The schema for converting the output of the pipeline into a structured JSON format. Fields in the schema:document_identifier
: The identifier of the document. This column must be of typeStringType
.document_text
: The text of the document, typically created by theDocumentAssembler
annotator.entities
: Chunk columns generated by various annotators, such as theChunkMergeModel
annotator.assertions
: Assertion columns produced by annotators like theAssertionDLModel
annotator.resolutions
: The schema for resolutions. SeeResolutionSchema
for details.relations
: Relation columns created by annotators such as theRelationExtractionModel
annotator.summaries
: Summary columns generated by annotators like theMedicalSummarizer
annotator.deidentifications
: The schema for deidentifications.classifications
: The schema for classifications.
Input Annotator Types: ANY
Output Annotator Type: JSON
Python API: StructuredJsonConverter | Scala API: StructuredJsonConverter | Notebook: StructuredJsonConverter |
Show Example
from johnsnowlabs import nlp, medical
from sparknlp_jsl.pipeline_tracer import PipelineTracer
oncology_pipeline = nlp.PretrainedPipeline("explain_clinical_doc_oncology", "en", "clinical/models")
text = """The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ovarian mass. A Pap smear performed one month later was positive for atypical glandular cells suspicious for adenocarcinoma. The pathologic specimen showed extension of the tumor throughout the fallopian tubes, appendix, omentum, and 5 out of 5 enlarged lymph nodes. The final pathologic diagnosis of the tumor was stage IIIC papillary serous ovarian adenocarcinoma. Two months later, the patient was diagnosed with lung metastases.Neoadjuvant chemotherapy with the regimens of Cyclophosphamide (500 mg/m2) is being given for 6 cycles with poor response"""
data = spark.createDataFrame([[text]]).toDF("text")
result_df = oncology_pipeline.transform(data)
pipeline_tracer = PipelineTracer(oncology_pipeline)
column_maps = pipeline_tracer.createParserDictionary()
output_converter = StructuredJsonConverter()\
.setOutputCol("result")\
.setConverterSchema(column_maps)\
.setCleanAnnotations(False)\
.setReturnRelationEntities(True)\
.setOutputAsStr(True)
json_output = output_converter.transform(result_df).select("result")
json_output.show(truncate=200)
# result
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"result":{"document_identifier":"fc43295f-cbbe-4ca9-b842-abb1c2fc017e","document_text":["The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ova...|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.util.tracer.PipelineTracer
val oncology_pipeline = PretrainedPipeline("explain_clinical_doc_oncology", "en", "clinical/models")
val text = """The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ovarian mass. A Pap smear performed one month later was positive for atypical glandular cells suspicious for adenocarcinoma. The pathologic specimen showed extension of the tumor throughout the fallopian tubes, appendix, omentum, and 5 out of 5 enlarged lymph nodes. The final pathologic diagnosis of the tumor was stage IIIC papillary serous ovarian adenocarcinoma. Two months later, the patient was diagnosed with lung metastases.Neoadjuvant chemotherapy with the regimens of Cyclophosphamide (500 mg/m2) is being given for 6 cycles with poor response"""
val data = Seq(text).toDF("text")
val result_df = oncology_pipeline.transform(data)
val pipeline_tracer = new PipelineTracer(oncology_pipeline)
val column_maps = pipeline_tracer.createParserDictionary()
val output_converter = new StructuredJsonConverter()
.setOutputCol("result")
.setConverterSchemaAsStr(column_maps)
.setCleanAnnotations(false)
.setReturnRelationEntities(true)
.setOutputAsStr(true)
val json_output = output_converter.transform(result_df).select("result")
json_output.show()
# result
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"result":{"document_identifier":"fc43295f-cbbe-4ca9-b842-abb1c2fc017e","document_text":["The Patient underwent a computed tomography (CT) scan of the abdomen and pelvis, which showed a complex ova...|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Summarizer
Summarizer annotator that uses a generative deep learning model to create summaries of medical, finance, and legal texts. This annotator helps to quickly summarize complex medical, finance, and legal information from related documents.
Parameters:
-
doSample
: Whether or not to use sampling, use greedy decoding otherwise (Default: false) -
ignoreTokenIds
: A list of token ids which are ignored in the decoder’s output (Default: Array()) -
maxNewTokens
: Maximum number of new tokens to be generated (Default: 30) -
maxTextLength
: Maximum length of context text. -
noRepeatNgramSize
: If set to int > 0, all ngrams of that size can only occur once (Default: 0) -
randomSeed
: Optional Random seed for the model. -
refineChunkSize
: How large should refined chunks Be. -
refineMaxAttempts
: How many times should chunks be re-summarized while they are above SummaryTargetLength before stopping. -
refineSummary
: Set true to perform refined summarization at increased computation cost. -
refineSummaryTargetLength
: Target length for refined summary. -
topK
: The number of highest probability vocabulary tokens to keep for top-k-filtering (Default: 50) -
useCache
: Cache internal state of the model to improve performance
Available models can be found at the Models Hub.
For more extended examples on document pre-processing see the Spark NLP Workshop
Input Annotator Types: DOCUMENT
Output Annotator Type: CHUNK
Python API: MedicalSummarizer | Scala API: MedicalSummarizer |
Show Example
from johnsnowlabs import nlp, medical
document_assembler = nlp.DocumentAssembler()\
.setInputCol('text')\
.setOutputCol('document')
summarizer = medical.Summarizer.pretrained("summarizer_clinical_jsl", "en", "clinical/models")\
.setInputCols(["document"])\
.setOutputCol("summary")\
.setMaxTextLength(512)\
.setMaxNewTokens(512)
pipeline = nlp.Pipeline(
stages=[
document_assembler,
summarizer
])
text = """The patient is a pleasant 17-year-old gentleman who was playing basketball today in gym. Two hours prior to presentation, he started to fall and someone stepped on his ankle and kind of twisted his right ankle and he cannot bear weight on it now. It hurts to move or bear weight. No other injuries noted. He does not think he has had injuries to his ankle in the past.
SOCIAL HISTORY: He does not drink or smoke.
MEDICAL DECISION MAKING:
He had an x-ray of his ankle that showed a small ossicle versus avulsion fracture of the talonavicular joint on the lateral view. He has had no pain over the metatarsals themselves. This may be a fracture based upon his exam. He does want to have me to put him in a splint. He was given Motrin here. He will be discharged home to follow up with Dr. X from Orthopedics.
DISPOSITION: Crutches and splint were administered here. I gave him a prescription for Motrin and some Darvocet if he needs to length his sleep and if he has continued pain to follow up with Dr. X. Return if any worsening problems."""
data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("summary.result").show(truncate=False)
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[A 17-year-old man fell and twisted his right ankle, causing pain to move or bear weight. An x-ray showed a small ossicle or avulsion fracture of the talonavicular joint on the lateral view, which may be a fracture based upon his exam. He was given Motrin and discharged home with crutches and a prescription for Motrin and Darvocet. He was advised to follow up with his doctor if pain worsens and return if any worsening problems worsen.]|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
from johnsnowlabs import nlp, finance
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
summarizer = finance.Summarizer().pretrained('finsum_flant5_base','en','finance/models')\
.setInputCols(["document"])\
.setOutputCol("summary")\
.setMaxNewTokens(1000)
pipeline = nlp.Pipeline(stages=[document_assembler, summarizer])
data = spark.createDataFrame([["""Lost Time Incident Rate:
The lost time incident rate per 200,000 hours worked in 2021 was 0.14, which decreased by 17.6% compared to 2020 (0.17) and decreased by 70.8% compared to 2019 (0.48). The decrease in the lost time incident rate can be attributed to the company's efforts to improve workplace safety and implement effective risk management strategies.
The total Scope 2 GHG emissions in 2021 were 688,228 tonnes, which remained relatively stable compared to 2020. The company's efforts to transition to renewable energy sources have helped to minimize Scope 2 GHG emissions."""]]).toDF('text')
result = pipeline.fit(data).transform(data)
result.select("summary.result").show(truncate=False)
from johnsnowlabs import nlp, legal
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
summarizer = legal.Summarizer().pretrained('legsum_flant5_legal_augmented','en','legal/models')\
.setInputCols(["document"])\
.setOutputCol("summary")\
.setMaxNewTokens(1000)
pipeline = nlp.Pipeline(stages=[document_assembler, summarizer])
data = spark.createDataFrame([
["""NOW, THEREFORE, in consideration of the Company’s disclosure of information to the Recipient
and the promises set forth below, the parties agree as follows:
1. Confidential Information. “Confidential Information” as used in this
Agreement means all information relating to the Company disclosed to the Recipient by the Company,
including without limitation any business, technical, marketing, financial or other information,
whether in written, electronic or oral form. Any and all reproductions, copies, notes, summaries,
reports, analyses or other material derived by the Recipient or its Representatives (as defined
below) in whole or in part from the Confidential Information in whatever form maintained shall be
considered part of the Confidential Information itself and shall be treated as such. Confidential
Information does not include information that (a) is or becomes part of the public domain other
than as a result of disclosure by the Recipient or its Representatives; (b) becomes available to
the Recipient on a nonconfidential basis from a source other than the Company, provided that source
is not bound with respect to that information by a confidentiality agreement with the Company or is
otherwise prohibited from transmitting that information by a contractual, legal or other
obligation; (c) can be proven by the Recipient to have been in the Recipient’s possession prior to
disclosure of the same by the Company; or (d) is independently developed by the Recipient without
reference to or reliance on any of the Company’s Confidential Information."""]
]).toDF('text')
result = pipeline.fit(data).transform(data)
result.select("summary.result").show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[This legal agreement states that the company has disclosed all information relating to the company to the recipient, including any business, technical, marketing, financial or other information. It also states that any reproductions, copies, notes, summaries, reports, analyses or other material derived from the confidential information must be treated as part of the confidential information. The confidential information does not include information that is or becomes part of the public domain other than as a result of disclosure by the recipient or its representatives, becomes available to the recipient on a nonconfidential basis from a source other than the company, can be proven by the recipient to have been in the recipient’s possession prior to disclosure, or is independently developed by the recipient without reference to or reliance on any of the company’s confidential information.]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val summarizer = Summarizer.pretrained("summarizer_clinical_jsl", "en", "clinical/models")
.setInputCols(Array("document"))
.setOutputCol("summary")
.setMaxTextLength(512)
.setMaxNewTokens(512)
val pipeline = new Pipeline().setStages(Array(documentAssembler, summarizer))
val text = """The patient is a pleasant 17-year-old gentleman who was playing basketball today in gym. Two hours prior to presentation, he started to fall and someone stepped on his ankle and kind of twisted his right ankle and he cannot bear weight on it now. It hurts to move or bear weight. No other injuries noted. He does not think he has had injuries to his ankle in the past.
SOCIAL HISTORY: He does not drink or smoke.
MEDICAL DECISION MAKING:
He had an x-ray of his ankle that showed a small ossicle versus avulsion fracture of the talonavicular joint on the lateral view. He has had no pain over the metatarsals themselves. This may be a fracture based upon his exam. He does want to have me to put him in a splint. He was given Motrin here. He will be discharged home to follow up with Dr. X from Orthopedics.
DISPOSITION: Crutches and splint were administered here. I gave him a prescription for Motrin and some Darvocet if he needs to length his sleep and if he has continued pain to follow up with Dr. X. Return if any worsening problems."""
val data = Seq(text).toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val summarizer = Summarizer.pretrained("finsum_flant5_base", "en", "finance/models")
.setInputCols(Array("document"))
.setOutputCol("summary")
.setMaxNewTokens(1000)
val pipeline = new Pipeline().setStages(Array(documentAssembler, summarizer))
val text = """Lost Time Incident Rate:
The lost time incident rate per 200,000 hours worked in 2021 was 0.14, which decreased by 17.6% compared to 2020 (0.17) and decreased by 70.8% compared to 2019 (0.48). The decrease in the lost time incident rate can be attributed to the company's efforts to improve workplace safety and implement effective risk management strategies.
The total Scope 2 GHG emissions in 2021 were 688,228 tonnes, which remained relatively stable compared to 2020. The company's efforts to transition to renewable energy sources have helped to minimize Scope 2 GHG emissions."""
val data = Seq(text).toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val summarizer = Summarizer.pretrained("legsum_flant5_legal_augmented", "en", "legal/models")
.setInputCols(Array("document"))
.setOutputCol("summary")
.setMaxNewTokens(1000)
val pipeline = new Pipeline()
.setStages(Array(documentAssembler, summarizer))
val text = """NOW, THEREFORE, in consideration of the Company’s disclosure of information to the Recipient
and the promises set forth below, the parties agree as follows:
1. Confidential Information. “Confidential Information” as used in this
Agreement means all information relating to the Company disclosed to the Recipient by the Company,
including without limitation any business, technical, marketing, financial or other information,
whether in written, electronic or oral form. Any and all reproductions, copies, notes, summaries,
reports, analyses or other material derived by the Recipient or its Representatives (as defined
below) in whole or in part from the Confidential Information in whatever form maintained shall be
considered part of the Confidential Information itself and shall be treated as such. Confidential
Information does not include information that (a) is or becomes part of the public domain other
than as a result of disclosure by the Recipient or its Representatives; (b) becomes available to
the Recipient on a nonconfidential basis from a source other than the Company, provided that source
is not bound with respect to that information by a confidentiality agreement with the Company or is
otherwise prohibited from transmitting that information by a contractual, legal or other
obligation; (c) can be proven by the Recipient to have been in the Recipient’s possession prior to
disclosure of the same by the Company; or (d) is independently developed by the Recipient without
reference to or reliance on any of the Company’s Confidential Information."""
val data = Seq(text).toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[This legal agreement states that the company has disclosed all information relating to the company to the recipient, including any business, technical, marketing, financial or other information. It also states that any reproductions, copies, notes, summaries, reports, analyses or other material derived from the confidential information must be treated as part of the confidential information. The confidential information does not include information that is or becomes part of the public domain other than as a result of disclosure by the recipient or its representatives, becomes available to the recipient on a nonconfidential basis from a source other than the company, can be proven by the recipient to have been in the recipient’s possession prior to disclosure, or is independently developed by the recipient without reference to or reliance on any of the company’s confidential information.]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
TFGraphBuilder
TFGraphBuilder
annotator can be used to create graphs in the model training pipeline. TFGraphBuilder
inspects the data and creates the proper graph if a suitable version of TensorFlow (>= 2.7 ) is available. The graph is stored in the defined folder and loaded by the approach.
You can use this builder with MedicalNerApproach
, FinanceNerApproach
, LegalNerApproach
, RelationExtractionApproach
, AssertionDLApproach
, and GenericClassifierApproach
.
ATTENTION: Playing with the parameters of TFGraphBuilder
may affect the model performance that you want to train.
Input Annotator Types: The setInputCols parameter is changing based on the setModelName parameter.
Output Annotator Type: There is no output file. The setGraphFile function creates a file with a .pb extension and saves it there.
Python API: TFGraphBuilder |
Show Example
graph_folder = "./medical_graphs"
ner_graph_builder = medical.TFGraphBuilder()\
.setModelName("ner_dl")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setLabelColumn("label")\
.setGraphFile("auto")\
.setHiddenUnitsNumber(20)\
.setGraphFolder(graph_folder)\
.setIsLicensed(True) # False -> for NerDLApproach
graph_folder = "./finance_graphs"
ner_graph_builder = finance.TFGraphBuilder()\
.setModelName("ner_dl")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setLabelColumn("label")\
.setGraphFile("auto")\
.setHiddenUnitsNumber(20)\
.setGraphFolder(graph_folder)\
.setIsLicensed(True) # False -> for NerDLApproach
graph_folder = "./legal_graphs"
ner_graph_builder = legal.TFGraphBuilder()\
.setModelName("ner_dl")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setLabelColumn("label")\
.setGraphFile("auto")\
.setHiddenUnitsNumber(20)\
.setGraphFolder(graph_folder)\
.setIsLicensed(True) # False -> for NerDLApproach
TextGenerator
The Medical, Financial, and Legal Text Generators are specialized tools designed for text abstraction in their respective fields. The MedicalTextGenerator
, based on the BioGPT model, excels in medical text abstraction, allowing users to provide prompts and contexts for tasks like disease explanation, paraphrasing medical context, or creating clinical notes for cancer patients. This model is adept at extracting relevant information due to its training on extensive medical data.
Similarly, the Financial and Legal Text Generators utilize the Flan-T5 model, an advanced version of the T5 model, for tasks in financial and legal text abstraction. Users can input prompts and contexts to receive high-quality summaries, document abstractions, and other text-based outputs. The Flan-T5 model’s training on a diverse range of texts ensures the generation of coherent and accurate content in these domains.
Parameters:
-
maxNewTokens
: Maximum number of of new tokens to generate, by default 30 -
maxContextLength
: Maximum length of context text -
configProtoBytes
: ConfigProto from tensorflow, serialized into byte array. -
doSample
: Whether or not to use sampling; use greedy decoding otherwise, by default False -
topK
: The number of highest probability vocabulary tokens to consider, by default 1 -
noRepeatNgramSize
: The number of tokens that can’t be repeated in the same order. Useful for preventing loops. The default is 0. -
ignoreTokenIds
: A list of token ids which are ignored in the decoder’s output, by default [] -
randomSeed
: Set to positive integer to get reproducible results, by default None. -
customPrompt
: The only available variable is {DOCUMENT} and it is populated with the contents of the input document
Available models can be found at the Models Hub.
For more extended examples on document pre-processing see the Spark NLP Workshop.
Input Annotator Types: DOCUMENT
Output Annotator Type: CHUNK
Python API: MedicalTextGenerator | Scala API: MedicalTextGenerator |
Show Example
from johnsnowlabs import nlp, medical
document_assembler = nlp.DocumentAssembler()\
.setInputCol("prompt")\
.setOutputCol("document_prompt")
med_text_generator = medical.TextGenerator.pretrained("text_generator_biomedical_biogpt_base", "en", "clinical/models")\
.setInputCols("document_prompt")\
.setOutputCol("answer")\
.setMaxNewTokens(256)\
.setDoSample(True)\
.setTopK(3)\
.setRandomSeed(42)\
.setStopAtEos(True)
pipeline = nlp.Pipeline(stages=[document_assembler, med_text_generator])
data = spark.createDataFrame([['Covid 19 is']]).toDF("prompt")
result = pipeline.fit(data).transform(data)
result.select("answer.result").show(truncate=False)
+--------------------------------------------------------------------------+
|result |
+--------------------------------------------------------------------------+
|[Covid 19 is a pandemic that has affected the world's economy and health.]|
+--------------------------------------------------------------------------+
from johnsnowlabs import nlp, finance
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("prompt")
flant5 = finance.TextGenerator.pretrained("fingen_flant5_base","en","finance/models")\
.setInputCols(["prompt"])\
.setOutputCol("answer")\
.setMaxNewTokens(150)\
.setStopAtEos(True)\
pipeline = nlp.Pipeline(stages=[document_assembler, flant5])
data = spark.createDataFrame([["Explain what is Sec 10-k filing"]]).toDF('text')
result = pipeline.fit(data).transform(data)
result.select("answer.result").show(truncate=False)
result = pipeline.fit(data).transform(data)
+--------------------------------------------------------------------------------------------------------------------+
|result |
+--------------------------------------------------------------------------------------------------------------------+
|[Sec 10k filing is a form of tax filing that requires a party to file jointly or several entities for tax purposes.]|
+--------------------------------------------------------------------------------------------------------------------+
from johnsnowlabs import nlp, legal
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("prompt")
flant5 = legal.TextGenerator.pretrained("leggen_flant5_finetuned","en","legal/models")\
.setInputCols(["prompt"])\
.setOutputCol("answer")\
.setMaxNewTokens(200)\
.setTopK(3)\
.setRandomSeed(42)\
.setNoRepeatNgramSize(3)\
.setStopAtEos(True)
pipeline = nlp.Pipeline(stages=[document_assembler, flant5])
data = spark.createDataFrame([["This exhibit has been redacted and is the subject of a confidential treatment request. Redacted material is marked with [* * *] and has been filed separately with the securities and exchange commission."]]).toDF("text")
pipeline.fit(data).transform(data)
result = pipeline.fit(data).transform(data)
result.select("answer.result").show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| result |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| This exhibit has been redacted and is the subject of a confidential treatment request. Redacted material is marked with [* * *] and has been filed separately with the securities and exchange commission. The redacted material is confidential |
| and will not be disclosed to any third party without the prior written consent of the parties. |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("prompt")
.setOutputCol("document_prompt")
val medTextGenerator = TextGenerator.pretrained("text_generator_biomedical_biogpt_base", "en", "clinical/models")
.setInputCols(Array("document_prompt"))
.setOutputCol("answer")
.setMaxNewTokens(256)
.setDoSample(true)
.setTopK(3)
.setRandomSeed(42)
.setStopAtEos(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, medTextGenerator))
val data = Seq("Covid 19 is").toDS.toDF("prompt")
val result = pipeline.fit(data).transform(data)
+--------------------------------------------------------------------------+
|result |
+--------------------------------------------------------------------------+
|[Covid 19 is a pandemic that has affected the world's economy and health.]|
+--------------------------------------------------------------------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("prompt")
val flanT5 = TextGenerator.pretrained("fingen_flant5_base", "en", "finance/models")
.setInputCols(Array("prompt"))
.setOutputCol("answer")
.setMaxNewTokens(150)
.setStopAtEos(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, flanT5))
val data = Seq("Explain what is Sec 10-k filing").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
+--------------------------------------------------------------------------------------------------------------------+
|result |
+--------------------------------------------------------------------------------------------------------------------+
|[Sec 10k filing is a form of tax filing that requires a party to file jointly or several entities for tax purposes.]|
+--------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("prompt")
val flanT5 = TextGenerator.pretrained("leggen_flant5_finetuned", "en", "legal/models")
.setInputCols(Array("prompt"))
.setOutputCol("answer")
.setMaxNewTokens(200)
.setTopK(3)
.setRandomSeed(42)
.setNoRepeatNgramSize(3)
.setStopAtEos(true)
val pipeline = new Pipeline().setStages(Array(documentAssembler, flanT5))
val data = Seq("This exhibit has been redacted and is the subject of a confidential treatment request. Redacted material is marked with [* * *] and has been filed separately with the securities and exchange commission.").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| result |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| This exhibit has been redacted and is the subject of a confidential treatment request. Redacted material is marked with [* * *] and has been filed separately with the securities and exchange commission. The redacted material is confidential |
| and will not be disclosed to any third party without the prior written consent of the parties. |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
TextMatcherInternal
This annotator match exact phrases provided in a file against a Document.
Parametres:
setEntities
(str): Sets the external resource for the entities. path : str Path to the external resource read_as : str, optional How to read the resource, by default ReadAs.TEXT options : dict, optional Options for reading the resource, by default {“format”: “text”}-
setCaseSensitive
(Boolean): Sets whether to match regardless of case. (Default: True) -
setMergeOverlapping
(Boolean): Sets whether to merge overlapping matched chunks. (Default: False) -
setEntityValue
(str): Sets the value for the entity metadata field. If any entity value isn’t set in the file, we need to set it for the entity value. -
setBuildFromTokens
(Boolean): Sets whether the TextMatcherInternal should take the CHUNK from TOKEN. setDelimiter
(str): Sets value for the delimiter between Phrase, Entity.
See Spark NLP Workshop for more examples of usage.
Input Annotator Types: DOCUMENT, TOKEN
Output Annotator Type: CHUNK
Python API: TextMatcherInternal | Scala API: TextMatcherInternal |
Show Example
from johnsnowlabs import nlp, medical
matcher_drug = """
Aspirin 100mg#Drug
aspirin#Drug
paracetamol#Drug
amoxicillin#Drug
ibuprofen#Drug
lansoprazole#Drug
"""
with open ('matcher_drug.csv', 'w') as f:
f.write(matcher_drug)
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = nlp.Tokenizer()\
.setInputCols(["document"])\
.setOutputCol("token")
entityExtractor = medical.TextMatcherInternal()\
.setInputCols(["document", "token"])\
.setEntities("matcher_drug.csv")\
.setOutputCol("matched_text")\
.setCaseSensitive(False)\
.setDelimiter("#")\
.setMergeOverlapping(False)
mathcer_pipeline = nlp.Pipeline().setStages([
documentAssembler,
tokenizer,
entityExtractor])
data = spark.createDataFrame([["John's doctor prescribed aspirin 100mg for his heart condition, along with paracetamol for his fever, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD."]]).toDF("text")
matcher_model = mathcer_pipeline.fit(data)
result = matcher_model.transform(data)
# result
+-------------+-----+---+-----+
| chunk|begin|end|label|
+-------------+-----+---+-----+
| aspirin| 25| 31| Drug|
|aspirin 100mg| 25| 37| Drug|
| paracetamol| 75| 85| Drug|
| amoxicillin| 102|112| Drug|
| ibuprofen| 134|142| Drug|
| lansoprazole| 170|181| Drug|
+-------------+-----+---+-----+
import spark.implicits._
//matcher_drug = """
//Aspirin 100mg#Drug
//aspirin#Drug
//paracetamol#Drug
//amoxicillin#Drug
//ibuprofen#Drug
//lansoprazole#Drug
//"""
//
//with open ('matcher_drug.csv', 'w') as f:
// f.write(matcher_drug)
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val entityExtractor = new EntityExtractor()
.setInputCols(Array("document", "token"))
.setOutputCol("matched_text")
.setEntities("matcher_drug.csv")
.setCaseSensitive(false)
.setDelimiter("#")
.setMergeOverlapping(false)
val matcherPipeline = new Pipeline()
.setStages(Array(documentAssembler,
tokenizer,
entityExtractor))
val data = Seq("John's doctor prescribed aspirin 100mg for his heart condition, along with paracetamol for his fever, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD.")
.toDF("text")
val matcherModel = matcherPipeline.fit(data)
val result = matcherModel.transform(data)
# result
+-------------+-----+---+-----+
| chunk|begin|end|label|
+-------------+-----+---+-----+
| aspirin| 25| 31| Drug|
|aspirin 100mg| 25| 37| Drug|
| paracetamol| 75| 85| Drug|
| amoxicillin| 102|112| Drug|
| ibuprofen| 134|142| Drug|
| lansoprazole| 170|181| Drug|
+-------------+-----+---+-----+
VectorDBPostProcessor
VectorDBPostProcessor is used to filter and sort the annotations from the :class:sparknlp_jsl.annotator.resolution.VectorDBModel
.
Parametres:
filterBy
:The filterBy parameter is used to select and prioritize filter options.sortBy
:The sortBy parameter is used to select sorting option. Options:ascending
,descending
,lost_in_the_middle
,diversity
.ascending
: Sort by ascending order of distance.descending
: Sort by descending order of distance.lost_in_the_middle
: Sort by lost in the middle ranker. Let’s say we have 5 annotations with distances [1, 2, 3, 4, 5]. The lost in the middle ranker will sort them as [1, 3, 5, 4, 2].diversity
: Sort by diversity ranker. The annotations are sorted by distance and the first annotation select, and then the next annotation is selected by the maximum average distance from the selected annotations. Default:ascending
caseSensitive
: Whether the criteria of the string operators are case sensitive or not. For example, if set to False, the operator “equals” will match “John” with “john”. Default: FalsediversityThreshold
: The diversityThreshold parameter is used to set the threshold for the diversityByThreshold filter. The diversityByThreshold filter selects the annotations by the distance between the sorted annotations. diversityThreshold must be greater than 0. Default: 0.01maxTopKAfterFiltering
: Whether to allow zero annotation after filtering. If set to True, the output may contain zero annotation if all annotations are filtered out. If set to False, The output is tried to contain at least one annotation. Default: FalsemetadataCriteria
: The metadataCriteria parameter is used to filter the annotations by metadata fields.
See Spark NLP Workshop for more examples of usage.
Input Annotator Types: VECTOR_SIMILARITY_RANKINGS
Output Annotator Type: VECTOR_SIMILARITY_RANKINGS
Python API: VectorDBPostProcessor | Scala API: VectorDBPostProcessor |
Show Example
# result
# result
WindowedSentenceModel
This annotator that helps you to merge the previous and following sentences of a given piece of text, so that you add the context surrounding them. This is super useful for especially context-rich analyses that require a deeper understanding of the language being used.
Inferring the class from sentence X may be a much harder task sometime, due to the lack of context, than to infer the class of sentence X-1 + sentence X + sentence X+1. In this example, the window is 1, that’s why we augment sentence with 1 neighbour from behind and another from ahead. Window size can be configured so that each piece of text/sentence get a number of previous and posterior sentences as context, equal to the windows size.
Parameters:
-
setWindowSize
: Sets size of the sliding window. -
setGlueString
: Sets string to use to join the neighboring elements together.
Input Annotator Types: DOCUMENT
Output Annotator Type: DOCUMENT
Python API: WindowedSentenceModel | Scala API: WindowedSentenceModel |
Show Example
from johnsnowlabs import medical, nlp
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols("document")\
.setOutputCol("sentence")
windowedSentence1 = medical.WindowedSentenceModel()\
.setWindowSize(1)\
.setInputCols("sentence")\
.setOutputCol("window_1")
windowedSentence2 = medical.WindowedSentenceModel()\
.setWindowSize(2)\
.setInputCols("sentence")\
.setOutputCol("window_2")
pipeline = nlp.Pipeline(stages=[
documentAssembler,
sentenceDetector,
windowedSentence1,
windowedSentence2
])
sample_text = """The patient was admitted on Monday.
She has a right-sided pleural effusion for thoracentesis.
Her Coumadin was placed on hold.
A repeat echocardiogram was checked.
She was started on prophylaxis for DVT.
Her CT scan from March 2006 prior to her pericardectomy.
It already shows bilateral plural effusions."""
data = spark.createDataFrame([[sample_text]]).toDF("text")
result = pipeline.fit(data).transform(data)
# Example results
result.select(F.explode('window_1')).select('col.result').show(truncate=False)
+---------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+---------------------------------------------------------------------------------------------------------------------------------------------+
|The patient was admitted on Monday. She has a right-sided pleural effusion for thoracentesis. |
|The patient was admitted on Monday. She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold. |
|She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold. A repeat echocardiogram was checked. |
|Her Coumadin was placed on hold. A repeat echocardiogram was checked. She was started on prophylaxis for DVT. |
|A repeat echocardiogram was checked. She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy. |
|She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy. It already shows bilateral plural effusions.|
|Her CT scan from March 2006 prior to her pericardectomy. It already shows bilateral plural effusions. |
+---------------------------------------------------------------------------------------------------------------------------------------------+
result.select(F.explode('window_2')).select('col.result').show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|The patient was admitted on Monday. She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold. |
|The patient was admitted on Monday. She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold. A repeat echocardiogram was checked. |
|The patient was admitted on Monday. She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold. A repeat echocardiogram was checked. She was started on prophylaxis for DVT. |
|She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold. A repeat echocardiogram was checked. She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy.|
|Her Coumadin was placed on hold. A repeat echocardiogram was checked. She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy. It already shows bilateral plural effusions. |
|A repeat echocardiogram was checked. She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy. It already shows bilateral plural effusions. |
|She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy. It already shows bilateral plural effusions. |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
from johnsnowlabs import nlp, legal
from pyspark.sql import functions as F
doc_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("isolated_sentence")
context_window = legal.WindowedSentenceModel()\
.setInputCols(["isolated_sentence"])\
.setOutputCol("window")\
.setWindowSize(1)
window_splitting_pipeline = nlp.Pipeline(stages=[doc_assembler, sentence_detector, context_window])
window_splitting_model = window_splitting_pipeline.fit(df)
window_splitting_lp = nlp.LightPipeline(window_splitting_model)
## Result
['1 \nMUTUAL NONDISCLOSURE AGREEMENT \nThis Mutual Nondisclosure Agreement (the “Agreement”) is made on _________ (“Effective \nDate”) by and between: \n(1) John Snow Labs, a Delaware corporation, registered at 16192 Coastal Highway, \nLewes, Delaware 19958 (“John Snow Labs”), and \n(2) Achiles, S.L, a Spanish corporation, registered at Gran Via, 2º floor, Offices 9\nand 10.(“Company”), \n(each a “party” and together the “parties”). Recitals: \nJohn Snow Labs and Company intend to explore the possibility of a business relationship \nbetween each other, whereby each party (“Discloser”) may disclose sensitive information to the \nother party (“Recipient”).',
'1 \nMUTUAL NONDISCLOSURE AGREEMENT \nThis Mutual Nondisclosure Agreement (the “Agreement”) is made on _________ (“Effective \nDate”) by and between: \n(1) John Snow Labs, a Delaware corporation, registered at 16192 Coastal Highway, \nLewes, Delaware 19958 (“John Snow Labs”), and \n(2) Achiles, S.L, a Spanish corporation, registered at Gran Via, 2º floor, Offices 9\nand 10.(“Company”), \n(each a “party” and together the “parties”). Recitals: \nJohn Snow Labs and Company intend to explore the possibility of a business relationship \nbetween each other, whereby each party (“Discloser”) may disclose sensitive information to the \nother party (“Recipient”). The parties agree as follows:',
'Recitals: \nJohn Snow Labs and Company intend to explore the possibility of a business relationship \nbetween each other, whereby each party (“Discloser”) may disclose sensitive information to the \nother party (“Recipient”). The parties agree as follows: 1. Definition.',]
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val windowedSentence1 = new WindowedSentenceModel()
.setWindowSize(1)
.setInputCols("sentence")
.setOutputCol("window_1")
val windowedSentence2 = new WindowedSentenceModel()
.setWindowSize(2)
.setInputCols("sentence")
.setOutputCol("window_2")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
windowedSentence1,
windowedSentence2
))
val testDataset = Seq("The patient was admitted on Monday.
She has a right-sided pleural effusion for thoracentesis.
Her Coumadin was placed on hold.
A repeat echocardiogram was checked.
She was started on prophylaxis for DVT.
Her CT scan from March 2006 prior to her pericardectomy.
It already shows bilateral plural effusions.").toDF("text")
val result = pipeline.fit(testDataset).transform(testDataset)
// Result
// window 1
+---------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+---------------------------------------------------------------------------------------------------------------------------------------------+
|The patient was admitted on Monday. She has a right-sided pleural effusion for thoracentesis. |
|The patient was admitted on Monday. She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold. |
|She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold. A repeat echocardiogram was checked. |
|Her Coumadin was placed on hold. A repeat echocardiogram was checked. She was started on prophylaxis for DVT. |
|A repeat echocardiogram was checked. She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy. |
|She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy. It already shows bilateral plural effusions.|
|Her CT scan from March 2006 prior to her pericardectomy. It already shows bilateral plural effusions. |
+---------------------------------------------------------------------------------------------------------------------------------------------+
// window 2
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|The patient was admitted on Monday. She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold. |
|The patient was admitted on Monday. She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold. A repeat echocardiogram was checked. |
|The patient was admitted on Monday. She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold. A repeat echocardiogram was checked. She was started on prophylaxis for DVT. |
|She has a right-sided pleural effusion for thoracentesis. Her Coumadin was placed on hold. A repeat echocardiogram was checked. She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy.|
|Her Coumadin was placed on hold. A repeat echocardiogram was checked. She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy. It already shows bilateral plural effusions. |
|A repeat echocardiogram was checked. She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy. It already shows bilateral plural effusions. |
|She was started on prophylaxis for DVT. Her CT scan from March 2006 prior to her pericardectomy. It already shows bilateral plural effusions. |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
val doc_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("isolated_sentence")
val context_window = new WindowedSentenceModel()
.setInputCols("isolated_sentence")
.setOutputCol("window")
.setWindowSize(1)
val pipeline = new Pipeline().setStages(Array(
doc_assembler,
sentence_detector,
context_window))
val window_splitting_model = window_splitting_pipeline.fit(df)
val window_splitting_lp = LightPipeline(window_splitting_model)
ZeroShotNerModel
This is a zero shot named entity recognition based on RoBertaForQuestionAnswering
. Zero shot models excel at generalization, meaning that the model can accurately predict entities in very different data sets without the need to fine tune the model or train from scratch for each different domain.
Even though a model trained to solve a specific problem can achieve better accuracy than a zero-shot model in this specific task, it probably won’t be be useful in a different task. That is where zero-shot models shows its usefulness by being able to achieve good results in many different scenarions.
Parameters:
-
entityDefinitions
: A dictionary with definitions of the named entities. The keys of dictionary are the entity types and the values are lists of hypothesis templates. -
predictionThreshold
: Minimal confidence score to consider the entity(Default:0.01
) -
ignoreEntitites
: A list of entities to be discarted from the output..
All the parameters can be set using the corresponding set method in camel case. For example, .setMultiLabel()
.
Input Annotator Types: DOCUMENT, TOKEN
Output Annotator Type: NAMED_ENTITY
Python API: ZeroShotNerModel | Scala API: ZeroShotNerModel | Notebook: ZeroShotNerModelNotebook |
Show Example
from johnsnowlabs import nlp, medical
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
zero_shot_ner = medical.ZeroShotNerModel.pretrained("zero_shot_ner_roberta", "en", "clinical/models")\
.setEntityDefinitions(
{
"PROBLEM": ["What is the disease?", "What is his symptom?", "What is her disease?", "What is his disease?",
"What is the problem?" ,"What does a patient suffer", 'What was the reason that the patient is admitted to the clinic?'],
"DRUG": ["Which drug?", "Which is the drug?", "What is the drug?", "Which drug does he use?", "Which drug does she use?", "Which drug do I use?", "Which drug is prescribed for a symptom?"],
"ADMISSION_DATE": ["When did patient admitted to a clinic?"],
"PATIENT_AGE": ["How old is the patient?","What is the gae of the patient?"]
})\
.setInputCols(["sentence", "token"])\
.setOutputCol("zero_shot_ner")\
.setPredictionThreshold(0.1) # default 0.01
ner_converter = medical.NerConverterInternal()\
.setInputCols(["sentence", "token", "zero_shot_ner"])\
.setOutputCol("ner_chunk")\
pipeline = nlp.Pipeline(stages = [
documentAssembler,
sentenceDetector,
tokenizer,
zero_shot_ner,
ner_converter])
text_list = ["The doctor pescribed Majezik for my severe headache.",
"The patient was admitted to the hospital for his colon cancer.",
"27 years old patient was admitted to clinic on Sep 1st by Dr. X for a right-sided pleural effusion for thoracentesis."
]
data = spark.createDataFrame(text_list, nlp.StringType()).toDF("text")
result = pipeline.fit(data).transform(data)
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias("cols"))\
.select(F.expr("cols['0']").alias("chunk"),
F.expr("cols['1']['entity']").alias("ner_label"),
F.expr("cols['1']['confidence']").alias("confidence")).show(50, truncate=100)
+------------------------------------------------+--------------+----------+
| chunk| ner_label|confidence|
+------------------------------------------------+--------------+----------+
| Majezik| DRUG|0.64671576|
| severe headache| PROBLEM| 0.5526346|
| colon cancer| PROBLEM| 0.8898498|
| 27 years old| PATIENT_AGE| 0.6943085|
| Sep 1st|ADMISSION_DATE|0.95646095|
|a right-sided pleural effusion for thoracentesis| PROBLEM|0.50026613|
+------------------------------------------------+--------------+----------+
from johnsnowlabs import nlp, finance
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
textsplitter = finance.TextSplitter()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
zero_shot_ner = finance.ZeroShotNerModel.pretrained("finner_roberta_zeroshot", "en", "finance/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("zero_shot_ner")\
.setEntityDefinitions(
{
"DATE": ['When was the company acquisition?', 'When was the company purchase agreement?'],
"ORG": ["Which company was acquired?"],
"PRODUCT": ["Which product?"],
"PROFIT_INCREASE": ["How much has the gross profit increased?"],
"REVENUES_DECLINED": ["How much has the revenues declined?"],
"OPERATING_LOSS_2020": ["Which was the operating loss in 2020"],
"OPERATING_LOSS_2019": ["Which was the operating loss in 2019"]
})
ner_converter = finance.NerConverterInternal()\
.setInputCols(["sentence", "token", "zero_shot_ner"])\
.setOutputCol("ner_chunk")
pipeline = nlp.Pipeline(stages=[
documentAssembler,
textsplitter,
tokenizer,
zero_shot_ner,
ner_converter
]
)
from pyspark.sql.types import StringType
text_list = ["In March 2012, as part of a longer-term strategy, the Company acquired Vertro, Inc., which owned and operated the ALOT product portfolio.",
"In February 2017, the Company entered into an asset purchase agreement with NetSeer, Inc.",
"While our gross profit margin increased to 81.4% in 2020 from 63.1% in 2019, our revenues declined approximately 27% in 2020 as compared to 2019.",
"We reported an operating loss of approximately $8,048,581 million in 2020 as compared to an operating loss of $7,738,193 in 2019."]
data = spark.createDataFrame(text_list, nlp.StringType()).toDF("text")
result = pipeline.fit(data).transform(data)
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias("cols"))\
.select(F.expr("cols['0']").alias("chunk"),
F.expr("cols['1']['entity']").alias("ner_label")).show(50, truncate=100)
+------------------+-------------------+
|chunk |ner_label |
+------------------+-------------------+
|March 2012 |DATE |
|Vertro |ORG |
|ALOT |PRODUCT |
|February 2017 |DATE |
|NetSeer |ORG |
|81.4% |PROFIT_INCREASE |
|27% |REVENUES_DECLINED |
|$8,048,581 million|OPERATING_LOSS_2020|
|$7,738,193 |OPERATING_LOSS_2019|
|2019 |DATE |
+------------------+-------------------+
from johnsnowlabs import nlp, legal
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
textSplitter = legal.TextSplitter()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
zero_shot_ner = legal.ZeroShotNerModel.pretrained("legner_roberta_zeroshot", "en", "legal/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("zero_shot_ner")\
.setEntityDefinitions(
{
"DATE": ['When was the company acquisition?', 'When was the company purchase agreement?', "When was the agreement?"],
"ORG": ["Which company?"],
"STATE": ["Which state?"],
"AGREEMENT": ["What kind of agreement?"],
"LICENSE": ["What kind of license?"],
"LICENSE_RECIPIENT": ["To whom the license is granted?"]
})
ner_converter = legal.NerConverterInternal()\
.setInputCols(["sentence", "token", "zero_shot_ner"])\
.setOutputCol("ner_chunk")
pipeline = nlp.Pipeline(stages=[
documentAssembler,
textSplitter,
tokenizer,
zero_shot_ner,
nerconverter
]
)
from pyspark.sql.types import StringType
text_list = [
"In March 2012, as part of a longer-term strategy, the Company acquired Vertro, Inc., which owned and operated the ALOT product portfolio.",
"In February 2017, the Company entered into an asset purchase agreement with NetSeer, Inc.",
"This INTELLECTUAL PROPERTY AGREEMENT, dated as of December 31, 2018 (the 'Effective Date') is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ('Seller') and AFI Licensing LLC, a Delaware company (the 'Licensee')",
"The Company hereby grants to Seller a perpetual, non- exclusive, royalty-free license",
]
data = spark.createDataFrame(text_list, nlp.StringType()).toDF("text")
result = pipeline.fit(data).transform(data)
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias("cols"))\
.select(F.expr("cols['0']").alias("chunk"),
F.expr("cols['1']['entity']").alias("ner_label")).show(50, truncate=100)
+-------------------------------------+-----------------+
|chunk |ner_label |
+-------------------------------------+-----------------+
|March 2012 |DATE |
|Vertro, Inc |ORG |
|February 2017 |DATE |
|asset purchase agreement |AGREEMENT |
|NetSeer |ORG |
|INTELLECTUAL PROPERTY |AGREEMENT |
|December 31, 2018 |DATE |
|Armstrong Flooring |LICENSE_RECIPIENT|
|Delaware |STATE |
|AFI Licensing LLC, a Delaware company|LICENSE_RECIPIENT|
|Seller |LICENSE_RECIPIENT|
|perpetual |LICENSE |
|non- exclusive |LICENSE |
|royalty-free |LICENSE |
+-------------------------------------+-----------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val zeroShotNer = ZeroShotNerModel.pretrained("zero_shot_ner_roberta", "en", "clinical/models")
.setEntityDefinitions(Map(
"PROBLEM" -> Seq("What is the disease?", "What is his symptom?", "What is her disease?", "What is his disease?",
"What is the problem?" ,"What does a patient suffer", "What was the reason that the patient is admitted to the clinic?"),
"DRUG" -> Seq("Which drug?", "Which is the drug?", "What is the drug?", "Which drug does he use?", "Which drug does she use?", "Which drug do I use?", "Which drug is prescribed for a symptom?"),
"ADMISSION_DATE" -> Seq("When did patient admitted to a clinic?"),
"PATIENT_AGE" -> Seq("How old is the patient?", "What is the gae of the patient?")
))
.setInputCols(Array("sentence", "token"))
.setOutputCol("zero_shot_ner")
.setPredictionThreshold(0.1)
val nerConverter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "zero_shot_ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
zeroShotNer,
nerConverter))
val textList = Seq(
"The doctor pescribed Majezik for my severe headache.",
"The patient was admitted to the hospital for his colon cancer.",
"27 years old patient was admitted to clinic on Sep 1st by Dr. X for a right-sided pleural effusion for thoracentesis."
).toDS.toDF("text")
val result = pipeline.fit(textList).transform(textList)
+------------------------------------------------+--------------+----------+
| chunk| ner_label|confidence|
+------------------------------------------------+--------------+----------+
| Majezik| DRUG|0.64671576|
| severe headache| PROBLEM| 0.5526346|
| colon cancer| PROBLEM| 0.8898498|
| 27 years old| PATIENT_AGE| 0.6943085|
| Sep 1st|ADMISSION_DATE|0.95646095|
|a right-sided pleural effusion for thoracentesis| PROBLEM|0.50026613|
+------------------------------------------------+--------------+----------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val textsplitter = new TextSplitter()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val zero_shot_ner = ZeroShotNerModel.pretrained("finner_roberta_zeroshot", "en", "finance/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("zero_shot_ner")
.setEntityDefinitions(
Map(
"DATE" -> Seq('When was the company acquisition?', 'When was the company purchase agreement?'),
"ORG" -> Seq("Which company was acquired?"),
"PRODUCT" -> Seq("Which product?"),
"PROFIT_INCREASE" -> Seq("How much has the gross profit increased?"),
"REVENUES_DECLINED" -> Seq("How much has the revenues declined?"),
"OPERATING_LOSS_2020" -> Seq("Which was the operating loss in 2020"),
"OPERATING_LOSS_2019" -> Seq("Which was the operating loss in 2019")
)
)
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "zero_shot_ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
textsplitter,
tokenizer,
zero_shot_ner,
ner_converter
))
val text_list = Seq(
"In March 2012, as part of a longer-term strategy, the Company acquired Vertro, Inc., which owned and operated the ALOT product portfolio.",
"In February 2017, the Company entered into an asset purchase agreement with NetSeer, Inc.",
"While our gross profit margin increased to 81.4% in 2020 from 63.1% in 2019, our revenues declined approximately 27% in 2020 as compared to 2019.",
"We reported an operating loss of approximately $8,048,581 million in 2020 as compared to an operating loss of $7,738,193 in 2019."
).toDS.toDF("text")
val result = pipeline.fit(text_list).transform(text_list)
+------------------+-------------------+
|chunk |ner_label |
+------------------+-------------------+
|March 2012 |DATE |
|Vertro |ORG |
|ALOT |PRODUCT |
|February 2017 |DATE |
|NetSeer |ORG |
|81.4% |PROFIT_INCREASE |
|27% |REVENUES_DECLINED |
|$8,048,581 million|OPERATING_LOSS_2020|
|$7,738,193 |OPERATING_LOSS_2019|
|2019 |DATE |
+------------------+-------------------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val textSplitter = new TextSplitter()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val zeroShotNer = ZeroShotNerModel.pretrained("legner_roberta_zeroshot", "en", "legal/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("zero_shot_ner")
.setEntityDefinitions(Map(
"DATE" -> Seq("When was the company acquisition?", "When was the company purchase agreement?", "When was the agreement?"),
"ORG" -> Seq("Which company?"),
"STATE" -> Seq("Which state?"),
"AGREEMENT" -> Seq("What kind of agreement?"),
"LICENSE" -> Seq("What kind of license?"),
"LICENSE_RECIPIENT" -> Seq("To whom the license is granted?")
))
val nerConverter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "zero_shot_ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
textSplitter,
tokenizer,
zeroShotNer,
nerConverter
))
val textList = Seq(
"In March 2012, as part of a longer-term strategy, the Company acquired Vertro, Inc., which owned and operated the ALOT product portfolio.",
"In February 2017, the Company entered into an asset purchase agreement with NetSeer, Inc.",
"This INTELLECTUAL PROPERTY AGREEMENT, dated as of December 31, 2018 (the 'Effective Date') is entered into by and between Armstrong Flooring, Inc., a Delaware corporation ('Seller') and AFI Licensing LLC, a Delaware company (the 'Licensee')",
"The Company hereby grants to Seller a perpetual, non-exclusive, royalty-free license"
).toDS.toDF("text")
val result = pipeline.fit(textList).transform(textList)
+-------------------------------------+-----------------+
|chunk |ner_label |
+-------------------------------------+-----------------+
|March 2012 |DATE |
|Vertro, Inc |ORG |
|February 2017 |DATE |
|asset purchase agreement |AGREEMENT |
|NetSeer |ORG |
|INTELLECTUAL PROPERTY |AGREEMENT |
|December 31, 2018 |DATE |
|Armstrong Flooring |LICENSE_RECIPIENT|
|Delaware |STATE |
|AFI Licensing LLC, a Delaware company|LICENSE_RECIPIENT|
|Seller |LICENSE_RECIPIENT|
|perpetual |LICENSE |
|non-exclusive |LICENSE |
|royalty-free |LICENSE |
+-------------------------------------+-----------------+
ZeroShotRelationExtractionModel
ZeroShotRelationExtractionModel
implements zero-shot binary relations extraction by utilizing BERT
transformer models trained on the NLI (Natural Language Inference) task.
The model inputs consists of documents/sentences and paired NER chunks, usually obtained by RENerChunksFilter
. The definitions of relations which are extracted is given by a dictionary structures, specifying a set of statements regarding the relationship of named entities.
These statements are automatically appended to each document in the dataset and the NLI model is used to determine whether a particular relationship between entities.
Parameters:
-
relationalCategories
: A dictionary with definitions of relational categories. The keys of dictionary are the relation labels and the values are lists of hypothesis templates. -
predictionThreshold
: Minimal confidence score to encode a relation (Default:0.5
) -
multiLabel
: Whether or not a pair of entities can be categorized by multiple relations (Default:False
).
All the parameters can be set using the corresponding set method in camel case. For example, .setMultiLabel()
.
For available pretrained models please see the Models Hub.
Input Annotator Types: CHUNK, DOCUMENT
Output Annotator Type: CATEGORY
Python API: ZeroShotRelationExtractionModel | Scala API: ZeroShotRelationExtractionModel | Notebook: ZeroShotRelationExtractionModelNotebook |
Show Example
from johnsnowlabs import nlp, medical
documenter = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencer = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentences")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentences"])\
.setOutputCol("tokens")
words_embedder = nlp.WordEmbeddingsModel().pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentences", "tokens"])\
.setOutputCol("embeddings")
ner_clinical = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models")\
.setInputCols(["sentences", "tokens", "embeddings"])\
.setOutputCol("ner_clinical")
ner_clinical_converter = medical.NerConverterInternal()\
.setInputCols(["sentences", "tokens", "ner_clinical"])\
.setOutputCol("ner_clinical_chunks")\
.setWhiteList(["PROBLEM", "TEST"]) # PROBLEM-TEST-TREATMENT
ner_posology = medical.NerModel.pretrained("ner_posology", "en", "clinical/models")\
.setInputCols(["sentences", "tokens", "embeddings"])\
.setOutputCol("ner_posology")
ner_posology_converter = medical.NerConverterInternal()\
.setInputCols(["sentences", "tokens", "ner_posology"])\
.setOutputCol("ner_posology_chunks")\
.setWhiteList(["DRUG"]) # DRUG-FREQUENCY-DOSAGE-DURATION-FORM-ROUTE-STRENGTH
chunk_merger = medical.ChunkMergeApproach()\
.setInputCols("ner_clinical_chunks", "ner_posology_chunks")\
.setOutputCol("merged_ner_chunks")
## ZERO-SHOT RE Starting...
pos_tagger = nlp.PerceptronModel().pretrained("pos_clinical", "en", "clinical/models")\
.setInputCols(["sentences", "tokens"])\
.setOutputCol("pos_tags")
dependency_parser = nlp.DependencyParserModel().pretrained("dependency_conllu", "en")\
.setInputCols(["document", "pos_tags", "tokens"])\
.setOutputCol("dependencies")
re_ner_chunk_filter = medical.RENerChunksFilter().setRelationPairs(["problem-test", "problem-drug"])\
.setMaxSyntacticDistance(4)\
.setDocLevelRelations(False)\
.setInputCols(["merged_ner_chunks", "dependencies"])\
.setOutputCol("re_ner_chunks")
re_model = medical.ZeroShotRelationExtractionModel.pretrained("re_zeroshot_biobert", "en", "clinical/models")\
.setInputCols(["re_ner_chunks", "sentences"])\
.setOutputCol("relations")\
.setMultiLabel(True)\
.setRelationalCategories(
{
"ADE": ["{DRUG} causes {PROBLEM}."],
"IMPROVE": ["{DRUG} improves {PROBLEM}.", "{DRUG} cures {PROBLEM}."],
"REVEAL": ["{TEST} reveals {PROBLEM}."],
}
)
pipeline = nlp.Pipeline(
stages = [
documenter,
sentencer,
tokenizer,
words_embedder,
ner_clinical,
ner_clinical_converter,
ner_posology,
ner_posology_converter,
chunk_merger,
pos_tagger,
dependency_parser,
re_ner_chunk_filter,
re_model
]
)
text = "Paracetamol can alleviate headache or sickness. An MRI test can be used to find cancer."
data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)
from pyspark.sql import functions as F
results.select(
F.explode(F.arrays_zip(results.relations.metadata, results.relations.result)).alias("cols")).select(
F.expr("cols['0']['sentence']").alias("sentence"),
F.expr("cols['0']['entity1_begin']").alias("entity1_begin"),
F.expr("cols['0']['entity1_end']").alias("entity1_end"),
F.expr("cols['0']['chunk1']").alias("chunk1"),
F.expr("cols['0']['entity1']").alias("entity1"),
F.expr("cols['0']['entity2_begin']").alias("entity2_begin"),
F.expr("cols['0']['entity2_end']").alias("entity2_end"),
F.expr("cols['0']['chunk2']").alias("chunk2"),
F.expr("cols['0']['entity2']").alias("entity2"),
F.expr("cols['0']['hypothesis']").alias("hypothesis"),
F.expr("cols['0']['nli_prediction']").alias("nli_prediction"),
F.expr("cols['1']").alias("relation"),
F.expr("cols['0']['confidence']").alias("confidence"),
).show(truncate=70)
+--------+-------------+-----------+-----------+-------+-------------+-----------+--------+-------+------------------------------+--------------+--------+----------+
sentence|entity1_begin|entity1_end| chunk1|entity1|entity2_begin|entity2_end| chunk2|entity2| hypothesis|nli_prediction|relation|confidence|
+--------+-------------+-----------+-----------+-------+-------------+-----------+--------+-------+------------------------------+--------------+--------+----------+
0| 0| 10|Paracetamol| DRUG| 38| 45|sickness|PROBLEM|Paracetamol improves sickness.| entail| IMPROVE|0.98819494|
0| 0| 10|Paracetamol| DRUG| 26| 33|headache|PROBLEM|Paracetamol improves headache.| entail| IMPROVE| 0.9929625|
1| 48| 58|An MRI test| TEST| 80| 85| cancer|PROBLEM| An MRI test reveals cancer.| entail| REVEAL| 0.9760039|
+--------+-------------+-----------+-----------+-------+-------------+-----------+--------+-------+------------------------------+--------------+--------+----------+
from johnsnowlabs import nlp, finance
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model = finance.NerModel.pretrained("finner_financial_small", "en", "finance/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = finance.NerConverterInternal()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
re_model = finance.ZeroShotRelationExtractionModel.pretrained("finre_zero_shot", "en", "finance/models")\
.setInputCols(["ner_chunk", "sentence"])\
.setOutputCol("relations")\
.setMultiLabel(False)\
.setRelationalCategories(
{
"profit_decline_by": [
"{PROFIT_DECLINE} decreased by {AMOUNT} from",
"{PROFIT_DECLINE} decreased by {AMOUNT} to",
],
"profit_decline_by_per": [
"{PROFIT_DECLINE} decreased by a {PERCENTAGE} from",
"{PROFIT_DECLINE} decreased by a {PERCENTAGE} to",
],
"profit_decline_from": [
"{PROFIT_DECLINE} decreased from {AMOUNT}",
"{PROFIT_DECLINE} decreased from {AMOUNT} for the year",
],
"profit_decline_from_per": [
"{PROFIT_DECLINE} decreased from {PERCENTAGE} to",
"{PROFIT_DECLINE} decreased from {PERCENTAGE} to a total of",
],
"profit_decline_to": ["{PROFIT_DECLINE} to {AMOUNT}"],
"profit_increase_from": ["{PROFIT_INCREASE} from {AMOUNT}"],
"profit_increase_to": ["{PROFIT_INCREASE} to {AMOUNT}"],
"expense_decrease_by": ["{EXPENSE_DECREASE} decreased by {AMOUNT}"],
"expense_decrease_by_per": ["{EXPENSE_DECREASE} decreased by a {PERCENTAGE}"],
"expense_decrease_from": ["{EXPENSE_DECREASE} decreased from {AMOUNT}"],
"expense_decrease_to": [
"{EXPENSE_DECREASE} for a total of {AMOUNT} for the fiscal year"
],
"has_date": [
"{AMOUNT} for the fiscal year ended {FISCAL_YEAR}",
"{PERCENTAGE} for the fiscal year ended {FISCAL_YEAR}",
],
}
)
pipeline = nlp.Pipeline(
stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter,
re_model,
]
)
text = """License fees revenue decreased 40 %, or $ 0.5 million to $ 0.7 million for the year ended December 31, 2020 compared to $ 1.2 million for the year ended December 31, 2019. Services revenue increased 4 %, or $ 1.1 million, to $ 25.6 million for the year ended December 31, 2020 from $ 24.5 million for the year ended December 31, 2019. Costs of revenue, excluding depreciation and amortization increased by $ 0.1 million, or 2 %, to $ 8.8 million for the year ended December 31, 2020 from $ 8.7 million for the year ended December 31, 2019. Also, a decrease in travel costs of $ 0.4 million due to travel restrictions caused by the global pandemic. As a percentage of revenue, cost of revenue, excluding depreciation and amortization was 34 % for each of the years ended December 31, 2020 and 2019. Sales and marketing expenses decreased 20 %, or $ 1.5 million, to $ 6.0 million for the year ended December 31, 2020 from $ 7.5 million for the year ended December 31, 2019."""
data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)
from pyspark.sql import functions as F
result.select(
F.explode(F.arrays_zip(result.relations.metadata, result.relations.result)).alias("cols")).select(
F.expr("cols['0']['sentence']").alias("sentence"),
F.expr("cols['0']['entity1_begin']").alias("entity1_begin"),
F.expr("cols['0']['entity1_end']").alias("entity1_end"),
F.expr("cols['0']['chunk1']").alias("chunk1"),
F.expr("cols['0']['entity1']").alias("entity1"),
F.expr("cols['0']['entity2_begin']").alias("entity2_begin"),
F.expr("cols['0']['entity2_end']").alias("entity2_end"),
F.expr("cols['0']['chunk2']").alias("chunk2"),
F.expr("cols['0']['entity2']").alias("entity2"),
F.expr("cols['0']['hypothesis']").alias("hypothesis"),
F.expr("cols['0']['nli_prediction']").alias("nli_prediction"),
F.expr("cols['1']").alias("relation"),
F.expr("cols['0']['confidence']").alias("confidence"),
).show(truncate=70)
+--------+-------------+-----------+----------------------------+----------------+-------------+-----------+-----------------+-----------+--------------------------------------------------------+--------------+---------------------+----------+
|sentence|entity1_begin|entity1_end| chunk1| entity1|entity2_begin|entity2_end| chunk2| entity2| hypothesis|nli_prediction| relation|confidence|
+--------+-------------+-----------+----------------------------+----------------+-------------+-----------+-----------------+-----------+--------------------------------------------------------+--------------+---------------------+----------+
| 1| 227| 238| 25.6 million| AMOUNT| 316| 332|December 31, 2019|FISCAL_YEAR|25.6 million for the fiscal year ended December 31, 2019| entail| has_date| 0.8744757|
| 0| 31| 32| 40| PERCENTAGE| 153| 169|December 31, 2019|FISCAL_YEAR| 40 for the fiscal year ended December 31, 2019| entail| has_date| 0.7889032|
| 5| 799| 826|Sales and marketing expenses|EXPENSE_DECREASE| 923| 933| 7.5 million| AMOUNT| Sales and marketing expenses decreased from 7.5 million| entail|expense_decrease_from| 0.9770538|
| 0| 59| 69| 0.7 million| AMOUNT| 90| 106|December 31, 2020|FISCAL_YEAR| 0.7 million for the fiscal year ended December 31, 2020| entail| has_date|0.67187774|
| 1| 172| 187| Services revenue| PROFIT_INCREASE| 227| 238| 25.6 million| AMOUNT| Services revenue to 25.6 million| entail| profit_increase_to| 0.9674029|
| 0| 31| 32| 40| PERCENTAGE| 90| 106|December 31, 2020|FISCAL_YEAR| 40 for the fiscal year ended December 31, 2020| entail| has_date|0.77800345|
| 5| 838| 839| 20| PERCENTAGE| 898| 914|December 31, 2020|FISCAL_YEAR| 20 for the fiscal year ended December 31, 2020| entail| has_date|0.85455483|
| 3| 561| 572| travel costs|EXPENSE_DECREASE| 579| 589| 0.4 million| AMOUNT| travel costs decreased by 0.4 million| entail| expense_decrease_by| 0.9946776|
| 0| 42| 52| 0.5 million| AMOUNT| 153| 169|December 31, 2019|FISCAL_YEAR| 0.5 million for the fiscal year ended December 31, 2019| entail| has_date| 0.7756689|
| 1| 172| 187| Services revenue| PROFIT_INCREASE| 209| 219| 1.1 million| AMOUNT| Services revenue from 1.1 million| entail| profit_increase_from|0.96610945|
| 2| 408| 418| 0.1 million| AMOUNT| 521| 537|December 31, 2019|FISCAL_YEAR| 0.1 million for the fiscal year ended December 31, 2019| entail| has_date| 0.9083247|
| 5| 849| 859| 1.5 million| AMOUNT| 898| 914|December 31, 2020|FISCAL_YEAR| 1.5 million for the fiscal year ended December 31, 2020| entail| has_date| 0.7528142|
| 5| 849| 859| 1.5 million| AMOUNT| 954| 970|December 31, 2019|FISCAL_YEAR| 1.5 million for the fiscal year ended December 31, 2019| entail| has_date|0.80734617|
| 0| 42| 52| 0.5 million| AMOUNT| 90| 106|December 31, 2020|FISCAL_YEAR| 0.5 million for the fiscal year ended December 31, 2020| entail| has_date| 0.7157578|
| 1| 172| 187| Services revenue| PROFIT_INCREASE| 284| 295| 24.5 million| AMOUNT| Services revenue to 24.5 million| entail| profit_increase_to| 0.8597209|
| 0| 59| 69| 0.7 million| AMOUNT| 153| 169|December 31, 2019|FISCAL_YEAR| 0.7 million for the fiscal year ended December 31, 2019| entail| has_date|0.74845695|
| 1| 199| 199| 4| PERCENTAGE| 259| 275|December 31, 2020|FISCAL_YEAR| 4 for the fiscal year ended December 31, 2020| entail| has_date|0.84127575|
| 2| 424| 424| 2| PERCENTAGE| 465| 481|December 31, 2020|FISCAL_YEAR| 2 for the fiscal year ended December 31, 2020| entail| has_date| 0.8046481|
| 2| 424| 424| 2| PERCENTAGE| 521| 537|December 31, 2019|FISCAL_YEAR| 2 for the fiscal year ended December 31, 2019| entail| has_date| 0.8485104|
| 0| 0| 19| License fees revenue| PROFIT_DECLINE| 31| 32| 40| PERCENTAGE| License fees revenue decreased by a 40 to| entail|profit_decline_by_per| 0.9948003|
+--------+-------------+-----------+----------------------------+----------------+-------------+-----------+-----------------+-----------+--------------------------------------------------------+--------------+---------------------+----------+
only showing top 20 rows
from johnsnowlabs import nlp, legal
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = nlp.Tokenizer()\
.setInputCols("document")\
.setOutputCol("token")
tokenClassifier = legal.BertForTokenClassification.pretrained('legner_obligations','en', 'legal/models')\
.setInputCols("token", "document")\
.setOutputCol("ner")\
.setMaxSentenceLength(512)\
.setCaseSensitive(True)
ner_converter = legal.NerConverterInternal()\
.setInputCols(["document", "token", "ner"])\
.setOutputCol("ner_chunk")
re_model = legal.ZeroShotRelationExtractionModel.pretrained("legre_zero_shot", "en", "legal/models")\
.setInputCols(["ner_chunk", "document"]) \
.setOutputCol("relations")
re_model.setRelationalCategories({
"GRANTS_TO": ["{OBLIGATION_SUBJECT} grants {OBLIGATION_INDIRECT_OBJECT}"],
"GRANTS": ["{OBLIGATION_SUBJECT} grants {OBLIGATION_ACTION}"]
})
pipeline = nlp.Pipeline(stages = [
document_assembler,
tokenizer,
tokenClassifier,
ner_converter,
re_model
])
text = """Arizona Copyright Grant. Subject to the terms and conditions of this Agreement, Arizona hereby grants to the Company a perpetual, non-exclusive, royalty-free license in, to and under the Arizona Licensed Copyrights for use in the Company Field throughout the world."""
data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)
from pyspark.sql import functions as F
result.select(
F.explode(F.arrays_zip(result.relations.metadata, result.relations.result)).alias("cols")).select(
F.expr("cols['0']['sentence']").alias("sentence"),
F.expr("cols['0']['entity1_begin']").alias("entity1_begin"),
F.expr("cols['0']['entity1_end']").alias("entity1_end"),
F.expr("cols['0']['chunk1']").alias("chunk1"),
F.expr("cols['0']['entity1']").alias("entity1"),
F.expr("cols['0']['entity2_begin']").alias("entity2_begin"),
F.expr("cols['0']['entity2_end']").alias("entity2_end"),
F.expr("cols['0']['chunk2']").alias("chunk2"),
F.expr("cols['0']['entity2']").alias("entity2"),
F.expr("cols['0']['hypothesis']").alias("hypothesis"),
F.expr("cols['0']['nli_prediction']").alias("nli_prediction"),
F.expr("cols['1']").alias("relation"),
F.expr("cols['0']['confidence']").alias("confidence"),
).show(truncate=70)
+--------+-------------+-----------+-------+------------------+-------------+-----------+-------------+--------------------------+----------------------------+--------------+---------+----------+
|sentence|entity1_begin|entity1_end| chunk1| entity1|entity2_begin|entity2_end| chunk2| entity2| hypothesis|nli_prediction| relation|confidence|
+--------+-------------+-----------+-------+------------------+-------------+-----------+-------------+--------------------------+----------------------------+--------------+---------+----------+
| 0| 80| 86|Arizona|OBLIGATION_SUBJECT| 109| 115| Company|OBLIGATION_INDIRECT_OBJECT| Arizona grants Company| entail|GRANTS_TO| 0.9535338|
| 0| 80| 86|Arizona|OBLIGATION_SUBJECT| 88| 100|hereby grants| OBLIGATION_ACTION|Arizona grants hereby grants| entail| GRANTS| 0.9873099|
+--------+-------------+-----------+-------+------------------+-------------+-----------+-------------+--------------------------+----------------------------+--------------+---------+----------+
import spark.implicits._
val documenter = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentencer = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentences")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentences"))
.setOutputCol("tokens")
val wordsEmbedder = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("embeddings")
val nerClinical = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens", "embeddings"))
.setOutputCol("ner_clinical")
val nerClinicalConverter = new NerConverterInternal()
.setInputCols(Array("sentences", "tokens", "ner_clinical"))
.setOutputCol("ner_clinical_chunks")
.setWhiteList(Array("PROBLEM", "TEST"))
val nerPosology = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens", "embeddings"))
.setOutputCol("ner_posology")
val nerPosologyConverter = new NerConverterInternal()
.setInputCols(Array("sentences", "tokens", "ner_posology"))
.setOutputCol("ner_posology_chunks")
.setWhiteList(Array("DRUG"))
val chunkMerger = new ChunkMergeApproach()
.setInputCols(Array("ner_clinical_chunks", "ner_posology_chunks"))
.setOutputCol("merged_ner_chunks")
val posTagger = PerceptronModel.pretrained("pos_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("pos_tags")
val dependencyParser = DependencyParserModel.pretrained("dependency_conllu", "en")
.setInputCols(Array("document", "pos_tags", "tokens"))
.setOutputCol("dependencies")
val reNerChunkFilter = new RENerChunksFilter()
.setRelationPairs(Array("problem-test", "problem-drug"))
.setMaxSyntacticDistance(4)
.setDocLevelRelations(false)
.setInputCols(Array("merged_ner_chunks", "dependencies"))
.setOutputCol("re_ner_chunks")
val reModel = ZeroShotRelationExtractionModel.pretrained("re_zeroshot_biobert", "en", "clinical/models")
.setInputCols(Array("re_ner_chunks", "sentences"))
.setOutputCol("relations")
.setMultiLabel(true)
.setRelationalCategories(Map(
"ADE" -> Array("{DRUG} causes {PROBLEM}."),
"IMPROVE" -> Array("{DRUG} improves {PROBLEM}.", "{DRUG} cures {PROBLEM}."),
"REVEAL" -> Array("{TEST} reveals {PROBLEM}.")
))
val pipeline = new Pipeline().setStages(Array(
documenter,
sentencer,
tokenizer,
wordsEmbedder,
nerClinical,
nerClinicalConverter,
nerPosology,
nerPosologyConverter,
chunkMerger,
posTagger,
dependencyParser,
reNerChunkFilter,
reModel
))
val text = "Paracetamol can alleviate headache or sickness. An MRI test can be used to find cancer."
val data = Seq(text).toDF("text")
val result = pipeline.fit(data).transform(data)
+--------+-------------+-----------+-----------+-------+-------------+-----------+--------+-------+--------------------+--------------+--------+----------+
|sentence|entity1_begin|entity1_end| chunk1|entity1|entity2_begin|entity2_end| chunk2|entity2| hypothesis|nli_prediction|relation|confidence|
+--------+-------------+-----------+-----------+-------+-------------+-----------+--------+-------+--------------------+--------------+--------+----------+
| 0| 0| 10|Paracetamol| DRUG| 38| 45|sickness|PROBLEM|Paracetamol impro...| entail| IMPROVE|0.98819494|
| 0| 0| 10|Paracetamol| DRUG| 26| 33|headache|PROBLEM|Paracetamol impro...| entail| IMPROVE| 0.9929625|
| 1| 48| 58|An MRI test| TEST| 80| 85| cancer|PROBLEM|An MRI test revea...| entail| REVEAL| 0.9760039|
+--------+-------------+-----------+-----------+-------+-------------+-----------+--------+-------+--------------------+--------------+--------+----------+
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_sec_bert_base", "en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val nerModel = FinanceNerModel.pretrained("finner_financial_small", "en", "finance/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val nerConverter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val reModel = ZeroShotRelationExtractionModel.pretrained("finre_zero_shot", "en", "finance/models")
.setInputCols(Array("ner_chunk", "sentence"))
.setOutputCol("relations")
.setMultiLabel(false)
.setRelationalCategories(Map(
"profit_decline_by" -> Array(
"{PROFIT_DECLINE} decreased by {AMOUNT} from",
"{PROFIT_DECLINE} decreased by {AMOUNT} to"
),
"profit_decline_by_per" -> Array(
"{PROFIT_DECLINE} decreased by a {PERCENTAGE} from",
"{PROFIT_DECLINE} decreased by a {PERCENTAGE} to"
),
"profit_decline_from" -> Array(
"{PROFIT_DECLINE} decreased from {AMOUNT}",
"{PROFIT_DECLINE} decreased from {AMOUNT} for the year"
),
"profit_decline_from_per" -> Array(
"{PROFIT_DECLINE} decreased from {PERCENTAGE} to",
"{PROFIT_DECLINE} decreased from {PERCENTAGE} to a total of"
),
"profit_decline_to" -> Array("{PROFIT_DECLINE} to {AMOUNT}"),
"profit_increase_from" -> Array("{PROFIT_INCREASE} from {AMOUNT}"),
"profit_increase_to" -> Array("{PROFIT_INCREASE} to {AMOUNT}"),
"expense_decrease_by" -> Array("{EXPENSE_DECREASE} decreased by {AMOUNT}"),
"expense_decrease_by_per" -> Array("{EXPENSE_DECREASE} decreased by a {PERCENTAGE}"),
"expense_decrease_from" -> Array("{EXPENSE_DECREASE} decreased from {AMOUNT}"),
"expense_decrease_to" -> Array("{EXPENSE_DECREASE} for a total of {AMOUNT} for the fiscal year"),
"has_date" -> Array(
"{AMOUNT} for the fiscal year ended {FISCAL_YEAR}",
"{PERCENTAGE} for the fiscal year ended {FISCAL_YEAR}"
)
))
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
nerModel,
nerConverter,
reModel
))
val text = """License fees revenue decreased 40 %, or $ 0.5 million to $ 0.7 million for the year ended December 31, 2020 compared to $ 1.2 million for the year ended December 31, 2019. Services revenue increased 4 %, or $ 1.1 million, to $ 25.6 million for the year ended December 31, 2020 from $ 24.5 million for the year ended December 31, 2019. Costs of revenue, excluding depreciation and amortization increased by $ 0.1 million, or 2 %, to $ 8.8 million for the year ended December 31, 2020 from $ 8.7 million for the year ended December 31, 2019. Also, a decrease in travel costs of $ 0.4 million due to travel restrictions caused by the global pandemic. As a percentage of revenue, cost of revenue, excluding depreciation and amortization was 34 % for each of the years ended December 31, 2020 and 2019. Sales and marketing expenses decreased 20 %, or $ 1.5 million, to $ 6.0 million for the year ended December 31, 2020 from $ 7.5 million for the year ended December 31, 2019."""
val data = Seq(text).toDF("text")
val result = pipeline.fit(data).transform(data)
+--------+-------------+-----------+----------------------------+----------------+-------------+-----------+-----------------+-----------+--------------------------------------------------------+--------------+---------------------+----------+
|sentence|entity1_begin|entity1_end| chunk1| entity1|entity2_begin|entity2_end| chunk2| entity2| hypothesis|nli_prediction| relation|confidence|
+--------+-------------+-----------+----------------------------+----------------+-------------+-----------+-----------------+-----------+--------------------------------------------------------+--------------+---------------------+----------+
| 1| 227| 238| 25.6 million| AMOUNT| 316| 332|December 31, 2019|FISCAL_YEAR|25.6 million for the fiscal year ended December 31, 2019| entail| has_date| 0.8744757|
| 0| 31| 32| 40| PERCENTAGE| 153| 169|December 31, 2019|FISCAL_YEAR| 40 for the fiscal year ended December 31, 2019| entail| has_date| 0.7889032|
| 5| 799| 826|Sales and marketing expenses|EXPENSE_DECREASE| 923| 933| 7.5 million| AMOUNT| Sales and marketing expenses decreased from 7.5 million| entail|expense_decrease_from| 0.9770538|
| 0| 59| 69| 0.7 million| AMOUNT| 90| 106|December 31, 2020|FISCAL_YEAR| 0.7 million for the fiscal year ended December 31, 2020| entail| has_date|0.67187774|
| 1| 172| 187| Services revenue| PROFIT_INCREASE| 227| 238| 25.6 million| AMOUNT| Services revenue to 25.6 million| entail| profit_increase_to| 0.9674029|
| 0| 31| 32| 40| PERCENTAGE| 90| 106|December 31, 2020|FISCAL_YEAR| 40 for the fiscal year ended December 31, 2020| entail| has_date|0.77800345|
| 5| 838| 839| 20| PERCENTAGE| 898| 914|December 31, 2020|FISCAL_YEAR| 20 for the fiscal year ended December 31, 2020| entail| has_date|0.85455483|
| 3| 561| 572| travel costs|EXPENSE_DECREASE| 579| 589| 0.4 million| AMOUNT| travel costs decreased by 0.4 million| entail| expense_decrease_by| 0.9946776|
| 0| 42| 52| 0.5 million| AMOUNT| 153| 169|December 31, 2019|FISCAL_YEAR| 0.5 million for the fiscal year ended December 31, 2019| entail| has_date| 0.7756689|
| 1| 172| 187| Services revenue| PROFIT_INCREASE| 209| 219| 1.1 million| AMOUNT| Services revenue from 1.1 million| entail| profit_increase_from|0.96610945|
| 2| 408| 418| 0.1 million| AMOUNT| 521| 537|December 31, 2019|FISCAL_YEAR| 0.1 million for the fiscal year ended December 31, 2019| entail| has_date| 0.9083247|
| 5| 849| 859| 1.5 million| AMOUNT| 898| 914|December 31, 2020|FISCAL_YEAR| 1.5 million for the fiscal year ended December 31, 2020| entail| has_date| 0.7528142|
| 5| 849| 859| 1.5 million| AMOUNT| 954| 970|December 31, 2019|FISCAL_YEAR| 1.5 million for the fiscal year ended December 31, 2019| entail| has_date|0.80734617|
| 0| 42| 52| 0.5 million| AMOUNT| 90| 106|December 31, 2020|FISCAL_YEAR| 0.5 million for the fiscal year ended December 31, 2020| entail| has_date| 0.7157578|
| 1| 172| 187| Services revenue| PROFIT_INCREASE| 284| 295| 24.5 million| AMOUNT| Services revenue to 24.5 million| entail| profit_increase_to| 0.8597209|
| 0| 59| 69| 0.7 million| AMOUNT| 153| 169|December 31, 2019|FISCAL_YEAR| 0.7 million for the fiscal year ended December 31, 2019| entail| has_date|0.74845695|
| 1| 199| 199| 4| PERCENTAGE| 259| 275|December 31, 2020|FISCAL_YEAR| 4 for the fiscal year ended December 31, 2020| entail| has_date|0.84127575|
| 2| 424| 424| 2| PERCENTAGE| 465| 481|December 31, 2020|FISCAL_YEAR| 2 for the fiscal year ended December 31, 2020| entail| has_date| 0.8046481|
| 2| 424| 424| 2| PERCENTAGE| 521| 537|December 31, 2019|FISCAL_YEAR| 2 for the fiscal year ended December 31, 2019| entail| has_date| 0.8485104|
| 0| 0| 19| License fees revenue| PROFIT_DECLINE| 31| 32| 40| PERCENTAGE| License fees revenue decreased by a 40 to| entail|profit_decline_by_per| 0.9948003|
+--------+-------------+-----------+----------------------------+----------------+-------------+-----------+-----------------+-----------+--------------------------------------------------------+--------------+---------------------+----------+
only showing top 20 rows
import spark.implicits._
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val tokenClassifier = LegalBertForTokenClassification.pretrained("legner_obligations", "en", "legal/models")
.setInputCols(Array("token", "document"))
.setOutputCol("ner")
.setMaxSentenceLength(512)
.setCaseSensitive(true)
val nerConverter = new NerConverterInternal()
.setInputCols(Array("document", "token", "ner"))
.setOutputCol("ner_chunk")
val reModel = ZeroShotRelationExtractionModel.pretrained("legre_zero_shot", "en", "legal/models")
.setInputCols(Array("ner_chunk", "document"))
.setOutputCol("relations")
reModel.setRelationalCategories(Map(
"GRANTS_TO" -> Array("{OBLIGATION_SUBJECT} grants {OBLIGATION_INDIRECT_OBJECT}"),
"GRANTS" -> Array("{OBLIGATION_SUBJECT} grants {OBLIGATION_ACTION}")
))
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
tokenClassifier,
nerConverter,
reModel
))
val text = """Arizona Copyright Grant. Subject to the terms and conditions of this Agreement, Arizona hereby grants to the Company a perpetual, non-exclusive, royalty-free license in, to and under the Arizona Licensed Copyrights for use in the Company Field throughout the world."""
val data = Seq(text).toDF("text")
val result = pipeline.fit(data).transform(data)
+--------+-------------+-----------+-------+------------------+-------------+-----------+-------------+--------------------------+----------------------------+--------------+---------+----------+
|sentence|entity1_begin|entity1_end| chunk1| entity1|entity2_begin|entity2_end| chunk2| entity2| hypothesis|nli_prediction| relation|confidence|
+--------+-------------+-----------+-------+------------------+-------------+-----------+-------------+--------------------------+----------------------------+--------------+---------+----------+
| 0| 80| 86|Arizona|OBLIGATION_SUBJECT| 109| 115| Company|OBLIGATION_INDIRECT_OBJECT| Arizona grants Company| entail|GRANTS_TO| 0.9535338|
| 0| 80| 86|Arizona|OBLIGATION_SUBJECT| 88| 100|hereby grants| OBLIGATION_ACTION|Arizona grants hereby grants| entail| GRANTS| 0.9873099|
+--------+-------------+-----------+-------+------------------+-------------+-----------+-------------+--------------------------+----------------------------+--------------+---------+----------+