A Spark NLP Enterprise license includes access to unique annotators. At the Spark NLP Workshop you can see different types of annotators in action.
By clicking on any annotator, you will see different sections:
- The
Approach
, or class to train models. - The
Model
, to infer using pretrained models.
Also, for most of the annotators, you will find examples for the different enterprise libraries:
- Healthcare NLP
- Finance NLP
- Legal NLP
Check out the Spark NLP Annotators page for more information on how to read this page.
Available Annotators
Annotators | Description |
---|---|
AssertionDL | AssertionDL is a deep Learning based approach used to extract Assertion Status from extracted entities and text. |
AssertionFilterer | Filters entities coming from ASSERTION type annotations and returns the CHUNKS. |
AssertionLogReg | Logistic Regression is used to extract Assertion Status from extracted entities and text. |
Chunk2Token | A feature transformer that converts the input array of strings (annotatorType CHUNK) into an array of chunk-based tokens (annotatorType TOKEN). |
ChunkEntityResolver | Returns a normalized entity for a particular trained ontology / curated dataset (e.g. clinical ICD-10, RxNorm, SNOMED; financial SEC’s EDGAR database, etc). |
ChunkFilterer | Filters entities coming from CHUNK annotations. |
ChunkKeyPhraseExtraction | Uses Bert Sentence Embeddings to determine the most relevant key phrases describing a text. |
ChunkMerge | Merges entities coming from different CHUNK annotations. |
ContextualParser | Extracts entity from a document based on user defined rules. |
DeIdentification | Deidentifies Input Annotations of types DOCUMENT, TOKEN and CHUNK, by either masking or obfuscating the given CHUNKS. |
DocumentLogRegClassifier | Classifies documents with a Logarithmic Regression algorithm. |
DrugNormalizer | Annotator which normalizes raw text from documents, e.g. scraped web pages or xml documents |
FeaturesAssembler | Collects features from different columns. |
GenericClassifier | Creates a generic single-label classifier which uses pre-generated Tensorflow graphs. |
IOBTagger | Merges token tags and NER labels from chunks in the specified format. |
NerChunker | Extracts phrases that fits into a known pattern using the NER tags. |
NerConverterInternal | Converts a IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their label. |
NerDisambiguator | Links words of interest, such as names of persons, locations and companies, from an input text document to a corresponding unique entity in a target Knowledge Base (KB). |
MedicalNer | This Named Entity recognition annotator is a generic NER model based on Neural Networks.. |
QuestionAnswering | GPT-based model for answering questions given a context. |
RENerChunksFilter | Filters and outputs combinations of relations between extracted entities, for further processing. |
ReIdentification | Reidentifies obfuscated entities by DeIdentification. |
RelationExtraction | Extracts and classifies instances of relations between named entities. |
RelationExtractionDL | Extracts and classifies instances of relations between named entities. |
SentenceEntityResolver | Returns the normalized entity for a particular trained ontology / curated dataset (e.g. clinical ICD-10, RxNorm, SNOMED; financial SEC’s EDGAR database, etc) based on sentence embeddings. |
Summarizer | Helps to quickly summarize complex medical information. |
TextGenerator | Uses the basic BioGPT model to perform various tasks related to medical text abstraction. |
TFGraphBuilder | Creates Tensorflow graphs. |
AnnotationMerger
Merge annotations from different pipeline steps that have the same annotation type into a unified annotation. Possible annotations that can be merged include:
- document (e.g., output of
DocumentAssembler
annotator) - token (e.g., output of
Tokenizer
annotator) - word_embeddings (e.g., output of
WordEmbeddingsModel
annotator) - sentence_embeddings (e.g., output of
BertSentenceEmbeddings
annotator) - category (e.g., output of
RelationExtractionModel
annotator) - date (e.g., output of
DateMatcher
annotator) - sentiment (e.g., output of
SentimentDLModel
annotator) - pos (e.g., output of
PerceptronModel
annotator) - chunk (e.g., output of
NerConverter
annotator) - named_entity (e.g., output of
NerDLModel
annotator) - regex (e.g., output of
RegexTokenizer
annotator) - dependency (e.g., output of
DependencyParserModel
annotator) - language (e.g., output of
LanguageDetectorDL
annotator) - keyword (e.g., output of
YakeModel
annotator)
Input Annotator Types: ANY
Output Annotator Type: ANY
Python API: AnnotationMerger | Scala API: AnnotationMerger |
Show Example
# Create the pipeline with two RE models
documenter = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentencer = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentences")
tokenizer = Tokenizer()\
.setInputCols(["sentences"])\
.setOutputCol("tokens")
words_embedder = WordEmbeddingsModel()\
.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentences", "tokens"])\
.setOutputCol("embeddings")
pos_tagger = PerceptronModel()\
.pretrained("pos_clinical", "en", "clinical/models") \
.setInputCols(["sentences", "tokens"])\
.setOutputCol("pos_tags")
pos_ner_tagger = MedicalNerModel()\
.pretrained("ner_posology", "en", "clinical/models")\
.setInputCols("sentences", "tokens", "embeddings")\
.setOutputCol("ner_pos")
pos_ner_chunker = NerConverterInternal()\
.setInputCols(["sentences", "tokens", "ner_pos"])\
.setOutputCol("pos_ner_chunks")
dependency_parser = DependencyParserModel()\
.pretrained("dependency_conllu", "en")\
.setInputCols(["sentences", "pos_tags", "tokens"])\
.setOutputCol("dependencies")
pos_reModel = RelationExtractionModel()\
.pretrained("posology_re")\
.setInputCols(["embeddings", "pos_tags", "pos_ner_chunks", "dependencies"])\
.setOutputCol("pos_relations")\
.setMaxSyntacticDistance(4)
ade_ner_tagger = MedicalNerModel.pretrained("ner_ade_clinical", "en", "clinical/models")\
.setInputCols("sentences", "tokens", "embeddings")\
.setOutputCol("ade_ner_tags")
ade_ner_chunker = NerConverterInternal()\
.setInputCols(["sentences", "tokens", "ade_ner_tags"])\
.setOutputCol("ade_ner_chunks")
ade_reModel = RelationExtractionModel()\
.pretrained("re_ade_clinical", "en", 'clinical/models')\
.setInputCols(["embeddings", "pos_tags", "ade_ner_chunks", "dependencies"])\
.setOutputCol("ade_relations")\
.setMaxSyntacticDistance(10)\
.setRelationPairs(["drug-ade, ade-drug"])
annotation_merger = AnnotationMerger()\
.setInputCols("ade_relations", "pos_relations")\
.setInputType("category")\
.setOutputCol("all_relations")
merger_pipeline = Pipeline(stages=[
documenter,
sentencer,
tokenizer,
words_embedder,
pos_tagger,
pos_ner_tagger,
pos_ner_chunker,
dependency_parser,
pos_reModel,
ade_ner_tagger,
ade_ner_chunker,
ade_reModel,
annotation_merger
])
empty_df= spark.createDataFrame([[""]]).toDF("text")
merger_model= merger_pipeline.fit(empty_df)
# Show example result
text = """
The patient was prescribed 1 unit of naproxen for 5 days after meals for chronic low back pain. The patient was also given 1 unit of oxaprozin daily for rheumatoid arthritis presented with tense bullae and cutaneous fragility on the face and the back of the hands..
"""
data = spark.createDataFrame([[text]]).toDF("text")
result = merger_model.transform(data)
result.show()
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
text| document| sentences| tokens| embeddings| pos_tags| ner_pos| pos_ner_chunks| dependencies| pos_relations| ade_ner_tags| ade_ner_chunks| ade_relations| all_relations|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
The patient was ...|[{document, 0, 26...|[{document, 1, 95...|[{token, 1, 3, Th...|[{word_embeddings...|[{pos, 1, 3, DD, ...|[{named_entity, 1...|[{chunk, 28, 33, ...|[{dependency, 1, ...|[{category, 28, 4...|[{named_entity, 1...|[{chunk, 38, 45, ...|[{category, 134, ...|[{category, 134, ...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
AssertionChunkConverter
This annotator creates a CHUNK
column with metadata useful for training an Assertion Status Detection model (see AssertionDL).
In some cases, there may be issues while creating the chunk column when using token indices that can lead to loss of data to train assertion status models.
The AssertionChunkConverter
annotator uses both begin and end indices of the tokens as input to add a more robust metadata to the chunk column in a way that improves the reliability of the indices and avoid loss of data.
NOTE: Chunk begin and end indices in the assertion status model training dataframe can be populated using the new version of ALAB module.
Input Annotator Types: TOKEN
Output Annotator Type: CHUNK
Python API: AssertionChunkConverter | Scala API: AssertionChunkConverter |
Show Example
data = spark.createDataFrame(
[
[
"An angiography showed bleeding in two vessels off of the Minnie supplying the sigmoid that were succesfully embolized.",
"Minnie",
57,
64,
],
[
"After discussing this with his PCP, Leon was clear that the patient had had recurrent DVTs and ultimately a PE and his PCP felt strongly that he required long-term anticoagulation ",
"PCP",
31,
34,
],
]
).toDF("text", "target", "char_begin", "char_end")
document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
sentenceDetector = (
SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
)
tokenizer = Tokenizer().setInputCols(["sentence"]).setOutputCol("tokens")
converter = (
AssertionChunkConverter()
.setInputCols("tokens")
.setChunkTextCol("target")
.setChunkBeginCol("char_begin")
.setChunkEndCol("char_end")
.setOutputTokenBeginCol("token_begin")
.setOutputTokenEndCol("token_end")
.setOutputCol("chunk")
)
pipeline = Pipeline().setStages(
[document_assembler, sentenceDetector, tokenizer, converter]
)
results = pipeline.fit(data).transform(data)
results.selectExpr(
"target",
"char_begin",
"char_end",
"token_begin",
"token_end",
"tokens[token_begin].result",
"tokens[token_end].result",
"target",
"chunk",
).show(truncate=False)
+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
|target|char_begin|char_end|token_begin|token_end|tokens[token_begin].result|tokens[token_end].result|target|chunk |
+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
|Minnie|57 |64 |10 |10 |Minnie |Minnie |Minnie|[{chunk, 57, 62, Minnie, {sentence -> 0}, []}]|
|PCP |31 |34 |5 |5 |PCP |PCP |PCP |[{chunk, 31, 33, PCP, {sentence -> 0}, []}] |
+------+----------+--------+-----------+---------+--------------------------+------------------------+------+----------------------------------------------+
AssertionDL
AssertionDL is a deep Learning based approach used to extract Assertion Status
from extracted entities and text. AssertionDLModel requires DOCUMENT, CHUNK and WORD_EMBEDDINGS type
annotator inputs, which can be obtained by e.g a
DocumentAssembler,
NerConverter
and WordEmbeddingsModel.
The result is an assertion status annotation for each recognized entity.
Possible values include “present”, “absent”, “hypothetical”, “conditional”, “associated_with_other_person”
etc.
For pretrained models please see the Models Hub for available models.
Input Annotator Types: DOCUMENT, CHUNK, WORD_EMBEDDINGS
Output Annotator Type: ASSERTION
Python API: AssertionDLModel | Scala API: AssertionDLModel |
Show Example
from johnsnowlabs import *
# Define pipeline stages to extract NER chunks first
data = spark.createDataFrame([
["Patient with severe fever and sore throat"],
["Patient shows no stomach pain"],
["She was maintained on an epidural and PCA for pain control."]]).toDF("text")
documentAssembler = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
sentenceDetector = nlp.SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
tokenizer = nlp.Tokenizer().setInputCols(["sentence"]).setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
.setOutputCol("embeddings")
nerModel = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]).setOutputCol("ner")
nerConverter = nlp.NerConverter().setInputCols(["sentence", "token", "ner"]).setOutputCol("ner_chunk")
# Then a pretrained AssertionDLModel is used to extract the assertion status
clinicalAssertion = medical.AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models") \
.setInputCols(["sentence", "ner_chunk", "embeddings"]) \
.setOutputCol("assertion")
assertionPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
nerModel,
nerConverter,
clinicalAssertion
])
assertionModel = assertionPipeline.fit(data)
# Show results
result = assertionModel.transform(data)
result.selectExpr("ner_chunk.result", "assertion.result").show(3, truncate=False)
+--------------------------------+--------------------------------+
|result |result |
+--------------------------------+--------------------------------+
|[severe fever, sore throat] |[present, present] |
|[stomach pain] |[absent] |
|[an epidural, PCA, pain control]|[present, present, hypothetical]|
+--------------------------------+--------------------------------+
from johnsnowlabs import *
data = spark.createDataFrame([["Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc."]]).toDF("text")
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")\
ner_converter = finance.NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")\
assertion = finance.AssertionDLModel.pretrained("finassertion_competitors", "en", "finance/models")\
.setInputCols(["sentence", "ner_chunk", "embeddings"]) \
.setOutputCol("assertion")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter,
assertion
])
assertionModel = pipeline.fit(data)
# Show results
result = assertionModel.transform(data)
result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata, result.assertion.result)).alias("cols"))\
.select(F.expr("cols['1']['sentence']").alias("sent_id"),
F.expr("cols['0']").alias("chunk"),
F.expr("cols['1']['entity']").alias("ner_label"),
F.expr("cols['2']").alias("assertion")).show(truncate=False)
+-------+------------+---------+----------+
|sent_id|chunk |ner_label|assertion |
+-------+------------+---------+----------+
|0 |McAfee LLC |ORG |COMPETITOR|
|0 |Broadcom Inc|ORG |COMPETITOR|
+-------+------------+---------+----------+
from johnsnowlabs import *
data = spark.createDataFrame([["This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc."]]).toDF("text")
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings_ner = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings_ner")\
ner_model = legal.NerModel.pretrained('legner_contract_doc_parties', 'en', 'legal/models')\
.setInputCols(["sentence", "token", "embeddings_ner"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")\
.setWhiteList(["DOC", "EFFDATE", "PARTY"])
embeddings_ass = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings_ass")
assertion = legal.AssertionDLModel.pretrained("legassertion_time", "en", "legal/models")\
.setInputCols(["sentence", "ner_chunk", "embeddings_ass"]) \
.setOutputCol("assertion")
nlpPipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
embeddings_ner,
ner_model,
ner_converter,
embeddings_ass,
assertion
])
assertionModel = nlpPipeline.fit(data)
# Show results
result = assertionModel.transform(data)
result.select(F.explode(F.arrays_zip(result.ner_chunk.result,
result.ner_chunk.begin,
result.ner_chunk.end,
result.ner_chunk.metadata,
result.assertion.result)).alias("cols"))\
.select(F.expr("cols['0']").alias("chunk"),
F.expr("cols['1']").alias("begin"),
F.expr("cols['2']").alias("end"),
F.expr("cols['3']['entity']").alias("ner_label"),
F.expr("cols['4']").alias("assertion")).show(truncate=False)
+-------------------------------+-----+---+---------+---------+
|chunk |begin|end|ner_label|assertion|
+-------------------------------+-----+---+---------+---------+
|Intellectual Property Agreement|11 |41 |DOC |PRESENT |
|Amazon Inc |51 |60 |PARTY |PRESENT |
|Atlantic Inc |67 |78 |PARTY |PRESENT |
+-------------------------------+-----+---+---------+---------+
from johnsnowlabs import *
// Define pipeline stages to extract NER chunks first
val data = Seq(
"Patient with severe fever and sore throat",
"Patient shows no stomach pain",
"She was maintained on an epidural and PCA for pain control.").toDF("text")
val documentAssembler = new nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector().setInputCols("document").setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer().setInputCols("sentence").setOutputCol("token")
val embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models").setInputCols(Array("sentence", "token")).setOutputCol("embeddings")
val nerModel = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings")).setOutputCol("ner")
val nerConverter = new nlp.NerConverter().setInputCols(Array("sentence", "token", "ner")).setOutputCol("ner_chunk")
// Then a pretrained AssertionDLModel is used to extract the assertion status
val clinicalAssertion = medical.AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models")
.setInputCols(Array("sentence", "ner_chunk", "embeddings"))
.setOutputCol("assertion")
val assertionPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
nerModel,
nerConverter,
clinicalAssertion
))
val assertionModel = assertionPipeline.fit(data)
// Show results
val result = assertionModel.transform(data)
result.selectExpr("ner_chunk.result", "assertion.result").show(3, truncate=false)
+--------------------------------+--------------------------------+
|result |result |
+--------------------------------+--------------------------------+
|[severe fever, sore throat] |[present, present] |
|[stomach pain] |[absent] |
|[an epidural, PCA, pain control]|[present, present, hypothetical]|
+--------------------------------+--------------------------------+
from johnsnowlabs import *
val data = Seq("Our competitors include the following by general category: legacy antivirus product providers, such as McAfee LLC and Broadcom Inc.").toDF("text")
val document_assembler = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new nlp.SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new finance.NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val assertion = finance.AssertionDLModel.pretrained("finassertion_competitors", "en", "finance/models")
.setInputCols(Array("sentence", "ner_chunk", "embeddings"))
.setOutputCol("assertion")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter,
assertion
)
val assertionModel = pipeline.fit(data)
from johnsnowlabs import *
val data = Seq("This is an Intellectual Property Agreement between Amazon Inc. and Atlantic Inc.").toDF("text")
val document_assembler = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings_ner = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base", "en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings_ner")
val ner_model = legal.NerModel.pretrained('legner_contract_doc_parties', 'en', 'legal/models')
.setInputCols(Array("sentence", "token", "embeddings_ner"))
.setOutputCol("ner")
val ner_converter = new nlp.NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
.setWhiteList(Array("DOC", "EFFDATE", "PARTY"))
val embeddings_ass = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings_ass")
val assertion = legal.AssertionDLModel.pretrained("legassertion_time", "en", "legal/models")
.setInputCols(Array("sentence", "ner_chunk", "embeddings_ass"))
.setOutputCol("assertion")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
embeddings_ner,
ner_model,
ner_converter,
embeddings_ass,
assertion
)
val assertionModel = pipeline.fit(data)
Trains AssertionDL, a deep Learning based approach used to extract Assertion Status from extracted entities and text. Contains all the methods for training an AssertionDLModel. For pretrained models please use AssertionDLModel and see the Models Hub for available models.
Input Annotator Types: DOCUMENT, CHUNK, WORD_EMBEDDINGS
Output Annotator Type: ASSERTION
Python API: AssertionDLApproach | Scala API: AssertionDLApproach |
Show Example
from johnsnowlabs import *
# First, pipeline stages for pre-processing the dataset (containing columns for text and label) are defined.
document = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
chunk = nlp.Doc2Chunk() \
.setInputCols(["document"]) \
.setOutputCol("chunk")
token = nlp.Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
# Define AssertionDLApproach with parameters and start training
assertionStatus = medical.AssertionDLApproach() \
.setLabelCol("label") \
.setInputCols(["document", "chunk", "embeddings"]) \
.setOutputCol("assertion") \
.setBatchSize(128) \
.setDropout(0.012) \
.setLearningRate(0.015) \
.setEpochs(1) \
.setStartCol("start") \
.setEndCol("end") \
.setMaxSentLen(250)
trainingPipeline = Pipeline().setStages([
document,
chunk,
token,
embeddings,
assertionStatus
])
assertionModel = trainingPipeline.fit(data)
assertionResults = assertionModel.transform(data).cache()
from johnsnowlabs import *
# First, pipeline stages for pre-processing the dataset (containing columns for text and label) are defined.
document = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
chunk = nlp.Doc2Chunk() \
.setInputCols(["document"]) \
.setOutputCol("chunk")
token = nlp.Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
# Define AssertionDLApproach with parameters and start training
assertionStatus = finance.AssertionDLApproach() \
.setLabelCol("label") \
.setInputCols(["document", "chunk", "embeddings"]) \
.setOutputCol("assertion") \
.setBatchSize(128) \
.setDropout(0.012) \
.setLearningRate(0.015) \
.setEpochs(1) \
.setStartCol("start") \
.setEndCol("end") \
.setMaxSentLen(250)
trainingPipeline = Pipeline().setStages([
document,
chunk,
token,
embeddings,
assertionStatus
])
assertionModel = trainingPipeline.fit(data)
assertionResults = assertionModel.transform(data).cache()
from johnsnowlabs import *
# First, pipeline stages for pre-processing the dataset (containing columns for text and label) are defined.
document = nlp.DocumentAssembler()\
.setInputCol("sentence")\
.setOutputCol("document")
chunk = nlp.Doc2Chunk()\
.setInputCols("document")\
.setOutputCol("doc_chunk")
token = nlp.Tokenizer()\
.setInputCols(['document'])\
.setOutputCol('token')
roberta_embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setMaxSentenceLength(512)
# Define AssertionDLApproach with parameters and start training
assertionStatus = legal.AssertionDLApproach()\
.setLabelCol("assertion_label")\
.setInputCols("document", "doc_chunk", "embeddings")\
.setOutputCol("assertion")\
.setBatchSize(128)\
.setLearningRate(0.001)\
.setEpochs(2)\
.setStartCol("tkn_start")\
.setEndCol("tkn_end")\
.setMaxSentLen(1200)\
.setEnableOutputLogs(True)\
.setOutputLogsPath('training_logs/')\
.setGraphFolder(graph_folder)\
.setGraphFile(f"{graph_folder}/assertion_graph.pb")\
.setTestDataset(path="test_data.parquet", read_as='SPARK', options={'format': 'parquet'})\
.setScopeWindow(scope_window)
#.setValidationSplit(0.2)\
#.setDropout(0.1)\
trainingPipeline = Pipeline().setStages([
document,
chunk,
token,
roberta_embeddings,
assertionStatus
])
assertionModel = trainingPipeline.fit(data)
assertionResults = assertionModel.transform(data).cache()
from johnsnowlabs import *
// First, pipeline stages for pre-processing the dataset (containing columns for text and label) are defined.
val document = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val chunk = new nlp.Doc2Chunk()
.setInputCols("document")
.setOutputCol("chunk")
val token = new nlp.Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols("document", "token")
.setOutputCol("embeddings")
// Define AssertionDLApproach with parameters and start training
val assertionStatus = new medical.AssertionDLApproach()
.setLabelCol("label")
.setInputCols("document", "chunk", "embeddings")
.setOutputCol("assertion")
.setBatchSize(128)
.setDropout(0.012f)
.setLearningRate(0.015f)
.setEpochs(1)
.setStartCol("start")
.setEndCol("end")
.setMaxSentLen(250)
val trainingPipeline = new Pipeline().setStages(Array(
document,
chunk,
token,
embeddings,
assertionStatus
))
val assertionModel = trainingPipeline.fit(data)
val assertionResults = assertionModel.transform(data).cache()
from johnsnowlabs import *
// First, pipeline stages for pre-processing the dataset (containing columns for text and label) are defined.
val document = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val chunk = new nlp.Doc2Chunk()
.setInputCols("document")
.setOutputCol("chunk")
val token = new nlp.Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols("document", "token")
.setOutputCol("embeddings")
// Define AssertionDLApproach with parameters and start training
val assertionStatus = new finance.AssertionDLApproach()
.setLabelCol("label")
.setInputCols("document", "chunk", "embeddings")
.setOutputCol("assertion")
.setBatchSize(128)
.setDropout(0.012f)
.setLearningRate(0.015f)
.setEpochs(1)
.setStartCol("start")
.setEndCol("end")
.setMaxSentLen(250)
val trainingPipeline = new Pipeline().setStages(Array(
document,
chunk,
token,
embeddings,
assertionStatus
))
val assertionModel = trainingPipeline.fit(data)
val assertionResults = assertionModel.transform(data).cache()
from johnsnowlabs import *
val document = new nlp.DocumentAssembler()\
.setInputCol("sentence")\
.setOutputCol("document")
val chunk = new nlp.Doc2Chunk()\
.setInputCols("document")\
.setOutputCol("doc_chunk")\
.setChunkCol("chunk")\
.setStartCol("tkn_start")\
.setStartColByTokenIndex(True)\
.setFailOnMissing(False)\
.setLowerCase(False)
val token = new nlp.Tokenizer()\
.setInputCols(['document'])\
.setOutputCol('token')
val roberta_embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setMaxSentenceLength(512)
# Define AssertionDLApproach with parameters and start training
val assertionStatus = new legal.AssertionDLApproach()\
.setLabelCol("assertion_label")\
.setInputCols("document", "doc_chunk", "embeddings")\
.setOutputCol("assertion")\
.setBatchSize(128)\
.setLearningRate(0.001)\
.setEpochs(2)\
.setStartCol("tkn_start")\
.setEndCol("tkn_end")\
.setMaxSentLen(1200)\
.setEnableOutputLogs(True)\
.setOutputLogsPath('training_logs/')\
.setGraphFolder(graph_folder)\
.setGraphFile(f"{graph_folder}/assertion_graph.pb")\
.setTestDataset(path="test_data.parquet", read_as='SPARK', options={'format': 'parquet'})\
.setScopeWindow(scope_window)
#.setValidationSplit(0.2)\
#.setDropout(0.1)\
val trainingPipeline = new Pipeline().setStages(Array(
document,
chunk,
token,
roberta_embeddings,
assertionStatus
))
val assertionModel = trainingPipeline.fit(data)
val assertionResults = assertionModel.transform(data).cache()
AssertionFilterer
Filters entities coming from ASSERTION type annotations and returns the CHUNKS.
Filters can be set via a white list on the extracted chunk, the assertion or a regular expression.
White list for assertion is enabled by default. To use chunk white list, criteria
has to be set to "isin"
.
For regex, criteria
has to be set to "regex"
.
Input Annotator Types: DOCUMENT, CHUNK, ASSERTION
Output Annotator Type: CHUNK
Python API: AssertionFilterer | Scala API: AssertionFilterer |
Show Example
from johnsnowlabs import *
# To see how the assertions are extracted, see the example for AssertionDLModel.
# Define an extra step where the assertions are filtered
assertionFilterer = medical.AssertionFilterer() \
.setInputCols(["sentence","ner_chunk","assertion"]) \
.setOutputCol("filtered") \
.setCriteria("assertion") \
.setWhiteList(["present"])
assertionPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
nerModel,
nerConverter,
clinicalAssertion,
assertionFilterer
])
assertionModel = assertionPipeline.fit(data)
result = assertionModel.transform(data)
# Show results:
result.selectExpr("ner_chunk.result", "assertion.result").show(3, truncate=False)
+--------------------------------+--------------------------------+
|result |result |
+--------------------------------+--------------------------------+
|[severe fever, sore throat] |[present, present] |
|[stomach pain] |[absent] |
|[an epidural, PCA, pain control]|[present, present, hypothetical]|
+--------------------------------+--------------------------------+
result.select("filtered.result").show(3, truncate=False)
+---------------------------+
|result |
+---------------------------+
|[severe fever, sore throat]|
|[] |
|[an epidural, PCA] |
+---------------------------+
from johnsnowlabs import *
# To see how the assertions are extracted, see the example for AssertionDLModel.
# Define an extra step where the assertions are filtered
assertionFilterer = finance.AssertionFilterer() \
.setInputCols(["sentence","ner_chunk","assertion"]) \
.setOutputCol("filtered") \
.setCriteria("assertion") \
.setWhiteList(["present"])
assertionPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
nerModel,
nerConverter,
clinicalAssertion,
assertionFilterer
])
assertionModel = assertionPipeline.fit(data)
result = assertionModel.transform(data)
from johnsnowlabs import *
# To see how the assertions are extracted, see the example for AssertionDLModel.
# Define an extra step where the assertions are filtered
assertionFilterer = legal.AssertionFilterer() \
.setInputCols(["sentence","ner_chunk","assertion"]) \
.setOutputCol("filtered") \
.setCriteria("assertion") \
.setWhiteList(["present"])
assertionPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
nerModel,
nerConverter,
clinicalAssertion,
assertionFilterer
])
assertionModel = assertionPipeline.fit(data)
result = assertionModel.transform(data)
from johnsnowlabs import *
// To see how the assertions are extracted, see the example for
// [[com.johnsnowlabs.nlp.annotators.assertion.dl.AssertionDLModel AssertionDLModel]].
// Define an extra step where the assertions are filtered
val assertionFilterer = new medical.AssertionFilterer()
.setInputCols("sentence","ner_chunk","assertion")
.setOutputCol("filtered")
.setCriteria("assertion")
.setWhiteList("present")
val assertionPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
nerModel,
nerConverter,
clinicalAssertion,
assertionFilterer
))
val assertionModel = assertionPipeline.fit(data)
val result = assertionModel.transform(data)
// Show results:
//
// result.selectExpr("ner_chunk.result", "assertion.result").show(3, truncate=false)
// +--------------------------------+--------------------------------+
// |result |result |
// +--------------------------------+--------------------------------+
// |[severe fever, sore throat] |[present, present] |
// |[stomach pain] |[absent] |
// |[an epidural, PCA, pain control]|[present, present, hypothetical]|
// +--------------------------------+--------------------------------+
// result.select("filtered.result").show(3, truncate=false)
// +---------------------------+
// |result |
// +---------------------------+
// |[severe fever, sore throat]|
// |[] |
// |[an epidural, PCA] |
// +---------------------------+
//
from johnsnowlabs import *
// To see how the assertions are extracted, see the example for
// [[com.johnsnowlabs.nlp.annotators.assertion.dl.AssertionDLModel AssertionDLModel]].
// Define an extra step where the assertions are filtered
val assertionFilterer = new legal.AssertionFilterer()
.setInputCols("sentence","ner_chunk","assertion")
.setOutputCol("filtered")
.setCriteria("assertion")
.setWhiteList("present")
val assertionPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
nerModel,
nerConverter,
clinicalAssertion,
assertionFilterer
))
val assertionModel = assertionPipeline.fit(data)
val result = assertionModel.transform(data)
from johnsnowlabs import *
// To see how the assertions are extracted, see the example for
// [[com.johnsnowlabs.nlp.annotators.assertion.dl.AssertionDLModel AssertionDLModel]].
// Define an extra step where the assertions are filtered
val assertionFilterer = new legal.AssertionFilterer()
.setInputCols("sentence","ner_chunk","assertion")
.setOutputCol("filtered")
.setCriteria("assertion")
.setWhiteList("present")
val assertionPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
nerModel,
nerConverter,
clinicalAssertion,
assertionFilterer
))
val assertionModel = assertionPipeline.fit(data)
val result = assertionModel.transform(data)
AssertionLogReg
This is a main class in AssertionLogReg family. Logarithmic Regression is used to extract Assertion Status
from extracted entities and text. AssertionLogRegModel requires DOCUMENT, CHUNK and WORD_EMBEDDINGS type
annotator inputs, which can be obtained by e.g a
DocumentAssembler,
NerConverter
and WordEmbeddingsModel.
The result is an assertion status annotation for each recognized entity.
Possible values are "Negated", "Affirmed" and "Historical"
.
Unlike the DL Model, this class does not extend AnnotatorModel. Instead it extends the RawAnnotator, that’s why the main point of interest is method transform().
At the moment there are no pretrained models available for this class. Please refer to AssertionLogRegApproach to train your own model.
Input Annotator Types: DOCUMENT, CHUNK, WORD_EMBEDDINGS
Output Annotator Type: ASSERTION
Python API: AssertionLogRegModel | Scala API: AssertionLogRegModel |
Trains a classification method, which uses the Logarithmic Regression Algorithm. It is used to extract Assertion Status from extracted entities and text. Contains all the methods for training a AssertionLogRegModel, together with trainWithChunk, trainWithStartEnd.
Input Annotator Types: DOCUMENT, CHUNK, WORD_EMBEDDINGS
Output Annotator Type: ASSERTION
Python API: AssertionLogRegApproach | Scala API: AssertionLogRegApproach |
Show Example
from johnsnowlabs import *
# Training with Glove Embeddings
# First define pipeline stages to extract embeddings and text chunks
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = nlp.Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
glove = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["document", "token"]) \
.setOutputCol("word_embeddings") \
.setCaseSensitive(False)
chunkAssembler = nlp.Doc2Chunk() \
.setInputCols(["document"]) \
.setChunkCol("target") \
.setOutputCol("chunk")
# Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
assertion = medical.AssertionLogRegApproach() \
.setLabelCol("label") \
.setInputCols(["document", "chunk", "word_embeddings"]) \
.setOutputCol("assertion") \
.setReg(0.01) \
.setBefore(11) \
.setAfter(13) \
.setStartCol("start") \
.setEndCol("end")
assertionPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
nerModel,
nerConverter,
assertion
])
assertionModel = assertionPipeline.fit(dataset)
from johnsnowlabs import *
# Training with Glove Embeddings
# First define pipeline stages to extract embeddings and text chunks
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = nlp.Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
glove = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["document", "token"]) \
.setOutputCol("word_embeddings") \
.setCaseSensitive(False)
chunkAssembler = nlp.Doc2Chunk() \
.setInputCols(["document"]) \
.setChunkCol("target") \
.setOutputCol("chunk")
# Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
assertion = finance.AssertionLogRegApproach() \
.setLabelCol("label") \
.setInputCols(["document", "chunk", "word_embeddings"]) \
.setOutputCol("assertion") \
.setReg(0.01) \
.setBefore(11) \
.setAfter(13) \
.setStartCol("start") \
.setEndCol("end")
assertionPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
nerModel,
nerConverter,
assertion
])
assertionModel = assertionPipeline.fit(dataset)
from johnsnowlabs import *
# Training with Glove Embeddings
# First define pipeline stages to extract embeddings and text chunks
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = nlp.Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
glove = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["document", "token"]) \
.setOutputCol("word_embeddings") \
.setCaseSensitive(False)
chunkAssembler = nlp.Doc2Chunk() \
.setInputCols(["document"]) \
.setChunkCol("target") \
.setOutputCol("chunk")
# Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
assertion = legal.AssertionLogRegApproach() \
.setLabelCol("label") \
.setInputCols(["document", "chunk", "word_embeddings"]) \
.setOutputCol("assertion") \
.setReg(0.01) \
.setBefore(11) \
.setAfter(13) \
.setStartCol("start") \
.setEndCol("end")
assertionPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
nerModel,
nerConverter,
assertion
])
assertionModel = assertionPipeline.fit(dataset)
from johnsnowlabs import *
// Training with Glove Embeddings
// First define pipeline stages to extract embeddings and text chunks
val documentAssembler = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new nlp.Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val glove = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("document", "token"))
.setOutputCol("word_embeddings")
.setCaseSensitive(false)
val chunkAssembler = new nlp.Doc2Chunk()
.setInputCols("document")
.setChunkCol("target")
.setOutputCol("chunk")
// Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
val assertion = new medical.AssertionLogRegApproach()
.setLabelCol("label")
.setInputCols(Array("document", "chunk", "word_embeddings"))
.setOutputCol("assertion")
.setReg(0.01)
.setBefore(11)
.setAfter(13)
.setStartCol("start")
.setEndCol("end")
val assertionPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
nerModel,
nerConverter,
assertion
))
val assertionModel = assertionPipeline.fit(dataset)
from johnsnowlabs import *
// Training with Glove Embeddings
// First define pipeline stages to extract embeddings and text chunks
val documentAssembler = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new nlp.Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val glove = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("document", "token"))
.setOutputCol("word_embeddings")
.setCaseSensitive(false)
val chunkAssembler = new nlp.Doc2Chunk()
.setInputCols("document")
.setChunkCol("target")
.setOutputCol("chunk")
// Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
val assertion = new finance.AssertionLogRegApproach()
.setLabelCol("label")
.setInputCols(Array("document", "chunk", "word_embeddings"))
.setOutputCol("assertion")
.setReg(0.01)
.setBefore(11)
.setAfter(13)
.setStartCol("start")
.setEndCol("end")
val assertionPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
nerModel,
nerConverter,
assertion
))
val assertionModel = assertionPipeline.fit(dataset)
from johnsnowlabs import *
// Training with Glove Embeddings
// First define pipeline stages to extract embeddings and text chunks
val documentAssembler = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new nlp.Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val glove = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("document", "token"))
.setOutputCol("word_embeddings")
.setCaseSensitive(false)
val chunkAssembler = new nlp.Doc2Chunk()
.setInputCols("document")
.setChunkCol("target")
.setOutputCol("chunk")
// Then the AssertionLogRegApproach model is defined. Label column is needed in the dataset for training.
val assertion = new legal.AssertionLogRegApproach()
.setLabelCol("label")
.setInputCols(Array("document", "chunk", "word_embeddings"))
.setOutputCol("assertion")
.setReg(0.01)
.setBefore(11)
.setAfter(13)
.setStartCol("start")
.setEndCol("end")
val assertionPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
nerModel,
nerConverter,
assertion
))
val assertionModel = assertionPipeline.fit(dataset)
BertSentenceChunkEmbeddings
This annotator allows aggregating sentence embeddings with ner chunk embeddings to get specific and more accurate resolution codes. It works by averaging sentence and chunk embeddings add contextual information in the embedding value. Input to this annotator is the context (sentence) and ner chunks, while the output is embedding for each chunk that can be fed to the resolver model.
The setChunkWeight
parameter can be used to control the influence of surrounding context.
For more information and examples of
BertSentenceChunkEmbeddings
annotator, you can check the Spark NLP Workshop, and in special, the notebook 24.1.Improved_Entity_Resolution_with_SentenceChunkEmbeddings.ipynb.
Input Annotator Types: DOCUMENT, CHUNK
Output Annotator Type: SENTENCE_EMBEDDINGS
Python API: BertSentenceChunkEmbeddings | Scala API: BertSentenceChunkEmbeddings |
Show Example
# Define the pipeline
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = nlp.Tokenizer()\
.setInputCols(["document"])\
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["document", "token"])\
.setOutputCol("word_embeddings")
clinical_ner = medical.NerModel.pretrained("ner_abbreviation_clinical", "en", "clinical/models") \
.setInputCols(["document", "token", "word_embeddings"]) \
.setOutputCol("ner")
ner_converter = medical.NerConverterInternal() \
.setInputCols(["document", "token", "ner"]) \
.setOutputCol("ner_chunk")\
.setWhiteList(['ABBR'])
sentence_chunk_embeddings = medical.BertSentenceChunkEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
.setInputCols(["document", "ner_chunk"])\
.setOutputCol("sentence_embeddings")\
.setChunkWeight(0.5)\
.setCaseSensitive(True)
abbr_resolver = medical.SentenceEntityResolverModel.pretrained("sbiobertresolve_clinical_abbreviation_acronym", "en", "clinical/models") \
.setInputCols(["ner_chunk", "sentence_embeddings"]) \
.setOutputCol("abbr_meaning")\
.setDistanceFunction("EUCLIDEAN")
resolver_pipeline = Pipeline(
stages = [
document_assembler,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
sentence_chunk_embeddings,
abbr_resolver
])
# Example results
sample_text = [
"""The patient admitted from the IR for aggressive irrigation of the Miami pouch. DISCHARGE DIAGNOSES: 1. A 58-year-old female with a history of stage 2 squamous cell carcinoma of the cervix status post total pelvic exenteration in 1991.""",
"""Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA: Laboratory tests include a CBC which is normal.
Blood Type: AB positive. Rubella: Immune. VDRL: Nonreactive. Hepatitis C surface antigen: Negative. HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet."""]
from pyspark.sql.types import StringType, IntegerType
df = spark.createDataFrame(sample_text, StringType()).toDF('text')
df.show(truncate = 100)
+----------------------------------------------------------------------------------------------------+
| text|
+----------------------------------------------------------------------------------------------------+
|The patient admitted from the IR for aggressive irrigation of the Miami pouch. DISCHARGE DIAGNOSE...|
|Gravid with estimated fetal weight of 6-6/12 pounds. LOWER EXTREMITIES: No edema. LABORATORY DATA...|
+----------------------------------------------------------------------------------------------------+
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("tokens")
val wordEmbeddings = BertEmbeddings
.pretrained("biobert_pubmed_base_cased")
.setInputCols(Array("sentence", "tokens"))
.setOutputCol("word_embeddings")
val nerModel = MedicalNerModel
.pretrained("ner_clinical_biobert", "en", "clinical/models")
.setInputCols(Array("sentence", "tokens", "word_embeddings"))
.setOutputCol("ner")
val nerConverter = new NerConverter()
.setInputCols("sentence", "tokens", "ner")
.setOutputCol("ner_chunk")
val sentenceChunkEmbeddings = BertSentenceChunkEmbeddings
.pretrained("sbluebert_base_uncased_mli", "en", "clinical/models")
.setInputCols(Array("sentence", "ner_chunk"))
.setOutputCol("sentence_chunk_embeddings")
val pipeline = new Pipeline()
.setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
wordEmbeddings,
nerModel,
nerConverter,
sentenceChunkEmbeddings))
val sampleText = "Her Diabetes has become type 2 in the last year with her Diabetes." +
" He complains of swelling in his right forearm."
val testDataset = Seq("").toDS.toDF("text")
val result = pipeline.fit(emptyDataset).transform(testDataset)
result
.selectExpr("explode(sentence_chunk_embeddings) AS s")
.selectExpr("s.result", "slice(s.embeddings, 1, 5) AS averageEmbedding")
.show(truncate=false)
+-----------------------------+-----------------------------------------------------------------+
| result| averageEmbedding|
+-----------------------------+-----------------------------------------------------------------+
|Her Diabetes |[-0.31995273, -0.04710883, -0.28973156, -0.1294758, 0.12481072] |
|type 2 |[-0.027161136, -0.24613449, -0.0949309, 0.1825444, -0.2252143] |
|her Diabetes |[-0.31995273, -0.04710883, -0.28973156, -0.1294758, 0.12481072] |
|swelling in his right forearm|[-0.45139068, 0.12400375, -0.0075617577, -0.90806055, 0.12871636]|
+-----------------------------+-----------------------------------------------------------------+
Chunk2Token
A feature transformer that converts the input array of strings (annotatorType CHUNK) into an array of chunk-based tokens (annotatorType TOKEN).
When the input is empty, an empty array is returned.
This Annotator is specially convenient when using NGramGenerator annotations as inputs to WordEmbeddingsModels
Input Annotator Types: CHUNK
Output Annotator Type: TOKEN
Scala API: Chunk2Token |
Show Example
from johnsnowlabs import *
# Define a pipeline for generating n-grams
data = spark.createDataFrame([["A 63-year-old man presents to the hospital ..."]]).toDF("text")
document = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
sentenceDetector = nlp.SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
token = nlp.Tokenizer().setInputCols(["sentence"]).setOutputCol("token")
ngrammer = nlp.NGramGenerator() \
.setN(2) \
.setEnableCumulative(False) \
.setInputCols(["token"]) \
.setOutputCol("ngrams") \
.setDelimiter("_")
# Stage to convert n-gram CHUNKS to TOKEN type
chunk2Token = medical.Chunk2Token().setInputCols(["ngrams"]).setOutputCol("ngram_tokens")
trainingPipeline = Pipeline(stages=[document, sentenceDetector, token, ngrammer, chunk2Token]).fit(data)
result = trainingPipeline.transform(data).cache()
result.selectExpr("explode(ngram_tokens)").show(5, False)
+----------------------------------------------------------------+
|col |
+----------------------------------------------------------------+
|{token, 3, 15, A_63-year-old, {sentence -> 0, chunk -> 0}, []} |
|{token, 5, 19, 63-year-old_man, {sentence -> 0, chunk -> 1}, []}|
|{token, 17, 28, man_presents, {sentence -> 0, chunk -> 2}, []} |
|{token, 21, 31, presents_to, {sentence -> 0, chunk -> 3}, []} |
|{token, 30, 35, to_the, {sentence -> 0, chunk -> 4}, []} |
+----------------------------------------------------------------+
from johnsnowlabs import *
# Define a pipeline for generating n-grams
document = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
sentenceDetector = nlp.SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
token = nlp.Tokenizer().setInputCols(["sentence"]).setOutputCol("token")
ngrammer = nlp.NGramGenerator() \
.setN(2) \
.setEnableCumulative(False) \
.setInputCols(["token"]) \
.setOutputCol("ngrams") \
.setDelimiter("_")
# Stage to convert n-gram CHUNKS to TOKEN type
chunk2Token = finance.Chunk2Token().setInputCols(["ngrams"]).setOutputCol("ngram_tokens")
trainingPipeline = Pipeline(stages=[document, sentenceDetector, token, ngrammer, chunk2Token])
from johnsnowlabs import *
# Define a pipeline for generating n-grams
document = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
sentenceDetector = nlp.SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
token = nlp.Tokenizer().setInputCols(["sentence"]).setOutputCol("token")
ngrammer = nlp.NGramGenerator() \
.setN(2) \
.setEnableCumulative(False) \
.setInputCols(["token"]) \
.setOutputCol("ngrams") \
.setDelimiter("_")
# Stage to convert n-gram CHUNKS to TOKEN type
chunk2Token = legal.Chunk2Token().setInputCols(["ngrams"]).setOutputCol("ngram_tokens")
trainingPipeline = Pipeline(stages=[document, sentenceDetector, token, ngrammer, chunk2Token])
from johnsnowlabs import *
// Define a pipeline for generating n-grams
val data = Seq(("A 63-year-old man presents to the hospital ...")).toDF("text")
val document = new nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector().setInputCols("document").setOutputCol("sentence")
val token = new nlp.Tokenizer().setInputCols("sentence").setOutputCol("token")
val ngrammer = new nlp.NGramGenerator()
.setN(2)
.setEnableCumulative(false)
.setInputCols("token")
.setOutputCol("ngrams")
.setDelimiter("_")
// Stage to convert n-gram CHUNKS to TOKEN type
val chunk2Token = new medical.Chunk2Token().setInputCols("ngrams").setOutputCol("ngram_tokens")
val trainingPipeline = new Pipeline().setStages(Array(document, sentenceDetector, token, ngrammer, chunk2Token)).fit(data)
val result = trainingPipeline.transform(data).cache()
result.selectExpr("explode(ngram_tokens)").show(5, false)
+----------------------------------------------------------------+
|col |
+----------------------------------------------------------------+
|{token, 3, 15, A_63-year-old, {sentence -> 0, chunk -> 0}, []} |
|{token, 5, 19, 63-year-old_man, {sentence -> 0, chunk -> 1}, []}|
|{token, 17, 28, man_presents, {sentence -> 0, chunk -> 2}, []} |
|{token, 21, 31, presents_to, {sentence -> 0, chunk -> 3}, []} |
|{token, 30, 35, to_the, {sentence -> 0, chunk -> 4}, []} |
+----------------------------------------------------------------+
from johnsnowlabs import *
// Define a pipeline for generating n-grams
val document = new nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector().setInputCols("document").setOutputCol("sentence")
val token = new nlp.Tokenizer().setInputCols("sentence").setOutputCol("token")
val ngrammer = new nlp.NGramGenerator()
.setN(2)
.setEnableCumulative(false)
.setInputCols("token")
.setOutputCol("ngrams")
.setDelimiter("_")
// Stage to convert n-gram CHUNKS to TOKEN type
val chunk2Token = new finance.Chunk2Token().setInputCols("ngrams").setOutputCol("ngram_tokens")
val trainingPipeline = new Pipeline().setStages(Array(document, sentenceDetector, token, ngrammer, chunk2Token))
from johnsnowlabs import *
// Define a pipeline for generating n-grams
val document = new nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector().setInputCols("document").setOutputCol("sentence")
val token = new nlp.Tokenizer().setInputCols("sentence").setOutputCol("token")
val ngrammer = new nlp.NGramGenerator()
.setN(2)
.setEnableCumulative(false)
.setInputCols("token")
.setOutputCol("ngrams")
.setDelimiter("_")
// Stage to convert n-gram CHUNKS to TOKEN type
val chunk2Token = new legal.Chunk2Token().setInputCols("ngrams").setOutputCol("ngram_tokens")
val trainingPipeline = new Pipeline().setStages(Array(document, sentenceDetector, token, ngrammer, chunk2Token))
ChunkConverter
Convert chunks from RegexMatcher to chunks with a entity in the metadata.
This annotator is important when the user wants to merge entities identified by NER models together with rules-based matching used by the RegexMathcer annotator. In the following steps of the pipeline, all the identified entities can be treated in a unified field.
Input Annotator Types: DOCUMENT, CHUNK
Output Annotator Type: CHUNK
Python API: ChunkConverter | Scala API: ChunkConverter |
Show Example
# Creating the pipeline
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model = MedicalNerModel.pretrained("ner_clinical_large","en","clinical/models") \
.setInputCols("sentence","token","embeddings") \
.setOutputCol("ner")
ner_converter= NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")\
regex_matcher = RegexMatcher()\
.setInputCols('document')\
.setStrategy("MATCH_ALL")\
.setOutputCol("regex_matches")\
.setExternalRules(path='file:/dbfs/regex_rules.txt', delimiter=',')
chunkConverter = ChunkConverter()\
.setInputCols("regex_matches")\
.setOutputCol("regex_chunk")
merger= ChunkMergeApproach()\
.setInputCols(["regex_chunk", "ner_chunk"])\
.setOutputCol("merged_chunks")\
.setMergeOverlapping(True)\
.setChunkPrecedence("field")
pipeline= Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
ner_model,
ner_converter,
regex_matcher,
chunkConverter,
merger
])
empty_df= spark.createDataFrame([[""]]).toDF("text")
model= pipeline.fit(empty_df)
lp_model = LightPipeline(model)
results = lp_model.fullAnnotate(sample_text)[0]
# Displaying the results
chunk= []
merge= []
for result in list(results["merged_chunks"]):
merge.append(result.metadata["entity"])
chunk.append(result.result)
df_merge = pd.DataFrame({"chunk": chunk, "merged_entity": merge})
df_merge
| chunk | merged_entity |
|-----------------------------------------------:|---------------:|
| POSTOPERATIVE DIAGNOSIS: | SECTION_HEADER |
| Cervical lymphadenopathy | PROBLEM |
| PROCEDURE: | SECTION_HEADER |
| Excisional biopsy of right cervical lymph node | TEST |
| ANESTHESIA: | SECTION_HEADER |
| General endotracheal anesthesia | TREATMENT |
| Right cervical lymph node | PROBLEM |
| EBL: | SECTION_HEADER |
| COMPLICATIONS: | SECTION_HEADER |
| FINDINGS: | SECTION_HEADER |
| Enlarged level 2 lymph node | PROBLEM |
| ... | |
val sampleDataset = ResourceHelper.spark.createDataFrame(Seq(
(1, "My first sentence with the first rule. This is my second sentence with ceremonies rule.")
)).toDF("id", "text")
val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentence = new SentenceDetector().setInputCols("document").setOutputCol("sentence")
val regexMatcher = new RegexMatcher()
.setExternalRules(ExternalResource("src/test/resources/regex-matcher/rules.txt", ReadAs.TEXT, Map("delimiter" -> ",")))
.setInputCols(Array("sentence"))
.setOutputCol("regex")
.setStrategy(strategy)
val chunkConverter = new ChunkConverter().setInputCols("regex").setOutputCol("chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, regexMatcher,chunkConverter))
val results = pipeline.fit(sampleDataset).transform(sampleDataset)
results.select("chunk").show(truncate = false)
+------------------------------------------------------------------------------------------------+
|col |
+------------------------------------------------------------------------------------------------+
|[chunk, 23, 31, the first, [identifier -> NAME, sentence -> 0, chunk -> 0, entity -> NAME], []] |
|[chunk, 71, 80, ceremonies, [identifier -> NAME, sentence -> 1, chunk -> 0, entity -> NAME], []]|
+------------------------------------------------------------------------------------------------+
ChunkEntityResolver
Returns a normalized entity for a particular trained ontology / curated dataset (e.g. ICD-10, RxNorm, SNOMED etc).
For available pretrained models please see the Models Hub.
Input Annotator Types: TOKEN, WORD_EMBEDDINGS
Output Annotator Type: ENTITY
Scala API: ChunkEntityResolverModel |
Show Example
from johnsnowlabs import *
# Using pretrained models for SNOMED
# First the prior steps of the pipeline are defined.
# Output of types TOKEN and WORD_EMBEDDINGS are needed.
data = spark.createDataFrame([["A 63-year-old man presents to the hospital ..."]]).toDF("text")
docAssembler = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
sentenceDetector = nlp.SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
tokenizer = nlp.Tokenizer().setInputCols(["sentence"]).setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("word_embeddings")
icdo_ner = medical.NerModel.pretrained("ner_bionlp", "en", "clinical/models") \
.setInputCols(["sentence", "token", "word_embeddings"]) \
.setOutputCol("icdo_ner")
icdo_chunk = nlp.NerConverter().setInputCols(["sentence","token","icdo_ner"]).setOutputCol("icdo_chunk").setWhiteList(["Cancer"])
icdo_chunk_embeddings = nlp.ChunkEmbeddings() \
.setInputCols(["icdo_chunk", "word_embeddings"]) \
.setOutputCol("icdo_chunk_embeddings")
icdo_chunk_resolver = medical.ChunkEntityResolverModel.pretrained("chunkresolve_icdo_clinical", "en", "clinical/models") \
.setInputCols(["token","icdo_chunk_embeddings"]) \
.setOutputCol("tm_icdo_code")
clinical_ner = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token", "word_embeddings"]) \
.setOutputCol("ner")
ner_converter = nlp.NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
ner_chunk_tokenizer = nlp.ChunkTokenizer() \
.setInputCols(["ner_chunk"]) \
.setOutputCol("ner_token")
ner_chunk_embeddings = nlp.ChunkEmbeddings() \
.setInputCols(["ner_chunk", "word_embeddings"]) \
.setOutputCol("ner_chunk_embeddings")
# Definition of the SNOMED Resolution
ner_snomed_resolver = medical.ChunkEntityResolverModel.pretrained("chunkresolve_snomed_findings_clinical","en","clinical/models") \
.setInputCols(["ner_token","ner_chunk_embeddings"]).setOutputCol("snomed_result")
pipelineFull = Pipeline().setStages([
docAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
ner_chunk_embeddings,
ner_chunk_tokenizer,
ner_snomed_resolver,
icdo_ner,
icdo_chunk,
icdo_chunk_embeddings,
icdo_chunk_resolver
])
pipelineModelFull = pipelineFull.fit(data)
result = pipelineModelFull.transform(data).cache()
# Show results
result.selectExpr("explode(snomed_result)")
.selectExpr(
"col.metadata.target_text",
"col.metadata.resolved_text",
"col.metadata.confidence",
"col.metadata.all_k_results",
"col.metadata.all_k_resolutions")
.filter($"confidence" > 0.2).show(5)
+--------------------+--------------------+----------+--------------------+--------------------+
| target_text| resolved_text|confidence| all_k_results| all_k_resolutions|
+--------------------+--------------------+----------+--------------------+--------------------+
|hypercholesterolemia|Hypercholesterolemia| 0.2524|13644009:::267432...|Hypercholesterole...|
| CBC| Neocyte| 0.4980|259680000:::11573...|Neocyte:::Blood g...|
| CD38| Hypoviscosity| 0.2560|47872005:::370970...|Hypoviscosity:::E...|
| platelets| Increased platelets| 0.5267|6631009:::2596800...|Increased platele...|
| CD38| Hypoviscosity| 0.2560|47872005:::370970...|Hypoviscosity:::E...|
+--------------------+--------------------+----------+--------------------+--------------------+
from johnsnowlabs import *
// Using pretrained models for SNOMED
// First the prior steps of the pipeline are defined.
// Output of types TOKEN and WORD_EMBEDDINGS are needed.
val data = Seq(("A 63-year-old man presents to the hospital ...")).toDF("text")
val docAssembler = new nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector().setInputCols("document").setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer().setInputCols("sentence").setOutputCol("token")
val word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("word_embeddings")
val icdo_ner = medical.NerModel.pretrained("ner_bionlp", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "word_embeddings"))
.setOutputCol("icdo_ner")
val icdo_chunk = new nlp.NerConverter().setInputCols(Array("sentence","token","icdo_ner")).setOutputCol("icdo_chunk").setWhiteList("Cancer")
val icdo_chunk_embeddings = new nlp.ChunkEmbeddings()
.setInputCols(Array("icdo_chunk", "word_embeddings"))
.setOutputCol("icdo_chunk_embeddings")
val icdo_chunk_resolver = medical.ChunkEntityResolverModel.pretrained("chunkresolve_icdo_clinical", "en", "clinical/models")
.setInputCols(Array("token","icdo_chunk_embeddings"))
.setOutputCol("tm_icdo_code")
val clinical_ner = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "word_embeddings"))
.setOutputCol("ner")
val ner_converter = new nlp.NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val ner_chunk_tokenizer = new nlp.ChunkTokenizer()
.setInputCols("ner_chunk")
.setOutputCol("ner_token")
val ner_chunk_embeddings = new nlp.ChunkEmbeddings()
.setInputCols(Array("ner_chunk", "word_embeddings"))
.setOutputCol("ner_chunk_embeddings")
// Definition of the SNOMED Resolution
val ner_snomed_resolver = medical.ChunkEntityResolverModel.pretrained("chunkresolve_snomed_findings_clinical","en","clinical/models")
.setInputCols(Array("ner_token","ner_chunk_embeddings")).setOutputCol("snomed_result")
val pipelineFull = new Pipeline().setStages(Array(
docAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
ner_chunk_embeddings,
ner_chunk_tokenizer,
ner_snomed_resolver,
icdo_ner,
icdo_chunk,
icdo_chunk_embeddings,
icdo_chunk_resolver
))
val pipelineModelFull = pipelineFull.fit(data)
val result = pipelineModelFull.transform(data).cache()
// Show results
//
// result.selectExpr("explode(snomed_result)")
// .selectExpr(
// "col.metadata.target_text",
// "col.metadata.resolved_text",
// "col.metadata.confidence",
// "col.metadata.all_k_results",
// "col.metadata.all_k_resolutions")
// .filter($"confidence" > 0.2).show(5)
// +--------------------+--------------------+----------+--------------------+--------------------+
// | target_text| resolved_text|confidence| all_k_results| all_k_resolutions|
// +--------------------+--------------------+----------+--------------------+--------------------+
// |hypercholesterolemia|Hypercholesterolemia| 0.2524|13644009:::267432...|Hypercholesterole...|
// | CBC| Neocyte| 0.4980|259680000:::11573...|Neocyte:::Blood g...|
// | CD38| Hypoviscosity| 0.2560|47872005:::370970...|Hypoviscosity:::E...|
// | platelets| Increased platelets| 0.5267|6631009:::2596800...|Increased platele...|
// | CD38| Hypoviscosity| 0.2560|47872005:::370970...|Hypoviscosity:::E...|
// +--------------------+--------------------+----------+--------------------+--------------------+
//
Contains all the parameters and methods to train a ChunkEntityResolverModel. It transform a dataset with two Input Annotations of types TOKEN and WORD_EMBEDDINGS, coming from e.g. ChunkTokenizer and ChunkEmbeddings Annotators and returns the normalized entity for a particular trained ontology / curated dataset. (e.g. ICD-10, RxNorm, SNOMED etc.)
To use pretrained models please use ChunkEntityResolverModel and see the Models Hub for available models.
Input Annotator Types: TOKEN, WORD_EMBEDDINGS
Output Annotator Type: ENTITY
Scala API: ChunkEntityResolverApproach |
Show Example
from johnsnowlabs import *
# Training a SNOMED model
# Define pre-processing pipeline for training data. It needs consists of columns for the normalized training data
# and their labels.
document = nlp.DocumentAssembler() \
.setInputCol("normalized_text") \
.setOutputCol("document")
chunk = nlp.Doc2Chunk() \
.setInputCols(["document"]) \
.setOutputCol("chunk")
token = nlp.Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_healthcare_100d", "en", "clinical/models") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
chunkEmb = nlp.ChunkEmbeddings() \
.setInputCols(["chunk", "embeddings"]) \
.setOutputCol("chunk_embeddings")
snomedTrainingPipeline = Pipeline().setStages([
document,
chunk,
token,
embeddings,
chunkEmb
])
snomedTrainingModel = snomedTrainingPipeline.fit(data)
snomedData = snomedTrainingModel.transform(data).cache()
# Then the Resolver can be trained with
snomedExtractor = medical.ChunkEntityResolverApproach() \
.setInputCols(["token", "chunk_embeddings"]) \
.setOutputCol("recognized") \
.setNeighbours(1000) \
.setAlternatives(25) \
.setNormalizedCol("normalized_text") \
.setLabelCol("label") \
.setEnableWmd(True).setEnableTfidf(True).setEnableJaccard(True) \
.setEnableSorensenDice(True).setEnableJaroWinkler(True).setEnableLevenshtein(True) \
.setDistanceWeights([1, 2, 2, 1, 1, 1]) \
.setAllDistancesMetadata(True) \
.setPoolingStrategy("MAX") \
.setThreshold(1e32)
model = snomedExtractor.fit(snomedData)
from johnsnowlabs import *
// Training a SNOMED model
// Define pre-processing pipeline for training data. It needs consists of columns for the normalized training data
// and their labels.
val document = new nlp.DocumentAssembler()
.setInputCol("normalized_text")
.setOutputCol("document")
val chunk = new nlp.Doc2Chunk()
.setInputCols("document")
.setOutputCol("chunk")
val token = new nlp.Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_healthcare_100d", "en", "clinical/models")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val chunkEmb = new nlp.ChunkEmbeddings()
.setInputCols(Array("chunk", "embeddings"))
.setOutputCol("chunk_embeddings")
val snomedTrainingPipeline = new Pipeline().setStages(Array(
document,
chunk,
token,
embeddings,
chunkEmb
))
val snomedTrainingModel = snomedTrainingPipeline.fit(data)
val snomedData = snomedTrainingModel.transform(data).cache()
// Then the Resolver can be trained with
val snomedExtractor = new medical.ChunkEntityResolverApproach()
.setInputCols(Array("token", "chunk_embeddings"))
.setOutputCol("recognized")
.setNeighbours(1000)
.setAlternatives(25)
.setNormalizedCol("normalized_text")
.setLabelCol("label")
.setEnableWmd(true).setEnableTfidf(true).setEnableJaccard(true)
.setEnableSorensenDice(true).setEnableJaroWinkler(true).setEnableLevenshtein(true)
.setDistanceWeights(Array(1, 2, 2, 1, 1, 1))
.setAllDistancesMetadata(true)
.setPoolingStrategy("MAX")
.setThreshold(1e32)
val model = snomedExtractor.fit(snomedData)
ChunkFilterer
Filters entities coming from CHUNK annotations. Filters can be set via a white list of terms or a regular expression.
White list criteria is enabled by default. To use regex, criteria
has to be set to regex
.
Input Annotator Types: DOCUMENT,CHUNK
Output Annotator Type: CHUNK
Python API: ChunkFilterer | Scala API: ChunkFilterer |
Show Example
from johnsnowlabs import *
# Filtering POS tags
# First pipeline stages to extract the POS tags are defined
data = spark.createDataFrame([["Has a past history of gastroenteritis and stomach pain, however patient ..."]]).toDF("text")
docAssembler = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
sentenceDetector = nlp.SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
tokenizer = nlp.Tokenizer().setInputCols(["sentence"]).setOutputCol("token")
posTagger = nlp.PerceptronModel.pretrained() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
chunker = nlp.Chunker() \
.setInputCols(["pos", "sentence"]) \
.setOutputCol("chunk") \
.setRegexParsers(["(<NN>)+"])
# Then the chunks can be filtered via a white list. Here only terms with "gastroenteritis" remain.
chunkerFilter = medical.ChunkFilterer() \
.setInputCols(["sentence","chunk"]) \
.setOutputCol("filtered") \
.setCriteria("isin") \
.setWhiteList(["gastroenteritis"])
pipeline = Pipeline(stages=[
docAssembler,
sentenceDetector,
tokenizer,
posTagger,
chunker,
chunkerFilter])
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(chunk)").show(truncate=False)
+---------------------------------------------------------------------------------+
|col |
+---------------------------------------------------------------------------------+
|{chunk, 11, 17, history, {sentence -> 0, chunk -> 0}, []} |
|{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []} |
|{chunk, 42, 53, stomach pain, {sentence -> 0, chunk -> 2}, []} |
|{chunk, 64, 70, patient, {sentence -> 0, chunk -> 3}, []} |
|{chunk, 81, 110, stomach pain now.We don't care, {sentence -> 0, chunk -> 4}, []}|
|{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []} |
+---------------------------------------------------------------------------------+
result.selectExpr("explode(filtered)").show(truncate=False)
+-------------------------------------------------------------------+
|col |
+-------------------------------------------------------------------+
|{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []} |
|{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []}|
+-------------------------------------------------------------------+
from johnsnowlabs import *
# Filtering POS tags
# First pipeline stages to extract the POS tags are defined
docAssembler = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
sentenceDetector = nlp.SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
tokenizer = nlp.Tokenizer().setInputCols(["sentence"]).setOutputCol("token")
posTagger = nlp.PerceptronModel.pretrained() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
chunker = nlp.Chunker() \
.setInputCols(["pos", "sentence"]) \
.setOutputCol("chunk") \
.setRegexParsers(["(<NN>)+"])
# Then the chunks can be filtered via a white list. Here only terms with "gastroenteritis" remain.
chunkerFilter = finance.ChunkFilterer() \
.setInputCols(["sentence","chunk"]) \
.setOutputCol("filtered") \
.setCriteria("isin") \
.setWhiteList(["gastroenteritis"])
pipeline = Pipeline(stages=[
docAssembler,
sentenceDetector,
tokenizer,
posTagger,
chunker,
chunkerFilter])
result = pipeline.fit(data).transform(data)
from johnsnowlabs import *
# Filtering POS tags
# First pipeline stages to extract the POS tags are defined
docAssembler = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
sentenceDetector = nlp.SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
tokenizer = nlp.Tokenizer().setInputCols(["sentence"]).setOutputCol("token")
posTagger = nlp.PerceptronModel.pretrained() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("pos")
chunker = nlp.Chunker() \
.setInputCols(["pos", "sentence"]) \
.setOutputCol("chunk") \
.setRegexParsers(["(<NN>)+"])
# Then the chunks can be filtered via a white list. Here only terms with "gastroenteritis" remain.
chunkerFilter = legal.ChunkFilterer() \
.setInputCols(["sentence","chunk"]) \
.setOutputCol("filtered") \
.setCriteria("isin") \
.setWhiteList(["gastroenteritis"])
pipeline = Pipeline(stages=[
docAssembler,
sentenceDetector,
tokenizer,
posTagger,
chunker,
chunkerFilter])
result = pipeline.fit(data).transform(data)
from johnsnowlabs import *
// Filtering POS tags
// First pipeline stages to extract the POS tags are defined
val data = Seq("Has a past history of gastroenteritis and stomach pain, however patient ...").toDF("text")
val docAssembler = new nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector().setInputCols("document").setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer().setInputCols("sentence").setOutputCol("token")
val posTagger = nlp.PerceptronModel.pretrained()
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val chunker = new nlp.Chunker()
.setInputCols(Array("pos", "sentence"))
.setOutputCol("chunk")
.setRegexParsers(Array("(<NN>)+"))
// Then the chunks can be filtered via a white list. Here only terms with "gastroenteritis" remain.
val chunkerFilter = new medical.ChunkFilterer()
.setInputCols(Array("sentence","chunk"))
.setOutputCol("filtered")
.setCriteria("isin")
.setWhiteList("gastroenteritis")
val pipeline = new Pipeline().setStages(Array(
docAssembler,
sentenceDetector,
tokenizer,
posTagger,
chunker,
chunkerFilter))
result.selectExpr("explode(chunk)").show(truncate=false)
+---------------------------------------------------------------------------------+
|col |
+---------------------------------------------------------------------------------+
|{chunk, 11, 17, history, {sentence -> 0, chunk -> 0}, []} |
|{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []} |
|{chunk, 42, 53, stomach pain, {sentence -> 0, chunk -> 2}, []} |
|{chunk, 64, 70, patient, {sentence -> 0, chunk -> 3}, []} |
|{chunk, 81, 110, stomach pain now.We don't care, {sentence -> 0, chunk -> 4}, []}|
|{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []} |
+---------------------------------------------------------------------------------+
result.selectExpr("explode(filtered)").show(truncate=false)
+-------------------------------------------------------------------+
|col |
+-------------------------------------------------------------------+
|{chunk, 22, 36, gastroenteritis, {sentence -> 0, chunk -> 1}, []} |
|{chunk, 118, 132, gastroenteritis, {sentence -> 0, chunk -> 5}, []}|
+-------------------------------------------------------------------+
from johnsnowlabs import *
val docAssembler = new nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector().setInputCols("document").setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer().setInputCols("sentence").setOutputCol("token")
val posTagger = nlp.PerceptronModel.pretrained()
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val chunker = new nlp.Chunker()
.setInputCols(Array("pos", "sentence"))
.setOutputCol("chunk")
.setRegexParsers(Array("(<NN>)+"))
// Then the chunks can be filtered via a white list. Here only terms with "gastroenteritis" remain.
val chunkerFilter = new finance.ChunkFilterer()
.setInputCols(Array("sentence","chunk"))
.setOutputCol("filtered")
.setCriteria("isin")
.setWhiteList("gastroenteritis")
val pipeline = new Pipeline().setStages(Array(
docAssembler,
sentenceDetector,
tokenizer,
posTagger,
chunker,
chunkerFilter))
from johnsnowlabs import *
val docAssembler = new nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector().setInputCols("document").setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer().setInputCols("sentence").setOutputCol("token")
val posTagger = nlp.PerceptronModel.pretrained()
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
val chunker = new nlp.Chunker()
.setInputCols(Array("pos", "sentence"))
.setOutputCol("chunk")
.setRegexParsers(Array("(<NN>)+"))
// Then the chunks can be filtered via a white list. Here only terms with "gastroenteritis" remain.
val chunkerFilter = new legal.ChunkFilterer()
.setInputCols(Array("sentence","chunk"))
.setOutputCol("filtered")
.setCriteria("isin")
.setWhiteList("gastroenteritis")
val pipeline = new Pipeline().setStages(Array(
docAssembler,
sentenceDetector,
tokenizer,
posTagger,
chunker,
chunkerFilter))
ChunkKeyPhraseExtraction
Chunk KeyPhrase Extraction uses Bert Sentence Embeddings to determine the most relevant key phrases describing a text. The input to the model consists of chunk annotations and sentence or document annotation. The model compares the chunks against the corresponding sentences/documents and selects the chunks which are most representative of the broader text context (i.e. the document or the sentence they belong to). The key phrases candidates (i.e. the input chunks) can be generated in various ways, e.g. by NGramGenerator, TextMatcher or NerConverter. The model operates either at sentence (selecting the most descriptive chunks from the sentence they belong to) or at document level. In the latter case, the key phrases are selected to represent all the input document annotations.
This model is a subclass of [[BertSentenceEmbeddings]] and shares all parameters with it. It can load any pretrained BertSentenceEmbeddings model. Available models can be found at the Models Hub.
Input Annotator Types: DOCUMENT, CHUNK
Output Annotator Type: CHUNK
Python API: ChunkKeyPhraseExtraction | Scala API: ChunkKeyPhraseExtraction |
Show Example
from johnsnowlabs import *
documenter = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentencer = nlp.SentenceDetector() \
.setInputCols(["document"])\
.setOutputCol("sentences")
tokenizer = nlp.Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("tokens") \
embeddings = nlp.WordEmbeddingsModel() \
.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["document", "tokens"]) \
.setOutputCol("embeddings")
ner_tagger = medical.NerModel() \
.pretrained("ner_jsl_slim", "en", "clinical/models") \
.setInputCols(["sentences", "tokens", "embeddings"]) \
.setOutputCol("ner_tags")
ner_converter = nlp.NerConverter()\
.setInputCols("sentences", "tokens", "ner_tags")\
.setOutputCol("ner_chunks")
key_phrase_extractor = medical.ChunkKeyPhraseExtraction\
.pretrained()\
.setTopN(1)\
.setDocumentLevelProcessing(False)\
.setDivergence(0.4)\
.setInputCols(["sentences", "ner_chunks"])\
.setOutputCol("ner_chunk_key_phrases")
pipeline = sparknlp.base.Pipeline() \
.setStages([documenter, sentencer, tokenizer, embeddings, ner_tagger, ner_converter, key_phrase_extractor])
data = spark.createDataFrame([["Her Diabetes has become type 2 in the last year with her Diabetes.He complains of swelling in his right forearm."]]).toDF("text")
results = pipeline.fit(data).transform(data)
results\
.selectExpr("explode(ner_chunk_key_phrases) AS key_phrase")\
.selectExpr(
"key_phrase.result",
"key_phrase.metadata.entity",
"key_phrase.metadata.DocumentSimilarity",
"key_phrase.metadata.MMRScore")\
.show(truncate=False)
+-----------------------------+------------------+-------------------+
|result |DocumentSimilarity|MMRScore |
+-----------------------------+------------------+-------------------+
|gestational diabetes mellitus|0.7391447825527298|0.44348688715422274|
|28-year-old |0.4366776288430703|0.13577881610104517|
|type two diabetes mellitus |0.7323921930094919|0.085800103824974 |
+-----------------------------+------------------+-------------------+
from johnsnowlabs import *
documenter = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentencer = nlp.SentenceDetector() \
.setInputCols(["document"])\
.setOutputCol("sentences")
tokenizer = nlp.Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("tokens")
embeddings = nlp.WordEmbeddingsModel() \
.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["document", "tokens"]) \
.setOutputCol("embeddings")
ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols("sentences", "tokens", "ner")\
.setOutputCol("ner_chunks")
key_phrase_extractor = finance.ChunkKeyPhraseExtraction\
.pretrained()\
.setTopN(1)\
.setDocumentLevelProcessing(False)\
.setDivergence(0.4)\
.setInputCols(["sentences", "ner_chunks"])\
.setOutputCol("ner_chunk_key_phrases")
pipeline = sparknlp.base.Pipeline() \
.setStages([documenter, sentencer, tokenizer, embeddings, ner_model, ner_converter, key_phrase_extractor])
from johnsnowlabs import *
documenter = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentencer = nlp.SentenceDetector() \
.setInputCols(["document"])\
.setOutputCol("sentences")
tokenizer = nlp.Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("tokens") \
embeddings = nlp.WordEmbeddingsModel() \
.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["document", "tokens"]) \
.setOutputCol("embeddings")
ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols("sentences", "tokens", "ner")\
.setOutputCol("ner_chunks")
key_phrase_extractor = legal.ChunkKeyPhraseExtraction\
.pretrained()\
.setTopN(1)\
.setDocumentLevelProcessing(False)\
.setDivergence(0.4)\
.setInputCols(["sentences", "ner_chunks"])\
.setOutputCol("ner_chunk_key_phrases")
pipeline = sparknlp.base.Pipeline() \
.setStages([documenter, sentencer, tokenizer, embeddings, ner_model, ner_converter, key_phrase_extractor])
from johnsnowlabs import *
val documentAssembler = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new nlp.Tokenizer()
.setInputCols("document")
.setOutputCol("tokens")
val stopWordsCleaner = nlp.StopWordsCleaner.pretrained()
.setInputCols("tokens")
.setOutputCol("clean_tokens")
.setCaseSensitive(false)
val nGrams = new nlp.NGramGenerator()
.setInputCols(Array("clean_tokens"))
.setOutputCol("ngrams")
.setN(3)
val chunkKeyPhraseExtractor = medical.ChunkKeyPhraseExtraction
.pretrained()
.setTopN(2)
.setDivergence(0.7f)
.setInputCols(Array("document", "ngrams"))
.setOutputCol("key_phrases")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
stopWordsCleaner,
nGrams,
chunkKeyPhraseExtractor))
val sampleText = "Her Diabetes has become type 2 in the last year with her Diabetes." +
" He complains of swelling in his right forearm."
val testDataset = Seq("").toDS.toDF("text")
val result = pipeline.fit(emptyDataset).transform(testDataset)
result
.selectExpr("explode(key_phrases) AS key_phrase")
.selectExpr(
"key_phrase.result",
"key_phrase.metadata.DocumentSimilarity",
"key_phrase.metadata.MMRScore")
.show(truncate=false)
+--------------------------+-------------------+------------------+
|result |DocumentSimilarity |MMRScore |
+--------------------------+-------------------+------------------+
|complains swelling forearm|0.6325718954229369 |0.1897715761677257|
|type 2 year |0.40181028931546364|-0.189501077108947|
+--------------------------+-------------------+------------------+
from johnsnowlabs import *
val documentAssembler = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new nlp.Tokenizer()
.setInputCols("document")
.setOutputCol("tokens")
val stopWordsCleaner = nlp.StopWordsCleaner.pretrained()
.setInputCols("tokens")
.setOutputCol("clean_tokens")
.setCaseSensitive(false)
val nGrams = new nlp.NGramGenerator()
.setInputCols(Array("clean_tokens"))
.setOutputCol("ngrams")
.setN(3)
val chunkKeyPhraseExtractor = finance.ChunkKeyPhraseExtraction
.pretrained()
.setTopN(2)
.setDivergence(0.7f)
.setInputCols(Array("document", "ngrams"))
.setOutputCol("key_phrases")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
stopWordsCleaner,
nGrams,
chunkKeyPhraseExtractor))
from johnsnowlabs import *
val documentAssembler = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new nlp.Tokenizer()
.setInputCols("document")
.setOutputCol("tokens")
val stopWordsCleaner = nlp.StopWordsCleaner.pretrained()
.setInputCols("tokens")
.setOutputCol("clean_tokens")
.setCaseSensitive(false)
val nGrams = new nlp.NGramGenerator()
.setInputCols(Array("clean_tokens"))
.setOutputCol("ngrams")
.setN(3)
val chunkKeyPhraseExtractor = legal.ChunkKeyPhraseExtraction
.pretrained()
.setTopN(2)
.setDivergence(0.7f)
.setInputCols(Array("document", "ngrams"))
.setOutputCol("key_phrases")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
stopWordsCleaner,
nGrams,
chunkKeyPhraseExtractor))
ChunkMapper
We can use ChunkMapper to map entities with their associated code/reference based on pre-defined dictionaries.
This is the AnnotatorModel of the ChunkMapper, which can be used to access pretrained models with the .pretrained()
or .load()
methods. To train a new model, check the documentation of the ChunkMapperApproach annotator.
The annotator also allows using fuzzy matching, which can take into consideration parts of the tokens tha can map even when word order is different, char ngrams that can map even when thre are typos, and using fuzzy distance metric (Jaccard, Levenshtein, etc.).
Example usage and more details can be found on Spark NLP Workshop repository accessible in GitHub, for example the notebook Healthcare Chunk Mapping.
Input Annotator Types: CHUNK
Output Annotator Type: LABEL_DEPENDENCY
Python API: ChunkMapperModel | Scala API: ChunkMapperModel |
Show Example
# Use `rxnorm_mapper` pretrained model to map entities with their corresponding RxNorm codes.
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("ner_chunk")
chunkerMapper = ChunkMapperModel.pretrained("rxnorm_mapper", "en", "clinical/models")\
.setInputCols(["ner_chunk"])\
.setOutputCol("rxnorm")\
.setRels(["rxnorm_code"])
mapper_pipeline = Pipeline().setStages([document_assembler, chunkerMapper])
empty_df = spark.createDataFrame([['']]).toDF('text')
mapper_model = mapper_pipeline.fit(empty_df)
mapper_lp = LightPipeline(mapper_model)
mapper_lp.fullAnnotate("metformin")
[{'ner_chunk': [Annotation(document, 0, 8, metformin, {})],
'rxnorm': [Annotation(labeled_dependency, 0, 8, 6809, {'entity': 'metformin', 'relation': 'rxnorm_code', 'all_relations': ''})]}]
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("ner_chunk")
val chunkerMapper = ChunkMapperModel.pretrained("rxnorm_mapper", "en", "clinical/models")\
.setInputCols(["ner_chunk"])\
.setOutputCol("rxnorm")\
.setRels(["rxnorm_code"])
mapper_pipeline = Pipeline().setStages([document_assembler, chunkerMapper])
empty_df = spark.createDataFrame([['']]).toDF('text')
mapper_model = mapper_pipeline.fit(empty_df)
mapper_lp = LightPipeline(mapper_model)
mapper_lp.fullAnnotate("metformin")
[{'ner_chunk': [Annotation(document, 0, 8, metformin, {})],
'rxnorm': [Annotation(labeled_dependency, 0, 8, 6809, {'entity': 'metformin', 'relation': 'rxnorm_code', 'all_relations': ''})]}]
We can use ChunkMapper to map entities with their associated code/reference based on pre-defined dictionaries.
This is the AnnotatorApproach of the ChunkMapper, which can be used to train ChunkMapper models by giving a custom mapping dictionary. To use pretriained models, check the documentation of the ChunkMapperModel annotator.
The annotator also allows using fuzzy matching, which can take into consideration parts of the tokens tha can map even when word order is different, char ngrams that can map even when thre are typos, and using fuzzy distance metric (Jaccard, Levenshtein, etc.).
Example usage and more details can be found on Spark NLP Workshop repository accessible in GitHub, for example the notebook Healthcare Chunk Mapping.
Input Annotator Types: CHUNK
Output Annotator Type: LABEL_DEPENDENCY
Python API: ChunkMapperApproach | Scala API: ChunkMapperApproach |
Show Example
# First, create a dictionay in JSON format following this schema:
import json
data_set= {
"mappings": [
{
"key": "metformin",
"relations": [
{
"key": "action",
"values" : ["hypoglycemic", "Drugs Used In Diabetes"]
},
{
"key": "treatment",
"values" : ["diabetes", "t2dm"]
}]
}]
}
with open('sample_drug.json', 'w', encoding='utf-8') as f:
json.dump(data_set, f, ensure_ascii=False, indent=4)
# Create a pipeline
document_assembler = DocumentAssembler()\
.setInputCol('text')\
.setOutputCol('document')
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
#NER model to detect drug in the text
clinical_ner = MedicalNerModel.pretrained("ner_posology_small","en","clinical/models")\
.setInputCols(["sentence","token","embeddings"])\
.setOutputCol("ner")\
.setLabelCasing("upper")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")\
.setWhiteList(["DRUG"])
chunkerMapper = ChunkMapperApproach()\
.setInputCols(["ner_chunk"])\
.setOutputCol("mappings")\
.setDictionary("sample_drug.json")\
.setRels(["action"]) #or treatment
pipeline = Pipeline(
stages=[
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
chunkerMapper,
]
)
# Train the model
text = ["The patient was given 1 unit of metformin daily."]
test_data = spark.createDataFrame([text]).toDF("text")
model = pipeline.fit(test_data)
ChunkMapperFilterer
ChunkMapperFilterer
is an annotator to be used after ChunkMapper
that allows to filter chunks based on the results of the mapping, whether it was successful or failed.
Example usage and more details can be found on Spark NLP Workshop repository accessible in GitHub, for example the notebook Healthcare Chunk Mapping.
Input Annotator Types: CHUNK, LABEL_DEPENDENCY
Output Annotator Type: CHUNK
Python API: ChunkMapperFilterer | Scala API: ChunkMapperFilterer |
Show Example
document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
sentence_detector = (
SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
)
tokenizer = Tokenizer().setInputCols("sentence").setOutputCol("token")
word_embeddings = (
WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(["sentence", "token"])
.setOutputCol("embeddings")
)
ner_model = (
MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models")
.setInputCols(["sentence", "token", "embeddings"])
.setOutputCol("ner")
)
ner_converter = (
NerConverter().setInputCols("sentence", "token", "ner").setOutputCol("chunk")
)
chunkerMapper = (
ChunkMapperModel.pretrained("rxnorm_mapper", "en", "clinical/models")
.setInputCols(["chunk"])
.setOutputCol("RxNorm_Mapper")
.setRel("rxnorm_code")
)
cfModel = (
ChunkMapperFilterer()
.setInputCols(["chunk", "RxNorm_Mapper"])
.setOutputCol("chunks_fail")
.setReturnCriteria("fail")
)
chunk2doc = Chunk2Doc().setInputCols("chunks_fail").setOutputCol("doc_chunk")
sbert_embedder = (
BertSentenceEmbeddings.pretrained(
"sbiobert_base_cased_mli", "en", "clinical/models"
)
.setInputCols(["doc_chunk"])
.setOutputCol("sentence_embeddings")
.setCaseSensitive(False)
)
resolver = (
SentenceEntityResolverModel.pretrained(
"sbiobertresolve_rxnorm_augmented", "en", "clinical/models"
)
.setInputCols(["chunks_fail", "sentence_embeddings"])
.setOutputCol("resolver_code")
.setDistanceFunction("EUCLIDEAN")
)
resolverMerger = (
ResolverMerger()
.setInputCols(["resolver_code", "RxNorm_Mapper"])
.setOutputCol("RxNorm")
)
mapper_pipeline = Pipeline(
stages=[
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner_model,
ner_converter,
chunkerMapper,
chunkerMapper,
cfModel,
chunk2doc,
sbert_embedder,
resolver,
resolverMerger,
]
)
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = mapper_pipeline.fit(empty_data)
samples = [
["The patient was given Adapin 10 MG, coumadn 5 mg"],
["The patient was given Avandia 4 mg, Tegretol, zitiga"],
]
result = model.transform(spark.createDataFrame(samples).toDF("text"))
result.selectExpr(
"chunk.result as chunk",
"RxNorm_Mapper.result as RxNorm_Mapper",
"chunks_fail.result as chunks_fail",
"resolver_code.result as resolver_code",
"RxNorm.result as RxNorm",
).show(truncate=False)
+--------------------------------+----------------------+--------------+-------------+------------------------+
chunk |RxNorm_Mapper |chunks_fail |resolver_code|RxNorm |
+--------------------------------+----------------------+--------------+-------------+------------------------+
[Adapin 10 MG, coumadn 5 mg] |[1000049, NONE] |[coumadn 5 mg]|[200883] |[1000049, 200883] |
[Avandia 4 mg, Tegretol, zitiga]|[261242, 203029, NONE]|[zitiga] |[220989] |[261242, 203029, 220989]|
+--------------------------------+----------------------+--------------+-------------+------------------------+
ChunkMerge
Merges entities coming from different CHUNK annotations
Input Annotator Types: CHUNK, CHUNK
Output Annotator Type: CHUNK
Python API: ChunkMergeModel | Scala API: ChunkMergeModel |
Merges two chunk columns coming from two annotators(NER, ContextualParser or any other annotator producing chunks). The merger of the two chunk columns is made by selecting one chunk from one of the columns according to certain criteria. The decision on which chunk to select is made according to the chunk indices in the source document. (chunks with longer lengths and highest information will be kept from each source) Labels can be changed by setReplaceDictResource.
Input Annotator Types: CHUNK, CHUNK
Output Annotator Type: CHUNK
Python API: ChunkMergeApproach | Scala API: ChunkMergeApproach |
Show Example
from johnsnowlabs import *
# Define a pipeline with 2 different NER models with a ChunkMergeApproach at the end
data = spark.createDataFrame([["A 63-year-old man presents to the hospital ..."]]).toDF("text")
pipeline = Pipeline(stages=[
nlp.DocumentAssembler().setInputCol("text").setOutputCol("document"),
nlp.SentenceDetector().setInputCols(["document"]).setOutputCol("sentence"),
nlp.Tokenizer().setInputCols(["sentence"]).setOutputCol("token"),
nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models").setOutputCol("embs"),
medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embs"]).setOutputCol("jsl_ner"),
nlp.NerConverter().setInputCols(["sentence", "token", "jsl_ner"]).setOutputCol("jsl_ner_chunk"),
medical.NerModel.pretrained("ner_bionlp", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embs"]).setOutputCol("bionlp_ner"),
nlp.NerConverter().setInputCols(["sentence", "token", "bionlp_ner"]) \
.setOutputCol("bionlp_ner_chunk"),
medical.ChunkMergeApproach().setInputCols(["jsl_ner_chunk", "bionlp_ner_chunk"]).setOutputCol("merged_chunk")
])
# Show results
result = pipeline.fit(data).transform(data).cache()
result.selectExpr("explode(merged_chunk) as a") \
.selectExpr("a.begin","a.end","a.result as chunk","a.metadata.entity as entity") \
.show(5, False)
+-----+---+-----------+---------+
|begin|end|chunk |entity |
+-----+---+-----------+---------+
|5 |15 |63-year-old|Age |
|17 |19 |man |Gender |
|64 |72 |recurrent |Modifier |
|98 |107|cellulitis |Diagnosis|
|110 |119|pneumonias |Diagnosis|
+-----+---+-----------+---------+
from johnsnowlabs import *
data = spark.createDataFrame([["Jeffrey Preston Bezos is an American entrepreneur, founder and CEO of Amazon"]]).toDF("text")
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
bert_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("bert_embeddings")
fin_ner = finance.NerModel.pretrained('finner_deid', "en", "finance/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
#.setLabelCasing("upper")
ner_converter = finance.NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")\
.setReplaceLabels({"ORG": "PARTY"}) # Replace "ORG" entity as "PARTY"
ner_finner = finance.NerModel.pretrained("finner_org_per_role_date", "en", "finance/models")\
.setInputCols(["sentence", "token", "bert_embeddings"]) \
.setOutputCol("ner_finner")
#.setLabelCasing("upper")
ner_converter_finner = nlp.NerConverter() \
.setInputCols(["sentence", "token", "ner_finner"]) \
.setOutputCol("ner_finner_chunk") \
.setWhiteList(['ROLE']) # Just use "ROLE" entity from this NER
chunk_merge = finance.ChunkMergeApproach()\
.setInputCols("ner_finner_chunk", "ner_chunk")\
.setOutputCol("deid_merged_chunk")
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
bert_embeddings,
fin_ner,
ner_converter,
ner_finner,
ner_converter_finner,
chunk_merge])
# Show results
result = nlpPipeline.fit(data).transform(data).cache()
result.select(F.explode(F.arrays_zip(result.deid_merged_chunk.result,
result.deid_merged_chunk.metadata)).alias("cols")) \
.select(F.expr("cols['0']").alias("chunk"),
F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)
+---------------------+---------+
|chunk |ner_label|
+---------------------+---------+
|Jeffrey Preston Bezos|PERSON |
|founder |ROLE |
|CEO |ROLE |
|Amazon |PARTY |
+---------------------+---------+
from johnsnowlabs import *
data = spark.createDataFrame([["ENTIRE AGREEMENT. This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby, supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter.
2THEMART.COM, INC.: I-ESCROW, INC.:
By:Dominic J. Magliarditi By:Sanjay Bajaj Name: Dominic J. Magliarditi Name: Sanjay Bajaj Title: President Title: VP Business Development Date: 6/21/99 Date: 6/11/99 "]]).toDF("text")
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
legal_ner = legal.NerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
#.setLabelCasing("upper")
ner_converter = legal.NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")\
.setReplaceLabels({"ALIAS": "PARTY"})
ner_signers = legal.NerModel.pretrained("legner_signers", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner_signers")
#.setLabelCasing("upper")
ner_converter_signers = nlp.NerConverter() \
.setInputCols(["sentence", "token", "ner_signers"]) \
.setOutputCol("ner_signer_chunk")
chunk_merge = legal.ChunkMergeApproach()\
.setInputCols("ner_signer_chunk", "ner_chunk")\
.setOutputCol("deid_merged_chunk")
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
legal_ner,
ner_converter,
ner_signers,
ner_converter_signers,
chunk_merge])
# Show results
result = nlpPipeline.fit(data).transform(data).cache()
result.select(F.explode(F.arrays_zip(result.deid_merged_chunk.result,
result.deid_merged_chunk.metadata)).alias("cols")) \
.select(F.expr("cols['0']").alias("chunk"),
F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)
+-----------------------+--------------+
|chunk |ner_label |
+-----------------------+--------------+
|ENTIRE AGREEMENT |DOC |
|INC |PARTY |
|J. Magliarditi |SIGNING_PERSON|
|Bajaj |SIGNING_PERSON|
|Dominic J. Magliarditi |SIGNING_PERSON|
|Sanjay Bajaj |SIGNING_PERSON|
|President |SIGNING_TITLE |
|VP Business Development|SIGNING_TITLE |
+-----------------------+--------------+
from johnsnowlabs import *
// Define a pipeline with 2 different NER models with a ChunkMergeApproach at the end
val data = Seq(("A 63-year-old man presents to the hospital ...")).toDF("text")
val pipeline = new Pipeline().setStages(Array(
new nlp.DocumentAssembler().setInputCol("text").setOutputCol("document"),
new nlp.SentenceDetector().setInputCol("document").setOutputCol("sentence"),
new nlp.Tokenizer().setInputCol("sentence").setOutputCol("token"),
nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models").setInputCols(Array("sentence","token")).setOutputCol("embs"),
medical.NerModel.pretrained("ner_jsl", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embs")).setOutputCol("jsl_ner"),
new nlp.NerConverter().setInputCols(Array("sentence", "token", "jsl_ner")).setOutputCol("jsl_ner_chunk"),
medical.NerModel.pretrained("ner_bionlp", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embs")).setOutputCol("bionlp_ner"),
new nlp.NerConverter().setInputCols(Array("sentence", "token", "bionlp_ner"))
.setOutputCol("bionlp_ner_chunk"),
new medical.ChunkMergeApproach().setInputCols(Array("jsl_ner_chunk", "bionlp_ner_chunk")).setOutputCol("merged_chunk")
))
// Show results
val result = pipeline.fit(data).transform(data).cache()
result.selectExpr("explode(merged_chunk) as a")
.selectExpr("a.begin","a.end","a.result as chunk","a.metadata.entity as entity")
.show(5, false)
+-----+---+-----------+---------+
|begin|end|chunk |entity |
+-----+---+-----------+---------+
|5 |15 |63-year-old|Age |
|17 |19 |man |Gender |
|64 |72 |recurrent |Modifier |
|98 |107|cellulitis |Diagnosis|
|110 |119|pneumonias |Diagnosis|
+-----+---+-----------+---------+
from johnsnowlabs import *
val data = Seq(("Jeffrey Preston Bezos is an American entrepreneur, founder and CEO of Amazon")).toDF("text")
val documentAssembler = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector()
.setInputCol("document")
.setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer()
.setInputCol("sentence")
.setOutputCol("token")
val embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val bert_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("bert_embeddings")
val fin_ner = finance.NerModel.pretrained('finner_deid', "en", "finance/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
#.setLabelCasing("upper")
val ner_converter = finance.NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")\
.setReplaceLabels({"ORG": "PARTY"}) # Replace "ORG" entity as "PARTY"
val ner_finner = finance.NerModel.pretrained("finner_org_per_role_date", "en", "finance/models")\
.setInputCols(Array("sentence", "token", "bert_embeddings"))
.setOutputCol("ner_finner")
#.setLabelCasing("upper")
val ner_converter_finner = new nlp.NerConverter()
.setInputCols(Array("sentence", "token", "ner_finner"))
.setOutputCol("ner_finner_chunk")
.setWhiteList(['ROLE']) # Just use "ROLE" entity from this NER
val chunk_merge = new finance.ChunkMergeApproach()
.setInputCols(Array("ner_finner_chunk", "ner_chunk"))
.setOutputCol("deid_merged_chunk")
val nlpPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
bert_embeddings,
fin_ner,
ner_converter,
ner_finner,
ner_converter_finner,
chunk_merge))
val model = nlpPipeline.fit(data)
from johnsnowlabs import *
val data = Seq(("ENTIRE AGREEMENT. This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby, supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter.
2THEMART.COM, INC.: I-ESCROW, INC.:
By:Dominic J. Magliarditi By:Sanjay Bajaj Name: Dominic J. Magliarditi Name: Sanjay Bajaj Title: President Title: VP Business Development Date: 6/21/99 Date: 6/11/99 ")).toDF("text")
val documentAssembler = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector()
.setInputCol("document")
.setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer()
.setInputCol("sentence")
.setOutputCol("token")
val embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val legal_ner = legal.NerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
#.setLabelCasing("upper")
val ner_converter = new legal.NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")\
.setReplaceLabels({"ALIAS": "PARTY"})
val ner_signers = legal.NerModel.pretrained("legner_signers", "en", "legal/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner_signers")
#.setLabelCasing("upper")
val ner_converter_signers = new nlp.NerConverter()
.setInputCols(Array("sentence", "token", "ner_signers"))
.setOutputCol("ner_signer_chunk")
val chunk_merge = new legal.ChunkMergeApproach()
.setInputCols(Array("ner_signer_chunk", "ner_chunk"))
.setOutputCol("deid_merged_chunk")
val nlpPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
legal_ner,
ner_converter,
ner_signers,
ner_converter_signers,
chunk_merge))
val model = nlpPipeline.fit(data)
ChunkSentenceSplitter
ChunkSentenceSplitter
annotator can split the documents into chunks according to separators given as CHUNK
columns. It is useful when you need to perform different models or analysis in different sections of your document (for example, for different headers, clauses, items, etc.). The given separator chunk can be the output from, for example, RegexMatcher or NerModel.
For detailed usage of this annotator, visit this notebook from our Spark NLP Workshop
.
Input Annotator Types: DOCUMENT, CHUNK
Output Annotator Type: DOCUMENT
Python API: ChunkSentenceSplitter | Scala API: ChunkSentenceSplitter |
Show Example
# Defining the pipeline
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token")
tokenClassifier = (
MedicalBertForTokenClassifier.pretrained(
"bert_token_classifier_ner_jsl_slim", "en", "clinical/models"
)
.setInputCols("token", "document")
.setOutputCol("ner")
.setCaseSensitive(True)
)
ner_converter = (
NerConverter()
.setInputCols(["document", "token", "ner"])
.setOutputCol("ner_chunk")
.setWhiteList(["Header"])
)
chunkSentenceSplitter = (
ChunkSentenceSplitter()
.setInputCols("document", "ner_chunk")
.setOutputCol("paragraphs")
.setGroupBySentences(False)
)
pipeline = Pipeline(
stages=[
documentAssembler,
tokenizer,
tokenClassifier,
ner_converter,
chunkSentenceSplitter,
]
)
empty_df = spark.createDataFrame([[""]]).toDF("text")
pipeline_model = pipeline.fit(empty_df)
sentences = [
[
"""ADMISSION DIAGNOSIS Right pleural effusion and suspected malignant mesothelioma.
PRINCIPAL DIAGNOSIS Right pleural effusion, suspected malignant mesothelioma.
REVIEW OF SYSTEMS Right pleural effusion, firm nodules, diffuse scattered throughout the right pleura and diaphragmatic surface.
"""
]
]
df = spark.createDataFrame(sentences).toDF("text")
paragraphs = pipeline_model.transform(df)
paragraphs.selectExpr("explode(paragraphs) as result").selectExpr("result.result","result.metadata.entity", "result.metadata.splitter_chunk").show(truncate=80)
+--------------------------------------------------------------------------------+------+-------------------+
| result|entity| splitter_chunk|
+--------------------------------------------------------------------------------+------+-------------------+
|ADMISSION DIAGNOSIS Right pleural effusion and suspected malignant mesothelio...|Header|ADMISSION DIAGNOSIS|
|PRINCIPAL DIAGNOSIS Right pleural effusion, suspected malignant mesothelioma....|Header|PRINCIPAL DIAGNOSIS|
|REVIEW OF SYSTEMS Right pleural effusion, firm nodules, diffuse scattered thr...|Header| REVIEW OF SYSTEMS|
+--------------------------------------------------------------------------------+------+-------------------+
val data = Seq(text,text).toDS.toDF("text")
val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("doc")
val regexMatcher = new RegexMatcher().setInputCols("doc").setOutputCol("chunks").setExternalRules("src/test/resources/chunker/title_regex.txt",",")
val chunkSentenceSplitter = new ChunkSentenceSplitter().setInputCols("chunks","doc").setOutputCol("paragraphs")
val pipeline = new Pipeline().setStages(Array(documentAssembler,regexMatcher,chunkSentenceSplitter))
val result = pipeline.fit(data).transform(data).select("paragraphs")
result.show(truncate = false)
ContextualParser
Extracts entity from a document based on user defined rules. Rule matching is based on a RegexMatcher defined in a JSON file. In this file, regex is defined that you want to match along with the information that will output on metadata field. To instantiate a model, see ContextualParserApproach and its accompanied example.
Input Annotator Types: DOCUMENT, TOKEN
Output Annotator Type: CHUNK
Python API: ContextualParserModel | Scala API: ContextualParserModel |
Creates a model, that extracts entity from a document based on user defined rules.
Rule matching is based on a RegexMatcher defined in a JSON file. It is set through the parameter setJsonPath()
In this JSON file, regex is defined that you want to match along with the information that will output on metadata
field. Additionally, a dictionary can be provided with setDictionary
to map extracted entities
to a unified representation. The first column of the dictionary file should be the representation with following
columns the possible matches.
Input Annotator Types: DOCUMENT, TOKEN
Output Annotator Type: CHUNK
Python API: ContextualParserApproach | Scala API: ContextualParserApproach |
Show Example
from johnsnowlabs import *
# An example JSON file `regex_token.json` can look like this:
#
# {
# "entity": "Stage",
# "ruleScope": "sentence",
# "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]",
# "matchScope": "token"
# }
#
# Which means to extract the stage code on a sentence level.
# An example pipeline could then be defined like this
# Pipeline could then be defined like this
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
# Define the parser (json file needs to be provided)
data = spark.createDataFrame([["A patient has liver metastases pT1bN0M0 and the T5 primary site may be colon or... "]]).toDF("text")
contextualParser = medical.ContextualParserApproach() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("entity") \
.setJsonPath("/path/to/regex_token.json") \
.setCaseSensitive(True) \
.setContextMatch(False)
pipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
contextualParser
])
result = pipeline.fit(data).transform(data)
# Show Results
result.selectExpr("explode(entity)").show(5, truncate=False)
+-------------------------------------------------------------------------------------------------------------------------+
|col |
+-------------------------------------------------------------------------------------------------------------------------+
|{chunk, 32, 39, pT1bN0M0, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 0}, []} |
|{chunk, 49, 50, T5, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 0}, []} |
|{chunk, 148, 156, cT4bcN2M1, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 1}, []}|
|{chunk, 189, 194, T?N3M1, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 2}, []} |
|{chunk, 316, 323, pT1bN0M0, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 3}, []} |
+-------------------------------------------------------------------------------------------------------------------------+
from johnsnowlabs import *
# An example JSON file `regex_token.json` can look like this:
#
# {
# "entity": "Stage",
# "ruleScope": "sentence",
# "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]",
# "matchScope": "token"
# }
#
# Which means to extract the stage code on a sentence level.
# An example pipeline could then be defined like this
# Pipeline could then be defined like this
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
# Define the parser (json file needs to be provided)
contextualParser = finance.ContextualParserApproach() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("entity") \
.setJsonPath("/path/to/regex_token.json") \
.setCaseSensitive(True) \
.setContextMatch(False)
pipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
contextualParser
])
from johnsnowlabs import *
# An example JSON file `regex_token.json` can look like this:
#
# {
# "entity": "Stage",
# "ruleScope": "sentence",
# "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]",
# "matchScope": "token"
# }
#
# Which means to extract the stage code on a sentence level.
# An example pipeline could then be defined like this
# Pipeline could then be defined like this
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
# Define the parser (json file needs to be provided)
contextualParser = legal.ContextualParserApproach() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("entity") \
.setJsonPath("/path/to/regex_token.json") \
.setCaseSensitive(True) \
.setContextMatch(False)
pipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
contextualParser
])
from johnsnowlabs import *
// An example JSON file `regex_token.json` can look like this:
//
// {
// "entity": "Stage",
// "ruleScope": "sentence",
// "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]",
// "matchScope": "token"
// }
//
// Which means to extract the stage code on a sentence level.
// An example pipeline could then be defined like this
val documentAssembler = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
// Define the parser (json file needs to be provided)
val data = Seq("A patient has liver metastases pT1bN0M0 and the T5 primary site may be colon or... ").toDF("text")
val contextualParser = new medical.ContextualParserApproach()
.setInputCols(Array("sentence", "token"))
.setOutputCol("entity")
.setJsonPath("/path/to/regex_token.json")
.setCaseSensitive(true)
.setContextMatch(false)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
contextualParser
))
val result = pipeline.fit(data).transform(data)
// Show Results
//
// result.selectExpr("explode(entity)").show(5, truncate=false)
// +-------------------------------------------------------------------------------------------------------------------------+
// |col |
// +-------------------------------------------------------------------------------------------------------------------------+
// |{chunk, 32, 39, pT1bN0M0, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 0}, []} |
// |{chunk, 49, 50, T5, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 0}, []} |
// |{chunk, 148, 156, cT4bcN2M1, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 1}, []}|
// |{chunk, 189, 194, T?N3M1, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 2}, []} |
// |{chunk, 316, 323, pT1bN0M0, {field -> Stage, normalized -> , confidenceValue -> 0.13, hits -> regex, sentence -> 3}, []} |
// +-------------------------------------------------------------------------------------------------------------------------+
//
from johnsnowlabs import *
// An example JSON file `regex_token.json` can look like this:
//
// {
// "entity": "Stage",
// "ruleScope": "sentence",
// "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]",
// "matchScope": "token"
// }
//
// Which means to extract the stage code on a sentence level.
// An example pipeline could then be defined like this
val documentAssembler = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
// Define the parser (json file needs to be provided)
val data = Seq("A patient has liver metastases pT1bN0M0 and the T5 primary site may be colon or... ").toDF("text")
val contextualParser = new finance.ContextualParserApproach()
.setInputCols(Array("sentence", "token"))
.setOutputCol("entity")
.setJsonPath("/path/to/regex_token.json")
.setCaseSensitive(true)
.setContextMatch(false)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
contextualParser
))
from johnsnowlabs import *
// An example JSON file `regex_token.json` can look like this:
//
// {
// "entity": "Stage",
// "ruleScope": "sentence",
// "regex": "[cpyrau]?[T][0-9X?][a-z^cpyrau]",
// "matchScope": "token"
// }
//
// Which means to extract the stage code on a sentence level.
// An example pipeline could then be defined like this
val documentAssembler = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
// Define the parser (json file needs to be provided)
val data = Seq("A patient has liver metastases pT1bN0M0 and the T5 primary site may be colon or... ").toDF("text")
val contextualParser = new legal.ContextualParserApproach()
.setInputCols(Array("sentence", "token"))
.setOutputCol("entity")
.setJsonPath("/path/to/regex_token.json")
.setCaseSensitive(true)
.setContextMatch(false)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
contextualParser
))
DateNormalizer
This annotator transforms date mentions to a common standard format: YYYY/MM/DD. It is useful when using data from different sources, some times from different countries that has different formats to represent dates.
For the relative dates (next year, past month, etc.), you can define an achor date to create the normalized date by setting the parameters anchorDateYear
, anchorDateMonth
, and anchorDateDay
.
The resultant chunk date will contain a metada indicating whether the normalization was successful or not (True / False).
Input Annotator Types: CHUNK
Output Annotator Type: CHUNK
Python API: DateNormalizer | Scala API: DateNormalizer |
Show Example
from pyspark.sql.types import StringType
dates = [
"08/02/2018",
"11/2018",
"11/01/2018",
"12Mar2021",
"Jan 30, 2018",
"13.04.1999",
"3April 2020",
"next monday",
"today",
"next week",
]
df = spark.createDataFrame(dates, StringType()).toDF("original_date")
document_assembler = (
DocumentAssembler().setInputCol("original_date").setOutputCol("document")
)
doc2chunk = Doc2Chunk().setInputCols("document").setOutputCol("date_chunk")
date_normalizer = (
DateNormalizer()
.setInputCols("date_chunk")
.setOutputCol("date")
.setAnchorDateYear(2000)
.setAnchorDateMonth(3)
.setAnchorDateDay(15)
)
pipeline = Pipeline(stages=[document_assembler, doc2chunk, date_normalizer])
result = pipeline.fit(df).transform(df)
result.selectExpr(
"date.result as normalized_date",
"original_date",
"date.metadata[0].normalized as metadata",
).show()
+---------------+-------------+--------+
|normalized_date|original_date|metadata|
+---------------+-------------+--------+
| [2018/08/02]| 08/02/2018| true|
| [2018/11/DD]| 11/2018| true|
| [2018/11/01]| 11/01/2018| true|
| [2021/03/12]| 12Mar2021| true|
| [2018/01/30]| Jan 30, 2018| true|
| [1999/04/13]| 13.04.1999| true|
| [2020/04/03]| 3April 2020| true|
| [2000/03/20]| next monday| true|
| [2000/03/15]| today| true|
| [2000/03/22]| next week| true|
+---------------+-------------+--------+
val df = Seq(("08/02/2018"),("11/2018"),("11/01/2018"),("next monday"),("today"),("next week")).toDF("original_date")
val documentAssembler = new DocumentAssembler().setInputCol("original_date").setOutputCol("document")
val chunksDF = documentAssembler
.transform(df)
.mapAnnotationsCol[Seq[Annotation]]("document",
"chunk_date",
CHUNK,
(aa:Seq[Annotation]) =>
aa.map( ann => ann.copy(annotatorType = CHUNK)))
val dateNormalizerModel = new DateNormalizer()
.setInputCols("chunk_date")
.setOutputCol("date")
.setAnchorDateDay(15)
.setAnchorDateMonth(3)
.setAnchorDateYear(2000)
val dateDf = dateNormalizerModel.transform(chunksDF)
dateDf.select("chunk_date.result","text").show()
+-------------+-------------+
| result|original_date|
+-------------+-------------+
| [08/02/2018]| 08/02/2018|
| [11/2018]| 11/2018|
| [11/01/2018]| 11/01/2018|
|[next monday]| next monday|
| [today]| today|
| [next week]| next week|
+-------------+-------------+
DeIdentification
Deidentifies Input Annotations of types DOCUMENT, TOKEN and CHUNK, by either masking or obfuscating the given CHUNKS.
To create a configured DeIdentificationModel, please see the example of DeIdentification.
Input Annotator Types: DOCUMENT, TOKEN, CHUNK
Output Annotator Type: DOCUMENT
Python API: DeIdentificationModel | Scala API: DeIdentificationModel |
Show Example
from johnsnowlabs import *
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
bert_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("bert_embeddings")
fin_ner = finance.NerModel.pretrained('finner_deid', "en", "finance/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
#.setLabelCasing("upper")
ner_converter = finance.NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")\
.setReplaceLabels({"ORG": "PARTY"}) # Replace "ORG" entity as "PARTY"
ner_finner = finance.NerModel.pretrained("finner_org_per_role_date", "en", "finance/models")\
.setInputCols(["sentence", "token", "bert_embeddings"]) \
.setOutputCol("ner_finner")
#.setLabelCasing("upper")
ner_converter_finner = nlp.NerConverter() \
.setInputCols(["sentence", "token", "ner_finner"]) \
.setOutputCol("ner_finner_chunk") \
.setWhiteList(['ROLE']) # Just use "ROLE" entity from this NER
chunk_merge = finance.ChunkMergeApproach()\
.setInputCols("ner_finner_chunk", "ner_chunk")\
.setOutputCol("deid_merged_chunk")
deidentification = finance.DeIdentification() \
.setInputCols(["sentence", "token", "deid_merged_chunk"]) \
.setOutputCol("deidentified") \
.setMode("mask")\
.setIgnoreRegex(True)
# Pipeline
data = spark.createDataFrame([
["Jeffrey Preston Bezos is an American entrepreneur, founder and CEO of Amazon"]
]).toDF("text")
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
bert_embeddings,
fin_ner,
ner_converter,
ner_finner,
ner_converter_finner,
chunk_merge,
deidentification])
result = nlpPipeline.fit(data).transform(data)
from johnsnowlabs import *
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
legal_ner = legal.NerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
#.setLabelCasing("upper")
ner_converter = legal.NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")\
.setReplaceLabels({"ALIAS": "PARTY"})
ner_signers = legal.NerModel.pretrained("legner_signers", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner_signers")
#.setLabelCasing("upper")
ner_converter_signers = nlp.NerConverter() \
.setInputCols(["sentence", "token", "ner_signers"]) \
.setOutputCol("ner_signer_chunk")
chunk_merge = legal.ChunkMergeApproach()\
.setInputCols("ner_signer_chunk", "ner_chunk")\
.setOutputCol("deid_merged_chunk")
deidentification = legal.DeIdentification() \
.setInputCols(["sentence", "token", "deid_merged_chunk"]) \
.setOutputCol("deidentified") \
.setMode("mask")\
.setIgnoreRegex(True)
# Pipeline
data = spark.createDataFrame([
["ENTIRE AGREEMENT. This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby, supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter.
2THEMART.COM, INC.: I-ESCROW, INC.:
By:Dominic J. Magliarditi By:Sanjay Bajaj Name: Dominic J. Magliarditi Name: Sanjay Bajaj Title: President Title: VP Business Development Date: 6/21/99 Date: 6/11/99 "]
]).toDF("text")
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
legal_ner,
ner_converter,
ner_signers,
ner_converter_signers,
chunk_merge,
deidentification])
result = nlpPipeline.fit(data).transform(data)
from johnsnowlabs import *
val documentAssembler = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector()
.setInputCols(["document"])
.setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer()
.setInputCols(["sentence"])
.setOutputCol("token")
val embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val bert_embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("bert_embeddings")
val fin_ner = finance.NerModel.pretrained('finner_deid', "en", "finance/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
#.setLabelCasing("upper")
val ner_converter = finance.NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
.setReplaceLabels({"ORG": "PARTY"}) # Replace "ORG" entity as "PARTY"
val ner_finner = finance.NerModel.pretrained("finner_org_per_role_date", "en", "finance/models")
.setInputCols(Array("sentence", "token", "bert_embeddings"))
.setOutputCol("ner_finner")
#.setLabelCasing("upper")
val ner_converter_finner = new nlp.NerConverter()
.setInputCols(Array("sentence", "token", "ner_finner"))
.setOutputCol("ner_finner_chunk")
.setWhiteList(['ROLE']) # Just use "ROLE" entity from this NER
val chunk_merge = new finance.ChunkMergeApproach()
.setInputCols(Array("ner_finner_chunk", "ner_chunk"))
.setOutputCol("deid_merged_chunk")
val deidentification = new finance.DeIdentification()
.setInputCols(Array("sentence", "token", "deid_merged_chunk"))
.setOutputCol("deidentified")
.setMode("mask")
.setIgnoreRegex(True)
# Pipeline
val data = Seq("Jeffrey Preston Bezos is an American entrepreneur, founder and CEO of Amazon").toDF("text")
val nlpPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
bert_embeddings,
fin_ner,
ner_converter,
ner_finner,
ner_converter_finner,
chunk_merge,
deidentification))
val result = nlpPipeline.fit(data).transform(data)
from johnsnowlabs import *
val documentAssembler = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector()
.setInputCols(["document"])
.setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer()
.setInputCols(["sentence"])
.setOutputCol("token")
val embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val legal_ner = legal.NerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
#.setLabelCasing("upper")
val ner_converter = new legal.NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")\
.setReplaceLabels({"ALIAS": "PARTY"})
val ner_signers = legal.NerModel.pretrained("legner_signers", "en", "legal/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner_signers")
#.setLabelCasing("upper")
val ner_converter_signers = new nlp.NerConverter()
.setInputCols(Array("sentence", "token", "ner_signers"))
.setOutputCol("ner_signer_chunk")
val chunk_merge = new legal.ChunkMergeApproach()
.setInputCols(Array("ner_signer_chunk", "ner_chunk"))
.setOutputCol("deid_merged_chunk")
val deidentification = new legal.DeIdentification()
.setInputCols(Array("sentence", "token", "deid_merged_chunk"))
.setOutputCol("deidentified") \
.setMode("mask")\
.setIgnoreRegex(True)
# Pipeline
val data = Seq("ENTIRE AGREEMENT. This Agreement contains the entire understanding of the parties hereto with respect to the transactions and matters contemplated hereby, supersedes all previous Agreements between i-Escrow and 2TheMart concerning the subject matter.
2THEMART.COM, INC.: I-ESCROW, INC.:
By:Dominic J. Magliarditi By:Sanjay Bajaj Name: Dominic J. Magliarditi Name: Sanjay Bajaj Title: President Title: VP Business Development Date: 6/21/99 Date: 6/11/99 ").toDF("text")
val nlpPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
legal_ner,
ner_converter,
ner_signers,
ner_converter_signers,
chunk_merge,
deidentification))
val result = nlpPipeline.fit(data).transform(data)
Contains all the methods for training a DeIdentificationModel model. This module can obfuscate or mask the entities that contains personal information. These can be set with a file of regex patterns with setRegexPatternsDictionary, where each line is a mapping of entity to regex.
DATE \d{4}
AID \d{6,7}
Additionally, obfuscation strings can be defined with setObfuscateRefFile, where each line is a mapping of string to entity. The format and seperator can be speficied with setRefFileFormat and setRefSep.
Dr. Gregory House#DOCTOR
01010101#MEDICALRECORD
Ideally this annotator works in conjunction with Demographic Named EntityRecognizers that can be trained either using TextMatchers, RegexMatchers, DateMatchers, NerCRFs or NerDLs
Input Annotator Types: DOCUMENT, TOKEN, CHUNK
Output Annotator Type: DOCUMENT
Python API: DeIdentification | Scala API: DeIdentification |
Show Example
from johnsnowlabs import *
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence") \
.setUseAbbreviations(True)
tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel \
.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
# Ner entities
clinical_sensitive_entities = medical.NerModel \
.pretrained("ner_deid_enriched", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]).setOutputCol("ner")
nerConverter = medical.NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
# Deidentification
deIdentification = medical.DeIdentification() \
.setInputCols(["ner_chunk", "token", "sentence"]) \
.setOutputCol("dei") \
# file with custom regex pattern for custom entities
.setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt") \
# file with custom obfuscator names for the entities
.setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt") \
.setRefFileFormat("csv") \
.setRefSep("#") \
.setMode("obfuscate") \
.setDateFormats(Array("MM/dd/yy","yyyy-MM-dd")) \
.setObfuscateDate(True) \
.setDateTag("DATE") \
.setDays(5) \
.setObfuscateRefSource("file")
# Pipeline
data = spark.createDataFrame([
["# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09."]
]).toDF("text")
pipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_sensitive_entities,
nerConverter,
deIdentification
])
result = pipeline.fit(data).transform(data)
# Show Results
result.select("dei.result").show(truncate = False)
+--------------------------------------------------------------------------------------------------+
|result |
+--------------------------------------------------------------------------------------------------+
|[# 01010101 Date : 01/18/93 PCP : Dr. Gregory House , <AGE> years-old , Record date : 2079-11-14.]|
+--------------------------------------------------------------------------------------------------+
from johnsnowlabs import *
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence") \
.setUseAbbreviations(True)
tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel \
.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
# Ner entities
ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
nerConverter = nlp.NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_con")
# Deidentification
deIdentification = finance.DeIdentification() \
.setInputCols(["ner_chunk", "token", "sentence"]) \
.setOutputCol("dei") \
# file with custom regex pattern for custom entities
.setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt") \
# file with custom obfuscator names for the entities
.setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt") \
.setRefFileFormat("csv") \
.setRefSep("#") \
.setMode("obfuscate") \
.setDateFormats(Array("MM/dd/yy","yyyy-MM-dd")) \
.setObfuscateDate(True) \
.setDateTag("DATE") \
.setDays(5) \
.setObfuscateRefSource("file")
# Pipeline
pipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
nerConverter,
deIdentification
])
from johnsnowlabs import *
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence") \
.setUseAbbreviations(True)
tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel \
.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
# Ner entities
ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
nerConverter = nlp.NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_con")
# Deidentification
deIdentification = legal.DeIdentification() \
.setInputCols(["ner_chunk", "token", "sentence"]) \
.setOutputCol("dei") \
# file with custom regex pattern for custom entities
.setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt") \
# file with custom obfuscator names for the entities
.setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt") \
.setRefFileFormat("csv") \
.setRefSep("#") \
.setMode("obfuscate") \
.setDateFormats(Array("MM/dd/yy","yyyy-MM-dd")) \
.setObfuscateDate(True) \
.setDateTag("DATE") \
.setDays(5) \
.setObfuscateRefSource("file")
# Pipeline
pipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
nerConverter,
deIdentification
])
from johnsnowlabs import *
val documentAssembler = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
.setUseAbbreviations(true)
val tokenizer = new nlp.Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = nlp.WordEmbeddingsModel
.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
// Ner entities
val clinical_sensitive_entities = medical.NerModel.pretrained("ner_deid_enriched", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings")).setOutputCol("ner")
val nerConverter = new nlp.NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_con")
// Deidentification
val deIdentification = new medical.DeIdentification()
.setInputCols(Array("ner_chunk", "token", "sentence"))
.setOutputCol("dei")
// file with custom regex patterns for custom entities
.setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt")
// file with custom obfuscator names for the entities
.setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt")
.setRefFileFormat("csv")
.setRefSep("#")
.setMode("obfuscate")
.setDateFormats(Array("MM/dd/yy","yyyy-MM-dd"))
.setObfuscateDate(true)
.setDateTag("DATE")
.setDays(5)
.setObfuscateRefSource("file")
// Pipeline
val data = Seq(
"# 7194334 Date : 01/13/93 PCP : Oliveira , 25 years-old , Record date : 2079-11-09."
).toDF("text")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_sensitive_entities,
nerConverter,
deIdentification
))
val result = pipeline.fit(data).transform(data)
result.select("dei.result").show(truncate = false)
// Show Results
//
// result.select("dei.result").show(truncate = false)
// +--------------------------------------------------------------------------------------------------+
// |result |
// +--------------------------------------------------------------------------------------------------+
// |[# 01010101 Date : 01/18/93 PCP : Dr. Gregory House , <AGE> years-old , Record date : 2079-11-14.]|
// +--------------------------------------------------------------------------------------------------+
//
from johnsnowlabs import *
val documentAssembler = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
.setUseAbbreviations(true)
val tokenizer = new nlp.Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = nlp.WordEmbeddingsModel
.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
// Ner entities
val ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val nerConverter = new nlp.NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_con")
// Deidentification
val deIdentification = new finance.DeIdentification()
.setInputCols(Array("ner_chunk", "token", "sentence"))
.setOutputCol("dei")
// file with custom regex patterns for custom entities
.setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt")
// file with custom obfuscator names for the entities
.setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt")
.setRefFileFormat("csv")
.setRefSep("#")
.setMode("obfuscate")
.setDateFormats(Array("MM/dd/yy","yyyy-MM-dd"))
.setObfuscateDate(true)
.setDateTag("DATE")
.setDays(5)
.setObfuscateRefSource("file")
// Pipeline
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
nerConverter,
deIdentification
))
from johnsnowlabs import *
val documentAssembler = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
.setUseAbbreviations(true)
val tokenizer = new nlp.Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = nlp.WordEmbeddingsModel
.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
// Ner entities
val ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val nerConverter = new nlp.NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_con")
// Deidentification
val deIdentification = new legal.DeIdentification()
.setInputCols(Array("ner_chunk", "token", "sentence"))
.setOutputCol("dei")
// file with custom regex patterns for custom entities
.setRegexPatternsDictionary("path/to/dic_regex_patterns_main_categories.txt")
// file with custom obfuscator names for the entities
.setObfuscateRefFile("path/to/obfuscate_fixed_entities.txt")
.setRefFileFormat("csv")
.setRefSep("#")
.setMode("obfuscate")
.setDateFormats(Array("MM/dd/yy","yyyy-MM-dd"))
.setObfuscateDate(true)
.setDateTag("DATE")
.setDays(5)
.setObfuscateRefSource("file")
// Pipeline
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
nerConverter,
deIdentification
))
Doc2ChunkInternal
Converts DOCUMENT
, TOKEN
typed annotations into CHUNK
type with the contents of a chunkCol
. Chunk text must be contained within input DOCUMENT
. May be either StringType
or ArrayType[StringType]
(using setIsArray
). Useful for annotators that require a CHUNK type input.
For more extended examples on document pre-processing see the Spark NLP Workshop.
Input Annotator Types: DOCUMENT, TOKEN
Output Annotator Type: CHUNK
Python API: Doc2ChunkInternal | Scala API: Doc2ChunkInternal |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
tokenizer = Tokenizer().setInputCol("document").setOutputCol("token")
chunkAssembler = (
Doc2ChunkInternal()
.setInputCols("document", "token")
.setChunkCol("target")
.setOutputCol("chunk")
.setIsArray(True)
)
data = spark.createDataFrame(
[
[
"Spark NLP is an open-source text processing library for advanced natural language processing.",
["Spark NLP", "text processing library", "natural language processing"],
]
]
).toDF("text", "target")
pipeline = (
Pipeline().setStages([documentAssembler, tokenizer, chunkAssembler]).fit(data)
)
result = pipeline.transform(data)
result.selectExpr("chunk.result", "chunk.annotatorType").show(truncate=False)
+-----------------------------------------------------------------+---------------------+
|result |annotatorType |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+
DocumentHashCoder
This annotator can replace dates in a column of DOCUMENT
type according with the hash code of any other column. It uses the hash of the specified column and creates a new document column containing the day shift information. In sequence, the DeIdentification
annotator deidentifies the document with the shifted date information.
If the specified column contains strings that can be parsed to integers, use those numbers to make the shift in the data accordingly.
Input Annotator Types: DOCUMENT
Output Annotator Type: DOCUMENT
Python API: DocumentHashCoder | Scala API: DocumentHashCoder |
Show Example
import pandas as pd
data = pd.DataFrame(
{'patientID' : ['A001', 'A001',
'A003', 'A003'],
'text' : ['Chris Brown was discharged on 10/02/2022',
'Mark White was discharged on 10/04/2022',
'John was discharged on 15/03/2022',
'John Moore was discharged on 15/12/2022'
],
'dateshift' : ['10', '10',
'30', '30']
}
)
my_input_df = spark.createDataFrame(data)
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
documentHasher = DocumentHashCoder()\
.setInputCols("document")\
.setOutputCol("document2")\
.setDateShiftColumn("dateshift")
tokenizer = Tokenizer()\
.setInputCols(["document2"])\
.setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["document2", "token"])\
.setOutputCol("word_embeddings")
clinical_ner = MedicalNerModel\
.pretrained("ner_deid_subentity_augmented", "en", "clinical/models")\
.setInputCols(["document2","token", "word_embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["document2", "token", "ner"])\
.setOutputCol("ner_chunk")
de_identification = DeIdentification() \
.setInputCols(["ner_chunk", "token", "document2"]) \
.setOutputCol("deid_text") \
.setMode("obfuscate") \
.setObfuscateDate(True) \
.setDateTag("DATE") \
.setLanguage("en") \
.setObfuscateRefSource('faker') \
.setUseShifDays(True)
pipeline_col = Pipeline().setStages([
documentAssembler,
documentHasher,
tokenizer,
embeddings,
clinical_ner,
ner_converter,
de_identification
])
empty_data = spark.createDataFrame([["", "", ""]]).toDF("patientID","text", "dateshift")
pipeline_col_model = pipeline_col.fit(empty_data)
output = pipeline_col_model.transform(my_input_df)
output.select('text', 'dateshift', 'deid_text.result').show(truncate = False)
+----------------------------------------+---------+----------------------------------------------+
text |dateshift|result |
+----------------------------------------+---------+----------------------------------------------+
Chris Brown was discharged on 10/02/2022|10 |[Ellender Manual was discharged on 20/02/2022]|
Mark White was discharged on 10/04/2022 |10 |[Errol Bang was discharged on 20/04/2022] |
John was discharged on 15/03/2022 |30 |[Ariel Null was discharged on 14/04/2022] |
John Moore was discharged on 15/12/2022 |30 |[Jean Cotton was discharged on 14/01/2023] |
+----------------------------------------+---------+----------------------------------------------+
DocumentLogRegClassifier
Classifies documents with a Logarithmic Regression algorithm. Currently there are no pretrained models available. Please see DocumentLogRegClassifierApproach to train your own model.
Please check out the Models Hub for available models in the future.
Input Annotator Types: TOKEN
Output Annotator Type: CATEGORY
Python API: DocumentLogRegClassifierModel | Scala API: DocumentLogRegClassifierModel |
Trains a model to classify documents with a Logarithmic Regression algorithm. Training data requires columns for text and their label. The result is a trained DocumentLogRegClassifierModel.
Input Annotator Types: TOKEN
Output Annotator Type: CATEGORY
Python API: DocumentLogRegClassifierApproach | Scala API: DocumentLogRegClassifierApproach |
Show Example
from johnsnowlabs import *
# Define pipeline stages to prepare the data
document_assembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = nlp.Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
normalizer = nlp.Normalizer() \
.setInputCols(["token"]) \
.setOutputCol("normalized")
stopwords_cleaner = nlp.StopWordsCleaner() \
.setInputCols(["normalized"]) \
.setOutputCol("cleanTokens") \
.setCaseSensitive(False)
stemmer = nlp.Stemmer() \
.setInputCols(["cleanTokens"]) \
.setOutputCol("stem")
# Define the document classifier and fit training data to it
logreg = medical.DocumentLogRegClassifierApproach() \
.setInputCols(["stem"]) \
.setLabelCol("category") \
.setOutputCol("prediction")
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
normalizer,
stopwords_cleaner,
stemmer,
logreg
])
model = pipeline.fit(trainingData)
from johnsnowlabs import *
# Define pipeline stages to prepare the data
document_assembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = nlp.Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
normalizer = nlp.Normalizer() \
.setInputCols(["token"]) \
.setOutputCol("normalized")
stopwords_cleaner = nlp.StopWordsCleaner() \
.setInputCols(["normalized"]) \
.setOutputCol("cleanTokens") \
.setCaseSensitive(False)
stemmer = nlp.Stemmer() \
.setInputCols(["cleanTokens"]) \
.setOutputCol("stem")
# Define the document classifier and fit training data to it
logreg = finance.DocumentLogRegClassifierApproach() \
.setInputCols(["stem"]) \
.setLabelCol("category") \
.setOutputCol("prediction")
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
normalizer,
stopwords_cleaner,
stemmer,
logreg
])
model = pipeline.fit(trainingData)
from johnsnowlabs import *
# Define pipeline stages to prepare the data
document_assembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
tokenizer = nlp.Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
normalizer = nlp.Normalizer() \
.setInputCols(["token"]) \
.setOutputCol("normalized")
stopwords_cleaner = nlp.StopWordsCleaner() \
.setInputCols(["normalized"]) \
.setOutputCol("cleanTokens") \
.setCaseSensitive(False)
stemmer = nlp.Stemmer() \
.setInputCols(["cleanTokens"]) \
.setOutputCol("stem")
# Define the document classifier and fit training data to it
logreg = legal.DocumentLogRegClassifierApproach() \
.setInputCols(["stem"]) \
.setLabelCol("category") \
.setOutputCol("prediction")
pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
normalizer,
stopwords_cleaner,
stemmer,
logreg
])
model = pipeline.fit(trainingData)
from johnsnowlabs import *
// Define pipeline stages to prepare the data
val document_assembler = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new nlp.Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val normalizer = new nlp.Normalizer()
.setInputCols("token")
.setOutputCol("normalized")
val stopwords_cleaner = new nlp.StopWordsCleaner()
.setInputCols("normalized")
.setOutputCol("cleanTokens")
.setCaseSensitive(false)
val stemmer = new nlp.Stemmer()
.setInputCols("cleanTokens")
.setOutputCol("stem")
// Define the document classifier and fit training data to it
val logreg = new medical.DocumentLogRegClassifierApproach()
.setInputCols("stem")
.setLabelCol("category")
.setOutputCol("prediction")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
tokenizer,
normalizer,
stopwords_cleaner,
stemmer,
logreg
))
val model = pipeline.fit(trainingData)
from johnsnowlabs import *
// Define pipeline stages to prepare the data
val document_assembler = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new nlp.Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val normalizer = new nlp.Normalizer()
.setInputCols("token")
.setOutputCol("normalized")
val stopwords_cleaner = new nlp.StopWordsCleaner()
.setInputCols("normalized")
.setOutputCol("cleanTokens")
.setCaseSensitive(false)
val stemmer = new nlp.Stemmer()
.setInputCols("cleanTokens")
.setOutputCol("stem")
// Define the document classifier and fit training data to it
val logreg = new finance.DocumentLogRegClassifierApproach()
.setInputCols("stem")
.setLabelCol("category")
.setOutputCol("prediction")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
tokenizer,
normalizer,
stopwords_cleaner,
stemmer,
logreg
))
val model = pipeline.fit(trainingData)
from johnsnowlabs import *
// Define pipeline stages to prepare the data
val document_assembler = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new nlp.Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val normalizer = new nlp.Normalizer()
.setInputCols("token")
.setOutputCol("normalized")
val stopwords_cleaner = new nlp.StopWordsCleaner()
.setInputCols("normalized")
.setOutputCol("cleanTokens")
.setCaseSensitive(false)
val stemmer = new nlp.Stemmer()
.setInputCols("cleanTokens")
.setOutputCol("stem")
// Define the document classifier and fit training data to it
val logreg = new legal.DocumentLogRegClassifierApproach()
.setInputCols("stem")
.setLabelCol("category")
.setOutputCol("prediction")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
tokenizer,
normalizer,
stopwords_cleaner,
stemmer,
logreg
))
val model = pipeline.fit(trainingData)
DrugNormalizer
Annotator which normalizes raw text from clinical documents, e.g. scraped web pages or xml documents, from document type columns into Sentence. Removes all dirty characters from text following one or more input regex patterns. Can apply non wanted character removal which a specific policy. Can apply lower case normalization.
See Spark NLP Workshop for more examples of usage.
Input Annotator Types: DOCUMENT
Output Annotator Type: DOCUMENT
Python API: DrugNormalizer | Scala API: DrugNormalizer |
Show Example
from johnsnowlabs import *
data = spark.createDataFrame([
["Sodium Chloride/Potassium Chloride 13bag"],
["interferon alfa-2b 10 million unit ( 1 ml ) injec"],
["aspirin 10 meq/ 5 ml oral sol"]
]).toDF("text")
document = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
drugNormalizer = medical.DrugNormalizer().setInputCols(["document"]).setOutputCol("document_normalized")
trainingPipeline = Pipeline(stages=[document, drugNormalizer])
result = trainingPipeline.fit(data).transform(data)
result.selectExpr("explode(document_normalized.result) as normalized_text").show(truncate=False)
+----------------------------------------------------+
|normalized_text |
+----------------------------------------------------+
|Sodium Chloride / Potassium Chloride 13 bag |
|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|
|aspirin 2 meq/ml oral solution |
+----------------------------------------------------+
from johnsnowlabs import *
document = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
drugNormalizer = finance.DrugNormalizer().setInputCols(["document"]).setOutputCol("document_normalized")
trainingPipeline = Pipeline(stages=[document, drugNormalizer])
from johnsnowlabs import *
document = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
drugNormalizer = legal.DrugNormalizer().setInputCols(["document"]).setOutputCol("document_normalized")
trainingPipeline = Pipeline(stages=[document, drugNormalizer])
from johnsnowlabs import *
val data = Seq(
("Sodium Chloride/Potassium Chloride 13bag"),
("interferon alfa-2b 10 million unit ( 1 ml ) injec"),
("aspirin 10 meq/ 5 ml oral sol")
).toDF("text")
val document = new nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
val drugNormalizer = new medical.DrugNormalizer().setInputCols("document").setOutputCol("document_normalized")
val trainingPipeline = new Pipeline().setStages(Array(document, drugNormalizer))
val result = trainingPipeline.fit(data).transform(data)
result.selectExpr("explode(document_normalized.result) as normalized_text").show(false)
+----------------------------------------------------+
|normalized_text |
+----------------------------------------------------+
|Sodium Chloride / Potassium Chloride 13 bag |
|interferon alfa - 2b 10000000 unt ( 1 ml ) injection|
|aspirin 2 meq/ml oral solution |
+----------------------------------------------------+
from johnsnowlabs import *
val document = new nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
val drugNormalizer = new finance.DrugNormalizer().setInputCols("document").setOutputCol("document_normalized")
val trainingPipeline = new Pipeline().setStages(Array(document, drugNormalizer))
from johnsnowlabs import *
val document = new nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
val drugNormalizer = new legal.DrugNormalizer().setInputCols("document").setOutputCol("document_normalized")
val trainingPipeline = new Pipeline().setStages(Array(document, drugNormalizer))
EntityChunkEmbeddings
Weighted average embeddings of multiple named entities chunk annotations.
Entity Chunk Embeddings uses BERT Sentence embeddings to compute a weighted average vector represention of related entity chunks. The input the model consists of chunks of recognized named entities. One or more entities are selected as target entities and for each of them a list of related entities is specified (if empty, all other entities are assumed to be related).
The model looks for chunks of the target entities and then tries to pair each target entity (e.g. DRUG) with other related entities (e.g. DOSAGE, STRENGTH, FORM, etc). The criterion for pairing a target entity with another related entity is that they appear in the same sentence and the maximal syntactic distance is below a predefined threshold.
The relationship between target and related entities is one-to-many, meaning that if there multiple instances of the same target entity (e.g.) within a sentence, the model will map a related entity (e.g. DOSAGE) to at most one of the instances of the target entity. For example, if there is a sentence “The patient was given 125 mg of paracetamol and metformin”, the model will pair “125 mg” to “paracetamol”, but not to “metformin”.
The output of the model is an average embeddings of the chunks of each of the target entities and their related entities. It is possible to specify a particular weight for each entity type.
An entity can be defined both as target a entity and as a related entity for some other target entity. For example, we may want to compute the embeddings of SYMPTOMs and their related entities, as well as the embeddings of DRUGs and their related entities, one of each is also SYMPTOM. In such cases, it is possible to use the TARGET_ENTITY:RELATED_ENTITY notation to specify the weight of an related entity (e.g. “DRUG:SYMPTOM” to set the weight of SYMPTOM when it appears as an related entity to target entity DRUG). The relative weights of entities for particular entity chunk embeddings are available in the annotations metadata.
This model is a subclass of BertSentenceEmbeddings
and shares all parameters
with it. It can load any pretrained BertSentenceEmbeddings
model.
The default model is "sbiobert_base_cased_mli"
from clinical/models
.
Other available models can be found at Models Hub.
Input Annotator Types: DEPENDENCY, CHUNK
Output Annotator Type: SENTENCE_EMBEDDINGS
Python API: EntityChunkEmbeddingsModel | Scala API: EntityChunkEmbeddingsModel |
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp_jsl.common import *
from sparknlp.annotator import *
from sparknlp.training import *
import sparknlp_jsl
from sparknlp_jsl.base import *
from sparknlp_jsl.annotator import *
from pyspark.ml import Pipeline
documenter = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("documents")
sentence_detector = SentenceDetector() \
.setInputCols("documents") \
.setOutputCol("sentences")
tokenizer = Tokenizer() \
.setInputCols("sentences") \
.setOutputCol("tokens")
embeddings = WordEmbeddingsModel() \
.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentences", "tokens"])\
.setOutputCol("embeddings")
ner_model = MedicalNerModel()\
.pretrained("ner_posology_large", "en", "clinical/models")\
.setInputCols(["sentences", "tokens", "embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverterInternal()\
.setInputCols("sentences", "tokens", "ner")\
.setOutputCol("ner_chunks")
pos_tager = PerceptronModel()\
.pretrained("pos_clinical", "en", "clinical/models")\
.setInputCols("sentences", "tokens")\
.setOutputCol("pos_tags")
dependency_parser = DependencyParserModel()\
.pretrained("dependency_conllu", "en")\
.setInputCols(["sentences", "pos_tags", "tokens"])\
.setOutputCol("dependencies")
drug_chunk_embeddings = EntityChunkEmbeddings()\
.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["ner_chunks", "dependencies"])\
.setOutputCol("drug_chunk_embeddings")\
.setMaxSyntacticDistance(3)\
.setTargetEntities({"DRUG": []})
.setEntityWeights({"DRUG": 0.8, "STRENGTH": 0.2, "DOSAGE": 0.2, "FORM": 0.5})
sampleData = "The parient was given metformin 125 mg, 250 mg of coumadin and then one pill paracetamol"
data = SparkContextForTest.spark.createDataFrame([[sampleData]]).toDF("text")
pipeline = Pipeline().setStages([
documenter,
sentence_detector,
tokenizer,
embeddings,
ner_model,
ner_converter,
pos_tager,
dependency_parser,
drug_chunk_embeddings])
results = pipeline.fit(data).transform(data)
results = results \
.selectExpr("explode(drug_chunk_embeddings) AS drug_chunk") \
.selectExpr("drug_chunk.result", "slice(drug_chunk.embeddings, 1, 5) AS drug_embedding") \
.cache()
results.show(truncate=False)
+-----------------------------+-----------------------------------------------------------------+
| result| drug_embedding"|
+-----------------------------+-----------------------------------------------------------------+
|metformin 125 mg |[-0.267413, 0.07614058, -0.5620966, 0.83838946, 0.8911504] |
|250 mg coumadin |[0.22319649, -0.07094894, -0.6885556, 0.79176235, 0.82672405] |
|one pill paracetamol |[-0.10939768, -0.29242, -0.3574444, 0.3981813, 0.79609615] |
+-----------------------------+-----------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotators.parser.dep.DependencyParserModel
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
import com.johnsnowlabs.nlp.annotators.ner.{MedicalNerModel, NerConverterInternal}
import com.johnsnowlabs.nlp.annotators.embeddings.EntityChunkEmbeddings
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("tokens")
val wordEmbeddings = WordEmbeddingsModel
.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentences", "tokens"))
.setOutputCol("word_embeddings")
val nerModel = MedicalNerModel
.pretrained("ner_posology_large", "en", "clinical/models")
.setInputCols(Array("sentence", "tokens", "word_embeddings"))
.setOutputCol("ner")
val nerConverter = new NerConverterInternal()
.setInputCols("sentence", "tokens", "ner")
.setOutputCol("ner_chunk")
val posTager = PerceptronModel
.pretrained("pos_clinical", "en", "clinical/models")
.setInputCols("sentences", "tokens")
.setOutputCol("pos_tags")
val dependencyParser = DependencyParserModel
.pretrained("dependency_conllu", "en")
.setInputCols(Array("sentences", "pos_tags", "tokens"))
.setOutputCol("dependencies")
val drugChunkEmbeddings = EntityChunkEmbeddings
.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols(Array("ner_chunks", "dependencies"))
.setOutputCol("drug_chunk_embeddings")
.setMaxSyntacticDistance(3)
.setTargetEntities(Map("DRUG" -> List()))
.setEntityWeights(Map[String, Float]("DRUG" -> 0.8f, "STRENGTH" -> 0.2f, "DOSAGE" -> 0.2f, "FORM" -> 0.5f))
val pipeline = new Pipeline()
.setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
wordEmbeddings,
nerModel,
nerConverter,
posTager,
dependencyParser,
drugChunkEmbeddings))
val sampleText = "The patient was given metformin 125 mg, 250 mg of coumadin and then one pill paracetamol."
val testDataset = Seq("").toDS.toDF("text")
val result = pipeline.fit(emptyDataset).transform(testDataset)
result
.selectExpr("explode(drug_chunk_embeddings) AS drug_chunk")
.selectExpr("drug_chunk.result", "slice(drug_chunk.embeddings, 1, 5) AS drugEmbedding")
.show(truncate=false)
+-----------------------------+-----------------------------------------------------------------+
| result| drugEmbedding|
+-----------------------------+-----------------------------------------------------------------+
|metformin 125 mg |[-0.267413, 0.07614058, -0.5620966, 0.83838946, 0.8911504] |
|250 mg coumadin |[0.22319649, -0.07094894, -0.6885556, 0.79176235, 0.82672405] |
|one pill paracetamol |[-0.10939768, -0.29242, -0.3574444, 0.3981813, 0.79609615] |
+-----------------------------+----------------------------------------------------------------+
FeaturesAssembler
The FeaturesAssembler is used to collect features from different columns. It can collect features from single value
columns (anything which can be cast to a float, if casts fails then the value is set to 0), array columns or
SparkNLP annotations (if the annotation is an embedding, it takes the embedding, otherwise tries to cast the
result
field). The output of the transformer is a FEATURE_VECTOR
annotation (the numeric vector is in the
embeddings
field).
Input Annotator Types: NONE
Output Annotator Type: "feature_vector"
Python API: FeaturesAssembler | Scala API: FeaturesAssembler |
Show Example
from johnsnowlabs import *
features_asm = medical.FeaturesAssembler() \
.setInputCols(["feature_1", "feature_2", "...", "feature_n"]) \
.setOutputCol("features")
gen_clf = medical.GenericClassifierApproach() \
.setLabelColumn("target") \
.setInputCols(["features"]) \
.setOutputCol("prediction") \
.setModelFile("/path/to/graph_file.pb") \
.setEpochsNumber(50) \
.setBatchSize(100) \
.setFeatureScaling("zscore") \
.setLearningRate(0.001) \
.setFixImbalance(True) \
.setOutputLogsPath("logs") \
.setValidationSplit(0.2) # keep 20% of the data for validation purposes
pipeline = Pipeline(stages=[
features_asm,
gen_clf
])
clf_model = pipeline.fit(data)
from johnsnowlabs import *
features_asm = finance.FeaturesAssembler() \
.setInputCols(["feature_1", "feature_2", "...", "feature_n"]) \
.setOutputCol("features")
gen_clf = finance.GenericClassifierApproach() \
.setLabelColumn("target") \
.setInputCols(["features"]) \
.setOutputCol("prediction") \
.setModelFile("/path/to/graph_file.pb") \
.setEpochsNumber(50) \
.setBatchSize(100) \
.setFeatureScaling("zscore") \
.setLearningRate(0.001) \
.setFixImbalance(True) \
.setOutputLogsPath("logs") \
.setValidationSplit(0.2) # keep 20% of the data for validation purposes
pipeline = Pipeline(stages=[
features_asm,
gen_clf
])
clf_model = pipeline.fit(data)
from johnsnowlabs import *
features_asm = legal.FeaturesAssembler() \
.setInputCols(["feature_1", "feature_2", "...", "feature_n"]) \
.setOutputCol("features")
gen_clf = legal.GenericClassifierApproach() \
.setLabelColumn("target") \
.setInputCols(["features"]) \
.setOutputCol("prediction") \
.setModelFile("/path/to/graph_file.pb") \
.setEpochsNumber(50) \
.setBatchSize(100) \
.setFeatureScaling("zscore") \
.setLearningRate(0.001) \
.setFixImbalance(True) \
.setOutputLogsPath("logs") \
.setValidationSplit(0.2) # keep 20% of the data for validation purposes
pipeline = Pipeline(stages=[
features_asm,
gen_clf
])
clf_model = pipeline.fit(data)
from johnsnowlabs import *
val features_asm = new medical.FeaturesAssembler()
.setInputCols(Array("feature_1", "feature_2", "...", "feature_n"))
.setOutputCol("features")
val gen_clf = new medical.GenericClassifierApproach()
.setLabelColumn("target")
.setInputCols("features")
.setOutputCol("prediction")
.setModelFile("/path/to/graph_file.pb")
.setEpochsNumber(50)
.setBatchSize(100)
.setFeatureScaling("zscore")
.setlearningRate(0.001f)
.setFixImbalance(true)
.setOutputLogsPath("logs")
.setValidationSplit(0.2f) // keep 20% of the data for validation purposes
val pipeline = new Pipeline().setStages(Array(
features_asm,
gen_clf
))
val clf_model = pipeline.fit(data)
from johnsnowlabs import *
val features_asm = new finance.FeaturesAssembler()
.setInputCols(Array("feature_1", "feature_2", "...", "feature_n"))
.setOutputCol("features")
val gen_clf = new finance.GenericClassifierApproach()
.setLabelColumn("target")
.setInputCols("features")
.setOutputCol("prediction")
.setModelFile("/path/to/graph_file.pb")
.setEpochsNumber(50)
.setBatchSize(100)
.setFeatureScaling("zscore")
.setlearningRate(0.001f)
.setFixImbalance(true)
.setOutputLogsPath("logs")
.setValidationSplit(0.2f) // keep 20% of the data for validation purposes
val pipeline = new Pipeline().setStages(Array(
features_asm,
gen_clf
))
val clf_model = pipeline.fit(data)
from johnsnowlabs import *
val features_asm = new legal.FeaturesAssembler()
.setInputCols(Array("feature_1", "feature_2", "...", "feature_n"))
.setOutputCol("features")
val gen_clf = new legal.GenericClassifierApproach()
.setLabelColumn("target")
.setInputCols("features")
.setOutputCol("prediction")
.setModelFile("/path/to/graph_file.pb")
.setEpochsNumber(50)
.setBatchSize(100)
.setFeatureScaling("zscore")
.setlearningRate(0.001f)
.setFixImbalance(true)
.setOutputLogsPath("logs")
.setValidationSplit(0.2f) // keep 20% of the data for validation purposes
val pipeline = new Pipeline().setStages(Array(
features_asm,
gen_clf
))
val clf_model = pipeline.fit(data)
GenericClassifier
Creates a generic single-label classifier which uses pre-generated Tensorflow graphs. The model operates on FEATURE_VECTOR annotations which can be produced using FeatureAssembler. Requires the FeaturesAssembler to create the input.
Input Annotator Types: FEATURE_VECTOR
Output Annotator Type: CATEGORY
Python API: GenericClassifierModel | Scala API: GenericClassifierModel |
Trains a TensorFlow model for generic classification of feature vectors. It takes FEATURE_VECTOR annotations from
FeaturesAssembler
as input, classifies them and outputs CATEGORY annotations.
Please see the Parameters section for required training parameters.
For a more extensive example please see the Spark NLP Workshop.
Input Annotator Types: FEATURE_VECTOR
Output Annotator Type: CATEGORY
Python API: GenericClassifierApproach | Scala API: GenericClassifierApproach |
Show Example
from johnsnowlabs import *
features_asm = medical.FeaturesAssembler() \
.setInputCols(["feature_1", "feature_2", "...", "feature_n"]) \
.setOutputCol("features")
gen_clf = medical.GenericClassifierApproach() \
.setLabelColumn("target") \
.setInputCols(["features"]) \
.setOutputCol("prediction") \
.setModelFile("/path/to/graph_file.pb") \
.setEpochsNumber(50) \
.setBatchSize(100) \
.setFeatureScaling("zscore") \
.setlearningRate(0.001) \
.setFixImbalance(True) \
.setOutputLogsPath("logs") \
.setValidationSplit(0.2) # keep 20% of the data for validation purposes
pipeline = Pipeline().setStages([
features_asm,
gen_clf
])
clf_model = pipeline.fit(data)
from johnsnowlabs import *
features_asm = finance.FeaturesAssembler() \
.setInputCols(["feature_1", "feature_2", "...", "feature_n"]) \
.setOutputCol("features")
gen_clf = finance.GenericClassifierApproach() \
.setLabelColumn("target") \
.setInputCols(["features"]) \
.setOutputCol("prediction") \
.setModelFile("/path/to/graph_file.pb") \
.setEpochsNumber(50) \
.setBatchSize(100) \
.setFeatureScaling("zscore") \
.setlearningRate(0.001) \
.setFixImbalance(True) \
.setOutputLogsPath("logs") \
.setValidationSplit(0.2) # keep 20% of the data for validation purposes
pipeline = Pipeline().setStages([
features_asm,
gen_clf
])
clf_model = pipeline.fit(data)
from johnsnowlabs import *
features_asm = legal.FeaturesAssembler() \
.setInputCols(["feature_1", "feature_2", "...", "feature_n"]) \
.setOutputCol("features")
gen_clf = legal.GenericClassifierApproach() \
.setLabelColumn("target") \
.setInputCols(["features"]) \
.setOutputCol("prediction") \
.setModelFile("/path/to/graph_file.pb") \
.setEpochsNumber(50) \
.setBatchSize(100) \
.setFeatureScaling("zscore") \
.setlearningRate(0.001) \
.setFixImbalance(True) \
.setOutputLogsPath("logs") \
.setValidationSplit(0.2) # keep 20% of the data for validation purposes
pipeline = Pipeline().setStages([
features_asm,
gen_clf
])
clf_model = pipeline.fit(data)
from johnsnowlabs import *
val features_asm = new medical.FeaturesAssembler()
.setInputCols(Array("feature_1", "feature_2", "...", "feature_n"))
.setOutputCol("features")
val gen_clf = new medical.GenericClassifierApproach()
.setLabelColumn("target")
.setInputCols("features")
.setOutputCol("prediction")
.setModelFile("/path/to/graph_file.pb")
.setEpochsNumber(50)
.setBatchSize(100)
.setFeatureScaling("zscore")
.setlearningRate(0.001f)
.setFixImbalance(true)
.setOutputLogsPath("logs")
.setValidationSplit(0.2f) // keep 20% of the data for validation purposes
val pipeline = new Pipeline().setStages(Array(
features_asm,
gen_clf
))
val clf_model = pipeline.fit(data)
from johnsnowlabs import *
val features_asm = new finance.FeaturesAssembler()
.setInputCols(Array("feature_1", "feature_2", "...", "feature_n"))
.setOutputCol("features")
val gen_clf = new finance.GenericClassifierApproach()
.setLabelColumn("target")
.setInputCols("features")
.setOutputCol("prediction")
.setModelFile("/path/to/graph_file.pb")
.setEpochsNumber(50)
.setBatchSize(100)
.setFeatureScaling("zscore")
.setlearningRate(0.001f)
.setFixImbalance(true)
.setOutputLogsPath("logs")
.setValidationSplit(0.2f) // keep 20% of the data for validation purposes
val pipeline = new Pipeline().setStages(Array(
features_asm,
gen_clf
))
val clf_model = pipeline.fit(data)
from johnsnowlabs import *
val features_asm = new legal.FeaturesAssembler()
.setInputCols(Array("feature_1", "feature_2", "...", "feature_n"))
.setOutputCol("features")
val gen_clf = new legal.GenericClassifierApproach()
.setLabelColumn("target")
.setInputCols("features")
.setOutputCol("prediction")
.setModelFile("/path/to/graph_file.pb")
.setEpochsNumber(50)
.setBatchSize(100)
.setFeatureScaling("zscore")
.setlearningRate(0.001f)
.setFixImbalance(true)
.setOutputLogsPath("logs")
.setValidationSplit(0.2f) // keep 20% of the data for validation purposes
val pipeline = new Pipeline().setStages(Array(
features_asm,
gen_clf
))
val clf_model = pipeline.fit(data)
IOBTagger
Merges token tags and NER labels from chunks in the specified format. For example output columns as inputs from NerConverter and Tokenizer can be used to merge.
Input Annotator Types: TOKEN, CHUNK
Output Annotator Type: NAMED_ENTITY
Python API: IOBTagger | Scala API: IOBTagger |
Show Example
from johnsnowlabs import *
# Pipeline stages are defined where NER is done. NER is converted to chunks.
data = spark.createDataFrame([["A 63-year-old man presents to the hospital ..."]]).toDF("text")
docAssembler = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
sentenceDetector = nlp.SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
tokenizer = nlp.Tokenizer().setInputCols(["sentence"]).setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models").setInputCols(["sentence", "token"]).setOutputCol("embs")
nerModel = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models").setInputCols(["sentence", "token", "embs"]).setOutputCol("ner")
nerConverter = nlp.NerConverter().setInputCols(["sentence", "token", "ner"]).setOutputCol("ner_chunk")
# Define the IOB tagger, which needs tokens and chunks as input. Show results.
iobTagger = medical.IOBTagger().setInputCols(["token", "ner_chunk"]).setOutputCol("ner_label")
pipeline = Pipeline(stages=[docAssembler, sentenceDetector, tokenizer, embeddings, nerModel, nerConverter, iobTagger])
result.selectExpr("explode(ner_label) as a") \
.selectExpr("a.begin","a.end","a.result as chunk","a.metadata.word as word") \
.where("chunk!='O'").show(5, False)
+-----+---+-----------+-----------+
|begin|end|chunk |word |
+-----+---+-----------+-----------+
|5 |15 |B-Age |63-year-old|
|17 |19 |B-Gender |man |
|64 |72 |B-Modifier |recurrent |
|98 |107|B-Diagnosis|cellulitis |
|110 |119|B-Diagnosis|pneumonias |
+-----+---+-----------+-----------+
from johnsnowlabs import *
# Pipeline stages are defined where NER is done. NER is converted to chunks.
docAssembler = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
sentenceDetector = nlp.SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
tokenizer = nlp.Tokenizer().setInputCols(["sentence"]).setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models").setInputCols(["sentence", "token"]).setOutputCol("embs")
ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models").setInputCols(["sentence", "token", "embs"]).setOutputCol("ner")
nerConverter = nlp.NerConverter().setInputCols(["sentence", "token", "ner"]).setOutputCol("ner_chunk")
# Define the IOB tagger, which needs tokens and chunks as input. Show results.
iobTagger = finance.IOBTagger().setInputCols(["token", "ner_chunk"]).setOutputCol("ner_label")
pipeline = Pipeline(stages=[docAssembler, sentenceDetector, tokenizer, embeddings, ner_model, nerConverter, iobTagger])
from johnsnowlabs import *
# Pipeline stages are defined where NER is done. NER is converted to chunks.
docAssembler = nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
sentenceDetector = nlp.SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")
tokenizer = nlp.Tokenizer().setInputCols(["sentence"]).setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models").setInputCols(["sentence", "token"]).setOutputCol("embs")
ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models").setInputCols(["sentence", "token", "embs"]).setOutputCol("ner")
nerConverter = nlp.NerConverter().setInputCols(["sentence", "token", "ner"]).setOutputCol("ner_chunk")
# Define the IOB tagger, which needs tokens and chunks as input. Show results.
iobTagger = legal.IOBTagger().setInputCols(["token", "ner_chunk"]).setOutputCol("ner_label")
pipeline = Pipeline(stages=[docAssembler, sentenceDetector, tokenizer, embeddings, ner_model, nerConverter, iobTagger])
from johnsnowlabs import *
// Pipeline stages are defined where NER is done. NER is converted to chunks.
val data = Seq(("A 63-year-old man presents to the hospital ...")).toDF("text")
val docAssembler = new nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector().setInputCols("document").setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer().setInputCols("sentence").setOutputCol("token")
val embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models").setInputCols(Array("sentence", "token")).setOutputCol("embs")
val nerModel = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models").setInputCols(Array("sentence", "token", "embs")).setOutputCol("ner")
val nerConverter = new nlp.NerConverter().setInputCols(Array("sentence", "token", "ner")).setOutputCol("ner_chunk")
// Define the IOB tagger, which needs tokens and chunks as input. Show results.
val iobTagger = new medical.IOBTagger().setInputCols(Array("token", "ner_chunk")).setOutputCol("ner_label")
val pipeline = new Pipeline().setStages(Array(docAssembler, sentenceDetector, tokenizer, embeddings, nerModel, nerConverter, iobTagger))
result.selectExpr("explode(ner_label) as a")
.selectExpr("a.begin","a.end","a.result as chunk","a.metadata.word as word")
.where("chunk!='O'").show(5, false)
+-----+---+-----------+-----------+
|begin|end|chunk |word |
+-----+---+-----------+-----------+
|5 |15 |B-Age |63-year-old|
|17 |19 |B-Gender |man |
|64 |72 |B-Modifier |recurrent |
|98 |107|B-Diagnosis|cellulitis |
|110 |119|B-Diagnosis|pneumonias |
+-----+---+-----------+-----------+
from johnsnowlabs import *
// Pipeline stages are defined where NER is done. NER is converted to chunks.
val docAssembler = new nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector().setInputCols("document").setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer().setInputCols("sentence").setOutputCol("token")
val embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models").setInputCols(Array("sentence", "token")).setOutputCol("embs")
val ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models").setInputCols(Array("sentence", "token", "embs")).setOutputCol("ner")
val nerConverter = new nlp.NerConverter().setInputCols(Array("sentence", "token", "ner")).setOutputCol("ner_chunk")
// Define the IOB tagger, which needs tokens and chunks as input. Show results.
val iobTagger = new legal.IOBTagger().setInputCols(Array("token", "ner_chunk")).setOutputCol("ner_label")
val pipeline = new Pipeline().setStages(Array(docAssembler, sentenceDetector, tokenizer, embeddings, ner_model, nerConverter, iobTagger))
from johnsnowlabs import *
// Pipeline stages are defined where NER is done. NER is converted to chunks.
val docAssembler = new nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector().setInputCols("document").setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer().setInputCols("sentence").setOutputCol("token")
val embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models").setInputCols(Array("sentence", "token")).setOutputCol("embs")
val ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models").setInputCols(Array("sentence", "token", "embs")).setOutputCol("ner")
val nerConverter = new nlp.NerConverter().setInputCols(Array("sentence", "token", "ner")).setOutputCol("ner_chunk")
// Define the IOB tagger, which needs tokens and chunks as input. Show results.
val iobTagger = new legal.IOBTagger().setInputCols(Array("token", "ner_chunk")).setOutputCol("ner_label")
val pipeline = new Pipeline().setStages(Array(docAssembler, sentenceDetector, tokenizer, embeddings, ner_model, nerConverter, iobTagger))
NerChunker
Extracts phrases that fits into a known pattern using the NER tags. Useful for entity groups with neighboring tokens when there is no pretrained NER model to address certain issues. A Regex needs to be provided to extract the tokens between entities.
Input Annotator Types: DOCUMENT, NAMED_ENTITY
Output Annotator Type: CHUNK
Python API: NerChunker | Scala API: NerChunker |
Show Example
from johnsnowlabs import *
# Defining pipeline stages for NER
data= spark.createDataFrame([["She has cystic cyst on her kidney."]]).toDF("text")
documentAssembler= nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector= nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence") \
.setUseAbbreviations(False)
tokenizer= nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["sentence","token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(False)
ner = medical.NerModel.pretrained("ner_radiology", "en", "clinical/models") \
.setInputCols(["sentence","token","embeddings"]) \
.setOutputCol("ner") \
.setIncludeConfidence(True)
# Define the NerChunker to combine to chunks
chunker = medical.NerChunker() \
.setInputCols(["sentence","ner"]) \
.setOutputCol("ner_chunk") \
.setRegexParsers(["<ImagingFindings>.*<BodyPart>"])
pipeline= Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner,
chunker
])
result = pipeline.fit(data).transform(data)
# Show results:
result.selectExpr("explode(arrays_zip(ner.metadata , ner.result))")
.selectExpr("col['0'].word as word" , "col['1'] as ner").show(truncate=False)
+------+-----------------+
|word |ner |
+------+-----------------+
|She |O |
|has |O |
|cystic|B-ImagingFindings|
|cyst |I-ImagingFindings|
|on |O |
|her |O |
|kidney|B-BodyPart |
|. |O |
+------+-----------------+
result.select("ner_chunk.result").show(truncate=False)
+---------------------------+
|result |
+---------------------------+
|[cystic cyst on her kidney]|
+---------------------------+
from johnsnowlabs import *
# Defining pipeline stages for NER
documentAssembler= nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector= nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence") \
.setUseAbbreviations(False)
tokenizer= nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["sentence","token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(False)
ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
# Define the NerChunker to combine to chunks
chunker = finance.NerChunker() \
.setInputCols(["sentence","ner"]) \
.setOutputCol("ner_chunk") \
.setRegexParsers(["<ImagingFindings>.*<BodyPart>"])
pipeline= Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
chunker
])
from johnsnowlabs import *
# Defining pipeline stages for NER
documentAssembler= nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector= nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence") \
.setUseAbbreviations(False)
tokenizer= nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["sentence","token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(False)
ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
# Define the NerChunker to combine to chunks
chunker = legal.NerChunker() \
.setInputCols(["sentence","ner"]) \
.setOutputCol("ner_chunk") \
.setRegexParsers(["<ImagingFindings>.*<BodyPart>"])
pipeline= Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
chunker
])
from johnsnowlabs import *
// Defining pipeline stages for NER
val data= Seq("She has cystic cyst on her kidney.").toDF("text")
val documentAssembler=new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector=new nlp.SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
.setUseAbbreviations(False)
val tokenizer=new nlp.Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
.setCaseSensitive(False)
val ner = medical.NerModel.pretrained("ner_radiology", "en", "clinical/models")
.setInputCols(Array("sentence","token","embeddings"))
.setOutputCol("ner")
.setIncludeConfidence(True)
// Define the NerChunker to combine to chunks
val chunker = new medical.NerChunker()
.setInputCols(Array("sentence","ner"))
.setOutputCol("ner_chunk")
.setRegexParsers(Array("<ImagingFindings>.<BodyPart>"))
val pipeline=new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner,
chunker
))
val result = pipeline.fit(data).transform(data)
// Show results:
//
// result.selectExpr("explode(arrays_zip(ner.metadata , ner.result))")
// .selectExpr("col['0'].word as word" , "col['1'] as ner").show(truncate=false)
// +------+-----------------+
// |word |ner |
// +------+-----------------+
// |She |O |
// |has |O |
// |cystic|B-ImagingFindings|
// |cyst |I-ImagingFindings|
// |on |O |
// |her |O |
// |kidney|B-BodyPart |
// |. |O |
// +------+-----------------+
// result.select("ner_chunk.result").show(truncate=false)
// +---------------------------+
// |result |
// +---------------------------+
// |[cystic cyst on her kidney]|
// +---------------------------+
//
from johnsnowlabs import *
// Defining pipeline stages for NER
val documentAssembler=new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector=new nlp.SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
.setUseAbbreviations(False)
val tokenizer=new nlp.Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
.setCaseSensitive(False)
val ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
.setInputCols(Array("sentence", "token", "embeddings")) \
.setOutputCol("ner")
// Define the NerChunker to combine to chunks
val chunker = new finance.NerChunker()
.setInputCols(Array("sentence","ner"))
.setOutputCol("ner_chunk")
.setRegexParsers(Array("<ImagingFindings>.<BodyPart>"))
val pipeline=new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
chunker
))
from johnsnowlabs import *
// Defining pipeline stages for NER
val documentAssembler=new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector=new nlp.SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
.setUseAbbreviations(False)
val tokenizer=new nlp.Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence","token"))
.setOutputCol("embeddings")
.setCaseSensitive(False)
val ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models")\
.setInputCols(Array("sentence", "token", "embeddings"))\
.setOutputCol("ner")
// Define the NerChunker to combine to chunks
val chunker = new legal.NerChunker()
.setInputCols(Array("sentence","ner"))
.setOutputCol("ner_chunk")
.setRegexParsers(Array("<ImagingFindings>.<BodyPart>"))
val pipeline=new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
chunker
))
NerConverterInternal
Converts a IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their label. Chunks with no associated entity (tagged “O”) are filtered out.
This licensed annotator adds extra functionality to the open-source version by adding the following parameters: blackList
, greedyMode
, threshold
, and ignoreStopWords
that are not available in the NerConverter annotator.
See also Inside–outside–beginning (tagging) for more information.
Input Annotator Types: DOCUMENT, TOKEN, NAMED_ENTITY
Output Annotator Type: CHUNK
Python API: NerConverterInternal | Scala API: NerConverterInternal |
Show Example
from johnsnowlabs import *
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
jsl_ner = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("jsl_ner")
jsl_ner_converter = nlp.NerConverter() \
.setInputCols(["sentence", "token", "jsl_ner"]) \
.setOutputCol("jsl_ner_chunk")
jsl_ner_converter_internal = medical.NerConverterInternal()\
.setInputCols(["sentence","token","jsl_ner"])\
.setOutputCol("replaced_ner_chunk")\
.setReplaceDictResource("replace_dict.csv","text", {"delimiter":","})
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
jsl_ner,
jsl_ner_converter,
jsl_ner_converter_internal
])
result = nlpPipeline.fit(data).transform(data)
from johnsnowlabs import *
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
#.setCustomBounds(["\n\n"])
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
fin_ner = finance.NerModel.pretrained("finner_deid", "en", "finance/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
#.setLabelCasing("upper")
ner_converter = finance.NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")\
.setReplaceLabels({"ORG": "PARTY"}) # Replace "ORG" entity as "PARTY"
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
fin_ner,
ner_converter])
result = nlpPipeline.fit(data).transform(data)
from johnsnowlabs import *
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
#.setCustomBounds(["\n\n"])
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
legal_ner = legal.NerModel.pretrained("legner_contract_doc_parties", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
#.setLabelCasing("upper")
ner_converter = legal.NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")\
.setReplaceLabels({"ALIAS": "PARTY"}) # "ALIAS" are secondary names of companies, so let's extract them also as PARTY
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
legal_ner,
ner_converter])
result = nlpPipeline.fit(data).transform(data)
from johnsnowlabs import *
val documentAssembler = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = nlp.SentenceDetectorDLModel
.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = nlp.WordEmbeddingsModel
.pretrained("embeddings_clinical", "en","clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val jsl_ner = medical.NerModel
.pretrained("ner_jsl", "en", "clinical/models")
.setInputCols(Array("sentence", "token","embeddings"))
.setOutputCol("jsl_ner")
val jsl_ner_converter = new nlp.NerConverter()
.setInputCols(Array("sentence", "token", "jsl_ner"))
.setOutputCol("jsl_ner_chunk")
val jsl_ner_converter_internal = new medical.NerConverterInternal()
.setInputCols(Array("sentence", "token", "jsl_ner"))
.setOutputCol("replaced_ner_chunk")
.setReplaceDictResource("replace_dict.csv","text", {"delimiter":","})
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
jsl_ner,
jsl_ner_converter,
jsl_ner_converter_internal
))
val result = pipeline.fit(data).transform(data)
from johnsnowlabs import *
val documentAssembler = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = nlp.SentenceDetectorDLModel
.pretrained("sentence_detector_dl","xx")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = nlp.RoBertaEmbeddings
.pretrained("roberta_embeddings_legal_roberta_base", "en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val fin_ner = finance.NerModel
.pretrained("finner_deid", "en", "finance/models")
.setInputCols(Array("sentence", "token","embeddings"))
.setOutputCol("ner")
val ner_converter = new finance.NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
.setReplaceLabels({"ORG": "PARTY"})
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
fin_ner,
ner_converter
))
val result = pipeline.fit(data).transform(data)
from johnsnowlabs import *
val documentAssembler = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = nlp.SentenceDetectorDLModel
.pretrained("sentence_detector_dl","xx")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = nlp.RoBertaEmbeddings
.pretrained("roberta_embeddings_legal_roberta_base", "en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val legal_ner = legal.NerModel
.pretrained("legner_contract_doc_parties", "en", "legal/models")
.setInputCols(Array("sentence", "token","embeddings"))
.setOutputCol("ner")
val ner_converter = new legal.NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
.setReplaceLabels({"ALIAS": "PARTY"})
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
legal_ner,
ner_converter
))
val result = pipeline.fit(data).transform(data)
NerDisambiguator
Links words of interest, such as names of persons, locations and companies, from an input text document to a corresponding unique entity in a target Knowledge Base (KB). Words of interest are called Named Entities (NEs), mentions, or surface forms. Instantiated / pretrained model of the NerDisambiguator. Links words of interest, such as names of persons, locations and companies, from an input text document to a corresponding unique entity in a target Knowledge Base (KB). Words of interest are called Named Entities (NEs), mentions, or surface forms.
Input Annotator Types: CHUNK, SENTENCE_EMBEDDINGS
Output Annotator Type: DISAMBIGUATION
Python API: NerDisambiguatorModel | Scala API: NerDisambiguatorModel |
Links words of interest, such as names of persons, locations and companies, from an input text document to a corresponding unique entity in a target Knowledge Base (KB). Words of interest are called Named Entities (NEs), mentions, or surface forms. The model needs extracted CHUNKS and SENTENCE_EMBEDDINGS type input from e.g. SentenceEmbeddings and NerConverter.
Input Annotator Types: CHUNK, SENTENCE_EMBEDDINGS
Output Annotator Type: DISAMBIGUATION
Python API: NerDisambiguator | Scala API: NerDisambiguator |
Show Example
from johnsnowlabs import *
# Extracting Person identities
# First define pipeline stages that extract entities and embeddings. Entities are filtered for PER type entities.
# Extracting Person identities
# First define pipeline stages that extract entities and embeddings. Entities are filtered for PER type entities.
data = spark.createDataFrame([["The show also had a contestant named Donald Trump who later defeated Christina Aguilera ..."]]) \
.toDF("text")
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
sentence_embeddings = nlp.SentenceEmbeddings() \
.setInputCols(["sentence","embeddings"]) \
.setOutputCol("sentence_embeddings")
ner_model = nlp.NerDLModel.pretrained() \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = nlp.NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk") \
.setWhiteList(["PER"])
# Then the extracted entities can be disambiguated.
disambiguator = medical.NerDisambiguator() \
.setS3KnowledgeBaseName("i-per") \
.setInputCols(["ner_chunk", "sentence_embeddings"]) \
.setOutputCol("disambiguation") \
.setNumFirstChars(5)
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
sentence_embeddings,
ner_model,
ner_converter,
disambiguator])
model = nlpPipeline.fit(data)
result = model.transform(data)
# Show results
result.selectExpr("explode(disambiguation)")
.selectExpr("col.metadata.chunk as chunk", "col.result as result").show(5, False)
+------------------+------------------------------------------------------------------------------------------------------------------------+
|chunk |result |
+------------------+------------------------------------------------------------------------------------------------------------------------+
|Donald Trump |http:#en.wikipedia.org/?curid=4848272, http:#en.wikipedia.org/?curid=31698421, http:#en.wikipedia.org/?curid=55907961 |
|Christina Aguilera|http:#en.wikipedia.org/?curid=144171, http:#en.wikipedia.org/?curid=6636454 |
+------------------+------------------------------------------------------------------------------------------------------------------------+
from johnsnowlabs import *
# Extracting Person identities
# First define pipeline stages that extract entities and embeddings. Entities are filtered for PER type entities.
# Extracting Person identities
# First define pipeline stages that extract entities and embeddings. Entities are filtered for PER type entities.
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
sentence_embeddings = nlp.SentenceEmbeddings() \
.setInputCols(["sentence","embeddings"]) \
.setOutputCol("sentence_embeddings")
ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = nlp.NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk") \
.setWhiteList(["PER"])
# Then the extracted entities can be disambiguated.
disambiguator = finance.NerDisambiguator() \
#.setS3KnowledgeBaseName("i-per") \
.setInputCols(["ner_chunk", "sentence_embeddings"]) \
.setOutputCol("disambiguation") \
.setNumFirstChars(5)
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
sentence_embeddings,
ner_model,
ner_converter,
disambiguator])
from johnsnowlabs import *
# Extracting Person identities
# First define pipeline stages that extract entities and embeddings. Entities are filtered for PER type entities.
# Extracting Person identities
# First define pipeline stages that extract entities and embeddings. Entities are filtered for PER type entities.
documentAssembler = nlp.DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained() \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
sentence_embeddings = nlp.SentenceEmbeddings() \
.setInputCols(["sentence","embeddings"]) \
.setOutputCol("sentence_embeddings")
ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk") \
.setWhiteList(["PER"])
# Then the extracted entities can be disambiguated.
disambiguator = legal.NerDisambiguator() \
#.setS3KnowledgeBaseName("i-per") \
.setInputCols(["ner_chunk", "sentence_embeddings"]) \
.setOutputCol("disambiguation") \
.setNumFirstChars(5)
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
sentence_embeddings,
ner_model,
ner_converter,
disambiguator])
from johnsnowlabs import *
// Extracting Person identities
// First define pipeline stages that extract entities and embeddings. Entities are filtered for PER type entities.
val data = Seq("The show also had a contestant named Donald Trump who later defeated Christina Aguilera ...")
.toDF("text")
val documentAssembler = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = nlp.WordEmbeddingsModel.pretrained()
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val sentence_embeddings = new nlp.SentenceEmbeddings()
.setInputCols(Array("sentence","embeddings"))
.setOutputCol("sentence_embeddings")
val ner_model = nlp.NerDLModel.pretrained()
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new nlp.NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
.setWhiteList("PER")
// Then the extracted entities can be disambiguated.
val disambiguator = new medical.NerDisambiguator()
.setS3KnowledgeBaseName("i-per")
.setInputCols(Array("ner_chunk", "sentence_embeddings"))
.setOutputCol("disambiguation")
.setNumFirstChars(5)
val nlpPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
sentence_embeddings,
ner_model,
ner_converter,
disambiguator))
val model = nlpPipeline.fit(data)
val result = model.transform(data)
// Show results
//
// result.selectExpr("explode(disambiguation)")
// .selectExpr("col.metadata.chunk as chunk", "col.result as result").show(5, false)
// +------------------+------------------------------------------------------------------------------------------------------------------------+
// |chunk |result |
// +------------------+------------------------------------------------------------------------------------------------------------------------+
// |Donald Trump |https://en.wikipedia.org/?curid=4848272, https://en.wikipedia.org/?curid=31698421, https://en.wikipedia.org/?curid=55907961|
// |Christina Aguilera|https://en.wikipedia.org/?curid=144171, https://en.wikipedia.org/?curid=6636454 |
// +------------------+------------------------------------------------------------------------------------------------------------------------+
//
from johnsnowlabs import *
// Extracting Person identities
// First define pipeline stages that extract entities and embeddings. Entities are filtered for PER type entities.
val data = Seq("The show also had a contestant named Donald Trump who later defeated Christina Aguilera ...")
.toDF("text")
val documentAssembler = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = nlp.WordEmbeddingsModel.pretrained()
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val sentence_embeddings = new nlp.SentenceEmbeddings()
.setInputCols(Array("sentence","embeddings"))
.setOutputCol("sentence_embeddings")
val ner_model = finance.NerModel.pretrained("finner_orgs_prods_alias","en","finance/models")\
.setInputCols(Array("sentence", "token", "embeddings")) \
.setOutputCol("ner")
val ner_converter = new nlp.NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
.setWhiteList("PER")
// Then the extracted entities can be disambiguated.
val disambiguator = new finance.NerDisambiguator()
#.setS3KnowledgeBaseName("i-per")
.setInputCols(Array("ner_chunk", "sentence_embeddings"))
.setOutputCol("disambiguation")
.setNumFirstChars(5)
val nlpPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
sentence_embeddings,
ner_model,
ner_converter,
disambiguator))
from johnsnowlabs import *
// Extracting Person identities
// First define pipeline stages that extract entities and embeddings. Entities are filtered for PER type entities.
val data = Seq("The show also had a contestant named Donald Trump who later defeated Christina Aguilera ...")
.toDF("text")
val documentAssembler = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = new nlp.SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = nlp.WordEmbeddingsModel.pretrained()
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val sentence_embeddings = new nlp.SentenceEmbeddings()
.setInputCols(Array("sentence","embeddings"))
.setOutputCol("sentence_embeddings")
val ner_model = legal.NerModel.pretrained("legner_orgs_prods_alias", "en", "legal/models")\
.setInputCols(Array("sentence", "token", "embeddings"))\
.setOutputCol("ner")
val ner_converter = new nlp.NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
.setWhiteList("PER")
// Then the extracted entities can be disambiguated.
val disambiguator = new legal.NerDisambiguator()
#.setS3KnowledgeBaseName("i-per")
.setInputCols(Array("ner_chunk", "sentence_embeddings"))
.setOutputCol("disambiguation")
.setNumFirstChars(5)
val nlpPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
sentence_embeddings,
ner_model,
ner_converter,
disambiguator))
NerModel
This Named Entity recognition annotator is a generic NER model based on Neural Networks.
Pretrained models can be loaded with pretrained
of the companion object:
val nerModel = nlp.NerDLModel.pretrained()
.setInputCols("sentence", "token", "embeddings")
.setOutputCol("ner")
The default model is "ner_clinical"
, if no name is provided.
For available pretrained models please see the Models Hub. Additionally, pretrained pipelines are available for this module, see Pipelines.
Note that some pretrained models require specific types of embeddings, depending on which they were trained on.
For example, the default model "ner_dl"
requires the
WordEmbeddings "ner_clinical"
.
For extended examples of usage, see the Spark NLP Workshop
(sections starting with Training a Clinical NER
)
Input Annotator Types: DOCUMENT, TOKEN, WORD_EMBEDDINGS
Output Annotator Type: NAMED_ENTITY
Python API: MedicalNerModel | Scala API: MedicalNerModel |
Show Example
from johnsnowlabs import *
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
jsl_ner = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("jsl_ner")
jsl_ner_converter = nlp.NerConverter() \
.setInputCols(["sentence", "token", "jsl_ner"]) \
.setOutputCol("jsl_ner_chunk")
jsl_ner_pipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
jsl_ner,
jsl_ner_converter])
result = jsl_ner_pipeline.fit(data).transform(data)
from johnsnowlabs import *
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.BertEmbeddings.pretrained("bert_embeddings_sec_bert_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner_model = finance.NerModel.pretrained("finner_headers", "en", "finance/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
ner_converter])
result = nlpPipeline.fit(data).transform(data)
from johnsnowlabs import *
documentAssembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_embeddings_legal_roberta_base","en") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")
ner_model = legal.NerModel.pretrained("legner_headers", "en", "legal/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = nlp.NerConverter()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
ner_model,
ner_converter])
result = nlpPipeline.fit(data).transform(data)
from johnsnowlabs import *
val documentAssembler = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = nlp.SentenceDetectorDLModel
.pretrained("sentence_detector_dl_healthcare","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = nlp.WordEmbeddingsModel
.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val jsl_ner = medical.NerModel
.pretrained("ner_jsl", "en", "clinical/models")
.setInputCols(Array("sentence", "token","embeddings"))
.setOutputCol("jsl_ner")
val jsl_ner_converter = new nlp.NerConverter()
.setInputCols(Array("sentence", "token", "jsl_ner"))
.setOutputCol("jsl_ner_chunk")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
jsl_ner,
jsl_ner_converter
))
val result = pipeline.fit(data).transform(data)
from johnsnowlabs import *
val documentAssembler = new nlp.DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = nlp.SentenceDetectorDLModel
.pretrained("sentence_detector_dl","xx")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new nlp.Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val embeddings = nlp.BertEmbeddings
.pretrained("bert_embeddings_sec_bert_base", "en")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner_model = finance.NerModel
.pretrained