package annotators
- Alphabetic
- Public
- All
Type Members
-
class
Chunk2Token extends AnnotatorModel[Chunk2Token] with HasSimpleAnnotate[Chunk2Token]
A feature transformer that converts the input array of strings (annotatorType CHUNK) into an array of chunk-based tokens (annotatorType TOKEN).
A feature transformer that converts the input array of strings (annotatorType CHUNK) into an array of chunk-based tokens (annotatorType TOKEN).
When the input is empty, an empty array is returned.
This Annotator is specially convenient when using NGramGenerator annotations as inputs to WordEmbeddingsModels
Example
Define a pipeline for generating n-grams
val data = Seq(("A 63-year-old man presents to the hospital ...")).toDF("text") val document = new DocumentAssembler().setInputCol("text").setOutputCol("document") val sentenceDetector = new SentenceDetector().setInputCols("document").setOutputCol("sentence") val token = new Tokenizer().setInputCols("sentence").setOutputCol("token") val ngrammer = new NGramGenerator() .setN(2) .setEnableCumulative(false) .setInputCols("token") .setOutputCol("ngrams") .setDelimiter("_")
Stage to convert n-gram CHUNKS to TOKEN type
val chunk2Token = new Chunk2Token().setInputCols("ngrams").setOutputCol("ngram_tokens") val trainingPipeline = new Pipeline().setStages(Array(document, sentenceDetector, token, ngrammer, chunk2Token)).fit(data) val result = trainingPipeline.transform(data).cache() result.selectExpr("explode(ngram_tokens)").show(5, false) +----------------------------------------------------------------+ |col | +----------------------------------------------------------------+ |{token, 3, 15, A_63-year-old, {sentence -> 0, chunk -> 0}, []} | |{token, 5, 19, 63-year-old_man, {sentence -> 0, chunk -> 1}, []}| |{token, 17, 28, man_presents, {sentence -> 0, chunk -> 2}, []} | |{token, 21, 31, presents_to, {sentence -> 0, chunk -> 3}, []} | |{token, 30, 35, to_the, {sentence -> 0, chunk -> 4}, []} | +----------------------------------------------------------------+
- See also
-
class
DocumentFiltererByClassifier extends AnnotatorModel[DocumentFiltererByClassifier] with HasSimpleAnnotate[DocumentFiltererByClassifier] with WhiteAndBlackListParams
Filters documents by the result of classifier annotators.
Filters documents by the result of classifier annotators. Documents are filtered by the white list and black list. The white list is a list of classifier results that are allowed to pass the filter. The black list is a list of classifier results that are not allowed to pass the filter. The filter is case sensitive. If the caseSensitive is set to false, the filter is case in-sensitive.
Example
val text ="""British Department of Health confirms first two cases of in UK. |So my trip to visit my australian exchange student just got canceled because of Coronavirus. |I wish everyone to be safe at home and stop pandemic.""".stripMargin val documentAssembler = new DocumentAssembler() .setInputCol("text").setOutputCol("document") val sentenceDetector = new SentenceDetector() .setInputCols("document").setOutputCol("sentence") val tokenizer = new Tokenizer() .setInputCols("sentence").setOutputCol("token") val medicalBFSC = MedicalBertForSequenceClassification .pretrained("bert_sequence_classifier_covid_sentiment", "en", "clinical/models") .setInputCols("sentence", "token").setOutputCol("classifier") val documentFilterer = new DocumentFiltererByClassifier() .setInputCols("sentence", "classifier").setOutputCol("filteredDocuments") .setWhiteList("positive") .setCaseSensitive(false) import spark.implicits._ val textDF = Seq(text).toDF("text") val result = new Pipeline().setStages(Array( documentAssembler, sentenceDetector, tokenizer, medicalBFSC, documentFilterer )).fit(textDF).transform(textDF) result.selectExpr("explode(filteredDocuments)").show(false) +------------------------------------------------------------------------------------------------+ |col | +------------------------------------------------------------------------------------------------+ |{document, 159, 211, I wish everyone to be safe at home and stop pandemic., {sentence -> 2}, []}| +------------------------------------------------------------------------------------------------+
-
class
DrugNormalizer extends AnnotatorModel[DrugNormalizer] with HasSimpleAnnotate[DrugNormalizer] with CheckLicense
Annotator which normalizes raw text from clinical documents, e.g.
Annotator which normalizes raw text from clinical documents, e.g. scraped web pages or xml documents, from document type columns into Sentence. Removes all dirty characters from text following one or more input regex patterns. Can apply non wanted character removal which a specific policy. Can apply lower case normalization.
See Spark NLP Workshop for more examples of usage.
Example
val data = Seq( ("Sodium Chloride/Potassium Chloride 13bag"), ("interferon alfa-2b 10 million unit ( 1 ml ) injec"), ("aspirin 10 meq/ 5 ml oral sol") ).toDF("text") val document = new DocumentAssembler().setInputCol("text").setOutputCol("document") val drugNormalizer = new DrugNormalizer().setInputCols("document").setOutputCol("document_normalized") val trainingPipeline = new Pipeline().setStages(Array(document, drugNormalizer)) val result = trainingPipeline.fit(data).transform(data) result.selectExpr("explode(document_normalized.result) as normalized_text").show(false) +----------------------------------------------------+ |normalized_text | +----------------------------------------------------+ |Sodium Chloride / Potassium Chloride 13 bag | |interferon alfa - 2b 10000000 unt ( 1 ml ) injection| |aspirin 2 meq/ml oral solution | +----------------------------------------------------+
-
class
Flattener extends Transformer with DefaultParamsWritable
Converts annotation results into exploded and flattened format.
Converts annotation results into exploded and flattened format. It is useful to extract the results from Spark NLP Pipelines. The Flattener outputs annotation(s) values into
String
.Example
val dataSet = Seq("GENERAL: He is an elderly gentleman in no acute distress. He is sitting up in bed eating his breakfast." + " He is alert and oriented and answering questions appropriately.\nHEENT: Sclerae showed mild arcus senilis in the right." + " Left was clear. Pupils are equally round and reactive to light. Extraocular movements are intact. Oropharynx is clear." + "\nNECK: Supple. Trachea is midline. No jugular venous pressure distention is noted. No adenopathy in the cervical, " + "supraclavicular, or axillary areas.\nABDOMEN: Soft and not tender. There may be some fullness in the left upper quadrant, " + "although I do not appreciate a true spleen with inspiration.\nEXTREMITIES: There is some edema, but no cyanosis and " ).toDS.toDF("text") val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document") val sentenceDetector = new SentenceDetector().setInputCols(Array("document")).setOutputCol("sentence") val tokenizer = new Tokenizer().setInputCols(Array("sentence")).setOutputCol("token") val wordEmbeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models").setInputCols(Array("sentence", "token")).setOutputCol("embeddings") val clinicalNer = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models").setInputCols(Array("sentence", "token", "embeddings")).setOutputCol("ner") val nerConverter = new NerConverter().setInputCols(Array("sentence", "token", "ner")).setOutputCol("ner_chunk") val clinicalAssertion = AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models").setInputCols(Array("sentence", "ner_chunk", "embeddings")).setOutputCol("assertion").setEntityAssertionCaseSensitive(false) val flattener = new Flattener() .setInputCols("sentence", "ner_chunk", "assertion") .setExplodeSelectedFields(Map("ner_chunk" -> Array("result","metadata.entity"), "assertion"->Array("result","metadata.confidence"))) val pipeline = new Pipeline().setStages( Array( documentAssembler, sentenceDetector, tokenizer, wordEmbeddings, clinicalNer, nerConverter, clinicalAssertion, flattener )) val result = pipeline.fit(dataSet).transform(dataSet) result.show(false) +----------------------------------+-------------------------+----------------+-----------------------------+ |ner_chunk_result |ner_chunk_metadata_entity|assertion_result|assertion_metadata_confidence| +----------------------------------+-------------------------+----------------+-----------------------------+ |distress |Symptom |Absent |1.0 | |arcus senilis |Disease_Syndrome_Disorder|Past |1.0 | |jugular venous pressure distention|Symptom |Absent |1.0 | |adenopathy |Symptom |Absent |1.0 | |tender |Symptom |Absent |1.0 | |fullness |Symptom |Possible |0.9999 | |edema |Symptom |Present |1.0 | |cyanosis |VS_Finding |Absent |1.0 | +----------------------------------+-------------------------+----------------+-----------------------------+
Value Members
-
object
DocumentFiltererByClassifier extends DefaultParamsReadable[DocumentFiltererByClassifier] with Serializable
This is the companion object of DocumentFiltererByClassifier.
This is the companion object of DocumentFiltererByClassifier. Please refer to that class for the documentation.
- object DrugNormalizer extends DefaultParamsReadable[DrugNormalizer] with Serializable
-
object
Flattener extends DefaultParamsReadable[Flattener] with Serializable
This is the companion object of Flattener.
This is the companion object of Flattener. Please refer to that class for the documentation.