Medical Spell Checker

Description

Contextual Spell Checker is a sequence-to-sequence model that detects and corrects spelling errors in your input text. It’s based on Levenshtein Automaton for generating candidate corrections and a Neural Language Model for ranking corrections. This model has been trained in a dataset containing data from different sources; MTSamples, i2b2 clinical notes, and PubMed. You can download the model that comes fully pretrained and ready to use. However, you can still customize it further without the need for re-training a new model from scratch. This can be accomplished by providing custom definitions for the word classes the model has been trained on, namely Dates, Numbers, Ages, Units, and Medications.

Live Demo Copy S3 URI

How to use

In order to use this model, you need to setup a pipeline and feed tokens.

documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")

tokenizer = RecursiveTokenizer()\
.setInputCols(["document"])\
.setOutputCol("token")\
.setPrefixes(["\"", "(", "[", "\n"])\
.setSuffixes([".", ",", "?", ")","!", "'s"])

spellModel = ContextSpellCheckerModel\
.pretrained('spellcheck_clinical', 'en', 'clinical/models')\
.setInputCols("token")\
.setOutputCol("checked")

finisher = Finisher()\
.setInputCols("checked")

pipeline = Pipeline(
stages = [
documentAssembler,
tokenizer,
spellModel,
finisher
])

empty_ds = spark.createDataFrame([[""]]).toDF("text")
lp = LightPipeline(pipeline.fit(empty_ds))

example = ["Witth the hell of phisical terapy the patient was imbulated and on posoperative, the impatient tolerating a post curgical soft diet.",
"With paint wel controlled on orall pain medications, she was discharged too reihabilitation facilitay.",
"She is to also call the ofice if she has any ever greater than 101, or leeding form the surgical wounds.",
"Abdomen is sort, nontender, and nonintended.",
"Patient not showing pain or any wealth problems.",
"No cute distress"

]
lp.annotate(example)

import nlu
nlu.load("en.spell.clinical").predict(""")

pipeline = Pipeline(
stages = [
documentAssembler,
tokenizer,
spellModel,
finisher
])

empty_ds = spark.createDataFrame([[""")

Results


[{'checked': ['With',
'the',
'help',
'of',
'physical',
'therapy',
'the',
'patient',
'was',
'ambulated',
'and',
'on',
'postoperative',
',',
'the',
'patient',
'tolerating',
'a',
'post',
'surgical',
'soft',
'diet',
'.']},
{'checked': ['With',
'pain',
'well',
'controlled',
'on',
'oral',
'pain',
'medications',
',',
'she',
'was',
'discharged',
'to',
'rehabilitation',
'facility',
'.']},
{'checked': ['She',
'is',
'to',
'also',
'call',
'the',
'office',
'if',
'she',
'has',
'any',
'fever',
'greater',
'than',
'101',
',',
'or',
'bleeding',
'from',
'the',
'surgical',
'wounds',
'.']},
{'checked': ['Abdomen',
'is',
'soft',
',',
'nontender',
',',
'and',
'nondistended',
'.']},
{'checked': ['Patient',
'not',
'showing',
'pain',
'or',
'any',
'health',
'problems',
'.']},
{'checked': ['No', 'acute', 'distress']}]

Model Information

Model Name: spellcheck_clinical
Compatibility: Spark NLP 2.7.2+
License: Licensed
Edition: Official
Input Labels: [token]
Language: en

Data Source

MTSamples, augmented version of i2b2 clinical notes, and PubMed.