Spell Checker for Drug Names (Norvig)

Description

This model corrects spelling mistakes in drug names by using The Symmetric Delete spelling correction algorithm which reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance.

Predicted Entities

Open in Colab Download Copy S3 URI

How to use

documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer()\
    .setInputCols("document")\
    .setOutputCol("token")

spell = NorvigSweetingModel.pretrained("spellcheck_drug_norvig", "en", "clinical/models")\
    .setInputCols("token")\
    .setOutputCol("corrected_token")\

pipeline = Pipeline(
    stages = [
        documentAssembler,
        tokenizer, 
        spell
        ])

model = pipeline.fit(spark.createDataFrame([['']]).toDF('text')) 

text = "You have to take Amrosia artemisiifoli, Oactra and a bit of Grastk and lastacaf"
test_df= spark.createDataFrame([[text]]).toDF("text")
result= model.transform(test_df)
val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new Tokenizer()
    .setInputCols(Array("document"))
    .setOutputCol("token")

val spell= NorvigSweetingModel.pretrained("spellcheck_drug_norvig", "en", "clinical/models")
    .setInputCols("token")
    .setOutputCol("corrected_token")

val pipeline =  new Pipeline().setStages(Array(documentAssembler, tokenizer, spell))

val data = Seq("You have to take Amrosia artemisiifoli, Oactra and a bit of Grastk and lastacaf").toDS.toDF("text")
val result= pipeline.fit(data).transform(data)

Results

Original Text: 
You have to take Amrosia artemisiifoli , Oactra and a bit of Grastk and lastacaf  

Corrected Text: 
You have to take Ambrosia artemisiifolia , Odactra and a bit of Grastek and lastacaft

Model Information

Model Name: spellcheck_drug_norvig
Compatibility: Healthcare NLP 4.4.0+
License: Licensed
Edition: Official
Input Labels: [token]
Output Labels: [spell]
Language: en
Size: 4.5 MB
Case sensitive: true