Typo Detector

Description

This model was imported from Hugging Face (link) and it’s been trained on NeuSpell corpus to detect typos, leveraging DistilBERT embeddings and DistilBertForTokenClassification for NER purposes. It classifies typo tokens as PO.

Predicted Entities

PO

Download Copy S3 URICopied!

How to use

documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\
.setInputCols(["document"])\
.setOutputCol("sentence")

tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")

tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_typo_detector", "en")\
.setInputCols(["sentence",'token'])\
.setOutputCol("ner")

ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")

nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter])
text = """He had also stgruggled with addiction during his tine in Congress."""
data = spark.createDataFrame([[text]]).toDF("text")

result = nlpPipeline.fit(data).transform(data)

Results

+------------+---------+
|chunk       |ner_label|
+------------+---------+
|stgruggled  |PO       |
|tine        |PO       |
+------------+---------+

Model Information

Model Name: distilbert_token_classifier_typo_detector
Compatibility: Spark NLP 3.3.4+
License: Open Source
Edition: Official
Input Labels: [sentence, token]
Output Labels: [ner]
Language: en
Size: 244.1 MB
Case sensitive: true
Max sentence length: 256

Data Source

https://github.com/neuspell/neuspell

Benchmarking

label        precision  recall    f1-score  support
micro-avg    0.992332   0.985997  0.989154  416054.0
macro-avg    0.992332   0.985997  0.989154  416054.0
weighted-avg 0.992332   0.985997  0.989154  416054.0