Description
This model was imported from Hugging Face
(link) and it’s been trained on NeuSpell corpus to detect typos, leveraging DistilBERT
embeddings and DistilBertForTokenClassification
for NER purposes. It classifies typo tokens as PO
.
Predicted Entities
PO
How to use
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_typo_detector", "en")\
.setInputCols(["sentence",'token'])\
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter])
text = """He had also stgruggled with addiction during his tine in Congress."""
data = spark.createDataFrame([[text]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
Results
+------------+---------+
|chunk |ner_label|
+------------+---------+
|stgruggled |PO |
|tine |PO |
+------------+---------+
Model Information
Model Name: | distilbert_token_classifier_typo_detector |
Compatibility: | Spark NLP 3.3.4+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [sentence, token] |
Output Labels: | [ner] |
Language: | en |
Size: | 244.1 MB |
Case sensitive: | true |
Max sentence length: | 256 |
Data Source
https://github.com/neuspell/neuspell
Benchmarking
label precision recall f1-score support
micro-avg 0.992332 0.985997 0.989154 416054.0
macro-avg 0.992332 0.985997 0.989154 416054.0
weighted-avg 0.992332 0.985997 0.989154 416054.0