Sentence Detection in Somali Text

Description

SentenceDetectorDL (SDDL) is based on a general-purpose neural network model for sentence boundary detection. The task of sentence boundary detection is to identify sentences within a text. Many natural language processing tasks take a sentence as an input unit, such as part-of-speech tagging, dependency parsing, named entity recognition or machine translation.

Download

How to use

documenter = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
    
sentencerDL = SentenceDetectorDLModel\
  .pretrained("sentence_detector_dl", "so") \
  .setInputCols(["document"]) \
  .setOutputCol("sentences")

sd_model = LightPipeline(PipelineModel(stages=[documenter, sentencerDL]))
sd_model.fullAnnotate("""Raadinta il weyn oo ka mid ah cutubyada akhriska Ingiriisiga? Waxaad timid meeshii saxda ahayd Sida laga soo xigtay daraasad dhowaan la sameeyay, caadadii wax -akhriska ee dhallinyarada maanta ayaa si degdeg ah hoos ugu dhacaysa. Waxay diiradda saari karin cutubka akhriska Ingiriisiga ee la siiyay wax ka badan dhowr ilbiriqsi! Sidoo kale, akhrintu waxay ahayd qayb muhiim ah oo ka mid ah dhammaan imtixaannada tartanka. Haddaba, sidee u hagaajin kartaa xirfadahaaga akhriska? Jawaabta su'aashan dhab ahaantii waa su'aal kale: Waa maxay isticmaalka xirfadaha akhriska? Ujeeddada ugu weyn ee wax -akhrisku waa 'macno samayn'.""")



val documenter = DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val model = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "so")
	.setInputCols(Array("document"))
	.setOutputCol("sentence")

val pipeline = new Pipeline().setStages(Array(documenter, model))
val data = Seq("Raadinta il weyn oo ka mid ah cutubyada akhriska Ingiriisiga? Waxaad timid meeshii saxda ahayd Sida laga soo xigtay daraasad dhowaan la sameeyay, caadadii wax -akhriska ee dhallinyarada maanta ayaa si degdeg ah hoos ugu dhacaysa. Waxay diiradda saari karin cutubka akhriska Ingiriisiga ee la siiyay wax ka badan dhowr ilbiriqsi! Sidoo kale, akhrintu waxay ahayd qayb muhiim ah oo ka mid ah dhammaan imtixaannada tartanka. Haddaba, sidee u hagaajin kartaa xirfadahaaga akhriska? Jawaabta su'aashan dhab ahaantii waa su'aal kale: Waa maxay isticmaalka xirfadaha akhriska? Ujeeddada ugu weyn ee wax -akhrisku waa 'macno samayn'.").toDF("text")
val result = pipeline.fit(data).transform(data)
nlu.load('so.sentence_detector').predict("Raadinta il weyn oo ka mid ah cutubyada akhriska Ingiriisiga? Waxaad timid meeshii saxda ahayd Sida laga soo xigtay daraasad dhowaan la sameeyay, caadadii wax -akhriska ee dhallinyarada maanta ayaa si degdeg ah hoos ugu dhacaysa. Waxay diiradda saari karin cutubka akhriska Ingiriisiga ee la siiyay wax ka badan dhowr ilbiriqsi! Sidoo kale, akhrintu waxay ahayd qayb muhiim ah oo ka mid ah dhammaan imtixaannada tartanka. Haddaba, sidee u hagaajin kartaa xirfadahaaga akhriska? Jawaabta su'aashan dhab ahaantii waa su'aal kale: Waa maxay isticmaalka xirfadaha akhriska? Ujeeddada ugu weyn ee wax -akhrisku waa 'macno samayn'.", output_level ='sentence')  

Results

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                   |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[Raadinta il weyn oo ka mid ah cutubyada akhriska Ingiriisiga?]                                                                                                          |
|[Waxaad timid meeshii saxda ahayd Sida laga soo xigtay daraasad dhowaan la sameeyay, caadadii wax -akhriska ee dhallinyarada maanta ayaa si degdeg ah hoos ugu dhacaysa.]|
|[Waxay diiradda saari karin cutubka akhriska Ingiriisiga ee la siiyay wax ka badan dhowr ilbiriqsi!]                                                                     |
|[Sidoo kale, akhrintu waxay ahayd qayb muhiim ah oo ka mid ah dhammaan imtixaannada tartanka.]                                                                           |
|[Haddaba, sidee u hagaajin kartaa xirfadahaaga akhriska?]                                                                                                                |
|[Jawaabta su'aashan dhab ahaantii waa su'aal kale:]                                                                                                                      |
|[Waa maxay isticmaalka xirfadaha akhriska?]                                                                                                                              |
|[Ujeeddada ugu weyn ee wax -akhrisku waa 'macno samayn'.]                                                                                                                |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+


Model Information

Model Name: sentence_detector_dl
Compatibility: Spark NLP 3.2.0+
License: Open Source
Edition: Official
Input Labels: [document]
Output Labels: [sentences]
Language: so

Benchmarking

Accuracy:      0.98
Recall:        1.00
Precision:     0.96
F1:            0.98