Named Entity Recognition for Thai (GloVe 840B 300d)

Description

This model annotates named entities in a text, that can be used to find features such as names of people, places, and organizations. The model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together.

This model uses the pre-trained glove_840B_300 embeddings model from WordEmbeddings annotator as an input, so be sure to use the same embeddings in the pipeline.

Predicted Entities

Brands-BRN, Designations (position or job title)-DES, Date and time-DTM, Locations-LOC, Measurements-MEA, Names-NAME, Numbers-NUM, Organizations-ORG, Persons-PER, Terminology-TRM, Titles-TTL.

Live Demo Open in Colab Download Copy S3 URI

How to use

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
    
sentence_detector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

word_segmenter = WordSegmenterModel.pretrained("wordseg_best", "th")\
    .setInputCols(["sentence"])\
    .setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", "xx")\
    .setInputCols("document", "token") \
    .setOutputCol("embeddings")
ner = NerDLModel.pretrained("ner_lst20_glove_840B_300d", "th") \
    .setInputCols(["document", "token", "embeddings"]) \
    .setOutputCol("ner")
...
pipeline = Pipeline(stages=[document_assembler, sentence_detector, word_segmenter, embeddings, ner, ner_converter])
example = spark.createDataFrame([['Mona Lisa เป็นภาพวาดสีน้ำมันในศตวรรษที่ 16 ที่สร้างโดย Leonardo จัดขึ้นที่พิพิธภัณฑ์ลูฟร์ในปารีส']], ["text"])
result = pipeline.fit(example).transform(example)

val document_assembler = DocumentAssembler()
        .setInputCol("text")
        .setOutputCol("document")
        
val sentence_detector = SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val word_segmenter = WordSegmenterModel.pretrained("wordseg_best", "th")
    .setInputCols(Array("sentence"))
    .setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", "xx")
    .setInputCols(Array("document", "token"))
    .setOutputCol("embeddings")
val ner = NerDLModel.pretrained("ner_lst20_glove_840B_300d", "th")
    .setInputCols(Array("document", "token", "embeddings"))
    .setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, word_segmenter, embeddings, ner, ner_converter))
val data = Seq("Mona Lisa เป็นภาพวาดสีน้ำมันในศตวรรษที่ 16 ที่สร้างโดย Leonardo จัดขึ้นที่พิพิธภัณฑ์ลูฟร์ในปารีส").toDF("text")
val result = pipeline.fit(data).transform(data)

import nlu

text = ["""Mona Lisa เป็นภาพวาดสีน้ำมันในศตวรรษที่ 16 ที่สร้างโดย Leonardo จัดขึ้นที่พิพิธภัณฑ์ลูฟร์ในปารีส"""]
ner_df = nlu.load('th.ner.lst20.glove_840B_300D').predict(text, output_level='token')
ner_df

Results

----------+-----+
|token     |ner  |
+----------+-----+
|Mona      |B_PER|
|Lisa      |E_PER|
|เป็น       |O    |
|ภาพวาด    |O    |
|สีน้ำมัน    |O    |
|ใน        |O    |
|ศตวรรษ    |O    |
|ที่         |O    |
|16        |O    |
|ที่        |O    |
|สร้าง      |O    |
|โดย       |O    |
|Leonardo  |B_PER|
|จัด        |O    |
|ขึ้น        |O    |
|ที่         |O    |
|พิพิธภัณฑ์  |O    |
|ลูฟร์       |O    |
|ใน        |O    |
|ปารีส      |O    |
+----------+-----+

Model Information

Model Name:	ner_lst20_glove_840B_300d
Type:	ner
Compatibility:	Spark NLP 2.7.0+
License:	Open Source
Edition:	Official
Input Labels:	[sentence, token, embeddings]
Output Labels:	[ner]
Language:	th

Data Source

The model was trained on the LST20 Corpus from National Electronics and Computer Technology Center (NECTEC) .

Benchmarking

|              | precision | recall | f1-score | support |
|--------------|-----------|--------|----------|---------|
| B_BRN        | 0.26      | 0.23   | 0.24     | 47      |
| B_DES        | 0.92      | 0.89   | 0.91     | 1176    |
| B_DTM        | 0.83      | 0.81   | 0.82     | 1329    |
| B_LOC        | 0.75      | 0.70   | 0.72     | 2344    |
| B_MEA        | 0.76      | 0.80   | 0.78     | 3155    |
| B_NUM        | 0.71      | 0.59   | 0.64     | 1240    |
| B_ORG        | 0.79      | 0.78   | 0.78     | 4248    |
| B_PER        | 0.92      | 0.92   | 0.92     | 3269    |
| B_TRM        | 0.87      | 0.77   | 0.81     | 128     |
| B_TTL        | 0.97      | 0.98   | 0.98     | 1379    |
| E_BRN        | 0.86      | 0.75   | 0.80     | 8       |
| E_DES        | 0.94      | 0.82   | 0.88     | 198     |
| E_DTM        | 0.80      | 0.79   | 0.80     | 1151    |
| E_LOC        | 0.71      | 0.70   | 0.71     | 851     |
| E_MEA        | 0.69      | 0.77   | 0.73     | 830     |
| E_NUM        | 0.80      | 0.61   | 0.69     | 79      |
| E_ORG        | 0.80      | 0.76   | 0.78     | 2090    |
| E_PER        | 0.93      | 0.96   | 0.94     | 1586    |
| E_TRM        | 0.33      | 0.17   | 0.22     | 12      |
| I_BRN        | 0.75      | 0.60   | 0.67     | 5       |
| I_DES        | 0.79      | 0.63   | 0.70     | 204     |
| I_DTM        | 0.92      | 0.86   | 0.89     | 2969    |
| I_LOC        | 0.47      | 0.46   | 0.47     | 462     |
| I_MEA        | 0.64      | 0.74   | 0.69     | 935     |
| I_NUM        | 0.87      | 0.71   | 0.78     | 115     |
| I_ORG        | 0.81      | 0.75   | 0.78     | 3015    |
| I_PER        | 0.93      | 0.95   | 0.94     | 1604    |
| I_TRM        | 0.40      | 0.13   | 0.20     | 15      |
| I_TTL        | 0.67      | 0.50   | 0.57     | 4       |
| accuracy     | 0.95      | 207278 |          |         |
| macro avg    | 0.76      | 0.71   | 0.73     | 207278  |
| weighted avg | 0.95      | 0.95   | 0.95     | 207278  |

PREVIOUSSentiment Analysis for Urdu (IMDB Review dataset)

NEXTPart of Speech for Thai (pos_lst20)