Part of Speech for Indonesian

Description

This model was imported from Hugging Face and it’s been fine-tuned on indonlu’s POSP dataset for the Indonesian language, leveraging RoBERTa embeddings and RobertaForTokenClassification for POS tagging purposes.

Predicted Entities

PPO, KUA, ADV, PRN, VBI, PAR, VBP, NNP, UNS, VBT, VBL, NNO, ADJ, PRR, PRK, CCN, $$$, ADK, ART, CSN, NUM, SYM, INT, NEG, PRI, VBE

Download Copy S3 URI

How to use

documentAssembler = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
       .setInputCols(["document"])\
       .setOutputCol("sentence")

tokenizer = Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")

tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_token_classifier_pos_tagger", "id"))\
  .setInputCols(["sentence",'token'])\
  .setOutputCol("ner")

ner_converter = NerConverter()\
      .setInputCols(["sentence", "token", "ner"])\
      .setOutputCol("ner_chunk")
      
nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)
text = """Budi sedang pergi ke pasar."""
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))
val documentAssembler = DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
       .setInputCols(Array("document"))
       .setOutputCol("sentence")

val tokenizer = Tokenizer()
      .setInputCols(Array("sentence"))
      .setOutputCol("token")

val tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_token_classifier_pos_tagger", "id"))
  .setInputCols(Array("sentence","token"))
  .setOutputCol("ner")

ner_converter = NerConverter()
      .setInputCols(Array("sentence", "token", "ner"))
      .setOutputCol("ner_chunk")
      
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter))

val example = Seq.empty["Budi sedang pergi ke pasar."].toDS.toDF("text")

val result = pipeline.fit(example).transform(example)
import nlu
nlu.load("id.ner.pos").predict("""Budi sedang pergi ke pasar.""")

Results

+------+---------+
|chunk |ner_label|
+------+---------+
|Budi  |NNO      |
|sedang|ADK      |
|pergi |VBI      |
|ke    |PPO      |
|pasar |NNO      |
|.     |SYM      |
+------+---------+

Model Information

Model Name: roberta_token_classifier_pos_tagger
Compatibility: Spark NLP 3.3.4+
License: Open Source
Edition: Official
Input Labels: [sentence, token]
Output Labels: [ner]
Language: id
Size: 466.2 MB
Case sensitive: true
Max sentense length: 256

Data Source

https://huggingface.co/w11wo/indonesian-roberta-base-posp-tagger

Benchmarking

   label      score
      f1     0.8893
Accuracy     0.9399