Detect Movie Entities - MIT Movie Complex (ner_mit_movie_complex_bert_base_cased)

Description

This NER model was trained over the MIT Movie Corpus complex queries dataset to detect movie trivia. We used BertEmbeddings (bert_base_cased) model for the embeddings to train this NER model.

Predicted Entities

  • Actor
  • Award
  • Character_Name
  • Director
  • Genre
  • Opinion
  • Origin
  • Plot
  • Quote
  • Relationship
  • Soundtrack
  • Year

Download Copy S3 URI

How to use

document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')

tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')

embeddings = BertEmbeddings\
.pretrained('bert_base_cased', 'en')\
.setInputCols(["token", "document"])\
.setOutputCol("embeddings")

ner_model = NerDLModel.pretrained('ner_mit_movie_complex_bert_base_cased', 'en') \
.setInputCols(['document', 'token', 'embeddings']) \
.setOutputCol('ner')

ner_converter = NerConverter() \
.setInputCols(['document', 'token', 'ner']) \
.setOutputCol('entities')

pipeline = Pipeline(stages=[
document_assembler, 
tokenizer,
embeddings,
ner_model,
ner_converter
])

example = spark.createDataFrame(pd.DataFrame({'text': ['My name is John!']}))
result = pipeline.fit(example).transform(example)
val document_assembler = DocumentAssembler() 
.setInputCol("text") 
.setOutputCol("document")

val tokenizer = Tokenizer() 
.setInputCols("document") 
.setOutputCol("token")

val embeddings = BertEmbeddings.pretrained("bert_base_cased", "en")
.setInputCols("document", "token") 
.setOutputCol("embeddings")

val ner_model = NerDLModel.pretrained("ner_mit_movie_complex_bert_base_cased", "en") 
.setInputCols("document"', "token", "embeddings") 
.setOutputCol("ner")

val ner_converter = NerConverter() 
.setInputCols("document", "token", "ner") 
.setOutputCol("entities")

val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, embeddings, ner_model, ner_converter))
val result = pipeline.fit(Seq.empty["My name is John!"].toDS.toDF("text")).transform(data)
import nlu

text = ["My name is John!"]

ner_df = nlu.load('en.ner.ner_mit_movie_complex_bert_base_cased').predict(text, output_level='token')

Model Information

Model Name: ner_mit_movie_complex_bert_base_cased
Type: ner
Compatibility: Spark NLP 3.1.3+
License: Open Source
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en

Data Source

https://groups.csail.mit.edu/sls/downloads/movie/

Benchmarking

processed 15904 tokens with 2278 phrases; found: 2292 phrases; correct: 1664.
accuracy:  88.81%; (non-O)
accuracy:  88.78%; precision:  72.60%; recall:  73.05%; FB1:  72.82
Actor: precision:  96.46%; recall:  94.97%; FB1:  95.71  509
Award: precision:  63.64%; recall:  61.76%; FB1:  62.69  33
Character_Name: precision:  61.62%; recall:  68.54%; FB1:  64.89  99
Director: precision:  83.43%; recall:  84.36%; FB1:  83.89  181
Genre: precision:  74.07%; recall:  73.62%; FB1:  73.85  324
Opinion: precision:  39.18%; recall:  46.91%; FB1:  42.70  97
Origin: precision:  35.37%; recall:  40.85%; FB1:  37.91  82
Plot: precision:  53.95%; recall:  53.60%; FB1:  53.77  621
Quote: precision:  64.29%; recall:  39.13%; FB1:  48.65  14
Relationship: precision:  48.00%; recall:  50.00%; FB1:  48.98  50
Soundtrack: precision:  80.00%; recall:  57.14%; FB1:  66.67  5
Year: precision:  94.22%; recall:  93.88%; FB1:  94.05  277