Question Pair Classifier

Description

Identifies whether two question sentences are semantically repetitive or different.

Predicted Entities

almost_same, not_same.

Live Demo Open in Colab Download

How to use

  • The model is trained with sent_electra_large_uncased embeddings therefore the same embeddings should be used in the prediction pipeline.

  • The question pairs should be identified with “q1” and “q2” in the text. The input text format should be as follows : text = "q1: What is your name? q2: Who are you?"

document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

embeddings = BertSentenceEmbeddings.pretrained("sent_electra_large_uncased", "en") \
      .setInputCols("document") \
      .setOutputCol("sentence_embeddings")

document_classifier = ClassifierDLModel.pretrained('classifierdl_electra_questionpair', 'en') \
  .setInputCols(["document", "sentence_embeddings"]) \
  .setOutputCol("class")

nlpPipeline = Pipeline(stages=[document, embeddings, document_classifier])
light_pipeline = LightPipeline(nlpPipeline.fit(spark.createDataFrame([['']]).toDF("text")))

result_1 = light_pipeline.annotate("q1: What is your favorite movie? q2: Which movie do you like most?")
print(result_1["class"])

result_2 = light_pipeline.annotate("q1: What is your favorite movie? q2: Which movie genre would you like to watch?")
print(result_2["class"])
val document = DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val embeddings = BertSentenceEmbeddings.pretrained("sent_electra_large_uncased", "en")
      .setInputCols("document")
      .setOutputCol("sentence_embeddings")

val document_classifier = ClassifierDLModel.pretrained("classifierdl_electra_questionpair", 'en')
  .setInputCols(Array("document", "sentence_embeddings"))
  .setOutputCol("class")

val nlpPipeline = new Pipeline().setStages(Array(document, embeddings, document_classifier))
val light_pipeline = LightPipeline(nlpPipeline.fit(spark.createDataFrame([['']]).toDF("text")))

val result_1 = light_pipeline.annotate("q1: What is your favorite movie? q2: Which movie do you like most?")

val result_2 = light_pipeline.annotate("q1: What is your favorite movie? q2: Which movie genre would you like to watch?")
import nlu
nlu.load("en.classify.questionpair").predict("""q1: What is your favorite movie? q2: Which movie genre would you like to watch?""")

Results

['almost_same']
['not_same']

Model Information

Model Name: classifierdl_electra_questionpair
Compatibility: Spark NLP 3.1.3+
License: Open Source
Edition: Official
Input Labels: [sentence_embeddings]
Output Labels: [class]
Language: en

Data Source

A custom dataset is used based on this source : “https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs”.

Benchmarking

              precision    recall  f1-score   support

 almost_same       0.85      0.91      0.88     29652
    not_same       0.90      0.84      0.87     29634

    accuracy                           0.88     59286
   macro avg       0.88      0.88      0.88     59286
weighted avg       0.88      0.88      0.88     59286