Self Report Age Classifier (BioBERT - Reddit)

Description

This model is a BioBERT based classifier that can classify self-report the exact age into social media forum (Reddit) posts.

Predicted Entities

self_report_age, no_report

Live Demo Open in Colab Copy S3 URI

How to use

document_assembler = DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

tokenizer = Tokenizer() \
    .setInputCols(['document']) \
    .setOutputCol('token')

sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_exact_age_reddit", "en", "clinical/models")\
    .setInputCols(["document",'token'])\
    .setOutputCol("class")

pipeline = Pipeline(stages=[
    document_assembler, 
    tokenizer,
    sequenceClassifier
])

data = spark.createDataFrame(["Is it bad for a 19 year old it's been getting worser.",
                              "I was about 10. So not quite as young as you but young."], StringType()).toDF("text")
                              
result = pipeline.fit(data).transform(data)

result.select("text", "class.result").show(truncate=False)
val documenter = new DocumentAssembler() 
    .setInputCol("text") 
    .setOutputCol("document")

val tokenizer = new Tokenizer()
    .setInputCols(Array("document"))
    .setOutputCol("token")

val sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_exact_age_reddit", "en", "clinical/models")
    .setInputCols(Array("document","token"))
    .setOutputCol("class")

val pipeline = new Pipeline().setStages(Array(documenter, tokenizer, sequenceClassifier))

val data = Seq(Array("Is it bad for a 19 year old it's been getting worser.",
                     "I was about 10. So not quite as young as you but young.")).toDS.toDF("text")

val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("en.classify.exact_age").predict("""I was about 10. So not quite as young as you but young.""")

Results

+-------------------------------------------------------+-----------------+
|text                                                   |result           |
+-------------------------------------------------------+-----------------+
|Is it bad for a 19 year old it's been getting worser.  |[self_report_age]|
|I was about 10. So not quite as young as you but young.|[no_report]      |
+-------------------------------------------------------+-----------------+

Model Information

Model Name: bert_sequence_classifier_exact_age_reddit
Compatibility: Healthcare NLP 4.0.0+
License: Licensed
Edition: Official
Input Labels: [document, token]
Output Labels: [class]
Language: en
Size: 406.5 MB
Case sensitive: true
Max sentence length: 128

References

The dataset is disease-specific and consists of posts collected via a series of keywords associated with dry eye disease.

Benchmarking

          label  precision    recall  f1-score   support
      no_report     0.9324    0.9577    0.9449      1325
self_report_age     0.9124    0.8637    0.8874       675
       accuracy     -         -         0.9260      2000
      macro-avg     0.9224    0.9107    0.9161      2000
   weighted-avg     0.9256    0.9260    0.9255      2000