Classifier for Genders - BIOBERT

Description

This model classifies the gender of the patient in the clinical document.

Predicted Entities

Female, ``Male, Unknown`.

Open in Colab Copy S3 URI

How to use

To classify your text, you can use this model as part of an nlp pipeline with the following stages: DocumentAssembler, BertSentenceEmbeddings (biobert_pubmed_base_cased), ClassifierDLModel.

...
biobert_embeddings = BertEmbeddings().pretrained('biobert_pubmed_base_cased') \
    .setInputCols(["document","token"])\
    .setOutputCol("bert_embeddings")

sentence_embeddings = SentenceEmbeddings() \
    .setInputCols(["document", "bert_embeddings"]) \
    .setOutputCol("sentence_bert_embeddings") \
    .setPoolingStrategy("AVERAGE")

genderClassifier = ClassifierDLModel.pretrained('classifierdl_gender_biobert', 'en', 'clinical/models') \
    .setInputCols(["document", "sentence_bert_embeddings"]) \
    .setOutputCol("gender")

nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, biobert_embeddings, sentence_embeddings, gender_classifier])

light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))

annotations = light_pipeline.fullAnnotate("""social history: shows that  does not smoke cigarettes or drink alcohol, lives in a nursing home. family history: shows a family history of breast cancer.""")

val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new Tokenizer()
    .setInputCols("document")
    .setOutputCol("token")

val biobert_embeddings = BertEmbeddings().pretrained("biobert_pubmed_base_cased")
    .setInputCols(Array("document","token"))
    .setOutputCol("bert_embeddings")

val sentence_embeddings = SentenceEmbeddings()
    .setInputCols(Array("document", "bert_embeddings"))
    .setOutputCol("sentence_bert_embeddings")
    .setPoolingStrategy("AVERAGE") 

val genderClassifier = ClassifierDLModel.pretrained("classifierdl_gender_biobert", "en", "clinical/models")
    .setInputCols(Array("document", "sentence_bert_embeddings"))
    .setOutputCol("gender")

val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, biobert_embeddings, sentence_embeddings, gender_classifier))

val data = Seq("""social history: shows that  does not smoke cigarettes or drink alcohol, lives in a nursing home. family history: shows a family history of breast cancer.""").toDS().toDF("text")

val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("en.classify.gender.biobert").predict("""social history: shows that  does not smoke cigarettes or drink alcohol, lives in a nursing home. family history: shows a family history of breast cancer.""")

Results

Female

Model Information

Model Name: classifierdl_gender_biobert
Type: ClassifierDLModel
Compatibility: Healthcare NLP 2.6.5 +
Edition: Official
License: Licensed
Input Labels: [sentence_embeddings]
Output Labels: [class]
Language: [en]
Case sensitive: True

Data Source

This model is trained on more than four thousands clinical documents (radiology reports, pathology reports, clinical visits etc.), annotated internally.

Benchmarking

label          precision    recall    f1-score   support

Female          0.9224      0.8954    0.9087       239
Male            0.7895      0.8468    0.8171       124
Unknown         0.8077      0.7778    0.7925        54

accuracy                              0.8657       417
macro-avg       0.8399      0.8400    0.8394       417
weighted-avg    0.8680      0.8657    0.8664       417