Classifier for Genders - BIOBERT


This model classifies the gender of the patient in the clinical document using context.

Predicted Entities

Female, Male, Unknown

Live Demo Open in Colab Copy S3 URI

How to use

document_assembler = DocumentAssembler()\

tokenizer = Tokenizer()\

biobert_embeddings = BertEmbeddings().pretrained("biobert_pubmed_base_cased") \
  .setInputCols(["document", "token"])\

sentence_embeddings = SentenceEmbeddings() \
  .setInputCols(["document", "bert_embeddings"]) \
  .setOutputCol("sentence_bert_embeddings") \

genderClassifier = ClassifierDLModel.pretrained("classifierdl_gender_biobert", "en", "clinical/models") \
  .setInputCols(["document", "sentence_bert_embeddings"]) \

nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, biobert_embeddings, sentence_embeddings, genderClassifier])

light_pipeline = LightPipeline([[""]]).toDF("text")))

annotations = light_pipeline.fullAnnotate("""social history: shows that  does not smoke cigarettes or drink alcohol, lives in a nursing home. family history: shows a family history of breast cancer.""")
val document_assembler = new DocumentAssembler()

tokenizer = Tokenizer()

val biobert_embeddings = BertEmbeddings().pretrained("biobert_pubmed_base_cased")

val sentence_embeddings = new SentenceEmbeddings()
  .setInputCols(Array("document", "bert_embeddings"))

val genderClassifier = ClassifierDLModel.pretrained("classifierdl_gender_biobert", "en", "clinical/models")
  .setInputCols(Array("document", "sentence_bert_embeddings"))

val nlp_pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, biobert_embeddings, sentence_embeddings, genderClassifier))

val data = Seq("""social history: shows that  does not smoke cigarettes or drink alcohol, lives in a nursing home. family history: shows a family history of breast cancer.""").toDS.toDF("text")

val result =
import nlu
nlu.load("en.classify.gender.biobert").predict("""social history: shows that  does not smoke cigarettes or drink alcohol, lives in a nursing home. family history: shows a family history of breast cancer.""")



Model Information

Model Name: classifierdl_gender_biobert
Compatibility: Spark NLP 2.7.1+
License: Licensed
Edition: Official
Input Labels: [sentence_embeddings]
Output Labels: [class]
Language: en
Dependencies: biobert_pubmed_base_cased

Data Source

This model is trained on more than four thousands clinical documents (radiology reports, pathology reports, clinical visits etc.), annotated internally.


label            precision    recall  f1-score   support
Female              0.9020    0.9364    0.9189       236
Male                0.8761    0.7857    0.8285       126
Unknown             0.7091    0.7647    0.7358        51
accuracy              -          -      0.8692       413
macro-avg           0.8291    0.8290    0.8277       413
weighted-avg        0.8703    0.8692    0.8687       413