BioBERT Embeddings (Pubmed Large)

Description

This model contains a pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. The details are described in the paper “BioBERT: a pre-trained biomedical language representation model for biomedical text mining”.

Download

How to use

...
embeddings = BertEmbeddings.pretrained("biobert_pubmed_large_cased", "en") \
      .setInputCols("sentence", "token") \
      .setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame(pd.DataFrame({"text": ["I hate cancer"]})))
...
val embeddings = BertEmbeddings.pretrained("biobert_pubmed_large_cased", "en")
      .setInputCols("sentence", "token")
      .setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))
val result = pipeline.fit(Seq.empty["I hate cancer"].toDS.toDF("text")).transform(data)
import nlu

text = ["I hate cancer"]
embeddings_df = nlu.load('en.embed.biobert.pubmed_large_cased').predict(text, output_level='token')
embeddings_df

Results

        token	en_embed_biobert_pubmed_large_cased_embeddings
	
	I	[-0.041047871112823486, 0.24242812395095825, 0...
	hate	[-0.6859451532363892, -0.45743268728256226, -0...
	cancer	[-0.12403186410665512, 0.6688604354858398, -0....

Model Information

Model Name: biobert_pubmed_large_cased
Type: embeddings
Compatibility: Spark NLP 2.6.2
License: Open Source
Edition: Official
Input Labels: [sentence, token]
Output Labels: [word_embeddings]
Language: [en]
Dimension: 1024
Case sensitive: true

Data Source

The model is imported from https://github.com/dmis-lab/biobert