MedEmbed large: Specialized Embedding Model for Medical and Clinical Information Retrieval (OpenVINO)

Description

MedEmbed is a family of embedding models fine-tuned specifically for medical and clinical data, designed to enhance performance in healthcare-related natural language processing (NLP) tasks, particularly information retrieval.

GitHub Repo: https://github.com/abhinand5/MedEmbed Technical Blog Post: https://huggingface.co/blog/abhinand/medembed-finetuned-embedding-models-for-medical-ir

This model is intended for use in medical and clinical contexts to improve information retrieval, question answering, and semantic search tasks. It can be integrated into healthcare systems, research tools, and medical literature databases to enhance search capabilities and information access.

Copy S3 URI

How to use

from sparknlp.base import DocumentAssembler
from sparknlp.annotator import BertSentenceEmbeddings
from pyspark.ml import Pipeline

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

bert_loaded = BertSentenceEmbeddings.load("bert_medembed_large_v0_1_openvino")\
    .setInputCols(["document"])\
    .setOutputCol("bert")\

pipeline = Pipeline(
    stages = [
        document_assembler,
        bert_loaded
  ])

data = spark.createDataFrame([
    ['William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor,and philanthropist.']
]).toDF("text")

model = pipeline.fit(data)
result = model.transform(data)

result.selectExpr("explode(bert.embeddings) as embeddings").show(

import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.embeddings.BertSentenceEmbeddings
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql.functions.explode
import spark.implicits._

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val bertEmbeddings = BertSentenceEmbeddings.load("bert_medembed_large_v0_1_openvino")
  .setInputCols("document")
  .setOutputCol("bert")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  bertEmbeddings
))

val data = Seq(
  "William Henry Gates III (born October 28, 1955) is an American business magnate, software developer, investor, and philanthropist."
).toDF("text")

val model = pipeline.fit(data)
val result = model.transform(data)

result.select(explode($"bert.embeddings").alias("embeddings")).show(false)

Results


+--------------------+
|          embeddings|
+--------------------+
|[0.461878, -0.240...|
+--------------------+

Model Information

Model Name: bert_medembed_large_v0_1_openvino
Compatibility: Spark NLP 6.0.0+
License: Licensed
Edition: Official
Input Labels: [document]
Output Labels: [bert]
Language: en
Size: 1.2 GB
Case sensitive: true