Description
This model is trained on a list of clinical and biomedical datasets curated in-house, using the word2vec algorithm. The dataset curation cut-off date is March 2023 and the model is expected to have a better generalization on recent content. The size of the model is around 1 GB and has 200 dimensions. Our benchmark tests indicate that our legacy clinical embeddings (embeddings_clinical) can be replaced with this one while training a new model (existing/previous models will still need to use the legacy embeddings that they’re trained with).
How to use
embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_medium","en","clinical/models")\
    .setInputCols(["document","token"])\
    .setOutputCol("word_embeddings")
val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical_medium","en","clinical/models")
    .setInputCols(Array("document","token"))
    .setOutputCol("word_embeddings")
Model Information
| Model Name: | embeddings_clinical_medium | 
| Type: | embeddings | 
| Compatibility: | Healthcare NLP 4.3.2+ | 
| License: | Licensed | 
| Edition: | Official | 
| Input Labels: | [document, token] | 
| Output Labels: | [word_embeddings] | 
| Language: | en | 
| Size: | 787.5 MB | 
| Case sensitive: | true | 
| Dimension: | 200 |