On Premise Deployment of John Snow Labs Medical LLMs
Deploying Medical LLMs within your infrastructure ensures complete control over your data and compliance with local security policies. This guide walks you through the steps to deploy a Medical LLM using Docker on your server.
Prerequisites: Before you begin, make sure the following are in place:
- John Snow Labs License: You need a valid John Snow Labs license file (license.json). Contact your account manager if you don’t have one.
- Docker Installed: Ensure Docker is installed and running on your machine. You can verify by running:
docker --version
Note: For GPU acceleration, your system must have compatible NVIDIA GPUs with the NVIDIA Container Toolkit installed.
John Snow Labs provides a ready-to-use Docker image for deploying the Medical LLMs on your own infrastructure. The image is available on Docker Hub and can be pulled using the following command:
docker pull johnsnowlabs/jsl-llms:latest
Use the command below to start the container. Replace
docker run -d \
--gpus all \
-v "<path_to_jsl_license>:/app/license.json" \
-p 8080:8080 \
--ipc=host \
johnsnowlabs/jsl-llms \
--model Medical-LLM-7B \
--port 8080
The following models are currently available for on-premise deployments:
Model Name | Parameters | Recommended GPU Memory | Max Sequence Length | Model Size | Max KV-Cache | Tensor Parallel Sizes |
---|---|---|---|---|---|---|
Medical-LLM-7B | 7B | ~25 GB | 16K | 14 GB | 11 GB | 1, 2, 4 |
Medical-LLM-10B | 10B | ~35 GB | 32K | 19 GB | 15 GB | 1, 2, 4 |
Medical-LLM-14B | 14B | ~40 GB | 16K | 27 GB | 13 GB | 1, 2 |
Medical-LLM-24B | 24B | ~69 GB | 32K | 44 GB | 25 GB | 1, 2, 4, 8 |
Medical-LLM-Small | 14B | ~58 GB | 32K | 28 GB | 30 GB | 1, 2, 4, 8 |
Medical-LLM-Medium | 70B | ~452 GB | 128K | 131 GB | 320 GB | 4, 8 |
Medical-Reasoning-LLM-14B | 14B | ~58 GB | 32K | 28 GB | 30 GB | 1, 2, 4, 8 |
Medical-Reasoning-LLM-32B | 32B | ~222 GB | 128K | 61 GB | 160 GB | 2, 4, 8 |
Note: All memory calculations are based on half-precision (fp16/bf16) weights. Recommended GPU Memory considers the model size and the maximum key-value cache at the model’s maximum sequence length. These calculations follow the guidelines from DJL’s LMI Deployment Guide.
🪄 Memory Optimization Tips
- Use smaller sequence lengths to reduce KV-cache memory
- Leverage tensor parallelism for large models
- Select an appropriate model based on your GPU resources
Model Interactions
Once deployed, the container exposes a RESTful API for model interactions.
Chat Completions
Use this endpoint for multi-turn conversational interactions (e.g., clinical assistants).
- Endpoint:
/v1/chat/completions
- Method: POST
- Example Request:
payload = {
"model": "Medical-LLM-7B",
"messages": [
{"role": "system", "content": "You are a professional medical assistant"},
{"role": "user", "content": "Explain symptoms of chronic fatigue syndrome"}
],
"temperature": 0.7,
"max_tokens": 1024
}
Text Completions
Use this endpoint for single-turn prompts or generating long-form medical text.
- Endpoint:
/v1/completions
- Method: POST
- Example Request:
payload = { "model": "Medical-LLM-7B", "prompt": "Provide a detailed explanation of rheumatoid arthritis treatment", "temperature": 0.7, "max_tokens": 4096 }