Spark NLP for Healthcare


Getting started

Spark NLP for Healthcare is a commercial extension of Spark NLP for clinical and biomedical text mining. If you don’t have a Spark NLP for Healthcare subscription yet, you can ask for a free trial by clicking on the button below.

Try Free

Spark NLP for Healthcare provides healthcare-specific annotators, pipelines, models, and embeddings for:

  • Clinical entity recognition
  • Clinical Entity Linking
  • Entity normalization
  • Assertion Status Detection
  • De-identification
  • Relation Extraction
  • Spell checking & correction

note: If you are going to use any pretrained licensed NER model, you don’t need to install licensed libray. As long as you have the AWS keys and license keys in your environment, you will be able to use licensed NER models with Spark NLP public library. For the other licensed pretrained models like AssertionDL, Deidentification, Entity Resolvers and Relation Extraction models, you will need to install Spark NLP Enterprise as well.

The library offers access to several clinical and biomedical transformers: JSL-BERT-Clinical, BioBERT, ClinicalBERT, GloVe-Med, GloVe-ICD-O. It also includes over 50 pre-trained healthcare models, that can recognize the following entities (any many more):

  • Clinical - support Signs, Symptoms, Treatments, Procedures, Tests, Labs, Sections
  • Drugs - support Name, Dosage, Strength, Route, Duration, Frequency
  • Risk Factors- support Smoking, Obesity, Diabetes, Hypertension, Substance Abuse
  • Anatomy - support Organ, Subdivision, Cell, Structure Organism, Tissue, Gene, Chemical
  • Demographics - support Age, Gender, Height, Weight, Race, Ethnicity, Marital Status, Vital Signs
  • Sensitive Data- support Patient Name, Address, Phone, Email, Dates, Providers, Identifiers

Install Spark NLP for Healthcare

You can install the Spark NLP for Healthcare package by using:

pip install -q spark-nlp-jsl==${version} --extra-index-url${secret.code} --upgrade

{version} is the version part of the {secret.code} ({secret.code}.split('-')[0]) (i.e. 2.6.0)

The {secret.code} is a secret code that is only available to users with valid/trial license. If you did not receive it yet, please contact us at

Setup AWS-CLI Credentials for licensed pretrained models

Starting from Spark NLP for Healthcare version 2.4.2, you need to first setup your AWS credentials to be able to access the private repository for John Snow Labs Pretrained Models. You can do this setup via Amazon AWS Command Line Interface (AWSCLI).

Instructions about how to install AWSCLI are available at:

Installing the AWS CLI

Make sure you configure your credentials with aws configure following the instructions at:

Configuring the AWS CLI

Please substitute the ACCESS_KEY and SECRET_KEY with the credentials you have received from your Customer Owner (CO). If you need your credentials contact us at

Start Spark NLP for Healthcare Session from Python

The following will initialize the spark session in case you have run the jupyter notebook directly. If you have started the notebook using pyspark this cell is just ignored.

Initializing the spark session takes some seconds (usually less than 1 minute) as the jar from the server needs to be loaded.

The {secret-code} is a secret string you should have received from your Customer Owner (CO). If you have not received them, please contact us at

You can either use our convenience function to start your Spark Session that will use standard configuration arguments:

import sparknlp_jsl
spark = sparknlp_jsl.start("{secret.code}")

Or use the SparkSession module for more flexibility:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Spark NLP Enterprise") \
    .master("local[*]") \
    .config("spark.driver.memory","16") \
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.kryoserializer.buffer.max", "1000M")\
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.7.6") \
    .config("spark.jars", "${secret.code}/spark-nlp-jsl-${version}.jar") \

If you want to download the source files (jar and whl files) locally, you can follow the instructions here.

Spark NLP for Healthcare Cheat Sheet

# Install Spark NLP from PyPI
pip install spark-nlp==3.2.3

#install Spark NLP helathcare

pip install spark-nlp-jsl==${version} --extra-index-url${secret.code} --upgrade

# Load Spark NLP with Spark Shell
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.2.3 --jars spark-nlp-jsl-${version}.jar

# Load Spark NLP with PySpark
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.2.3  --jars spark-nlp-jsl-${version}.jar

# Load Spark NLP with Spark Submit
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.2.3 --jars spark-nlp-jsl-${version}.jar

Install Spark NLP for Healthcare on Databricks

  1. Create a cluster if you don’t have one already
  2. On a new cluster or existing one you need to add the following to the Advanced Options -> Spark tab, in Spark.Config box:

     spark.kryoserializer.buffer.max 1000M
     spark.serializer org.apache.spark.serializer.KryoSerializer
     spark.driver.extraJavaOptions -Dspark.jsl.settings.pretrained.credentials.secret_access_key=xxx -Dspark.jsl.settings.pretrained.credentials.access_key_id=yyy
    • Please add the following to the Advanced Options -> Spark tab, in Environment Variables box:
    • (OPTIONAL) If the environment variables used to setup the AWS Access/Secret keys are conflicting with the credential provider chain in Databricks, you may not be able to access to other s3 buckets. To access both JSL repos with JSL AWS keys as well as your own s3 bucket with your own AWS keys), you need to use the following script, copy that to dbfs folder, then go to the Databricks console (init scripts menu) to add the init script for your cluster as follows:
     val script = """
     echo "******** Inject Spark NLP AWS Profile Credentials ******** "
     mkdir ~/.aws/
     cat << EOF > ~/.aws/credentials
     echo "******** End Inject Spark NLP AWS Profile Credentials  ******** "
  3. In Libraries tab inside your cluster you need to follow these steps:
    • Install New -> PyPI -> spark-nlp -> Install
    • Install New -> Maven -> Coordinates -> com.johnsnowlabs.nlp:spark-nlp_2.12:${version} -> Install
    • Please add following jars:
      • Install New -> Python Whl -> upload${secret.code}/spark-nlp-jsl/spark_nlp_jsl-${version}-py3-none-any.whl
      • Install New -> Jar -> upload${secret.code}/spark-nlp-jsl-${version}.jar
  4. Now you can attach your notebook to the cluster and use Spark NLP!

Install Spark NLP for Healthcare on GCP Dataproc

  1. Create a cluster if you don’t have one already as follows.

At gcloud shell:

gcloud services enable \ \ \ \
gsutil mb -c standard -l ${REGION} gs://${BUCKET_NAME}

You can set image-version, master-machine-type, worker-machine-type, master-boot-disk-size, worker-boot-disk-size, num-workers as your needs. If you use the previous image-version from 2.0, you should also add ANACONDA to optional-components. And, you should enable gateway. As noticed below, you should explicitly write JSL_SECRET and JSL_VERSION at metadata param inside the quotes. This will start the pip installation using the wheel file of Licensed SparkNLP!

gcloud dataproc clusters create ${CLUSTER_NAME} \
  --region=${REGION} \
  --network=${NETWORK} \
  --zone=${ZONE} \
  --image-version=2.0 \
  --master-machine-type=n1-standard-4 \
  --worker-machine-type=n1-standard-2 \
  --master-boot-disk-size=128GB \
  --worker-boot-disk-size=128GB \
  --num-workers=2 \
  --bucket=${BUCKET_NAME} \
  --optional-components=JUPYTER \
  --enable-component-gateway \
  --metadata 'PIP_PACKAGES=google-cloud-bigquery google-cloud-storage spark-nlp-display' \
  --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/python/
  1. On an existing one, you need to install spark-nlp and spark-nlp-display packages from PyPI.

  2. Now, you can attach your notebook to the cluster and use Spark NLP via following the instructions. The key part of this usage is how to start SparkNLP sessions using Apache Hadoop YARN cluster manager.

3.1. Read license file from the notebook using GCS.

3.2. Set the right path of the Java Home Path.

3.3. Use the start function to start the SparkNLP JSL version such as follows:

def start(secret):
    builder = SparkSession.builder \
        .appName("Spark NLP Licensed") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000M") \
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:"+version) \
        .config("spark.jars", ""+secret+"/spark-nlp-jsl-"+jsl_version+".jar")

    return builder.getOrCreate()

spark = start(SECRET)

As you see, we did not set .master('local[*]') explicitly to let YARN manage the cluster. Or you can set .master('yarn').

Google Colab Notebook

Google Colab is perhaps the easiest way to get started with spark-nlp. It requires no installation or setup other than having a Google account.

Run the following code in Google Colab notebook and start using spark-nlp right away.

The first thing that you need is to create the json file with the credentials and the configuration iun your local system.

  "PUBLIC_VERSION": "3.2.3",
  "JSL_VERSION": "{version}",
  "SECRET": "{version}-{secret.code}",
  "SPARK_NLP_LICENSE": "xxxxx",
  "AWS_ACCESS_KEY_ID": "yyyy",

Then you need to write that piece of code to load the credentials that you created before.

import json

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)
# This is only to setup PySpark and Spark NLP on Colab

This script comes with the two options to define pyspark,spark-nlp and spark-nlp-jsl versions via options:

# -p is for pyspark
# -s is for spark-nlp
# by default they are set to the latest

Spark NLP quick start on Google Colab is a live demo on Google Colab that performs named entity recognitions for HealthCare.

Last updated