Utilities for Databricks

John Snow Labs NLP Models on Databricks Marketplace

This section demonstrates the process of obtaining an AI model from John Snow Labs on Databricks Marketplace and deploying it for inference. Let’s assume a data scientist from an healthcare company is interested in building an intelligent application to help detect cancer genetics in patients records and intends to use an NLP model to analyze the available clinical notes at scale by using an API service.

Step 1: Navigate to the Databricks Marketplace

Access the Marketplace: Open the Databricks Marketplace homepage.

Search for Models: Use the search functionality to find specific AI models. For our purpose, input the search term e.g. “cancer” on the search box and filter results by product type on the first filter dropdown.

Step 2: Select the AI Model to Deploy

Identify the Listing: Locate the “Detect Cancer Genetics” listing on the Marketplace and click on the model card.This opens the model page on the Databricks Marketplace. An extended set of information on a model is provided in this page as well as usage intructions.

Understand the Model: This AI model is designed to identify and extract various biological entities such as genes, anatomical systems, cellular structures, or chemicals from clinical texts, research articles, and/or pathology reports, capturing intricate biological and oncological concepts, facilitating targeted therapeutic research, early diagnosis, and personalized treatment plans.

License: Note that this is a paid model and requires a license from John Snow Labs.

Step 3: Obtain Access to the Model

Instant Access: On the model page, click on ‘Get Access’ button to integrate the model’s metadata into the Unity Catalog.

Open the Listing: Once access is granted, click on ‘Open in Listing’.

Locate in Unity Catalog: The models should now be available in the Unity Catalog and ready for deployment to a model serving endpoint.

Step 4: Obtain PAYG License for John Snow Labs Models

Login to your account: Access my.JohnSnowLabs.com and log in to your account. If you don’t have an account, create one.

Get a PAYG License: Go to the Get License page. Switch to the PAYG Subscription tab and provide your credit card details. Carefully review the End User License Agreement and the Terms and Conditions documents. If you agree, click on the Create Subscription button.

Copy License Key: Once the process is complete, you will find your PAY-As-You-Go license listed on the My Subscriptions page. Visit the My Subscriptions page and copy the PAYG license key by clicking on the corresponding copy icon in the License Key column.

Go to your Databricks notebook and paste the license key into the JSL-License widget in the top of the notebook (see Step 5).

Step 5: Use the Embedded Notebook to Deploy the Model

Import the Notebook: Import the provided embedded notebook from the listing.

Configure the Notebook: Fill in the widgets with workspace-specific values and the PAYG license that you can copy from my.JohnSnowLabs.com (see Step 4).

Run the Notebook: Execute the notebook to initiate the process.

Step 6: Model Endpoint Creation and Querying

Endpoint Creation: The notebook will create a serving endpoint, visible in the Databricks serving UI.

Sample Text Query: Use a sample text to query against the newly created endpoint.

Observing Results: The results will showcase the identified and classified biological entities.

Manual Endpoint Creation

You can Query&Deploy John Snow Labs models with 1 line of code as Databricks Model Serve Endpoints.
Data is passed to the predict() function and predictions are shaped accordingly.
You must create endpoints from a Databricks cluster created by nlp.install.

See Cluster Creation Notebook and Databricks Endpoint Tutorial Notebook

# You need `mlflow_by_johnsnowlabs` installed until next mlflow is released
! pip install mlflow_by_johnsnowlabs
from johnsnowlabs import nlp
nlp.deploy_endpoint('bert')
nlp.query_endpoint('bert_ENDPOINT','My String to embed')

nlp.deploy_endpoint will register a ML-FLow model into your registry and deploy an Endpoint with a JSL license. It has the following parameters:

Parameter	Description
`model`	Model to be deployed as endpoint which is converted into NluPipelines, supported classes are: `String` Reference to NLU Pipeline name like ‘bert’, `NLUPipeline`, `List[Annotator]`, `Pipeline`, `LightPipeline`, `PretrainedPipeline`, `PipelineModel`. In case of a NLU reference, the endpoint name is auto-generated aus `<nlu_ref>_ENDPOINT` i.e. `bert_ENDPOINT`. ‘.’ is replaced with ‘_’ in the nlu reference for the endpoint name
`endpoint_name`	Name for the deployed endpoint. Optional if using NLU model reference but mandatory for custom pipelines.
`re_create_endpoint`	if False, endpoint creation is skipped if one already exists. If True, it will delete existing endpoint if it exists
`re_create_model`	if False, model creation is skipped if one already exists. If True, model will be re-logged again, bumping the current version by 2
`workload_size`	one of Small, Medium, Large.
`gpu`	`True`/`False` to load GPU-optimized jars or CPU-optimized jars in the container. Must use a gpu based `workload_type` if `gpu=true`
`new_run`	if True, mlflow will start a new run before logging the model
`block_until_deployed`	if True, this function will block until the endpoint is created
`workload_type`	`CPU` by default, use `GPU_SMALL` to spawn a GPU based endpoint instead. Check Databricks docs for alternative values
`db_host`	the databricks host URL. If not specified, the DATABRICKS_HOST environment variable is used
`db_token`	the databricks Access Token. If not specified, the DATABRICKS_TOKEN environment variable is used

nlp.query_endpoint translates your query to JSON, sends it to the endpoint and returns the result as pandas DataFrame. It has the following parameters which are forwarded to the model.predict() call inside of the endpoint:

Parameter	Description
`endpoint_name`	Name of the endpoint to query
`query`	str or list of strings or raw json string. If raw json, is_json_query must be True
`is_json_query`	if True, query is treated as raw json string
`output_level`	One of `token`, `chunk`, `sentence`, `relation`, `document` to shape outputs
`positions`	Set `True`/`False` to include or exclude character index position of predictions
`metadata`	Set `True`/`False` to include additional metadata
`drop_irrelevant_cols`	Set `True`/`False` to drop irrelevant columns
`get_embeddings`	Set `True`/`False` to include embedding or not
`keep_stranger_features`	Set `True`/`False` to return columns not named “text”, ‘image” or “file_type” from your input data
`multithread`	Set `True`/`False` to use multi-Threading for inference. Auto-inferred if not set
`db_host`	the databricks host URL. If not specified, the DATABRICKS_HOST environment variable is used
`db_token`	the databricks Access Token. If not specified, the DATABRICKS_TOKEN environment variable is used

nlp.query_endpoint and nlp.deploy_endpoint check the following mandatory env vars to resolve wheels for endpoints

Env Var Name	Description
`HEALTHCARE_SECRET`	Automatically set on your cluster if you run nlp.install()
`VISUAL_SECRET`	Automatically set if you run. `nlp.install(..., visual=True)`. You can only spawn visual endpoint from a cluster created by `nlp.install(..., visual=True)`
`JOHNSNOWLABS_LICENSE_JSON`	JSON content of your john snow labs licensed to use for endpoints. Should be airgap license

Submit a Task with nlp.run_in_databricks

Easily run Python code in a Databricks cluster, using the John Snow Labs library. The fastest way to test this out, is to create a cluster with nlp.install() and then use nlp.run_in_databricks to start a task. You can parameterize your jobs, according to the Databricks docs via the

# Execute a Raw Python string as script on Databricks
from johnsnowlabs import *
script = """
import nlu
print(nlu.load('sentiment').predict('That was easy!'))"""

cluster_id = nlp.install(json_license_path=my_license, databricks_host=my_host,databricks_token=my_token)
nlp.run_in_databricks(script,
                      databricks_cluster_id=cluster_id,
                      databricks_host=my_host,
                      databricks_token=my_token,
                      run_name='Python Code String Example')

This will start a Job Run which you can view in the Workflows tab

And after a while you can see the results

Run a local Python Notebook in Databricks

Provide the path to a notebook on your localhost, it will be copied to HDFS and executed by the Databricks cluster. You need to provide a destination path to your Workspace, where the notebook will be copied to and you have write access to. A common pattern that should work is/Users/<your_databricks_email@somewhere.com>/test.ipynb

local_nb_path = "path/to/my/notebook.ipynb"
remote_dst_path = "/Users/christian@johnsnowlabs.com/test.ipynb"

# notebook.ipynb will run on databricks, url will be printed
nlp.run_in_databricks(
  local_nb_path,
  databricks_host=host,
  databricks_token=token,
  run_name="Notebook Test",
  dst_path=remote_dst_path,
)

This could be your input notebook

A URL where you can monitor the run will be printed, which will look like this

Run a Python Function in Databricks

Define a function, which will be written to a local file, copied to HDFS and executed by the Databricks cluster.

def my_function():
    import nlu
    medical_text = """A 28-year-old female with a history of gestational 
    diabetes presented with a one-week history of polyuria ,
     polydipsia , poor appetite , and vomiting ."""
    df = nlu.load('en.med_ner.diseases').predict(medical_text)
    for c in df.columns: print(df[c])

# my_function will run on databricks
nlp.run_in_databricks(my_function,
                      databricks_cluster_id=cluster_id,
                      databricks_host=my_host,
                      databricks_token=my_token,
                      run_name='Function test')

This example will print all columns of the resulting dataframe which contains medical NER predictions.

Run a Raw Python Code String in Databricks

Provide a string which must be valid Python Syntax.
It will be written to string, copied to HDFS and executed by the Databricks Cluster.

script = """
import nlu
print(nlu.load('sentiment').predict('That was easy!'))"""

nlp.run_in_databricks(script,
                      databricks_cluster_id=cluster_id,
                      databricks_host=my_host,
                      databricks_token=my_token,
                      run_name='Python Code String Example')

Run a Python Script in Databricks

Provide the path to a script on your machine. It will be copied to the Databricks HDFS and executed as task.

nlp.run_in_databricks('path/to/my/script.py',
                      databricks_cluster_id=cluster_id,
                      databricks_host=my_host,
                      databricks_token=my_token,
                      run_name='Script test ')

Run a Python Module in Databricks

Provide a module accessible to the john snow labs library. It’s content’s will be written to a local file, copied to HDFS and executed by the databricks cluster.

import johnsnowlabs.auto_install.health_checks.nlp_test as nlp_test
nlp.run_in_databricks(nlp_test,
                      databricks_cluster_id=cluster_id,
                      databricks_host=my_host,
                      databricks_token=my_token,
                      run_name='nlp_test')

Parameterizing your Databricks Jobs

You can parameterize run_in_databricks with parameters formatted according to the job task API

Parameterizing a Notebook run in Databricks

In your notebook you can use dbutils.widgets.text("my_parameter", "my_default_value") and other ipywidets to parameterize your notebook. You just pass a dictionary to run_in_databricks where key=parameter_name and value=parameter_value. See this example parameterized DB-Notebook for a full example which you can use as nb_path parameter.

nb_path = "parameterized_nb_example.ipynb"
dst_path = "/Users/christian@johnsnowlabs.com/test.ipynb"
nlp.run_in_databricks(
    nb_path,
    databricks_cluster_id=cluster_id,
    databricks_host=host,
    databricks_token=token,
    run_name="Parameterized Notebook",
    dst_path=dst_path,
    parameters={"input_text": "I love peanut butter", "model_name": "sentiment"},
)

Parameterizing a Python Script run in Databricks

Simply pass a list of parameters to the run_in_databricks function. They will be passed to the script as sys.argv arguments.

script = """
import sys
print(f"First argument: {sys.argv[1]}, Second argument: {sys.argv[2]}")
"""
arg1 = "My first arg"
arg2 = "My second arg"
nlp.run_in_databricks(
    script,
    databricks_cluster_id=cluster_id,
    databricks_host=host,
    databricks_token=token,
    run_name="Parameterized Script",
    parameters=[arg1, arg2],
)

PREVIOUSUtilities for Snowflake

NEXTUtilities for AWS EMR