Spark OCR is built on top of Apache Spark. Currently, it supports 3.0., 3.1., 2.4.* and 2.3.* versions of Spark.

It is recommended to have basic knowledge of the framework and a working environment before using Spark OCR. Refer to Spark documentation to get started with Spark.

Spark OCR required:

  • Scala 2.11 or 2.12 related to the Spark version
  • Python 3.+ (in case using PySpark)

Before you start, make sure that you have:

  • Spark OCR jar file (or secret for download it)
  • Spark OCR python wheel file
  • License key

If you don’t have a valid subscription yet and you want to test out the Spark OCR library press the button below:

Try Free

Spark OCR from Scala

You can start a spark REPL with Scala by running in your terminal a spark-shell including the com.johnsnowlabs.nlp:spark-ocr_2.11:1.0.0 package:

spark-shell --jars ####

The #### is a secret url only avaliable for license users. If you have purchansed a license but did not receive it please contact us at

Start Spark OCR Session

The following code will initialize the spark session in case you have run the jupyter notebook directly. If you have started the notebook using pyspark this cell is just ignored.

Initializing the spark session takes some seconds (usually less than 1 minute) as the jar from the server needs to be loaded.

The #### in .config(“spark.jars”, “####”) is a secret code, if you have not received it please contact us at

import org.apache.spark.sql.SparkSession

val spark = SparkSession
    .appName("Spark OCR")
    .config("spark.driver.memory", "4G")
    .config("spark.driver.maxResultSize", "2G")
    .config("spark.jars", "####")

Spark OCR from Python

Install Python package

Install python package using pip:

pip install spark-ocr==1.8.0.spark24 --extra-index-url #### --ignore-installed

The #### is a secret url only avaliable for license users. If you have purchansed a license but did not receive it please contact us at

Start Spark OCR Session


from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Spark OCR") \
    .master("local[*]") \
    .config("spark.driver.memory", "4G") \
    .config("spark.driver.maxResultSize", "2G") \
    .config("spark.jars", "") \

Using Start function

Another way to initialize SparkSession with Spark OCR to use start function in Python.

Start function has following params:

Param name Type Default Description
secret string None Secret for download Spark OCR jar file
jar_path string None Path to jar file in case you need to run spark session offline
extra_conf SparkConf None Extra spark configuration
master_url string local[*] Spark master url
nlp_version string None Spark NLP version for add it Jar to session
nlp_internal boolean/string None Run Spark session with Spark NLP Internal if set to ‘True’ or specify version
nlp_secret string None Secret for get Spark NLP Internal jar
keys_file string keys.json Name of the json file with license, secret and aws keys

For start Spark session with Spark NLP please specify version of it in nlp_version param.


from sparkocr import start
spark = start(secret=secret, nlp_version="2.4.4")


The installation process to Databricks includes following steps:

  • Installing Spark OCR library to Databricks and attaching it to the cluster
  • Same step for Spark OCR python wheel file
  • Adding license key
  • Adding cluster init script for install dependencies

Please look databricks python helpers for simplify install init script.

Example notebooks:

Last updated