Installation

Spark OCR is built on top of Apache Spark. Currently, it supports 3.0., 2.4. and 2.3.* versions of Spark.

It is recommended to have basic knowledge of the framework and a working environment before using Spark OCR. Refer to Spark documentation to get started with Spark.

Spark OCR requires:

Scala 2.11 or 2.12 related to the Spark version
Python 3.7 + (in case using PySpark)

Before you start, make sure that you have:

Spark OCR jar file (or secret for download it)
Spark OCR python wheel file
License key

If you don’t have a valid subscription yet and you want to test out the Spark OCR library press the button below:

Try Free

Spark OCR from Scala

You can start a spark REPL with Scala by running in your terminal a spark-shell including the com.johnsnowlabs.nlp:spark-ocr_2.11:1.0.0 package:

spark-shell --jars ####

The #### is a secret url only available for license users. If you have purchased a license but did not receive it please contact us at info@johnsnowlabs.com.

Start Spark OCR Session

The following code will initialize the spark session in case you have run the jupyter notebook directly. If you have started the notebook using pyspark this cell is just ignored.

Initializing the spark session takes some seconds (usually less than 1 minute) as the jar from the server needs to be loaded.

The #### in .config(“spark.jars”, “####”) is a secret code, if you have not received it please contact us at info@johnsnowlabs.com.

import org.apache.spark.sql.SparkSession

val spark = SparkSession
    .builder()
    .appName("Spark OCR")
    .master("local[*]")
    .config("spark.driver.memory", "4G")
    .config("spark.driver.maxResultSize", "2G")
    .config("spark.jars", "####")
    .getOrCreate()

Spark OCR from Python

Install Python package

Install python package using pip:

pip install spark-ocr==1.8.0.spark24 --extra-index-url #### --ignore-installed

The #### is a secret url only available for license users. If you have purchased a license but did not receive it please contact us at info@johnsnowlabs.com.

Start Spark OCR Session

Manually

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Spark OCR") \
    .master("local[*]") \
    .config("spark.driver.memory", "4G") \
    .config("spark.driver.maxResultSize", "2G") \
    .config("spark.jars", "https://pypi.johnsnowlabs.com/####") \
    .getOrCreate()

Using Start function

Another way to initialize SparkSession with Spark OCR to use start function in Python.

Start function has following params:

Param name	Type	Default	Description
secret	string	None	Secret for download Spark OCR jar file
jar_path	string	None	Path to jar file in case you need to run spark session offline
extra_conf	SparkConf	None	Extra spark configuration
master_url	string	local[*]	Spark master url
nlp_version	string	None	Spark NLP version for add it Jar to session
nlp_internal	boolean/string	None	Run Spark session with Spark NLP Internal if set to ‘True’ or specify version
nlp_secret	string	None	Secret for get Spark NLP Internal jar
keys_file	string	`keys.json`	Name of the json file with license, secret and aws keys

For start Spark session with Spark NLP please specify version of it in nlp_version param.

Example:

from sparkocr import start
   
spark = start(secret=secret, nlp_version="2.4.4")

Databricks

The installation process to Databricks includes following steps:

Installing Spark OCR library to Databricks and attaching it to the cluster
Same step for Spark OCR python wheel file
Adding license key
Adding cluster init script for install dependencies

Please look databricks python helpers for simplify install init script.

Example notebooks:

PREVIOUSGetting Started

NEXTVersion Compatibility