Spark OCR is built on top of Apache Spark. Currently, it supports 3.0., 2.4. and 2.3.* versions of Spark.
It is recommended to have basic knowledge of the framework and a working environment before using Spark OCR. Refer to Spark documentation to get started with Spark.
Spark OCR requires:
- Scala 2.11 or 2.12 related to the Spark version
- Python 3.7 + (in case using PySpark)
Before you start, make sure that you have:
- Spark OCR jar file (or secret for download it)
- Spark OCR python wheel file
- License key
If you don’t have a valid subscription yet and you want to test out the Spark OCR library press the button below:
Spark OCR from Scala
You can start a spark REPL with Scala by running in your terminal a spark-shell including the com.johnsnowlabs.nlp:spark-ocr_2.11:1.0.0 package:
spark-shell --jars ####
The #### is a secret url only available for license users. If you have purchased a license but did not receive it please contact us at info@johnsnowlabs.com.
Start Spark OCR Session
The following code will initialize the spark session in case you have run the jupyter notebook directly. If you have started the notebook using pyspark this cell is just ignored.
Initializing the spark session takes some seconds (usually less than 1 minute) as the jar from the server needs to be loaded.
The #### in .config(“spark.jars”, “####”) is a secret code, if you have not received it please contact us at info@johnsnowlabs.com.
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Spark OCR")
.master("local[*]")
.config("spark.driver.memory", "4G")
.config("spark.driver.maxResultSize", "2G")
.config("spark.jars", "####")
.getOrCreate()
Spark OCR from Python
Install Python package
Install python package using pip:
pip install spark-ocr==1.8.0.spark24 --extra-index-url #### --ignore-installed
The #### is a secret url only available for license users. If you have purchased a license but did not receive it please contact us at info@johnsnowlabs.com.
Start Spark OCR Session
Manually
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Spark OCR") \
.master("local[*]") \
.config("spark.driver.memory", "4G") \
.config("spark.driver.maxResultSize", "2G") \
.config("spark.jars", "https://pypi.johnsnowlabs.com/####") \
.getOrCreate()
Using Start function
Another way to initialize SparkSession with Spark OCR to use start
function in Python.
Start function has following params:
Param name | Type | Default | Description |
---|---|---|---|
secret | string | None | Secret for download Spark OCR jar file |
jar_path | string | None | Path to jar file in case you need to run spark session offline |
extra_conf | SparkConf | None | Extra spark configuration |
master_url | string | local[*] | Spark master url |
nlp_version | string | None | Spark NLP version for add it Jar to session |
nlp_internal | boolean/string | None | Run Spark session with Spark NLP Internal if set to ‘True’ or specify version |
nlp_secret | string | None | Secret for get Spark NLP Internal jar |
keys_file | string | keys.json |
Name of the json file with license, secret and aws keys |
For start Spark session with Spark NLP please specify version of it in nlp_version
param.
Example:
from sparkocr import start
spark = start(secret=secret, nlp_version="2.4.4")
Databricks
The installation process to Databricks includes following steps:
- Installing Spark OCR library to Databricks and attaching it to the cluster
- Same step for Spark OCR python wheel file
- Adding license key
- Adding cluster init script for install dependencies
Please look databricks python helpers for simplify install init script.
Example notebooks: