Spark OCR


Spark NLP comes with an OCR module that can read both PDF files and scanned images (requires Tesseract 4.x+).


Spark Packages

To include the OCR submodule in Spark NLP, you will need to add the following to your start up commands:

--packages JohnSnowLabs:spark-nlp:2.2.1,com.johnsnowlabs.nlp:spark-nlp-ocr_2.11:2.2.1,

Spark Session

This way you will download the extra dependencies needed by our OCR submodule. The Python SparkSession equivalent is:

val spark = SparkSession
    .appName("Spark NLP with OCR")
    .config("spark.driver.memory", "6g")
    .config("spark.executor.memory", "6g")
    .config("spark.jars.repositories", "")
    .config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.2.1,com.johnsnowlabs.nlp:spark-nlp-ocr_2.11:2.2.1,")

Compiled JARs

However, you can also compile a JAR by yourself by cloning spark-nlp repository and run one of these commands:

sbt ocr/assembly
  • Packaging the project
sbt ocr/package

Installing Tesseract

As mentioned above, if you are dealing with scanned images instead of test-selectable PDF files you need to install tesseract 4.x+ on all the nodes in your cluster. Here how you can install it on Ubuntu/Debian:

apt-get install tesseract-ocr

In Databricks this command may result in installing tesseract 3.x instead of version 4.x.

You can simply run this init script to install tesseract 4.x in your Databricks cluster:

sudo apt-get install -y g++ # or clang++ (presumably)
sudo apt-get install -y autoconf automake libtool
sudo apt-get install -y pkg-config
sudo apt-get install -y libpng-dev
sudo apt-get install -y libjpeg8-dev
sudo apt-get install -y libtiff5-dev
sudo apt-get install -y zlib1g-dev
tar xvf leptonica-1.74.4.tar.gz
cd leptonica-1.74.4
sudo make install
git clone --single-branch --branch 4.1
cd tesseract
sudo make install
sudo ldconfig
tesseract -v

Quick start

Let’s read a PDF file:

val ocrHelper = new OcrHelper()

//If you do this locally you can use file:/// or hdfs:/// if the files are hosted in Hadoop
val dataset = ocrHelper.createDataset(spark, "/tmp/sample_article.pdf")

If you are trying to extract text from scanned images in the format of PDF, please keep in mind to use these configs:

Last updated