Spark OCR (Licensed)

 

Spark OCR provides a set of Spark ML transformers/estimators that help users create and use OCR pipelines. It is built on top of Apache Spark.

OCR Pipelines

Using Spark OCR it is possible to build pipelines for text recognition from:

  • scanned image(s) (png, tiff, jpeg …)
  • selectable PDF (that contains text layout)
  • not selectable PDF (that contains scanned text as an image)

It contains a set of tools for:

  • PDF processing transformers which extract text and images from PDF files
  • Image pre-processing (scaling, binarization, skew correction, etc.) transformers
  • Splitting image to regions analyzers and transformers
  • Characters recognition using ImageToText estimator

More details on transformers/estimators could be found in further section OCR Pipeline Components

Spark OCR Workshop

If you prefer learning by example, check this repository:

Spark OCR Workshop

It is full of fresh examples.

Below, you can follow a more theoretical and thorough quick start guide.

Quickstart Examples

Images

The following code example creates an OCR Pipeline for processing image(s). The image file(s) can contain complex layout like columns, tables, images inside.


import org.apache.spark.ml.Pipeline

import com.johnsnowlabs.ocr.transformers._

val imagePath = "path to image files"

// Read image files as binary file
val df = spark.read
  .format("binaryFile")
  .load(imagePath)

// Transform binary content to image
val binaryToImage = new BinaryToImage()
  .setInputCol("content")
  .setOutputCol("image")

// OCR
val ocr = new ImageToText()
  .setInputCol("image")
  .setOutputCol("text")

// Define Pipeline
val pipeline = new Pipeline()
pipeline.setStages(Array(
  binaryToImage,
  ocr
))

val modelPipeline = pipeline.fit(spark.emptyDataFrame)

val data = modelPipeline.transform(df)

data.show()
from pyspark.ml import PipelineModel

from sparkocr.transformers import *

imagePath = "path to image files"

# Read image files as binary file
df = spark.read \
  .format("binaryFile") \
  .load(imagePath)

# Transform binary content to image
binaryToImage = BinaryToImage() \
  .setInputCol("content") \
  .setOutputCol("image")

# OCR
ocr = ImageToText() \
  .setInputCol("image") \
  .setOutputCol("text")

# Define Pipeline
pipeline = PipelineModel(stages=[
  binaryToImage,
  ocr
])

data = pipeline.transform(df)

data.show()



Scanned PDF files

Next sample provides an example of OCR Pipeline for processing PDF files containing image data. In this case, the PdfToImage transformer is used to convert PDF file to a set of images.


import org.apache.spark.ml.Pipeline

import com.johnsnowlabs.ocr.transformers._

val imagePath = "path to pdf files"

// Read pdf files as binary file
val df = spark.read
  .format("binaryFile")
  .load(imagePath)

// Transform PDF file to the image
val pdfToImage = new PdfToImage()
  .setInputCol("content")
  .setOutputCol("image")

// OCR
val ocr = new ImageToText()
  .setInputCol("image")
  .setOutputCol("text")

// Define pipeline
val pipeline = new Pipeline()
pipeline.setStages(Array(
  pdfToImage,
  ocr
))

val modelPipeline = pipeline.fit(spark.emptyDataFrame)

val data = modelPipeline.transform(df)

data.show()
from pyspark.ml import PipelineModel

from sparkocr.transformers import *

imagePath = "path to pdf files"

# Read pdf files as binary file
df = spark.read \
  .format("binaryFile") \
  .load(imagePath)

# Transform PDF file to the image
pdfToImage = PdfToImage() \
  .setInputCol("content") \
  .setOutputCol("image")

# OCR
ocr = ImageToText() \
  .setInputCol("image") \
  .setOutputCol("text")

# Define pipeline
pipeline = PipelineModel(stages=[
  pdfToImage,
  ocr
])

data = pipeline.transform(df)

data.show()



PDF files (scanned or text)

In the following code example we will create OCR Pipeline for processing PDF files that contain text or image data.

For each PDF file, this pipeline will:

  • extract the text from document and save it to the text column
  • if text contains less than 10 characters (so the document isn’t PDF with text layout) it will process the PDF file as a scanned document:
    • convert PDF file to an image
    • detect and split image to regions
    • run OCR and save output to the text column

import org.apache.spark.ml.Pipeline

import com.johnsnowlabs.ocr.transformers._

val imagePath = "path to PDF files"

// Read PDF files as binary file
val df = spark.read
  .format("binaryFile")
  .load(imagePath)

// Extract text from PDF text layout
val pdfToText = new PdfToText()
  .setInputCol("content")
  .setOutputCol("text")
  .setSplitPage(false)

// In case of `text` column contains less then 10 characters,
// pipeline run PdfToImage as fallback method
val pdfToImage = new PdfToImage()
  .setInputCol("content")
  .setOutputCol("image")
  .setFallBackCol("text")
  .setMinSizeBeforeFallback(10)

// OCR
val ocr = new ImageToText()
  .setInputCol("image")
  .setOutputCol("text")

// Define pipeline
val pipeline = new Pipeline()
pipeline.setStages(Array(
  pdfToText,
  pdfToImage,
  ocr
))

val modelPipeline = pipeline.fit(spark.emptyDataFrame)

val data = modelPipeline.transform(df)

data.show()
from pyspark.ml import PipelineModel

from sparkocr.transformers import *


imagePath = "path to PDF files"

# Read PDF files as binary file
df = spark.read
  .format("binaryFile")
  .load(imagePath)

# Extract text from PDF text layout
pdfToText = PdfToText() \
  .setInputCol("content") \
  .setOutputCol("text") \
  .setSplitPage(false)

# In case of `text` column contains less then 10 characters,
# pipeline run PdfToImage as fallback method
pdfToImage = PdfToImage() \
  .setInputCol("content") \
  .setOutputCol("image") \
  .setFallBackCol("text") \
  .setMinSizeBeforeFallback(10)

# OCR
ocr = ImageToText() \
  .setInputCol("image") \
  .setOutputCol("text")

# Define pipeline
pipeline = PipelineModel(stages=[
  pdfToText,
  pdfToImage,
  ocr,
])

data = pipeline.transform(df)

data.show()


Images (streaming mode)

Next code segments provide an example of streaming OCR pipeline. It processes images and stores results to memory table.


val imagePath = "path folder with images"

val batchDataFrame = spark.read.format("binaryFile").load(imagePath).limit(1)
    
val pipeline = new Pipeline()
pipeline.setStages(Array(
  binaryToImage,
  binarizer,
  ocr
))

val modelPipeline = pipeline.fit(batchDataFrame)

// Read files in streaming mode
val dataFrame = spark.readStream
  .format("binaryFile")
  .schema(batchDataFrame.schema)
  .load(imagePath)

// Call pipeline and store results to 'results' memory table
val query = modelPipeline.transform(dataFrame)
  .select("text", "exception")
  .writeStream
  .format("memory")
  .queryName("results")
  .start()
imagePath = "path folder with images"

batchDataFrame = spark.read.format("binaryFile").load(imagePath).limit(1)
    
pipeline = Pipeline()
pipeline.setStages(Array(
  binaryToImage,
  binarizer,
  ocr
))

modelPipeline = pipeline.fit(batchDataFrame)

# Read files in streaming mode
dataFrame = spark.readStream
  .format("binaryFile")
  .schema(batchDataFrame.schema)
  .load(imagePath)

# Call pipeline and store results to 'results' memory table
query = modelPipeline.transform(dataFrame) \
  .select("text", "exception") \
  .writeStream() \
  .format("memory") \
  .queryName("results") \
  .start()

For getting results from memory table following code could be used:


spark.table("results").select("path", "text").show()
spark.table("results").select("path", "text").show()

More details about Spark Structured Streaming could be found in spark documentation.

Advanced Topics

Error Handling

Pipeline execution would not be interrupted in case of the runtime exceptions while processing some records.

In this case OCR transformers would fill exception column that contains transformer name and exception.

NOTE: Storing runtime errors to the exception field allows to process batch of files.

Output

Here is an output with exception when try to process js file using OCR pipeline:


result.select("path", "text", "exception").show(2, false)
result.select("path", "text", "exception").show(2, False)
+----------------------+-------------------------------------------+-----------------------------------------------------+
|path                  |text                                       |exception                                            |
+----------------------+-------------------------------------------+-----------------------------------------------------+
|file:jquery-1.12.3.js |                                           |BinaryToImage_c0311dc62161: Can't open file as image.|
|file:image.png        |I prefer the morning flight through Denver |null                                                 |
+----------------------+-------------------------------------------+-----------------------------------------------------+

Performance

In case of big count of text PDF’s in dataset need have manual partitioning for avoid skew in partitions and effective utilize resources. For example the randomization could be used.

Last updated