Structures and helpers

Schemas

Image Schema

Images are loaded as a DataFrame with a single column called “image.”

It is a struct-type column, that contains all information about image:

image: struct (nullable = true)
 |    |-- origin: string (nullable = true)
 |    |-- height: integer (nullable = false)
 |    |-- width: integer (nullable = false)
 |    |-- nChannels: integer (nullable = false)
 |    |-- mode: integer (nullable = false)
 |    |-- resolution: integer (nullable = true)
 |    |-- data: binary (nullable = true)

Fields

Field name	Type	Description
origin	string	source URI
height	integer	image height in pixels
width	integer	image width in pixels
nChannels	integer	number of color channels
mode	ImageType	the data type and channel order the data is stored in
resolution	integer	resolution of image in dpi
data	binary	image data in a binary format

NOTE: Image data stored in a binary format. Image data is represented as a 3-dimensional array with the dimension shape (height, width, nChannels) and array values of type t specified by the mode field.

Coordinate Schema

element: struct (containsNull = true)
 |    |    |-- index: integer (nullable = false)
 |    |    |-- page: integer (nullable = false)
 |    |    |-- x: float (nullable = false)
 |    |    |-- y: float (nullable = false)
 |    |    |-- width: float (nullable = false)
 |    |    |-- height: float (nullable = false)

Field name	Type	Description
index	integer	Chunk index
page	integer	Page number
x	float	The lower left x coordinate
y	float	The lower left y coordinate
width	float	The width of the rectangle
height	float	The height of the rectangle
score	float	The score of the object
label	string	The label of the object

PageMatrix Schema

element: struct (containsNull = true)
 |    |    |-- mappings: array[struct] (nullable = false)

Field name	Type	Description
mappings	Array[Mapping]	Array of mappings

Mapping Schema

element: struct (containsNull = true)
 |    |    |-- c: string (nullable = false)
 |    |    |-- p: integer (nullable = false)
 |    |    |-- x: float (nullable = false)
 |    |    |-- y: float (nullable = false)
 |    |    |-- width: float (nullable = false)
 |    |    |-- height: float (nullable = false)
 |    |    |-- fontSize: integer (nullable = false)

Field name	Type	Description
c	string	Character
p	integer	Page number
x	float	The lower left x coordinate
y	float	The lower left y coordinate
width	float	The width of the rectangle
height	float	The height of the rectangle
fontSize	integer	Font size in points

Enums

PageSegmentationMode

OSD_ONLY: Orientation and script detection only.
AUTO_OSD: Automatic page segmentation with orientation and script detection.
AUTO_ONLY: Automatic page segmentation, but no OSD, or OCR.
AUTO: Fully automatic page segmentation, but no OSD.
SINGLE_COLUMN: Assume a single column of text of variable sizes.
SINGLE_BLOCK_VERT_TEXT: Assume a single uniform block of vertically aligned text.
SINGLE_BLOCK: Assume a single uniform block of text.
SINGLE_LINE: Treat the image as a single text line.
SINGLE_WORD: Treat the image as a single word.
CIRCLE_WORD: Treat the image as a single word in a circle.
SINGLE_CHAR: Treat the image as a single character.
SPARSE_TEXT: Find as much text as possible in no particular order.
SPARSE_TEXT_OSD: Sparse text with orientation and script detection.

EngineMode

TESSERACT_ONLY: Legacy engine only.
OEM_LSTM_ONLY: Neural nets LSTM engine only.
TESSERACT_LSTM_COMBINED: Legacy + LSTM engines.
DEFAULT: Default, based on what is available.

PageIteratorLevel

BLOCK: Block of text/image/separator line.
PARAGRAPH: Paragraph within a block.
TEXTLINE: Line within a paragraph.
WORD: Word within a text line.
SYMBOL: Symbol/character within a word.

Language

ENG: English
FRA: French
SPA: Spanish
RUS: Russian
DEU: German
VIE: Vietnamese
ARA: Arabic

ModelType

BASE: Block of text/image/separator line.
BEST: Paragraph within a block.
FAST: Line within a paragraph.

ImageType

TYPE_BYTE_GRAY
TYPE_BYTE_BINARY
TYPE_3BYTE_BGR
TYPE_4BYTE_ABGR

NoiseMethod

VARIANCE
RATIO

KernelShape

SQUARE
DIAMOND
DISK
OCTAHEDRON
OCTAGON
STAR

MorphologyOperationType

OPENING
CLOSING
EROSION
DILATION

CropSquareType

TOP_LEFT
TOP_CENTER
TOP_RIGHT
CENTER_LEFT
CENTER
CENTER_RIGHT
BOTTOM_LEFT
BOTTOM_CENTER
BOTTOM_RIGHT

SplittingStrategy

FIXED_NUMBER_OF_PARTITIONS
FIXED_SIZE_OF_PARTITION

AdaptiveThresholdingMethod

GAUSSIAN
MEAN
MEDIAN
WOLF
SINGH

TresholdingMethod

GAUSSIAN
OTSU
SAUVOLA
WOLF

CellDetectionAlgos

CONTOURS - Detect cells in bordered tables
MORPHOPS - Detected calls in: bordered, borderless and combined tables

TableOutputFormat

TABLE - Table struct format
CSV - Comma separated CSV

OCR implicits

asImage

asImage transforms binary content to Image schema.

Parameters

Param name	Type	Default	Description
outputCol	string	image	output column name
contentCol	string	content	input column name with binary content
pathCol	string	path	input column name with path to original file

Example:

import com.johnsnowlabs.ocr.OcrContext.implicits._

val imagePath = "path to image"

// Read image file as binary file
val df = spark.read
  .format("binaryFile")
  .load(imagePath)
  .asImage("image")

df.show()

storeImage

storeImage stores the image(s) to tmp location and return Dataset with path(s) to stored image files.

Parameters

Param name	Type	Default	Description
inputColumn	string		input column name with image struct
formatName	string	png	image format name
prefix	string	sparknlp_ocr_	prefix for output file

Example:

import com.johnsnowlabs.ocr.OcrContext.implicits._

val imagePath = "path to image"

// Read image file as binary file
val df = spark.read
  .format("binaryFile")
  .load(imagePath)
  .asImage("image")

df.storeImage("image")

showImages

Show images on Databrics notebook.

Parameters

Param name	Type	Default	Description
field	string	image	input column name with image struct
limit	integer	5	count of rows for display
width	string	“800”	width of image
show_meta	boolean	true	enable/disable displaying metadata of image

Jupyter Python helpers

display_image

Show single image with metadata in Jupyter notebook.

Parameters

Param name	Type	Default	Description
width	string	“600”	width of image
show_meta	boolean	true	enable/disable displaying metadata of image

Example:

from sparkocr.utils import display_image
from sparkocr.transformers import BinaryToImage

images_path = "/tmp/ocr/images/*.tif"
images_example_df = spark.read.format("binaryFile").load(images_path).cache()

display_image(BinaryToImage().transform(images_example_df).collect()[0].image)

display_images

Show images from dataframe.

Parameters

Param name	Type	Default	Description
field	string	image	input column name with image struct
limit	integer	5	count of rows for display
width	string	“600”	width of image
show_meta	boolean	true	enable/disable displaying metadata of image

Example:

from sparkocr.utils import display_images
from sparkocr.transformers import BinaryToImage

images_path = "/tmp/ocr/images/*.tif"
images_example_df = spark.read.format("binaryFile").load(images_path).cache()

display_images(BinaryToImage().transform(images_example_df), limit=3)

Structures and helpers

display_images_horizontal

Show one or more images per row from dataframe.

Parameters

Param name	Type	Default	Description
fields	string	image	comma separated input column names with image struct
limit	integer	5	count of rows for display
width	string	“600”	width of image
show_meta	boolean	true	enable/disable displaying metadata of image

Example:

from sparkocr.utils import display_images_horizontal

display_images_horizontal(df_with_few_image_fields, fields="images, image_with_regions", limit=10)

Structures and helpers

display_pdf

Show pdf from dataframe.

Parameters

Param name	Type	Default	Description
field	string	content	input column with binary representation of pdf
limit	integer	5	count of rows for display
width	string	“600”	width of image
show_meta	boolean	true	enable/disable displaying metadata of image

Example:

from sparkocr.utils import display_pdf

pdf_df = spark.read.format("binaryFile").load(pdf_path)
display_pdf(pdf_df)

Structures and helpers

display_pdf_file

Show pdf file using embedded pdf viewer.

Parameters

Param name	Type	Default	Description
pdf	string		Path to the file name
size	integer	size=(600, 500)	count of rows for display

Example:

from sparkocr.utils import display_pdf_file

display_pdf_file("path to the pdf file")

Example output:

Structures and helpers

display_table

Display table from the dataframe.

display_tables

Display tables from the dataframe. It is useful for display results of table recognition from the multipage documents/few tables per page.

Example output:

Structures and helpers

Databricks Python helpers

display_images

Show images from dataframe.

Parameters

Param name	Type	Default	Description
field	string	image	input column name with image struct
limit	integer	5	count of rows for display
width	string	“800”	width of image
show_meta	boolean	true	enable/disable displaying metadata of image

Example:

from sparkocr.databricks import display_images
from sparkocr.transformers import BinaryToImage

images_path = "/tmp/ocr/images/*.tif"
images_example_df = spark.read.format("binaryFile").load(images_path).cache()

display_images(BinaryToImage().transform(images_example_df), limit=3)

Databricks Python helpers

PREVIOUSObject detection

NEXTSpeed benchmarks