Structures and helpers

 

Schemas

Image Schema

Images are loaded as a DataFrame with a single column called “image.”

It is a struct-type column, that contains all information about image:

image: struct (nullable = true)
 |    |-- origin: string (nullable = true)
 |    |-- height: integer (nullable = false)
 |    |-- width: integer (nullable = false)
 |    |-- nChannels: integer (nullable = false)
 |    |-- mode: integer (nullable = false)
 |    |-- resolution: integer (nullable = true)
 |    |-- data: binary (nullable = true)

Fields

Field name Type Description
origin string source URI
height integer image height in pixels
width integer image width in pixels
nChannels integer number of color channels
mode ImageType the data type and channel order the data is stored in
resolution integer resolution of image in dpi
data binary image data in a binary format

NOTE: Image data stored in a binary format. Image data is represented as a 3-dimensional array with the dimension shape (height, width, nChannels) and array values of type t specified by the mode field.

Coordinate Schema

element: struct (containsNull = true)
 |    |    |-- index: integer (nullable = false)
 |    |    |-- page: integer (nullable = false)
 |    |    |-- x: float (nullable = false)
 |    |    |-- y: float (nullable = false)
 |    |    |-- width: float (nullable = false)
 |    |    |-- height: float (nullable = false)
Field name Type Description
index integer Chunk index
page integer Page number
x float The lower left x coordinate
y float The lower left y coordinate
width float The width of the rectangle
height float The height of the rectangle
score float The score of the object
label string The label of the object

PageMatrix Schema

element: struct (containsNull = true)
 |    |    |-- mappings: array[struct] (nullable = false)
Field name Type Description
mappings Array[Mapping] Array of mappings

Mapping Schema

element: struct (containsNull = true)
 |    |    |-- c: string (nullable = false)
 |    |    |-- p: integer (nullable = false)
 |    |    |-- x: float (nullable = false)
 |    |    |-- y: float (nullable = false)
 |    |    |-- width: float (nullable = false)
 |    |    |-- height: float (nullable = false)
 |    |    |-- fontSize: integer (nullable = false)
Field name Type Description
c string Character
p integer Page number
x float The lower left x coordinate
y float The lower left y coordinate
width float The width of the rectangle
height float The height of the rectangle
fontSize integer Font size in points

Enums

PageSegmentationMode

  • OSD_ONLY: Orientation and script detection only.
  • AUTO_OSD: Automatic page segmentation with orientation and script detection.
  • AUTO_ONLY: Automatic page segmentation, but no OSD, or OCR.
  • AUTO: Fully automatic page segmentation, but no OSD.
  • SINGLE_COLUMN: Assume a single column of text of variable sizes.
  • SINGLE_BLOCK_VERT_TEXT: Assume a single uniform block of vertically aligned text.
  • SINGLE_BLOCK: Assume a single uniform block of text.
  • SINGLE_LINE: Treat the image as a single text line.
  • SINGLE_WORD: Treat the image as a single word.
  • CIRCLE_WORD: Treat the image as a single word in a circle.
  • SINGLE_CHAR: Treat the image as a single character.
  • SPARSE_TEXT: Find as much text as possible in no particular order.
  • SPARSE_TEXT_OSD: Sparse text with orientation and script detection.

EngineMode

  • TESSERACT_ONLY: Legacy engine only.
  • OEM_LSTM_ONLY: Neural nets LSTM engine only.
  • TESSERACT_LSTM_COMBINED: Legacy + LSTM engines.
  • DEFAULT: Default, based on what is available.

PageIteratorLevel

  • BLOCK: Block of text/image/separator line.
  • PARAGRAPH: Paragraph within a block.
  • TEXTLINE: Line within a paragraph.
  • WORD: Word within a text line.
  • SYMBOL: Symbol/character within a word.

Language

  • ENG: English
  • FRA: French
  • SPA: Spanish
  • RUS: Russian
  • DEU: German
  • VIE: Vietnamese
  • ARA: Arabic

ModelType

  • BASE: Block of text/image/separator line.
  • BEST: Paragraph within a block.
  • FAST: Line within a paragraph.

ImageType

  • TYPE_BYTE_GRAY
  • TYPE_BYTE_BINARY
  • TYPE_3BYTE_BGR
  • TYPE_4BYTE_ABGR

NoiseMethod

  • VARIANCE
  • RATIO

KernelShape

  • SQUARE
  • DIAMOND
  • DISK
  • OCTAHEDRON
  • OCTAGON
  • STAR

MorphologyOperationType

  • OPENING
  • CLOSING
  • EROSION
  • DILATION

CropSquareType

  • TOP_LEFT
  • TOP_CENTER
  • TOP_RIGHT
  • CENTER_LEFT
  • CENTER
  • CENTER_RIGHT
  • BOTTOM_LEFT
  • BOTTOM_CENTER
  • BOTTOM_RIGHT

SplittingStrategy

  • FIXED_NUMBER_OF_PARTITIONS
  • FIXED_SIZE_OF_PARTITION

AdaptiveThresholdingMethod

  • GAUSSIAN
  • MEAN
  • MEDIAN
  • WOLF
  • SINGH

TresholdingMethod

  • GAUSSIAN
  • OTSU
  • SAUVOLA
  • WOLF

CellDetectionAlgos

  • CONTOURS - Detect cells in bordered tables
  • MORPHOPS - Detected calls in: bordered, borderless and combined tables

TableOutputFormat

  • TABLE - Table struct format
  • CSV - Comma separated CSV

OCR implicits

asImage

asImage transforms binary content to Image schema.

Parameters

Param name Type Default Description
outputCol string image output column name
contentCol string content input column name with binary content
pathCol string path input column name with path to original file

Example:

import com.johnsnowlabs.ocr.OcrContext.implicits._

val imagePath = "path to image"

// Read image file as binary file
val df = spark.read
  .format("binaryFile")
  .load(imagePath)
  .asImage("image")

df.show()

storeImage

storeImage stores the image(s) to tmp location and return Dataset with path(s) to stored image files.

Parameters

Param name Type Default Description
inputColumn string   input column name with image struct
formatName string png image format name
prefix string sparknlp_ocr_ prefix for output file

Example:

import com.johnsnowlabs.ocr.OcrContext.implicits._

val imagePath = "path to image"

// Read image file as binary file
val df = spark.read
  .format("binaryFile")
  .load(imagePath)
  .asImage("image")

df.storeImage("image")

showImages

Show images on Databrics notebook.

Parameters

Param name Type Default Description
field string image input column name with image struct
limit integer 5 count of rows for display
width string “800” width of image
show_meta boolean true enable/disable displaying metadata of image

Jupyter Python helpers

display_image

Show single image with metadata in Jupyter notebook.

Parameters

Param name Type Default Description
width string “600” width of image
show_meta boolean true enable/disable displaying metadata of image

Example:

from sparkocr.utils import display_image
from sparkocr.transformers import BinaryToImage

images_path = "/tmp/ocr/images/*.tif"
images_example_df = spark.read.format("binaryFile").load(images_path).cache()

display_image(BinaryToImage().transform(images_example_df).collect()[0].image)

display_images

Show images from dataframe.

Parameters

Param name Type Default Description
field string image input column name with image struct
limit integer 5 count of rows for display
width string “600” width of image
show_meta boolean true enable/disable displaying metadata of image

Example:

from sparkocr.utils import display_images
from sparkocr.transformers import BinaryToImage

images_path = "/tmp/ocr/images/*.tif"
images_example_df = spark.read.format("binaryFile").load(images_path).cache()

display_images(BinaryToImage().transform(images_example_df), limit=3)

image

display_images_horizontal

Show one or more images per row from dataframe.

Parameters

Param name Type Default Description
fields string image comma separated input column names with image struct
limit integer 5 count of rows for display
width string “600” width of image
show_meta boolean true enable/disable displaying metadata of image

Example:

from sparkocr.utils import display_images_horizontal

display_images_horizontal(df_with_few_image_fields, fields="images, image_with_regions", limit=10)

image

display_pdf

Show pdf from dataframe.

Parameters

Param name Type Default Description
field string content input column with binary representation of pdf
limit integer 5 count of rows for display
width string “600” width of image
show_meta boolean true enable/disable displaying metadata of image

Example:

from sparkocr.utils import display_pdf

pdf_df = spark.read.format("binaryFile").load(pdf_path)
display_pdf(pdf_df)

image

display_pdf_file

Show pdf file using embedded pdf viewer.

Parameters

Param name Type Default Description
pdf string   Path to the file name
size integer size=(600, 500) count of rows for display

Example:

from sparkocr.utils import display_pdf_file

display_pdf_file("path to the pdf file")

Example output:

image

display_table

Display table from the dataframe.

display_tables

Display tables from the dataframe. It is useful for display results of table recognition from the multipage documents/few tables per page.

Example output:

image

Databricks Python helpers

display_images

Show images from dataframe.

Parameters

Param name Type Default Description
field string image input column name with image struct
limit integer 5 count of rows for display
width string “800” width of image
show_meta boolean true enable/disable displaying metadata of image

Example:

from sparkocr.databricks import display_images
from sparkocr.transformers import BinaryToImage

images_path = "/tmp/ocr/images/*.tif"
images_example_df = spark.read.format("binaryFile").load(images_path).cache()

display_images(BinaryToImage().transform(images_example_df), limit=3)

image

Last updated