img2table

1.4.1last stable release 2 months ago

Complexity Score

Medium

Open Issues

N/A

Dependent Projects

Weekly Downloadsglobal

7,811

Keywords

License

MIT
- Yesattribution
- Permissivelinking
- Permissivedistribution
- Permissivemodification
- Nopatent grant
- Yesprivate use
- Permissivesublicensing
- Notrademark grant

Downloads

Readme

img2table

img2table is a simple, easy to use, table identification and extraction Python Library based on OpenCV image processing that supports most common image file formats as well as PDF files.

Thanks to its design, it provides a practical and lighter alternative to Neural Networks based solutions, especially for usage on CPU.

Installation
Features
Supported file formats
Usage
- Documents
  - Images
  - PDF
- Supported OCRs
- Table extraction
- Excel export
Examples
Caveats / FYI

Installation

The library can be installed via pip:

pip install img2table: Standard installation, supporting Tesseract
pip install img2table[paddle]: For usage with Paddle OCR
pip install img2table[easyocr]: For usage with EasyOCR
pip install img2table[surya]: For usage with Surya OCR
pip install img2table[gcp]: For usage with Google Vision OCR
pip install img2table[aws]: For usage with AWS Textract OCR
pip install img2table[azure]: For usage with Azure Cognitive Services OCR

Features

Table identification for images and PDF files, including bounding boxes at the table cell level
Handling of complex table structures such as merged cells
Handling of implicit content - see example
Table content extraction by providing support for OCR services / tools
Extracted tables are returned as a simple object, including a Pandas DataFrame representation
Export extracted tables to an Excel file, preserving their original structure

Supported file formats

Images

Images are loaded using the opencv-python library, supported formats are listed below.

Supported image formats

Windows bitmaps - .bmp, .dib
JPEG files - .jpeg, .jpg, *.jpe
JPEG 2000 files - *.jp2
Portable Network Graphics - *.png
WebP - *.webp
Portable image format - .pbm, .pgm, .ppm .pxm, *.pnm
PFM files - *.pfm
Sun rasters - .sr, .ras
TIFF files - .tiff, .tif
OpenEXR Image files - *.exr
Radiance HDR - .hdr, .pic
Raster and Vector geospatial data supported by GDAL
OpenCV: Image file reading and writing

Multi-page images are not supported.

PDF

Both native and scanned PDF files are supported.

Usage

Documents

Images

Images are instantiated as follows :

from img2table.document import Image

image = Image(src, 
              detect_rotation=False)

Parameters

src : str, pathlib.Path, bytes or io.BytesIO, required Image source detect_rotation : bool, optional, default False Detect and correct skew/rotation of the image
The implemented method to handle skewed/rotated images supports skew angles up to 45° and is based on the publication by Huang, 2020.
Setting the detect_rotation parameter to True, image coordinates and bounding boxes returned by other methods might not correspond to the original image.

PDF

PDF files are instantiated as follows :

from img2table.document import PDF

pdf = PDF(src, 
          pages=[0, 2],
          detect_rotation=False,
          pdf_text_extraction=True)

Parameters

src : str, pathlib.Path, bytes or io.BytesIO, required PDF source pages : list, optional, default None List of PDF page indexes to be processed. If None, all pages are processed detect_rotation : bool, optional, default False Detect and correct skew/rotation of extracted images from the PDF pdf_text_extraction : bool, optional, default True Extract text from the PDF file for native PDFs

PDF pages are converted to images with a 200 DPI for table identification.

OCR

img2table provides an interface for several OCR services and tools in order to parse table content.
If possible (i.e for native PDF), PDF text will be extracted directly from the file and the OCR service/tool will not be called.

Tesseract

from img2table.ocr import TesseractOCR

ocr = TesseractOCR(n_threads=1, 
                   lang="eng", 
                   psm=11,
                   tessdata_dir="...")

Parameters

n_threads : int, optional, default 1 Number of concurrent threads used to call Tesseract lang : str, optional, default "eng" Lang parameter used in Tesseract for text extraction psm : int, optional, default 11 PSM parameter used in Tesseract, run tesseract --help-psm for details tessdata_dir : str, optional, default None Directory containing Tesseract traineddata files. If None, the TESSDATA_PREFIX env variable is used.

Usage of Tesseract-OCR requires prior installation. Check documentation for instructions.
For Windows users getting environment variable errors, you can check this tutorial

PaddleOCR

PaddleOCR is an open-source OCR based on Deep Learning models.
At first use, relevant languages models will be downloaded.

from img2table.ocr import PaddleOCR

ocr = PaddleOCR(lang="en",
                kw={"kwarg": kw_value, ...})

Parameters

lang : str, optional, default "en" Lang parameter used in Paddle for text extraction, check documentation for available languages kw : dict, optional, default None Dictionary containing additional keyword arguments passed to the PaddleOCR constructor.
NB: For usage of PaddleOCR with GPU, the CUDA specific version of paddlepaddle-gpu has to be installed by the user manually as stated in this issue.

# Example of installation with CUDA 11.8
pip install paddlepaddle-gpu==2.5.0rc1.post118 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
pip install paddleocr img2table

If you get an error trying to run PaddleOCR on Ubuntu, please check this issue for a working solution.

EasyOCR

EasyOCR is an open-source OCR based on Deep Learning models.
At first use, relevant languages models will be downloaded.

from img2table.ocr import EasyOCR

ocr = EasyOCR(lang=["en"],
              kw={"kwarg": kw_value, ...})

Parameters

lang : list, optional, default ["en"] Lang parameter used in EasyOCR for text extraction, check documentation for available languages kw : dict, optional, default None Dictionary containing additional keyword arguments passed to the EasyOCR Reader constructor.
docTR

docTR is an open-source OCR based on Deep Learning models.
In order to be used, docTR has to be installed by the user beforehand. Installation procedures are detailed in the package documentation

from img2table.ocr import DocTR

ocr = DocTR(detect_language=False,
            kw={"kwarg": kw_value, ...})

Parameters

detect_language : bool, optional, default False Parameter indicating if language prediction is run on the document kw : dict, optional, default None Dictionary containing additional keyword arguments passed to the docTR ocr_predictor method.
Surya OCR

Only available for python >= 3.10
Surya is an open-source OCR based on Deep Learning models.
At first use, relevant models will be downloaded.

from img2table.ocr import SuryaOCR

ocr = SuryaOCR(langs=["en"])

Parameters

langs : list, optional, default ["en"] Lang parameter used in Surya OCR for text extraction
Google Vision

Authentication to GCP can be done by setting the standard GOOGLE_APPLICATION_CREDENTIALS environment variable.
If this variable is missing, an API key should be provided via the api_key parameter.

from img2table.ocr import VisionOCR

ocr = VisionOCR(api_key="api_key", timeout=15)

Parameters

api_key : str, optional, default None Google Vision API key timeout : int, optional, default 15 API requests timeout, in seconds
AWS Textract

When using AWS Textract, the DetectDocumentText API is exclusively called.

Authentication to AWS can be done by passing credentials to the TextractOCR class.
If credentials are not provided, authentication is done using environment variables or configuration files. Check boto3 documentation for more details.

from img2table.ocr import TextractOCR

ocr = TextractOCR(aws_access_key_id="***",
                  aws_secret_access_key="***",
                  aws_session_token="***",
                  region="eu-west-1")

Parameters

aws_access_key_id : str, optional, default None AWS access key id aws_secret_access_key : str, optional, default None AWS secret access key aws_session_token : str, optional, default None AWS temporary session token region : str, optional, default None AWS server region
Azure Cognitive Services

from img2table.ocr import AzureOCR

ocr = AzureOCR(endpoint="abc.azure.com",
               subscription_key="***")

Parameters

endpoint : str, optional, default None Azure Cognitive Services endpoint. If None, inferred from the COMPUTER_VISION_ENDPOINT environment variable. subscription_key : str, optional, default None Azure Cognitive Services subscription key. If None, inferred from the COMPUTER_VISION_SUBSCRIPTION_KEY environment variable.

Table extraction

Multiple tables can be extracted at once from a PDF page/ an image using the extract_tables method of a document.

from img2table.ocr import TesseractOCR
from img2table.document import Image

# Instantiation of OCR
ocr = TesseractOCR(n_threads=1, lang="eng")

# Instantiation of document, either an image or a PDF
doc = Image(src)

# Table extraction
extracted_tables = doc.extract_tables(ocr=ocr,
                                      implicit_rows=False,
                                      implicit_columns=False,
                                      borderless_tables=False,
                                      min_confidence=50)

Parameters

ocr : OCRInstance, optional, default None OCR instance used to parse document text. If None, cells content will not be extracted implicit_rows : bool, optional, default False Boolean indicating if implicit rows should be identified - check related example implicit_columns : bool, optional, default False Boolean indicating if implicit columns should be identified - check related example borderless_tables : bool, optional, default False Boolean indicating if borderless tables are extracted on top of bordered tables. min_confidence : int, optional, default 50 Minimum confidence level from OCR in order to process text, from 0 (worst) to 99 (best)

NB: Borderless table extraction can, by design, only extract tables with 3 or more columns.

Method return

The ExtractedTable class is used to model extracted tables from documents.

Attributes

bbox : BBox Table bounding box title : str Extracted title of the table content : OrderedDict Dict with row indexes as keys and list of TableCell objects as values df : pd.DataFrame Pandas DataFrame representation of the table html : str HTML representation of the table

In order to access bounding boxes at the cell level, you can use the following code snippet :

for id_row, row in enumerate(table.content.values()):
    for id_col, cell in enumerate(row):
        x1 = cell.bbox.x1
        y1 = cell.bbox.y1
        x2 = cell.bbox.x2
        y2 = cell.bbox.y2
        value = cell.value

Images

extract_tables method from the Image class returns a list of ExtractedTable objects.

output = [ExtractedTable(...), ExtractedTable(...), ...]

PDF

extract_tables method from the PDF class returns an OrderedDict object with page indexes as keys and lists of ExtractedTable objects.

output = {
    0: [ExtractedTable(...), ...],
    1: [],
    ...
    last_page: [ExtractedTable(...), ...]
}

Excel export

Tables extracted from a document can be exported to a xlsx file. The resulting file is composed of one worksheet per extracted table.
Method arguments are mostly common with the extract_tables method.

from img2table.ocr import TesseractOCR
from img2table.document import Image

# Instantiation of OCR
ocr = TesseractOCR(n_threads=1, lang="eng")

# Instantiation of document, either an image or a PDF
doc = Image(src)

# Extraction of tables and creation of a xlsx file containing tables
doc.to_xlsx(dest=dest,
            ocr=ocr,
            implicit_rows=False,
            implicit_columns=False,
            borderless_tables=False,
            min_confidence=50)

Parameters

dest : str, pathlib.Path or io.BytesIO, required Destination for xlsx file ocr : OCRInstance, optional, default None OCR instance used to parse document text. If None, cells content will not be extracted implicit_rows : bool, optional, default False Boolean indicating if implicit rows should be identified - check related example implicit_rows : bool, optional, default False Boolean indicating if implicit columns should be identified - check related example borderless_tables : bool, optional, default False Boolean indicating if borderless tables are extracted. It requires to provide an OCR to the method in order to be performed - feature in alpha version min_confidence : int, optional, default 50 Minimum confidence level from OCR in order to process text, from 0 (worst) to 99 (best)

Returns

If a io.BytesIO buffer is passed as dest arg, it is returned containing xlsx data

Examples

Several Jupyter notebooks with examples are available :

Basic usage: generic library usage, including examples with images, PDF and OCRs
Borderless tables: specific examples dedicated to the extraction of borderless tables
Implicit content: illustrated effect of the parameter implicit_rows/implicit_columns of the extract_tables method

Caveats / FYI

For table extraction, results are highly dependent on OCR quality. By design, tables where no OCR data can be found are not returned.
The library is tailored for usage on documents with white/light background. Effectiveness can not be guaranteed on other type of documents.
Table detection using only OpenCV processing can have some limitations. If the library fails to detect tables, you may check CNN based solutions.

Dependencies

Loading dependencies...

CVE IssuesActive

Scorecards Score

No Data

Test Coverage

No Data

Follows Semver

Yes

Github Stars

643

Dependenciestotal

DependenciesOutdated

DependenciesDeprecated

Threat Modelling

No Data

Repo Audits

No Data

Learn how to distribute img2table in your own private PyPI registry

pip install img2table

Processing...

Done

Start your free trial

47 Releases

PyPI on Cloudsmith

Getting started with PyPI on Cloudsmith is fast and easy.

Learn more about PyPI on Cloudsmith

View the Cloudsmith + Python Docs

Keywords

License

Readme

img2table

Table of contents

Installation

Features

Supported file formats

Images

PDF

Usage

Documents

Images

Parameters

PDF

Parameters

OCR

Parameters

Parameters

Parameters

Parameters

Parameters

Parameters

Parameters

Parameters

Table extraction

Parameters

Method return

Attributes

Images

PDF

Excel export

Parameters

Returns

Examples

Caveats / FYI

61Quality

37Maintenance

60Docs

Learn how to distribute img2table in your own private PyPI registry

47 Releases

Getting started with PyPI on Cloudsmith is fast and easy.