Tutorial

How to use OCR Table to generate CSV with Python

TABLE OF CONTENTS

Quickly and easily extract tables from documents and transform them in CSV with just a few simple steps!

‍

Why convert to CSV?

CSV is a widely accepted format for tabular data, making it ideal for data manipulation, analysis, and integration with existing systems.

It is a simple, human-readable, and widely supported format for tabular data. It’s the go-to choice when it comes to data manipulation, analysis, and integration, especially for businesses that rely on spreadsheets, databases, and data warehouses for decision-making.

That is precisely why we are offering a Python-based solution for converting JSON responses from Eden AI OCR table API into CSV format. By following these simple steps, you’ll acquire practical skills to streamline data processing and integration, ensuring you get the most out of your digitized content.

‍

How to convert Table into a CSV

‍

1. Get Response from OCR Table API

NOTE: For this tutorial we will concentrate on simple tables easily readable in .csv format. For tables with lots of row & column spans, it is an entire different challenge to represent them in a simple format.

‍

First thing first, we should parse our document into JSON thanks to the Eden AI API.

The API is asynchronous, meaning that we can conduct multiple requests at the same time without waiting for the previous request to execute. This is useful when you need to parse a document spanning multiple pages, which would take a long time to process.

However, for the purpose of this example, we will just send a very simple table that can be found here.

‍

Here is a code snippet to show you how to launch the job:


import requests

headers = {"Authorization": "Bearer 🔑 Your_API_Key"}
url="https://api.edenai.run/v2/ocr/ocr_tables_async"

file_url = "https://developer.mozilla.org/en-US/docs/Learn/HTML/Tables/Basics/numbers-table.png"
provider = "amazon"

payload={
    "providers": provider,
    "file_url": file_url,
    "language":"en",
}

response = requests.post(url, json=payload, headers=headers)
result = response.json()

job_id = result['public_id']

‍

The API returns a public_id that we can now use to get the result of the job. Since we don’t know when it will finish, we will poll the job and check its status every 5 seconds.


import time

def poll_ocr_table_job(job_id: str, max_poll_count = 10, poll_interval_sec = 5) -> dict:
    """
    Poll asynchronous job every `poll_interval_sec` seconds
    Raises Exception if job still not finished after `max_poll_count`
    """
    for i in range(max_poll_count):
        time.sleep(poll_interval_sec)
        response = requests.get(f"{url}/{job_id}", headers=headers)
        data = response.json()
        if data['status'] == 'finished':
            return data
    raise Exception("Call took too long.")


poll_response = poll_ocr_table_job(job_id)
# we know there is only one page and one table
# in reality you can iterate over pages and create one cv file per table found
json_table = poll_response['results'][provider]['pages'][0]['tables'][0]

‍

2. Use Python CSV library to generate CSV

Now that we got the table, we need to format it into multiple lists of strings, each list representing a row.

‍

Example:


[ ['header1', 'header2'], ['data1', 'data2']]

‍

Here is how to do it:


csv_table = []
for row in json_table['rows']:
    csv_row = []

    for cell in row['cells']:
        csv_row.append(cell['text'])

    csv_table.append(csv_row)

‍

Finally we just need to create a csv file and write the data into it:


import csv
with open("table.csv", 'w') as csvfile:
    tablewriter = csv.writer(csvfile)
    tablewriter.writerows(csv_table)

‍

Here is the resulting CSV file:

Chris,38

Dennis,45

Sarah,29

Karen,47

‍

Conclusion

Here it is! We have successfully parsed a table document and transformed it into a CSV file. It’s actually very easy to do it with Python, and it shouldn’t be a problem to implement it in other languages.

‍

Create your Account on Eden AI