Table Extractor Flask App – with source code – 2024

Hey guys in today’s article we will build a Table Extractor Flask App using Python. We will use the pdfplumber library to fetch the tables from the PDF files. It will be a fun article, so without any further due, let’s do it…

Introducing the Table Extractor Flask App

The Table Extractor Flask App is a web-based application designed to simplify the extraction of tabular data from PDF documents. Leveraging the powerful capabilities of the pdfplumber library, the app offers a user-friendly interface for uploading PDF files and extracting structured data from tables within those documents.

Snapshots

Extracted Tables on the Front End

Table Extractor Flask App
Table Extractor Flask App

Extracted Tables in the Excel File

Table Extractor Flask App
Table Extractor Flask App

Working of our Table Extractor Flask App

  • Run the Flask App using the ‘flask run’ command.
  • An Interface will open, and upload the PDF File there.
  • Click on the ‘Extract Table’ button, and the script will run in the background.
  • In the background, the pdfplumber library will search for tables in the given PDF file.
  • If any tables are found, they will be converted to HTML format and fed to the front end where they are displayed.
  • Also, an Excel file is stored with all the tables extracted.
  • And, this is how this application works.

Source Code for Table Extractor Flask App

app.py

import os
from werkzeug.utils import secure_filename
from flask import Flask,request,render_template
import pdfplumber
import pandas as pd


UPLOAD_FOLDER = './static/uploads'
ALLOWED_EXTENSIONS = set(['pdf'])

app = Flask(__name__)
app.config['SEND_FILE_MAX_AGE_DEFAULT'] = 0
app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER
app.secret_key = "secret key"

def allowed_file(filename):
    return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS

def extract(pdf_path):
    tables = []
    with pdfplumber.open('test.pdf') as pdf:
        for page in pdf.pages:
            table = page.extract_table()
            if table:
                table = pd.DataFrame(table)
                table.columns = table.iloc[0]
                table.drop(0,inplace=True)
                tables.append(table)
    return tables

@app.route('/')
def home():
    return render_template('home.html')

@app.route('/extract_table',methods=['POST'])
def extract_table():
    file = request.files['file']
    if file and allowed_file(file.filename):
        filename = secure_filename(file.filename)
        file.save(os.path.join(app.config['UPLOAD_FOLDER'], filename))
        tables = extract(UPLOAD_FOLDER+'/'+filename)

        with pd.ExcelWriter('extracted_tables.xlsx') as writer:
            for i, df in enumerate(tables):
                df.to_excel(writer, sheet_name=str(i+1), index=False)

        tables = [x.to_html(index=False) for x in tables]

        return render_template('home.html',org_img_name=filename,tables=tables,ntables=len(tables))


if __name__ == '__main__':
    app.run(debug=True)

home.html

<!doctype html>
<html lang="en">

<style type='text/css'>
    body {
        background-image: url('https://cdn.pixabay.com/photo/2018/12/18/22/29/background-3883181_1280.jpg');
        background-repeat: no-repeat;
        background-attachment: fixed;
        background-size: cover;
        font-family: sans-serif;
        margin-top: 40px;
    }

    .regform {
        width: 800px;
        background-color: rgb(0, 0, 0, 0.8);
        margin: auto;
        color: #FFFFFF;
        padding: 10px 0px 10px 0px;
        text-align: center;
        border-radius: 15px 15px 0px 0px;

    }

    .main-form {
        width: 800px;
        margin: auto;
        background-color: rgb(0, 0, 0, 0.7);
        padding-left: 50px;
        padding-right: 50px;
        padding-bottom: 20px;
        color: #FFFFFF;
    }
</style>


<head>
    <!-- Required meta tags -->
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">

    <!-- Bootstrap CSS -->
    <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.0.0-beta3/dist/css/bootstrap.min.css" rel="stylesheet"
        integrity="sha384-eOJMYsd53ii+scO/bJGFsiCZc+5NDVN2yr8+0RDqr0Ql0h+rP48ckxlpbzKgwra6" crossorigin="anonymous">

    <title>PDF Tables Extraction</title>
</head>

<body>

    <div class='regform mt-3'>
        <h1>PDF Tables Extraction</h1>
    </div>

    <form action='/extract_table' class='main-form' method="POST" enctype="multipart/form-data">

        <div class='text-center'>
            <input type="file" id="file" name='file' style="margin-top:10px;margin-bottom:10px;">
            <button type='submit' class='btn btn-outline-success'> Extract Table
            </button>
        </div>

    </form>

    {% if ntables %}
    <div class="main-form" style="margin:auto;">
        <div class="text-center">
            {% for i in range(ntables) %}
            <h2>Table {{i+1}}</h2>
            <div class="row" style="text-align: right; border: ridge;margin-bottom: 40px;">
                {{ tables[i] | safe}}
            </div>
            {% endfor %}

        </div>
    </div>
    {% endif %}

</body>

</html>

Snapshots

Extracted Tables on the Front End

Table Extractor Flask App
Table Extractor Flask App

Extracted Tables in the Excel File

Table Extractor Flask App
Table Extractor Flask App

Download the Source Code for the Table Extractor Flask App

Conclusion

The Table Extractor Flask App, powered by the pdfplumber Python library, offers a convenient and efficient solution for extracting tabular data from PDF documents. By automating the extraction process and providing a user-friendly interface, the app streamlines data-handling tasks and enhances productivity.

So in this way you too can build your Table Extractor Flask App using Python. If you have any doubt regarding this, you can contact me by mail.

FAQs

What is the Table Extractor Flask App?

The Table Extractor Flask App is a web-based application designed to extract tabular data from PDF documents. It utilizes the pdfplumber Python library to automatically detect tables within PDF files and extract structured data from them.

How does the app work?

Users can upload PDF documents containing tables to the app. The app then analyzes the documents, identifies table locations, and extracts tabular data using pdfplumber’s table extraction algorithms. Users can choose to export the extracted data in formats like CSV or Excel.

What types of documents does the app support?

The app supports PDF documents containing tables. It can handle various types of tables, including those found in financial reports, research papers, invoices, and legal documents.

Is the extracted data accurate?

While the app employs sophisticated algorithms for table detection and data extraction, the accuracy of the extracted data may vary depending on factors such as the quality and formatting of the input documents. Users can review and validate the extracted data to ensure accuracy.

Read my last article – Words Counter and Paragraphs Counter Flask App using Python

Check out my other machine learning projectsdeep learning projectscomputer vision projectsNLP projectsFlask projects at machinelearningprojects.net

Leave a Reply

Your email address will not be published. Required fields are marked *