So guys in today’s blog we will see how to extract tables from PDF files and save them as CSV files using just 3-4 lines of code.
This use-case can be very useful when you need to extract n number of tables from a PDF File. So without any further due, let’s do it…
Snapshot of our Final CSV…
Step 1 – Install Camelot
- To install the Camelot library, run the following command in your terminal.
pip install "camelot-py[cv]"
Step 2 – Importing required libraries
- For today’s use case, we just need to import the Camelot library.
Step 3 – Reading the PDF file.
- Download the pdf file.
- Here we are simply using camelot.read_pdf function to read our PDF file and extract tables from it automatically.
- If our PDF has more than 1 page, we can also specify the page numbers from which we need to read the CSVs.
- Also if our PDF file is password protected we can pass the password of the file as the parameter to the read_pdf function.
tables = camelot.read_pdf('table.pdf') # tables = camelot.read_pdf('table.pdf', pages='1,2,3,5-7,8') # tables = camelot.read_pdf('table.pdf', password='*******')
Step 4 – Let’s extract tables from PDF files
- As we already know that our PDF File is having just one table so we will just do tables.df, means print the 0th element(table) in our tables as a dataframe.
- When you are working with multiple tables simply run a for-loop.
#Access the ith table as Pandas Data frame tables.df
Step 5 – Save the table in CSV format
- Simply use the tables.export method to save the tables in CSV format.
Step 6 – Visualizing the conversion metrics
- Use the tables.parsing_report to visualize the conversion metrics.
- Read more about the advance usage of camelot library here.
And this is how you Extract Tables from PDF files…
So this is all for this blog folks. Thanks for reading it and I hope you are taking something with you after reading this and till the next time …
Read my previous post: How to Deploy a Flask app online using Pythonanywhere