Improve Article
Save Article
Like Article
Improve Article
Save Article
Like Article
In this article, we will see how to convert a PDF to Excel or CSV File Using Python. It can be done with various methods, here are we are going to use some methods.
Method 1: Using pdftables_api
Here will use the pdftables_api Module for converting the PDF file into any other format. It’s a simple web-based API, so can be called from any programming language.
Installation:
pip install git+https://github.com/pdftables/python-pdftables-api.git
After Installation, you need an API KEY. Go to PDFTables.com and signup, then visit the API Page to see your API KEY.
For Converting PDF File Into excel File we will use xml() method.
Syntax:
xml(pdf_path, xml_path)
Below is the Implementation:
PDF File Used:
PDF FILE
Python3
import
pdftables_api
conversion
=
pdftables_api.Client(
'API KEY'
)
conversion.xlsx(
"pdf_file_path"
,
"output_file_path"
)
Output:
EXCEL FILE
Method 2: Using tabula-py
Here will use the tabula-py Module for converting the PDF file into any other format.
Installation:
pip install tabula-py
Before we start, first we need to install java and add a java installation folder to the PATH variable.
- Install java click here
- Add java installation folder (C:Program Files (x86)Javajre1.8.0_251bin) to the environment path variable
Approach:
- Read PDF file using read_pdf() method.
- Then we will convert the PDF files into an Excel file using the to_excel() method.
Syntax:
read_pdf(PDF File Path, pages = Number of pages, **agrs)
Below is the Implementation:
PDF File Used:
PDF FILE
Python3
import
tabula
df
=
tabula.read_pdf(
"PDF File Path"
, pages
=
1
)[
0
]
df.to_excel(
'Excel File Path'
)
Output:
EXCEL FILE
Like Article
Save Article
Updated February 2019
You can convert your PDF to Excel, CSV, XML or HTML with Python using the PDFTables API. Our API will enable you to convert PDFs without uploading each one manually.
In this tutorial, I’ll be showing you how to get the library set up on your local machine and then use it to convert PDF to Excel, with Python.
Here’s an example of a PDF that I’ve converted with the library. In order to properly test the library, make sure you have a PDF handy!
Step 1
If you haven’t already, install Anaconda on your machine from Anaconda website. You can use either Python 3.6.x or 2.7.x, as the PDFTables API works with both. Downloading Anaconda means that pip will also be installed. Pip gives a simple way to install the PDFTables API Python package.
For this tutorial, I’ll be using the Windows Python IDLE Shell, but the instructions are almost identical for Linux and Mac.
Step 2
In your terminal/command line, install the PDFTables Python library with:
pip install git+https://github.com/pdftables/python-pdftables-api.git
If git is not recognised, download it here. Then, run the above command again.
Or if you’d prefer to install it manually, you can download it from python-pdftables-api then install it with:
python setup.py install
Step 3
Create a new Python script then add the following code:
import pdftables_api c = pdftables_api.Client('my-api-key') c.xlsx('input.pdf', 'output') #replace c.xlsx with c.csv to convert to CSV #replace c.xlsx with c.xml to convert to XML #replace c.xlsx with c.html to convert to HTML
Now, you’ll need to make the following changes to the script:
- Replace
my-api-key
with your PDFTables API key, which you can get here. - Replace
input.pdf
with the PDF you would like to convert. - Replace
output
with the name you’d like to give the converted document.
Now, save your finished script as convert-pdf.py
in the same directory as the PDF document you’d like to convert.
If you don’t understand the script above, see the script overview section.
Step 4
Open your command line/terminal and change your directory (e.g. cd C:/Users/Bob
) to the folder you saved your convert-pdf.py
script and PDF in, then run the following command:
python convert-pdf.py
To find your converted spreadsheet, navigate to the folder in your file explorer and hey presto, you’ve converted a PDF to Excel or CSV with Python!
Script overview
The first line is simply importing the PDFTables API toolset, so that Python knows what to do when certain actions are called. The second
line is calling the PDFTables API with your unique API key. This means here at PDFTables we know which account is using the API and how many
PDF pages are available. Finally, the third line is telling Python to convert the file with name input.pdf
to xlsx and also what
you would like it to be called upon output: output
. To convert to CSV, XML or HTML simply change c.xlsx
to be c.csv
,
c.xml
or c.html
respectively.
Looking to convert multiple PDF files at once?
Check out our blog post here.
Love PDFTables? Leave us a review on our Trustpilot page!
I want to convert a pdf file into excel and save it in local via python.
I have converted the pdf to excel format but how should I save it local?
my code:
df = ("./Downloads/folder/myfile.pdf")
tabula.convert_into(df, "test.csv", output_format="csv", stream=True)
asked Nov 4, 2019 at 9:28
You can specify your whole output path instead of only output.csv
df = ("./Downloads/folder/myfile.pdf")
output = "./Downloads/folder/test.csv"
tabula.convert_into(df, output, output_format="csv", stream=True)
Hope this answers your question!!!
answered Nov 4, 2019 at 9:41
skaul05skaul05
2,1143 gold badges17 silver badges25 bronze badges
0
In my case, the script below worked:
import tabula
df = tabula.read_pdf(r'C:UsersuserDownloadsfolder3.pdf', pages='all')
tabula.convert_into(r'C:UsersuserDownloadsfolder3.pdf', r'C:UsersuserDownloadsfoldertest.csv' , output_format="csv",pages='all', stream=True)
David Buck
3,69335 gold badges33 silver badges35 bronze badges
answered Aug 8, 2020 at 12:48
i use google collab
install the packege needed
!pip install tabula-py
!pip install pandas
Import the required Module
import tabula
import pandas as pd
Read a PDF File
data = tabula.read_pdf("example.pdf", pages='1')[0] # "all" untuk semua data, pages diisi nomor halaman
convert PDF into CSV
tabula.convert_into("example.pdf", "example.csv", output_format="csv", pages='1') #"all" untuk semua data, pages diisi no halaman
print(data)
to convert to excell file
data1 = pd.read_csv("example.csv")
data1.dtypes
now save to xlsx
data.to_excel('example.xlsx')
answered Feb 23 at 2:02
1
Documentation says that:
Output file will be saved into output_path
output_path is your second parameter, «test.csv». I guess it works fine, but you are loking it in the wrong folder. It will be located near to your script (to be strict — in current working directory) since you didn’t specify full path.
answered Nov 4, 2019 at 9:43
QtRoSQtRoS
1,1391 gold badge17 silver badges23 bronze badges
0
PDF to .xlsx file:
for item in df:
list1.append(item)
df = pd.DataFrame(list1)
df.to_excel('outputfile.xlsx', sheet_name='Sheet1', index=True)
answered Apr 8, 2021 at 10:03
HithHith
495 bronze badges
you can also use camelot
in combination with pandas
import camelot
import pandas
tables = camelot.read_pdf(path_to_pdf, flavor='stream',pages='all')
df = pandas.concat([table.df for table in tables])
df.to_csv(path_to_csv)
answered Dec 7, 2022 at 11:31
smoquetsmoquet
2913 silver badges10 bronze badges
Last Updated on July 14, 2022 by
In this tutorial, we’ll take a look at how to convert PDF to Excel with Python. If you work with data, the chances are that you have had, or will have to deal with data stored in a .pdf file. It’s difficult to copy a table from PDF and paste it directly into Excel. In most cases, what we copy from the PDF file is text, instead of formatted Excel tables. Therefore, when pasting the data into Excel, we see a chunk of text squeezed into one cell.
Of course, we don’t want to copy and paste individual values one by one into Excel. There are several commercial software that allows PDF to Excel conversion, but they charge a hefty fee. If you are willing to learn a little bit of Python, it takes less than 10 lines of code to achieve a reasonably good result.
We’ll extract the COVID-19 cases by country from the WHO’s website. I’m attaching it here in case the source file gets removed later.
Step 1. Install Python library and Java
tabula-py
is a Python wrapper of tabula-java, which can read tables in PDF file. It means that we need to install Java first. The installation takes about 1 minute, and you can follow this link to find the Java installation file for your operating system: https://java.com/en/download/help/download_options.xml.
Once you have Java, install tabula-py with pip
:
pip install tabula-py
We are going to extract the table on page 3 of the PDF file. tabula.read_pdf()
returns a list of dataframes. For some reason, tabula detected 8 tables on this page, looking through them, we see that the second table is what we want to extract. Thus we specify that we want to get the second element of that list using [1]
.
import tabula
df = tabula.read_pdf('data.pdf', pages = 3, lattice = True)[1]
If this is your first time installing Java and tabula-py
, you might get the following error message when running the above 2 lines of code:
tabula.errors.JavaNotFoundError: `java` command is not found from this Python process.Please ensure Java is installed and PATH is set for `java`
Which is due to Java folder is not in the PATH system variable. Simply add your Java installation folder to the PATH variable. I used the default installation, so the Java folder is C:Program Files (x86)Javajre1.8.0_251bin
on my laptop.
Now the script should run.
By default, tabula-py
will extract tables from PDF file into a pandas
dataframe
. Let’s take a look at the data by inspecting the first 10 rows with .head(10)
:
We immediately see two problems with this unprocessed table: the header row contains weird letters “r”, and there are many NaN values. We’ll have to do a little bit further clean up to make the data useful.
Step 2. Clean up the header row
Let’s first clean up the header row. df.columns
returns the dataframe header names.
We can replace the “r” in the header by doing the following:
df.columns = df.columns.str.replace('r', ' ')
.str
returns all of the string values of the header, then we can perform the .replace()
function to replace “r” with a space. Then, we assign the clean string values back to the dataframe’s header (columns)
Step 3. Remove NaN values
Next, we’ll clean those NaN values, which were created by the function tabula.read_pdf()
, for whenever a particular cell is blank. These values cause troubles for us when doing data analysis, so most of the time we’ll remove them. Glancing through the table, it appears we can remove the rows that contain NaN values without losing any data points. Lucky for us, pandas
provide a convenient way to remove rows with NaN values.
data = df.dropna()
data.to_excel('data.xlsx')
Putting it all together
import tabula
df = tabula.read_pdf('data.pdf', pages = 3, lattice = True)[1]
df.columns = df.columns.str.replace('r', ' ')
data = df.dropna()
data.to_excel('data.xlsx')
Now you see, it takes only 5 lines of code to convert PDF to Excel with Python. It’s simple and powerful. The best part? You control what you want to extract, keep, and change!
Home > How-tos > How to Convert PDF to Excel / CSV Using Python: A Step-By-Step Tutorial
-
By
Jack Buger -
April 6, 2023
Many are the times when you need to extract data from a PDF and export it in a different format to avoid the need to retype all the content for reuse. While most of us have become accustomed to fully-fledged PDF converter software, it is also possible to achieve the exact task at hand using a popular programming language like Python. Unfortunately, Python has not seen a boatload of packages that can accomplish this reliably but at the same time, there are a few that are still able to kick the ball out of the park for you.
As usual, we have gone the extra mile to let you in on the tools you can use and at the same time guide you on how to get started with them. Here, we will look at two Python tools that will come in handy to convert PDF to Excel offline. The good thing is that these tools are free to obtain and the code used can be found on Github. You will be able to transform PDF to Excel without the need to set up any software on your computer. Let’s now learn how to extract data from PD files.
This Tutorial Covers
Method 1). Using PyPDF2 and PDFTables
PDFTables provides you with an API that you can use in combination with Python to convert PDF to Excel. Actually, it will help you convert any PDF file to either Excel, CSV. XML or HTML depending on which one works best for you. For the sake of this article, we are going to put our focus in the PDF to CSV or Excel conversion process. This will happen by just running a very simplified Python script that will eventually save you a great deal of time and effort. In order to help you make the most out of this tool, below is a comprehensive guide that will sail you to the end of a successful task to convert PDF to Excel using Python. Let’s get started.
- 1. From the Anaconda website, install Anaconda. This will install Python on your computer and at the same time have pip set up to help install packages.
- 2. Set up the PDFTables Python library. In the directory containing the PDF file to be converted, start a command interface, input the code below, and hit “Enter”.
pip install git+https://github.com/pdftables/python-pdftables-api.git
If you get a git related error, install it from here.
- 3. Create a Python script containing the code below. This is the code that will be necessary to make the conversion successful.
import pdftables_api
c = pdftables_api.Client(‘my-api-key’)
c.xlsx(‘input.pdf’, ‘output’)
#replace c.xlsx with c.csv to convert to CSV
#replace c.xlsx with c.xml to convert to XML
#replace c.xlsx with c.html to convert to HTML
A few things to do here though include;
-
-
-
-
-
- Grab your API key from the PDFTables website and replace my-api-key
- With the target PDF filename at hand, replace the pdf appropriately
- Replace output with your preferred name for the converted file.
-
-
-
-
Once you have done that, save the PyPDF2 script as convert-pdf.py in the same folder as the source PDF file.
- 4. Open a command line in the source folder and run the saved script. Click on the Address bar, type the word cmd, and hit “Enter”. Next, type in this code py convert-pdf.py and hit “Enter”.
The PDF to Excel process will begin and within moments, you should be able to notice a new file with the XLSX file format in addition to the original PDF file. This will be tacit proof that you have successfully managed to convert PDF to Excel using Python.
Method 2): Using PDFMiner for Extracting Data from PDFs
PDFMiner has been crafted as a suitable tool when you need to parse and extract data from PDF files. It works on a Python base and that means you need to have Python set up on your computer before getting started. Its main focus lies in the extraction and analysis of textual data and is able to give the location of text with accompanying font and line information.
You are also able to convert encrypted PDF to CSV though you have to input the right password for the file. Better yet, this tool is open source and is available for free from Github. As much as possible, this tool to extract data from PDF to Excel using Python ensures that the original layout of the text is maintained. Here are the steps to use PDFMiner.
- Create a Folder and place the target PDF file inside. This is largely for convenience purposes though this tool can be run from any folder once installed.
- Install Python 3.6 or newer on your computer. This is necessary since PDFMiner is a Python tool.
- Open a command-line interface in the PDF directory. From the File Explorer “Address bar”, type in the word cmd and hit the “Enter” key.
- Install PDFMiner. Basically, type in the command “pip install pdfminer” and hit the “Enter” key on your keyboard. Wait for the process to complete.
- Extract data from PDF. To do this, type the command “pdf2txt.py -o sample.csv sample.pdf” and hit the “Enter” key. Remember to replace the word “sample” with your PDF filename. To break down the command, we are simply extracting data from the sample.pdf and outputting the data in the file sample.csv. Opening the output file will reveal the extracted data.
After going through the steps above, you should be having a file containing your extracted data ready for opening. Keep in mind that you have to input the right source PDF filename and the output filename that you prefer. You can also output to XLSX or XLS format if you don’t want to use the CSV format.
Editors’ Recommendations
-
-
- Convert PDF to Excel Using VBA: A Step-by-Step Tutorial
- Convert PDF to Excel Using VBA: A Step-by-Step Tutorial
-
In this article we will see how to quickly extract a table from a PDF to Excel.
For this tutorial you will need two Python libraries :
- tabula-py
- pandas
To install them, go to your terminal/shell and type these lines of code:
pip install tabula-py
pip install pandas
If you use Google Colab, you can install these libraries directly on it. You just have to add an exclamation mark “!” in front of it, like this:
!pip install tabula-py
!pip install pandas
[smartslider3 slider=”10″]
Photo by Aurelien Romain on Unsplash
PDF to Excel (one table only)
First we load the libraries into our text editor :
import tabula
import pandas as pd
Then, we will read the pdf with the read_pdf() function of the tabula library.
This function automatically detects the tables in a pdf and converts them into DataFrames. Ideal for converting them into Excel files!
df = tabula.read_pdf('file_path/file.pdf', pages = 'all')[0]
We can then check that the table has the expected shape.
df.head()
Then convert it to an Excel file !
df.to_excel('file_path/file.xlsx')
The entire code :
THE PANE METHOD FOR DEEP LEARNING!
Get your 7 DAYS FREE TRAINING to learn how to create your first ARTIFICIAL INTELLIGENCE!
For the next 7 days I will show you how to use Neural Networks.
You will learn what Deep Learning is with concrete examples that will stick in your head.
BEWARE, this email series is not for everyone. If you are the kind of person who likes theoretical and academic courses, you can skip it.
But if you want to learn the PANE method to do Deep Learning, click here :
import tabula import pandas as pd
df = tabula.read_pdf('file_path/file.pdf', pages = 'all')[0]
df.to_excel('file_path/file.xlsx')
PDF containing several tables
We load the libraries in our text editor :
import tabula
import pandas as pd
Then, we will read the pdf with the read_pdf() function of the tabula library.
This function automatically detects the tables in a pdf and converts them into DataFrames. Ideal to convert them then in Excel file !
Here, the variable df will be in fact a list of DataFrame. The first element corresponds to the first table, the second to the second table, etc.
df = tabula.read_pdf('file_path/file.pdf', pages = 'all')
To save these tables separately, you will have to use a for loop that will save each table in an Excel file.
for i in range(len(df)):
df[i].to_excel('file_'+str(i)+'.xlsx')
The entire code :
import tabula
import pandas as pd
df = tabula.read_pdf('file_path/file.pdf', pages = 'all')
for i in range(len(df)):
df[i].to_excel('file_'+str(i)+'.xlsx')
sources:
- Medium
- Photo by Birger Strahl on Unsplash
THE PANE METHOD FOR DEEP LEARNING!
Get your 7 DAYS FREE TRAINING to learn how to create your first ARTIFICIAL INTELLIGENCE!
For the next 7 days I will show you how to use Neural Networks.
You will learn what Deep Learning is with concrete examples that will stick in your head.
BEWARE, this email series is not for everyone. If you are the kind of person who likes theoretical and academic courses, you can skip it.
But if you want to learn the PANE method to do Deep Learning, click here :
Converting PDF to Excel: There are several online tools and websites with the help of which we can easily convert PDF files to Excel. However, converting the PDF files to Excel using Python is much easier. This is because, unlike online tools, we don’t have to upload files to websites to convert them. To convert the data, all that is required is to extract the file into Python. Python uses the function PDF tables API for file conversations.
In this article, let us discuss how to convert PDF files to Excel files using the PDF tables API. Scroll down to find out more.
Extract Data From Multiple PDF Files to Excel Using Python
Given a PDF file, the task is to convert the given PDF file to Excel in Python.
If you work with data, you have probably had or will have to deal with data saved in a pdf file. It is tough to copy a table from a PDF and paste it immediately into Excel. In most cases, we copy text from a PDF file rather than structured Excel tables. As a result, when we paste the data into Excel, we see a portion of text compressed into one cell.
Of course, we don’t want to manually copy and paste individual values into Excel. There is commercial software that permits PDF to Excel conversion, but it is expensive. If you’re prepared to learn a little Python, you can accomplish a reasonably good outcome with fewer than 10 lines of code.
Prerequisites:
- What is Excel?
Given Pdf File:
- How to Convert PDF to Google Sheets: Free Online Conversion
- Convert a TSV file to Excel using Python
- How to Import an Excel File into Python using Pandas?
Below are the ways to convert the given pdf file to Excel File in Python:
- Using pdftables_api
- Using tabula-py
Method #1: Using pdftables_api
The pdftables API Module will be used here to convert the PDF file into any other format. Because it is a basic web-based API, it may be used by any programming language.
Installation:
pip install git+https://github.com/pdftables/python-pdftables-api.git
Collecting git+https://github.com/pdftables/python-pdftables-api.git Cloning https://github.com/pdftables/python-pdftables-api.git to /tmp/pip-req-build-qfdz6fq6 Running command git clone -q https://github.com/pdftables/python-pdftables-api.git /tmp/pip-req-build-qfdz6fq6 Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from pdftables-api==1.1.0) (2.23.0) Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->pdftables-api==1.1.0) (1.24.3) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->pdftables-api==1.1.0) (2021.10.8) Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->pdftables-api==1.1.0) (3.0.4) Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->pdftables-api==1.1.0) (2.10) Building wheels for collected packages: pdftables-api Building wheel for pdftables-api (setup.py) ... done Created wheel for pdftables-api: filename=pdftables_api-1.1.0-py3-none-any.whl size=5879 sha256=ddeaa9d1b7e5e0fb16cd34564d1dfa50891be0cb33ec19b70afe5c90830842af Stored in directory: /tmp/pip-ephem-wheel-cache-o0v_cktl/wheels/80/d5/88/7c51378c0b76213ee939fcb303019731948c2271fc8aab2330 Successfully built pdftables-api Installing collected packages: pdftables-api Successfully installed pdftables-api-1.1.0
After installing pdftables we need an API Key to get access.
For getting an API key visit PDFTables.com and login /signup using the email.
Get the API key from https://pdftables.com/pdf-to-excel-api and save it which will be used in the code.
API key Sample:
1)Converting into excel using xlsx() function
Approach:
- Import pdftables_api module using the import Keyword.
- Verification of API_KEY.
- Pass the API_KEY to the Client function of the pdftables_api module and store it in a variable.
- Converting the given SamplePdf to excel by passing the given pdf and output excel file path as arguments to the xlsx() function and apply it to the above object.
- The Exit of the Program.
Below is the Implementation:
# import pdftables_api module using the import Keyword import pdftables_api # Verification of API_KEY #Pass the API_KEY to the Client function of the pdftables_api module and store it in a variable pdf_conversion = pdftables_api.Client('zufjqhsgxitu') # Converting the given SamplePdf to excel by passing the given pdf and output excel file # path as arguments to the xlsx() function and apply it to the above object pdf_conversion.xlsx("samplePdf.pdf", "resultExcel.xlsx")
Output:
Website | Name |
Sheets Tips | Vikram |
Sheets Tips | Akash |
Sheets Tips | Vishal |
Python-Programs | Pavan |
Python-Programs | Dhoni |
Python-Programs | Virat |
BTechGeeks | Devilliers |
BTechGeeks | Pant |
PythonArray | Smith |
PythonArray | Warner |
Output Image:
2)Converting into XML using xml() function
Approach:
- Import pdftables_api module using the import Keyword.
- Verification of API_KEY.
- Pass the API_KEY to the Client function of the pdftables_api module and store it in a variable.
- Converting the given SamplePdf to XML by passing the given pdf and output XML file path as arguments to the xml() function and apply it to the above object.
- The Exit of the Program.
Below is the Implementation:
# import pdftables_api module using the import Keyword import pdftables_api # Verification of API_KEY #Pass the API_KEY to the Client function of the pdftables_api module and store it in a variable pdf_conversion = pdftables_api.Client('zufjqhsgxitu') # Converting the given SamplePdf to XML by passing the given pdf and # output XML file path as arguments to the xml() function and apply it to the above object. pdf_conversion.xml("samplePdf.pdf", "result.xml")
Output:
<document page-count="1"> <page number="1"> <table data-filename="file.pdf" data-page="1" data-table="1"> <tr> <td>Website</td> <td>Name</td> </tr> <tr> <td>Sheets Tips</td> <td>Vikram</td> </tr> <tr> <td>Sheets Tips</td> <td>Akash</td> </tr> <tr> <td>Sheets Tips</td> <td>Vishal</td> </tr> <tr> <td>Python-Programs</td> <td>Pavan</td> </tr> <tr> <td>Python-Programs</td> <td>Dhoni</td> </tr> <tr> <td>Python-Programs</td> <td>Virat</td> </tr> <tr> <td>BTechGeeks</td> <td>Devilliers</td> </tr> <tr> <td>BTechGeeks</td> <td>Pant</td> </tr> <tr> <td>PythonArray</td> <td>Smith</td> </tr> <tr> <td>PythonArray</td> <td>Warner</td> </tr> </table> </page> </document>
Output Image:
Method #2: Using tabula-py
We will use the tabula-py to convert the given pdf to excel file.
Installation:
pip install tabula-py
Output:
Collecting tabula-py Downloading tabula_py-2.3.0-py3-none-any.whl (12.0 MB) |████████████████████████████████| 12.0 MB 5.4 MB/s Collecting distro Downloading distro-1.7.0-py3-none-any.whl (20 kB) Requirement already satisfied: pandas>=0.25.3 in /usr/local/lib/python3.7/dist-packages (from tabula-py) (1.3.5) Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from tabula-py) (1.21.6) Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.25.3->tabula-py) (2.8.2) Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.25.3->tabula-py) (2022.1) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas>=0.25.3->tabula-py) (1.15.0) Installing collected packages: distro, tabula-py Successfully installed distro-1.7.0 tabula-py-2.3.0
Before we begin, we must first install Java and include a java installation path in the PATH variable.
- Click here to install Java.
- Set the environment path variable to the java installation folder (C: Program Files (x64)Javajre1.8.0 251bin).
1)Excel File Without Index
Approach:
- Import the tabula module using the import keyword.
- Pass the given pdf file path and number of pages as an argument to the read_pdf() function of the tabula module and store that dataframe to a variable.
- Convert the data frame to excel using the to_excel() function by passing the arguments output excel file path and boolean variable index.
- The Exit of the Program.
Below is the Implementation:
# import the tabula module using the import keyword import tabula # Pass the given pdf file path and number of pages as an argument to the read_pdf() function # of the tabula module and store that dataframe to a variable. dataframe = tabula.read_pdf("samplePdf.pdf", pages = 1)[0] #Convert the data frame to excel using the to_excel() function # by passing the arguments output excel file path and boolean variable index. dataframe.to_excel('resultExcel.xlsx',index=False)
Output:
2)Excel File with Index
Approach:
- Import the tabula module using the import keyword.
- Pass the given pdf file path and number of pages as an argument to the read_pdf() function of the tabula module and store that dataframe to a variable.
- Convert the data frame to excel using the to_excel() function by passing the arguments output excel file path and boolean variable index here by default the index value is True.
- The Exit of the Program.
Below is the Implementation:
# import the tabula module using the import keyword import tabula # Pass the given pdf file path and number of pages as an argument to the read_pdf() function # of the tabula module and store that dataframe to a variable. dataframe = tabula.read_pdf("samplePdf.pdf", pages = 1)[0] #Convert the data frame to excel using the to_excel() function by passing the arguments output excel file path # and boolean variable index here by default the index value is True. dataframe.to_excel('resultExcel.xlsx')
Output:
Now that you have been provided with the information on how to convert the PDF files to Excel files using Python, So, the next time you are in a situation where you want to convert PDF files to Excel, use the methods provided here to start converting your files without any difficulty.