Конвертировать pdf в excel python - Word и Excel - помощь в работе с программами

Improve Article

Save Article

Like Article

Read

Discuss

Improve Article

Save Article

Like Article

In this article, we will see how to convert a PDF to Excel or CSV File Using Python. It can be done with various methods, here are we are going to use some methods.

Method 1: Using pdftables_api

Here will use the pdftables_api Module for converting the PDF file into any other format. It’s a simple web-based API, so can be called from any programming language.

Installation:

pip install git+https://github.com/pdftables/python-pdftables-api.git

After Installation, you need an API KEY. Go to PDFTables.com and signup, then visit the API Page to see your API KEY.

For Converting PDF File Into excel File we will use xml() method.

Syntax:

xml(pdf_path, xml_path)

Below is the Implementation:

PDF File Used:

PDF FILE

Python3

import pdftables_api

conversion = pdftables_api.Client('API KEY')

conversion.xlsx("pdf_file_path", "output_file_path")

Output:

EXCEL FILE

Method 2: Using tabula-py

Here will use the tabula-py Module for converting the PDF file into any other format.

Installation:

pip install tabula-py

Before we start, first we need to install java and add a java installation folder to the PATH variable.

Install java click here
Add java installation folder (C:Program Files (x86)Javajre1.8.0_251bin) to the environment path variable

Approach:

Read PDF file using read_pdf() method.
Then we will convert the PDF files into an Excel file using the to_excel() method.

Syntax:

read_pdf(PDF File Path, pages = Number of pages, **agrs)

Below is the Implementation:

PDF File Used:

PDF FILE

Python3

import tabula

df = tabula.read_pdf("PDF File Path", pages = 1)[0]

df.to_excel('Excel File Path')

Output:

EXCEL FILE

Like Article

Save Article

Источник

PDF — не самый удобный формат для передачи данных, но иногда возникает необходимость извлекать таблицы (или текст). Данный скрипт Python будет особенно полезен в случае, если вам необходимо периодически извлекать данные из однотипных PDF файлов.

Начнём с импорта библиотек, которые мы будем использовать — Pandas (для записи таблиц в CSV/Excel). Непосредственно работать с PDF файлами мы будем с помощью библиотеки tabula. (установка — cmd -> «pip install tabula-py« ; также для работы необходимо установить Java (https://www.java.com/en/download/))

import tabula
import pandas as pd

Задаём путь к файлу в формате PDF, из которого необходимо извлечь табличные данные:

pdf_in = "D:/Folder/File.pdf"

Извлекаем все таблицы из файла в переменную PDF в виде вложенных списков.

PDF = tabula.read_pdf(pdf_in, pages='all', multiple_tables=True)

pages=’all’ и multiple_tables=True — необязательные параметры

Далее прописываем пути к Excel/CSV файлам, которые мы хотим получить на выходе:

pdf_out_xlsx = "D:TempFrom_PDF.xlsx"
pdf_out_csv = "D:TempFrom_PDF.csv"

Для сохранения в .xlsx мы создаем датафрейм pandas из нашего вложенного списка и используем pandas.DataFrame.to_excel :

PDF = pd.DataFrame(PDF)
PDF.to_excel(pdf_out_xlsx,index=False)

Для сохранения в CSV мы можем использовать convert_into из tabula.

tabula.convert_into (input_PDF, pdf_out_csv, pages='all',multiple_tables=True)
print("Done")

Скрипт целиком:

# Script to export tables from PDF files
# Requirements:
# Pandas (cmd --> pip install pandas)
# Java   (https://www.java.com/en/download/)
# Tabula (cmd --> pip install tabula-py)
# openpyxl (cmd --> pip install openpyxl) to export to Excel from pandas dataframe

import tabula
import pandas as pd

# Path to input PDF file
pdf_in = "D:/Folder/File.pdf" #Path to PDF

# pages and multiple_tables are optional attributes
# outputs df as list
PDF = tabula.read_pdf(pdf_in, pages='all', multiple_tables=True)

#View result
print ('nTables from PDF filen'+str(PDF))

#CSV and Excel save paths
pdf_out_xlsx = "D:TempFrom_PDF.xlsx"
pdf_out_csv = "D:TempFrom_PDF.csv"

# to Excel
PDF = pd.DataFrame(PDF)
PDF.to_excel(pdf_out_xlsx,index=False) 

# to CSV
tabula.convert_into (input_PDF, pdf_out_csv, pages='all',multiple_tables=True)
print("Done")

Источник

Last Updated on July 14, 2022 by

In this tutorial, we’ll take a look at how to convert PDF to Excel with Python. If you work with data, the chances are that you have had, or will have to deal with data stored in a .pdf file. It’s difficult to copy a table from PDF and paste it directly into Excel. In most cases, what we copy from the PDF file is text, instead of formatted Excel tables. Therefore, when pasting the data into Excel, we see a chunk of text squeezed into one cell.

Of course, we don’t want to copy and paste individual values one by one into Excel. There are several commercial software that allows PDF to Excel conversion, but they charge a hefty fee. If you are willing to learn a little bit of Python, it takes less than 10 lines of code to achieve a reasonably good result.

We’ll extract the COVID-19 cases by country from the WHO’s website. I’m attaching it here in case the source file gets removed later.

Step 1. Install Python library and Java

tabula-py is a Python wrapper of tabula-java, which can read tables in PDF file. It means that we need to install Java first. The installation takes about 1 minute, and you can follow this link to find the Java installation file for your operating system: https://java.com/en/download/help/download_options.xml.

Once you have Java, install tabula-py with pip:

pip install tabula-py

We are going to extract the table on page 3 of the PDF file. tabula.read_pdf() returns a list of dataframes. For some reason, tabula detected 8 tables on this page, looking through them, we see that the second table is what we want to extract. Thus we specify that we want to get the second element of that list using [1].

import tabula
df = tabula.read_pdf('data.pdf', pages = 3, lattice = True)[1]

If this is your first time installing Java and tabula-py, you might get the following error message when running the above 2 lines of code:

tabula.errors.JavaNotFoundError: `java` command is not found from this Python process.Please ensure Java is installed and PATH is set for `java`

Which is due to Java folder is not in the PATH system variable. Simply add your Java installation folder to the PATH variable. I used the default installation, so the Java folder is C:Program Files (x86)Javajre1.8.0_251bin on my laptop.

Add Java to PATH

Now the script should run.

By default, tabula-py will extract tables from PDF file into a pandas dataframe. Let’s take a look at the data by inspecting the first 10 rows with .head(10):

Table extracted from PDF

We immediately see two problems with this unprocessed table: the header row contains weird letters “r”, and there are many NaN values. We’ll have to do a little bit further clean up to make the data useful.

Step 2. Clean up the header row

Let’s first clean up the header row. df.columns returns the dataframe header names.

Dataframe header

We can replace the “r” in the header by doing the following:

df.columns = df.columns.str.replace('r', ' ')

.str returns all of the string values of the header, then we can perform the .replace() function to replace “r” with a space. Then, we assign the clean string values back to the dataframe’s header (columns)

Step 3. Remove NaN values

Next, we’ll clean those NaN values, which were created by the function tabula.read_pdf(), for whenever a particular cell is blank. These values cause troubles for us when doing data analysis, so most of the time we’ll remove them. Glancing through the table, it appears we can remove the rows that contain NaN values without losing any data points. Lucky for us, pandas provide a convenient way to remove rows with NaN values.

data = df.dropna()
data.to_excel('data.xlsx')

Clean dataframe

Putting it all together

import tabula
df = tabula.read_pdf('data.pdf', pages = 3, lattice = True)[1]

df.columns = df.columns.str.replace('r', ' ')
data = df.dropna()
data.to_excel('data.xlsx')

Now you see, it takes only 5 lines of code to convert PDF to Excel with Python. It’s simple and powerful. The best part? You control what you want to extract, keep, and change!

Источник

Updated February 2019

You can convert your PDF to Excel, CSV, XML or HTML with Python using the PDFTables API. Our API will enable you to convert PDFs without uploading each one manually.

In this tutorial, I’ll be showing you how to get the library set up on your local machine and then use it to convert PDF to Excel, with Python.

Here’s an example of a PDF that I’ve converted with the library. In order to properly test the library, make sure you have a PDF handy!

Step 1

If you haven’t already, install Anaconda on your machine from Anaconda website. You can use either Python 3.6.x or 2.7.x, as the PDFTables API works with both. Downloading Anaconda means that pip will also be installed. Pip gives a simple way to install the PDFTables API Python package.

For this tutorial, I’ll be using the Windows Python IDLE Shell, but the instructions are almost identical for Linux and Mac.

Step 2

In your terminal/command line, install the PDFTables Python library with:

pip install git+https://github.com/pdftables/python-pdftables-api.git

If git is not recognised, download it here. Then, run the above command again.

Or if you’d prefer to install it manually, you can download it from python-pdftables-api then install it with:

python setup.py install

Step 3

Create a new Python script then add the following code:

import pdftables_api

c = pdftables_api.Client('my-api-key')
c.xlsx('input.pdf', 'output') 
#replace c.xlsx with c.csv to convert to CSV
#replace c.xlsx with c.xml to convert to XML
#replace c.xlsx with c.html to convert to HTML

Now, you’ll need to make the following changes to the script:

Replace my-api-key with your PDFTables API key, which you can get here.
Replace input.pdf with the PDF you would like to convert.
Replace output with the name you’d like to give the converted document.

Now, save your finished script as convert-pdf.py in the same directory as the PDF document you’d like to convert.

If you don’t understand the script above, see the script overview section.

Step 4

Open your command line/terminal and change your directory (e.g. cd C:/Users/Bob) to the folder you saved your convert-pdf.py script and PDF in, then run the following command:

python convert-pdf.py

To find your converted spreadsheet, navigate to the folder in your file explorer and hey presto, you’ve converted a PDF to Excel or CSV with Python!

Script overview

The first line is simply importing the PDFTables API toolset, so that Python knows what to do when certain actions are called. The second
line is calling the PDFTables API with your unique API key. This means here at PDFTables we know which account is using the API and how many
PDF pages are available. Finally, the third line is telling Python to convert the file with name input.pdf to xlsx and also what
you would like it to be called upon output: output. To convert to CSV, XML or HTML simply change c.xlsx to be c.csv,
c.xml or c.htmlrespectively.

Looking to convert multiple PDF files at once?

Check out our blog post here.

Love PDFTables? Leave us a review on our Trustpilot page!

Источник

In this article we will see how to quickly extract a table from a PDF to Excel.

For this tutorial you will need two Python libraries :

tabula-py
pandas

To install them, go to your terminal/shell and type these lines of code:

pip install tabula-py
pip install pandas

If you use Google Colab, you can install these libraries directly on it. You just have to add an exclamation mark “!” in front of it, like this:

!pip install tabula-py
!pip install pandas

[smartslider3 slider=”10″]

Photo by Aurelien Romain on Unsplash

PDF to Excel (one table only)

First we load the libraries into our text editor :

import tabula
import pandas as pd

Then, we will read the pdf with the read_pdf() function of the tabula library.

This function automatically detects the tables in a pdf and converts them into DataFrames. Ideal for converting them into Excel files!

df = tabula.read_pdf('file_path/file.pdf', pages = 'all')[0]

We can then check that the table has the expected shape.

df.head()

Then convert it to an Excel file !

df.to_excel('file_path/file.xlsx')

The entire code :

THE PANE METHOD FOR DEEP LEARNING!

Get your 7 DAYS FREE TRAINING to learn how to create your first ARTIFICIAL INTELLIGENCE!

For the next 7 days I will show you how to use Neural Networks.

You will learn what Deep Learning is with concrete examples that will stick in your head.

BEWARE, this email series is not for everyone. If you are the kind of person who likes theoretical and academic courses, you can skip it.

But if you want to learn the PANE method to do Deep Learning, click here :

import tabula import pandas as pd
df = tabula.read_pdf('file_path/file.pdf', pages = 'all')[0]
df.to_excel('file_path/file.xlsx')

Photo by Darius Cotoi on Unsplash

PDF containing several tables

We load the libraries in our text editor :

import tabula
import pandas as pd

Then, we will read the pdf with the read_pdf() function of the tabula library.

This function automatically detects the tables in a pdf and converts them into DataFrames. Ideal to convert them then in Excel file !

Here, the variable df will be in fact a list of DataFrame. The first element corresponds to the first table, the second to the second table, etc.

df = tabula.read_pdf('file_path/file.pdf', pages = 'all')

To save these tables separately, you will have to use a for loop that will save each table in an Excel file.

for i in range(len(df)):
 df[i].to_excel('file_'+str(i)+'.xlsx')

The entire code :

import tabula
import pandas as pd
df = tabula.read_pdf('file_path/file.pdf', pages = 'all')

for i in range(len(df)):
 df[i].to_excel('file_'+str(i)+'.xlsx')

sources:

Medium
Photo by Birger Strahl on Unsplash

THE PANE METHOD FOR DEEP LEARNING!

Get your 7 DAYS FREE TRAINING to learn how to create your first ARTIFICIAL INTELLIGENCE!

For the next 7 days I will show you how to use Neural Networks.

You will learn what Deep Learning is with concrete examples that will stick in your head.

BEWARE, this email series is not for everyone. If you are the kind of person who likes theoretical and academic courses, you can skip it.

But if you want to learn the PANE method to do Deep Learning, click here :

Источник