Конвертировать pdf в excel python

Improve Article

Save Article

Like Article

  • Read
  • Discuss
  • Improve Article

    Save Article

    Like Article

    In this article, we will see how to convert a PDF to Excel or CSV File Using Python. It can be done with various methods, here are we are going to use some methods.

    Method 1: Using pdftables_api 

    Here will use the pdftables_api Module for converting the PDF file into any other format. It’s a simple web-based API, so can be called from any programming language.

    Installation:

    pip install git+https://github.com/pdftables/python-pdftables-api.git

    After Installation, you need an API KEY. Go to PDFTables.com and signup, then visit the API Page to see your API KEY.

    For Converting PDF File Into excel File we will use xml() method.

    Syntax:

    xml(pdf_path, xml_path)

    Below is the Implementation:

    PDF File Used:

    PDF FILE

    Python3

    import pdftables_api

    conversion = pdftables_api.Client('API KEY')

    conversion.xlsx("pdf_file_path", "output_file_path")

    Output:

    EXCEL FILE

    Method 2: Using tabula-py

    Here will use the tabula-py Module for converting the PDF file into any other format.

    Installation:

    pip install tabula-py

    Before we start, first we need to install java and add a java installation folder to the PATH variable.

    • Install java click here
    • Add java installation folder (C:Program Files (x86)Javajre1.8.0_251bin) to the environment path variable

    Approach:

    • Read PDF file using read_pdf() method.
    • Then we will convert the PDF files into an Excel file using the to_excel() method.

    Syntax:

    read_pdf(PDF File Path, pages = Number of pages, **agrs)

    Below is the Implementation:

    PDF File Used:

    PDF FILE

    Python3

    import tabula

    df = tabula.read_pdf("PDF File Path", pages = 1)[0]

    df.to_excel('Excel File Path')

    Output:

    EXCEL FILE

    Like Article

    Save Article

    PDF — не самый удобный формат для передачи данных, но иногда возникает необходимость извлекать таблицы (или текст). Данный скрипт Python будет особенно полезен в случае, если вам необходимо периодически извлекать данные из однотипных PDF файлов.

    Начнём с импорта библиотек, которые мы будем использовать — Pandas (для записи таблиц в CSV/Excel). Непосредственно работать с PDF файлами мы будем с помощью библиотеки tabula. (установка — cmd -> «pip install tabula-py« ; также для работы необходимо установить Java (https://www.java.com/en/download/)) 

    import tabula
    import pandas as pd

    Задаём путь к файлу в формате PDF, из которого необходимо извлечь табличные данные:

    pdf_in = "D:/Folder/File.pdf"

    Извлекаем все таблицы из файла в переменную PDF в виде вложенных списков.

    PDF = tabula.read_pdf(pdf_in, pages='all', multiple_tables=True)
    

    pages=’all’ и multiple_tables=True — необязательные параметры

    Далее прописываем пути к Excel/CSV файлам, которые мы хотим получить на выходе:

    pdf_out_xlsx = "D:TempFrom_PDF.xlsx"
    pdf_out_csv = "D:TempFrom_PDF.csv"

    Для сохранения в .xlsx мы создаем датафрейм pandas из нашего вложенного списка и используем pandas.DataFrame.to_excel :

    PDF = pd.DataFrame(PDF)
    PDF.to_excel(pdf_out_xlsx,index=False) 

    Для сохранения в CSV  мы можем использовать convert_into из tabula.

    tabula.convert_into (input_PDF, pdf_out_csv, pages='all',multiple_tables=True)
    print("Done")

    Скрипт целиком:

    # Script to export tables from PDF files
    # Requirements:
    # Pandas (cmd --> pip install pandas)
    # Java   (https://www.java.com/en/download/)
    # Tabula (cmd --> pip install tabula-py)
    # openpyxl (cmd --> pip install openpyxl) to export to Excel from pandas dataframe
    
    import tabula
    import pandas as pd
    
    # Path to input PDF file
    pdf_in = "D:/Folder/File.pdf" #Path to PDF
    
    # pages and multiple_tables are optional attributes
    # outputs df as list
    PDF = tabula.read_pdf(pdf_in, pages='all', multiple_tables=True)
    
    #View result
    print ('nTables from PDF filen'+str(PDF))
    
    #CSV and Excel save paths
    pdf_out_xlsx = "D:TempFrom_PDF.xlsx"
    pdf_out_csv = "D:TempFrom_PDF.csv"
    
    # to Excel
    PDF = pd.DataFrame(PDF)
    PDF.to_excel(pdf_out_xlsx,index=False) 
    
    # to CSV
    tabula.convert_into (input_PDF, pdf_out_csv, pages='all',multiple_tables=True)
    print("Done")
    
    

    Last Updated on July 14, 2022 by

    In this tutorial, we’ll take a look at how to convert PDF to Excel with Python. If you work with data, the chances are that you have had, or will have to deal with data stored in a .pdf file. It’s difficult to copy a table from PDF and paste it directly into Excel. In most cases, what we copy from the PDF file is text, instead of formatted Excel tables. Therefore, when pasting the data into Excel, we see a chunk of text squeezed into one cell.

    Of course, we don’t want to copy and paste individual values one by one into Excel. There are several commercial software that allows PDF to Excel conversion, but they charge a hefty fee. If you are willing to learn a little bit of Python, it takes less than 10 lines of code to achieve a reasonably good result.

    We’ll extract the COVID-19 cases by country from the WHO’s website. I’m attaching it here in case the source file gets removed later.

    Step 1. Install Python library and Java

    tabula-py is a Python wrapper of tabula-java, which can read tables in PDF file. It means that we need to install Java first. The installation takes about 1 minute, and you can follow this link to find the Java installation file for your operating system: https://java.com/en/download/help/download_options.xml.

    Once you have Java, install tabula-py with pip:

    pip install tabula-py

    We are going to extract the table on page 3 of the PDF file. tabula.read_pdf() returns a list of dataframes. For some reason, tabula detected 8 tables on this page, looking through them, we see that the second table is what we want to extract. Thus we specify that we want to get the second element of that list using [1].

    import tabula
    df = tabula.read_pdf('data.pdf', pages = 3, lattice = True)[1]

    If this is your first time installing Java and tabula-py, you might get the following error message when running the above 2 lines of code:

    tabula.errors.JavaNotFoundError: `java` command is not found from this Python process.Please ensure Java is installed and PATH is set for `java`

    Which is due to Java folder is not in the PATH system variable. Simply add your Java installation folder to the PATH variable. I used the default installation, so the Java folder is C:Program Files (x86)Javajre1.8.0_251bin on my laptop.

    Add Java to PATH

    Now the script should run.

    By default, tabula-py will extract tables from PDF file into a pandas dataframe. Let’s take a look at the data by inspecting the first 10 rows with .head(10):

    Table extracted from PDF

    We immediately see two problems with this unprocessed table: the header row contains weird letters “r”, and there are many NaN values. We’ll have to do a little bit further clean up to make the data useful.

    Step 2. Clean up the header row

    Let’s first clean up the header row. df.columns returns the dataframe header names.

    Dataframe header

    We can replace the “r” in the header by doing the following:

    df.columns = df.columns.str.replace('r', ' ')

    .str returns all of the string values of the header, then we can perform the .replace() function to replace “r” with a space. Then, we assign the clean string values back to the dataframe’s header (columns)

    Step 3. Remove NaN values

    Next, we’ll clean those NaN values, which were created by the function tabula.read_pdf(), for whenever a particular cell is blank. These values cause troubles for us when doing data analysis, so most of the time we’ll remove them. Glancing through the table, it appears we can remove the rows that contain NaN values without losing any data points. Lucky for us, pandas provide a convenient way to remove rows with NaN values.

    data = df.dropna()
    data.to_excel('data.xlsx')

    Clean dataframe

    Putting it all together

    import tabula
    df = tabula.read_pdf('data.pdf', pages = 3, lattice = True)[1]
    
    df.columns = df.columns.str.replace('r', ' ')
    data = df.dropna()
    data.to_excel('data.xlsx')

    Now you see, it takes only 5 lines of code to convert PDF to Excel with Python. It’s simple and powerful. The best part? You control what you want to extract, keep, and change!

    Updated February 2019

    You can convert your PDF to Excel, CSV, XML or HTML with Python using the PDFTables API. Our API will enable you to convert PDFs without uploading each one manually.

    In this tutorial, I’ll be showing you how to get the library set up on your local machine and then use it to convert PDF to Excel, with Python.

    PDF to CSV Python - PDF to Excel Python
    Here’s an example of a PDF that I’ve converted with the library. In order to properly test the library, make sure you have a PDF handy!

    Step 1

    If you haven’t already, install Anaconda on your machine from Anaconda website. You can use either Python 3.6.x or 2.7.x, as the PDFTables API works with both. Downloading Anaconda means that pip will also be installed. Pip gives a simple way to install the PDFTables API Python package.

    For this tutorial, I’ll be using the Windows Python IDLE Shell, but the instructions are almost identical for Linux and Mac.

    Step 2

    In your terminal/command line, install the PDFTables Python library with:

    pip install git+https://github.com/pdftables/python-pdftables-api.git

    If git is not recognised, download it here. Then, run the above command again.

    Or if you’d prefer to install it manually, you can download it from python-pdftables-api then install it with:

    python setup.py install

    Step 3

    Create a new Python script then add the following code:

    import pdftables_api
    
    c = pdftables_api.Client('my-api-key')
    c.xlsx('input.pdf', 'output') 
    #replace c.xlsx with c.csv to convert to CSV
    #replace c.xlsx with c.xml to convert to XML
    #replace c.xlsx with c.html to convert to HTML

    Now, you’ll need to make the following changes to the script:

    • Replace my-api-key with your PDFTables API key, which you can get here.
    • Replace input.pdf with the PDF you would like to convert.
    • Replace output with the name you’d like to give the converted document.

    Now, save your finished script as convert-pdf.py in the same directory as the PDF document you’d like to convert.

    PDF and Python script in the conversion directory

    If you don’t understand the script above, see the script overview section.

    Step 4

    Open your command line/terminal and change your directory (e.g. cd C:/Users/Bob) to the folder you saved your convert-pdf.py script and PDF in, then run the following command:

    python convert-pdf.py

    To find your converted spreadsheet, navigate to the folder in your file explorer and hey presto, you’ve converted a PDF to Excel or CSV with Python!

    Converted Excel spreadsheet in its directory

    Script overview

    The first line is simply importing the PDFTables API toolset, so that Python knows what to do when certain actions are called. The second
    line is calling the PDFTables API with your unique API key. This means here at PDFTables we know which account is using the API and how many
    PDF pages are available. Finally, the third line is telling Python to convert the file with name input.pdf to xlsx and also what
    you would like it to be called upon output: output. To convert to CSV, XML or HTML simply change c.xlsx to be c.csv,
    c.xml or c.htmlrespectively.

    Looking to convert multiple PDF files at once?

    Check out our blog post here.

    Love PDFTables? Leave us a review on our Trustpilot page!

    In this article we will see how to quickly extract a table from a PDF to Excel.

    For this tutorial you will need two Python libraries :

    • tabula-py
    • pandas

    To install them, go to your terminal/shell and type these lines of code:

    pip install tabula-py
    pip install pandas

    If you use Google Colab, you can install these libraries directly on it. You just have to add an exclamation mark “!” in front of it, like this:

    !pip install tabula-py
    !pip install pandas

    [smartslider3 slider=”10″]

    Photo by Aurelien Romain on Unsplash

    PDF to Excel (one table only)

    First we load the libraries into our text editor :

    import tabula
    import pandas as pd

    Then, we will read the pdf with the read_pdf() function of the tabula library.

    This function automatically detects the tables in a pdf and converts them into DataFrames. Ideal for converting them into Excel files!

    df = tabula.read_pdf('file_path/file.pdf', pages = 'all')[0]

    We can then check that the table has the expected shape.

    df.head()

    Then convert it to an Excel file !

    df.to_excel('file_path/file.xlsx')

    The entire code :

    THE PANE METHOD FOR DEEP LEARNING!

    Get your 7 DAYS FREE TRAINING to learn how to create your first ARTIFICIAL INTELLIGENCE!

    For the next 7 days I will show you how to use Neural Networks.

    You will learn what Deep Learning is with concrete examples that will stick in your head.

    BEWARE, this email series is not for everyone. If you are the kind of person who likes theoretical and academic courses, you can skip it.

    But if you want to learn the PANE method to do Deep Learning, click here :

    import tabula import pandas as pd
    df = tabula.read_pdf('file_path/file.pdf', pages = 'all')[0]
    df.to_excel('file_path/file.xlsx')

    Photo by Darius Cotoi on Unsplash

    PDF containing several tables

    We load the libraries in our text editor :

    import tabula
    import pandas as pd

    Then, we will read the pdf with the read_pdf() function of the tabula library.

    This function automatically detects the tables in a pdf and converts them into DataFrames. Ideal to convert them then in Excel file !

    Here, the variable df will be in fact a list of DataFrame. The first element corresponds to the first table, the second to the second table, etc.

    df = tabula.read_pdf('file_path/file.pdf', pages = 'all')

    To save these tables separately, you will have to use a for loop that will save each table in an Excel file.

    for i in range(len(df)):
     df[i].to_excel('file_'+str(i)+'.xlsx')

    The entire code :

    import tabula
    import pandas as pd
    df = tabula.read_pdf('file_path/file.pdf', pages = 'all')
    
    for i in range(len(df)):
     df[i].to_excel('file_'+str(i)+'.xlsx')

    sources:

    •  Medium
    • Photo by Birger Strahl on Unsplash

    THE PANE METHOD FOR DEEP LEARNING!

    Get your 7 DAYS FREE TRAINING to learn how to create your first ARTIFICIAL INTELLIGENCE!

    For the next 7 days I will show you how to use Neural Networks.

    You will learn what Deep Learning is with concrete examples that will stick in your head.

    BEWARE, this email series is not for everyone. If you are the kind of person who likes theoretical and academic courses, you can skip it.

    But if you want to learn the PANE method to do Deep Learning, click here :

    Like this post? Please share to your friends:
  • Конвертировать pdf в word программа онлайн
  • Конвертировать pdf в excel adobe
  • Конвертировать pdf в word оптическое распознавание текста
  • Конвертировать pdf в doc word онлайн convertio
  • Конвертировать pdf в word онлайн с возможностью редактирования текста бесплатно