Python from pdf to word - Word и Excel - помощь в работе с программами

Method #2). Convert PDFs to Word Using GroupDocs Python SDK

GroupDocs is a Cloud SDK for Python that will help you convert PDF to Word in one of the easiest and most convenient ways as long as you are able to follow this simple guide provided here. Unlike the previous PyPDF2 library, GroupDocs is capable of processing a richly formatted PDF in a way that it retains the original formatting in the converted Word file. It performs all this as a standalone tool without the need for any other extra tools or software.

Of course, this process to save PDF as Word is not that simple but this Python module will come in handy to kick the ball out of the park for you. The good thing is that we are providing you with a comprehensive and reliable tutorial on how to go about the task at hand even if you are a novice. It is now the perfect moment to let you in on how to convert PDF to Word using GroupDocs Python SDK.

Step 1: Get your APP SID and APP KEY. Simply sign up for free with https://dashboard.groupdocs.cloud and once you do so, you should be able to find the APP SID and APP KEY in the “My Apps” tab under the “Manage My Apps” sub-tab. These are necessary for the success of the process so have them ready.

Step 2: Create a directory and place the PDF file in it. For convenience purposes, it is advisable to create a fresh directory with a preferred name and then place the target PDF document inside.

Step 3: Install the necessary GroupDocs package. To achieve this, we are going to install the “groupdocs-conversion-cloud” Python package using the command line. So, from the file explorer address bar, type in the word “cmd” and hit the “Enter” key. In the resulting command-driven interface, type in the command below and hit the “Enter” key on the keyboard.

pip install groupdocs-conversion-cloud

The installation will not take ages to complete and you will be heading to the next step in a matter of moments.

Step 4: Create the required Python script. In the same directory as the PDF file, create a new “.py” script with the code below that will be responsible for the success of the process to save PDF as Word. Ensure that you replace the highlighted “app_sid” and “app_key” values with what you were assigned when you signed up. At the same time, ensure you replace the highlighted “filename” with the name of your PDF file. For convenience, save the Python script with the filename “groupdocs.py”

# Import module

import groupdocs_conversion_cloud

# Get your app_sid and app_key at https://dashboard.groupdocs.cloud (free registration is required).

app_sid = "xxxxx-xxxx-xxxx-xxxx-xxxxxxxxx"

app_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# Create instance of the API

convert_api = groupdocs_conversion_cloud.ConvertApi.from_keys(app_sid, app_key)

file_api = groupdocs_conversion_cloud.FileApi.from_keys(app_sid, app_key)

try:

#upload source file to storage

filename = 'filename.pdf'

remote_name = 'filename.pdf'

output_name= 'filename.docx'

strformat='docx'

request_upload = groupdocs_conversion_cloud.UploadFileRequest(remote_name,filename)

response_upload = file_api.upload_file(request_upload)

#Extract Text from PDF document

settings = groupdocs_conversion_cloud.ConvertSettings()

settings.file_path =remote_name

settings.format = strformat

settings.output_path = output_name

request = groupdocs_conversion_cloud.ConvertDocumentRequest(settings)

response = convert_api.convert_document(request)

print("Document converted successfully: " + str(response))

except groupdocs_conversion_cloud.ApiException as e:

print("Exception when calling get_supported_conversion_types: {0}".format(e.message))

Double-check that you have made the necessary changes as required before saving the Python script. Once you have confirmed the highlights, head over to the next step. As a quick summary, this script will import the needed Python package, initialize the API, upload the source PDF file, convert the PDF to Word, and then deliver the output information.

Step 5: Run the Python script. It is now time to let the script work its magic by running it from the command line that we opened. To do that, type the command below, hit the “Enter” key, and be patient for it to complete.

groupdocs.py

To confirm that the process has completed as expected, you should be able to see some output information with a successful conversion message, the file path, size, and Url.

Step 6: Download the converted Word document. On your preferred web browser, navigate to https://dashboard.groupdocs.cloud/ and then head over to the “My Files” tab. This will open up your “Storage” and list all the files available and it is here that you will find the uploaded PDF file and the converted Word file. Simply tick the box on the left-hand side of the DOCX file and then hit the “Download” button. You will be asked to confirm whether you are sure you want to download the checked files and all you need to do here is click the “Yes” button.

The download process will start momentarily after which you will be able to find the Word document in your default downloads folder. From there, you can open the file and perform further actions that you deem necessary. This will mark the end of your task to save PDF as Word using the GroupDocs Python SDK.

At no cost at all, GroupDocs has delivered a reliable method at your disposal that will help you extract data from PDF files and on top of that retain the original layout and formatting to the highest degree. This means that you can say goodbye to the need for corrections after the conversion process besides enjoying a very efficient process.

At the end of the day, you can comfortably take advantage of the Python programming language anytime the need to convert PDF to Word arises, all thanks to these awesome libraries featured in this article. The guides on how to tackle the task at hand ensure that you do not encounter a steep learning curve when you decide to make the most out of these tools.

Therefore, any moment you feel you need to save a PDF as Word, you need not necessary hassle looking for fully-fledged software when you can do that easily using Python. Pick the one tool that has proven to lace your shoes in the best way, follow the guide on how to use it, and sail your way to the kind of results you are looking forward to.

Источник

How to convert a pdf file to docx. Is there a way of doing this using python?

I’ve saw some pages that allow user to upload PDF and returns a DOC file, like PdfToWord

Thanks in advance

rsc05

3,5062 gold badges34 silver badges56 bronze badges

asked Oct 14, 2014 at 10:16

If you have LibreOffice installed

lowriter --invisible --convert-to doc '/your/file.pdf'

If you want to use Python for this:

import os
import subprocess

for top, dirs, files in os.walk('/my/pdf/folder'):
    for filename in files:
        if filename.endswith('.pdf'):
            abspath = os.path.join(top, filename)
            subprocess.call('lowriter --invisible --convert-to doc "{}"'
                            .format(abspath), shell=True)

answered Oct 14, 2014 at 10:30

This is difficult because PDFs are presentation oriented and word documents are content oriented. I have tested both and can recommend the following projects.

PyPDF2
PDFMiner

However, you are most definitely going to lose presentational aspects in the conversion.

answered Oct 14, 2014 at 10:30

ham-sandwichham-sandwich

3,96410 gold badges33 silver badges46 bronze badges

If you want to convert PDF -> MS Word type file like docx, I came across this.

Ahsin Shabbir wrote:

import glob
import win32com.client
import os

word = win32com.client.Dispatch("Word.Application")
word.visible = 0

pdfs_path = "" # folder where the .pdf files are stored
for i, doc in enumerate(glob.iglob(pdfs_path+"*.pdf")):
    print(doc)
    filename = doc.split('\')[-1]
    in_file = os.path.abspath(doc)
    print(in_file)
    wb = word.Documents.Open(in_file)
    out_file = os.path.abspath(reqs_path +filename[0:-4]+ ".docx".format(i))
    print("outfilen",out_file)
    wb.SaveAs2(out_file, FileFormat=16) # file format for docx
    print("success...")
    wb.Close()

word.Quit()

This worked like a charm for me, converted 500 pages PDF with formatting and images.

answered Apr 6, 2020 at 19:06

eleks007eleks007

991 silver badge3 bronze badges

You can use GroupDocs.Conversion Cloud SDK for python without installing any third-party tool or software.

Sample Python code:

# Import module
import groupdocs_conversion_cloud

# Get your app_sid and app_key at https://dashboard.groupdocs.cloud (free registration is required).
app_sid = "xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx"
app_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# Create instance of the API
convert_api = groupdocs_conversion_cloud.ConvertApi.from_keys(app_sid, app_key)
file_api = groupdocs_conversion_cloud.FileApi.from_keys(app_sid, app_key)

try:

        #upload soruce file to storage
        filename = 'Sample.pdf'
        remote_name = 'Sample.pdf'
        output_name= 'sample.docx'
        strformat='docx'

        request_upload = groupdocs_conversion_cloud.UploadFileRequest(remote_name,filename)
        response_upload = file_api.upload_file(request_upload)
        #Convert PDF to Word document
        settings = groupdocs_conversion_cloud.ConvertSettings()
        settings.file_path =remote_name
        settings.format = strformat
        settings.output_path = output_name

        loadOptions = groupdocs_conversion_cloud.PdfLoadOptions()
        loadOptions.hide_pdf_annotations = True
        loadOptions.remove_embedded_files = False
        loadOptions.flatten_all_fields = True

        settings.load_options = loadOptions

        convertOptions = groupdocs_conversion_cloud.DocxConvertOptions()
        convertOptions.from_page = 1
        convertOptions.pages_count = 1

        settings.convert_options = convertOptions
 .               
        request = groupdocs_conversion_cloud.ConvertDocumentRequest(settings)
        response = convert_api.convert_document(request)

        print("Document converted successfully: " + str(response))
except groupdocs_conversion_cloud.ApiException as e:
        print("Exception when calling get_supported_conversion_types: {0}".format(e.message))

I’m developer evangelist at aspose.

answered Nov 7, 2019 at 15:29

Tilal AhmadTilal Ahmad

9405 silver badges9 bronze badges

Based on previews answers this was the solution that worked best for me using Python 3.7.1

import win32com.client
import os

# INPUT/OUTPUT PATH
pdf_path = r"""C:path2pdf.pdf"""
output_path = r"""C:output_folder"""

word = win32com.client.Dispatch("Word.Application")
word.visible = 0  # CHANGE TO 1 IF YOU WANT TO SEE WORD APPLICATION RUNNING AND ALL MESSAGES OR WARNINGS SHOWN BY WORD

# GET FILE NAME AND NORMALIZED PATH
filename = pdf_path.split('\')[-1]
in_file = os.path.abspath(pdf_path)

# CONVERT PDF TO DOCX AND SAVE IT ON THE OUTPUT PATH WITH THE SAME INPUT FILE NAME
wb = word.Documents.Open(in_file)
out_file = os.path.abspath(output_path + '\' + filename[0:-4] + ".docx")
wb.SaveAs2(out_file, FileFormat=16)
wb.Close()
word.Quit()

answered Aug 4, 2021 at 12:33

Jonny_PJonny_P

1171 silver badge3 bronze badges

With Adobe on your machine

If you have adobe acrobate on your machine you can use the following function that enables you to save the PDF file as docx file

# Open PDF file, use Acrobat Exchange to save file as .docx file.

import win32com.client, win32com.client.makepy, os, winerror, errno, re
from win32com.client.dynamic import ERRORS_BAD_CONTEXT

def PDF_to_Word(input_file, output_file):
    
    ERRORS_BAD_CONTEXT.append(winerror.E_NOTIMPL)
    src = os.path.abspath(input_file)
    
    # Lunch adobe
    win32com.client.makepy.GenerateFromTypeLibSpec('Acrobat')
    adobe = win32com.client.DispatchEx('AcroExch.App')
    avDoc = win32com.client.DispatchEx('AcroExch.AVDoc')
    # Open file
    avDoc.Open(src, src)
    pdDoc = avDoc.GetPDDoc()
    jObject = pdDoc.GetJSObject()
    # Save as word document
    jObject.SaveAs(output_file, "com.adobe.acrobat.docx")
    avDoc.Close(-1)

Be mindful that the input_file and the output_file need to be as follow:

D:OneDrive…file.pdf
D:OneDrive…dafad.docx

answered Sep 17, 2022 at 14:27

rsc05rsc05

3,5062 gold badges34 silver badges56 bronze badges

For Linux users with LibreOffice installed try

soffice --invisible --convert-to doc file_name.pdf

If you get an error like Error: no export filter found, abording try this

soffice --infilter="writer_pdf_import" --convert-to doc file_name.pdf

answered Nov 17, 2022 at 11:28

el2e10el2e10

1,46823 silver badges21 bronze badges

Источник

Project description

English | 中文

codecov

license

Extract data from PDF with PyMuPDF, e.g. text, images and drawings
Parse layout with rule, e.g. sections, paragraphs, images and tables
Generate docx with python-docx

Features

Parse and re-create page layout
- page margin
- section and column (1 or 2 columns only)
- page header and footer [TODO]
Parse and re-create paragraph
- OCR text [TODO]
- text in horizontal/vertical direction: from left to right, from bottom to top
- font style, e.g. font name, size, weight, italic and color
- text format, e.g. highlight, underline, strike-through
- list style [TODO]
- external hyper link
- paragraph horizontal alignment (left/right/center/justify) and vertical spacing
Parse and re-create image
- in-line image
- image in Gray/RGB/CMYK mode
- transparent image
- floating image, i.e. picture behind text
Parse and re-create table
- border style, e.g. width, color
- shading style, i.e. background color
- merged cells
- vertical direction cell
- table with partly hidden borders
- nested tables
Parsing pages with multi-processing

It can also be used as a tool to extract table contents since both table content and format/style is parsed.

Limitations

Text-based PDF file
Left to right language
Normal reading direction, no word transformation / rotation
Rule-based method can’t 100% convert the PDF layout

Documentation

Installation
Quickstart
- Convert PDF
- Extract table
- Command Line Interface
- Graphic User Interface
Technical Documentation (In Chinese)
API Documentation

Sample

Download files

Download the file for your platform. If you’re not sure which to choose, learn more about installing packages.

Source Distribution

Built Distribution

Источник

In this article we will explore how to convert PDF files into Microsoft Word docx format and vice versa using Python.

Table of contents

Introduction
Sample files
How to convert PDF files to docx format
- Convert all pages
- Convert a single page
How to convert docx files to PDF format
Conclusion

Introduction

In one of our tutorials explaining how to work with PDF files in Python, and specifically how to extract tables from PDF files, we focused on PDF files with tables. However, in a lot of instances, we don’t want to only export certain parts of the PDF, rather than convert the whole PDF file to docx to allow for editing. In this article we will see how to easily and efficiently perform this conversion.

To continue following this tutorial we will need the following Python libraries: pdf2docx and docx2pdf.

If you don’t have it installed, please open “Command Prompt” (on Windows) and install it using the following code:


pip install pdf2docx
pip install docx2pdf

Sample files

In order to follow the examples shown below, you will need to have a PDF file to make the conversion to docx format.

The file I will be using for this article is here. And you can also download it from below:

Once you have downloaded the file, place it in the same folder as you Python file. For example, here is my setup:

For the docx to PDF conversion, you will need a sample docx file which you can download below:

Once you have downloaded the file, place it in the same folder as you Python file. For example, here is my setup:

How to convert PDF files to docx format using Python

Using pdf2docx library, we can perform the conversion in a few lines of code. This library is quite extensive and I encourage readers to check out their official documentation to learn more about its capabilities.

Convert all pages from PDF file to docx format using Python

Method 1:

First step is to import the required dependencies:


from pdf2docx import Converter

Second step is to define input and output paths. The input path should be the path to your PDF file, and the output path should be the path to where you would like to write out the .docx file to (in our case it’s just filenames since the code and the files are located in the same directory):


pdf_file = 'sample.pdf'
docx_file = 'sample.docx'

Third step is to convert PDF file to .docx:


cv = Converter(pdf_file)
cv.convert(docx_file)
cv.close()

And you should see the sample.docx created in the same directory.

Method 2:

First step is to import the required dependencies:


from pdf2docx import parse


pdf_file = 'sample.pdf'
docx_file = 'sample.docx'

Third step is to convert PDF file to .docx:


parse(pdf_file, docx_file)

And you should see the sample.docx created in the same directory.

Convert a single page from PDF file to docx format using Python

First step is to import the required dependencies:


from pdf2docx import Converter


pdf_file = 'sample.pdf'
docx_file = 'sample.docx'

Third step is to define a list with the numbers of pages you would like to be converted. In our case, the PDF file has 2 pages and let’s say we would like to print the first one. We would define a pages_list and assign the index number of the page (first page is index 0).


pages_list = [0]

Fourth step is to convert PDF file specific pages to .docx:


cv = Converter(pdf_file)
cv.convert(docx_file, pages=pages_list)
cv.close()

How to convert docx files to PDF format using Python

Using docx2pdf library, we can perform the conversion in a few lines of code. This library is quite extensive and I encourage readers to check out their official documentation to learn more about its capabilities.

First step is to import the required dependencies:


from docx2pdf import convert


docx_file = 'input.docx'
pdf_file = 'output.pdf'

Third step is to convert the .docx file to PDF:


convert(docx_file, pdf_file)

And you should see the output.pdf created in the same directory.

Conclusion

In this article we explored how to convert PDF files into Microsoft Word docx format and vice versa using Python.

Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Python for PDF tutorials.

Источник

Recently, we got a requirement where we need to build a Python application that converts PDF files to word documents (Docx file). Now, for this task in Python, we utilize the pdf2docx package in Python. But in the implementation, we got an error stating “AttributeError: ‘Page’ object has no attribute ‘rotationMatrix’. Did you mean: ‘rotation_matrix’?“.

So, in this Python tutorial, we will understand how to solve this error and convert PDF file to Docx in Python. Here is the list of topics that we are going to discuss.

Prerequisite for “Convert PDF file to Docx in Python”
Convert PDF file to Docx in Python Error
Convert PDF file to Docx in Python using Converter()
Convert PDF file to Docx in Python using parse()

Prerequisite for “Convert PDF file to Docx in Python”

Checking Python Version

Before we start with the implementation of the converting PDF to Docx, we need to make sure that python is properly installed in our system. Now, we can check the version of python by using the following command in our command prompt or terminal.

python --version

However, for the current instance, we are using Windows operating system, so we will be using Windows Command Prompt. And here is the result of the above command.

Check the Python version in CMD

So, from the output, you can observe that we are using Python version 3.10.2.

Also, check: What is a Python Dictionary

Installing pdf2docx package

After this, the next prerequisite is the pdf2docx package. This Python library utilizes PyMuPDF which is Python binding to extract data from PDF files and interpret its layout. And then it uses the python-docx library to create word document files.

Here python-docx is another useful library that is generally utilized in generating and editing Microsoft Word (.docx) files.

Now, this pdf2docx package is a 3rd party package so before using it we need to install it in our system or virtual environment.

Here are the basic steps where we will create a virtual environment and then use the pip command to install the pdf2docx package in it.

Command to create a virtual environment in Python.

python -m venv myapp

In the above command, myapp is the name of the virtual environment. However, you can also specify any other environment name as well.

The next step is to activate the virtual environment and we will use the following command for this task.

myappScriptsactivate

Once the virtual environment is activated, the name of the virtual environment will appear at the starting of the terminal.

Convert PDF file to Docx in Python Example

Now, we are ready to install the pdf2docx package in our myapp virtual environment. For this task, we will use the following pip command.

pip install pdf2docx

Once we run the above command, it will install all the required packages related to this pdf2docx package.

Important:

In our case, it has install the pdf2docx version 0.5.3. And the error wheich we are goining to resolve will also come in the same version. If you have installed some other version there is a possibility that you don’t recieve any or same error.

Read: Python naming conventions

Once we have installed the pdf2docx package, we are ready to use this package in Python to convert a PDF file to a word document having a .docx extension. For this task, we have 2 different methods in pdf2docx. The first method includes the use Converter() class from the package and the second method includes the use of the parse() function.

Let us discuss each method with an example in Python.

Using Converter() class

The Converter() class utilizes PyMuPDF to read the specified PDF file and fetch page-by-page raw layout data, which includes text, images, and their associated properties.
After this, it examines the document’s layout at the page header, footer, and margin level.
Next, it will parse page layout to docx structure. In the last, it uses “python-docx” to generate a docx file as a result.

Let us understand how to use this Converter() class to convert a PDF to a word document in Python.

# Importing the Converter() class
from pdf2docx import Converter

# Specifying the pdf & docx files
pdf_file = 'muscles.pdf'
docx_file = 'muscles.docx'


try:
    # Converting PDF to Docx
    cv_obj = Converter(pdf_file)
    cv_obj.convert(docx_file)
    cv_obj.close()

except:
    print('Conversion Failed')
    
else:
    print('File Converted Successfully')

In the above code, first, we have imported the Converter() class from the pdf2docx module. After this, we defined 2 variables to specify the file and path for both the pdf file we want to convert and also the resultant word file. For the current instance, we have kept the muscles.pdf file in the same working directory where we kept the python file.

Next, we created an object of Converter() class named cv_obj where we passed the pdf_file variable as an argument. And then the object utilized the convert() method to convert the file. Moreover, within the convert() method, we passed the docx_file variable as an argument.

In the last, we utilized the close() method to close the file. Next, run the python program, it will create a new docx file named muscles.docx which will also consist of all the data from the pdf file.

Convert PDF file to Docx in Python Error

Now, here if you have also installed pdf2docx version 0.5.3 then there is a high possibility that the conversion at your end also fails and it returns “Conversion Failed“. This happens due to the try-except block where while execution an exception has been raised.

Here, if we remove the try-except block and then execute the program, it will return the following error.

AttributeError: ‘Page’ object has no attribute ‘rotationMatrix’. Did you mean: ‘rotation_matrix’?

Now, to resolve the above error in Python, we need to follow the following steps.

First, go inside the virtual environment directory and go to the following folder. Here is the path at our end: “D:projectmyappLibsite-packages”. Other than this, if you have not used the virtual environment, you need to go to the following path: C:usersUserNameappdataroamingpythonpython310site-packages.

In the second path, please insert your username and python directory properly.

Next, open the pdf2docx directory and go to the page directory and open the RawPage.py file.

In the file, go to the line number 279 where it shows Element.set_rotation_matrix(self.fitz_page.rotationMatrix).
Now, we need to replace rotationMatrix with rotation_matrix and then save and close the file.

Convert PDF file to Docx in Python Error

After implementing the above steps, we can utilize the previous example to convert the muscles.pdf file into the muscles.docx file. Here is the sample result of the command prompt, when we implement the python program.

Example of Convert PDF file to Docx in Python

Read: How to create a list in Python

Using parse() function

Unlike the Converter() class, we can also utilize the parse() function from the pdf2docx module. And we can directly use this function to convert a pdf file into a word document.

For implementation, we may need to use the following syntax of the parse() function.

parse(pdf_file_path, docx_file_path, start=page_no, end=page_no)

The parse() method accepts 4 argument values and the explanation related to each parameter is given below.

The pdf_file_path argument is utilized to define the file name and path of the PDF file that we want to convert.
The docx_file_path argument is utilized to define the file name and path of the word file that we want in the result.
The start parameter will be utilized to specify the starting page number of the pdf file from where we want to start the conversion.
In the last, there is an end argument that can be utilized to specify the ending page number of a pdf file, and the method will convert the page in the specified range.

Next, to understand the above syntax, let us execute a sample example in Python. And the code for the example is given below.

# Importing the parse() function
from pdf2docx import parse

# Specifying the pdf & docx files
pdf_file = 'muscles.pdf'
docx_file = 'new_muscles.docx'

try:
    # Converting PDF to Docx
    parse(pdf_file, docx_file)
    
except:
    print('Conversion Failed')
    
else:
    print('File Converted Successfully')

In the above example, first, we imported the parse() function from the pdf2docx package. After this, we defined two variables, just like the previous example to specify the file name and path for both pdf and docx files.

Next, we utilized the parse() function where the first argument is the pdf_file variable representing the pdf file and the second argument is docx_file representing the docx file.

Moreover, we have kept this code within the try-except-else block to handle exceptions whenever raised. However, in the end, it will convert the muscles.pdf file and generate the new_muscles.pdf file.

You may also like to read the following Python tutorial.

Check if a list is empty in Python
Sum Elements in List in Python using For Loop
Multiply in Python with Examples
Python built-in functions with examples
Unexpected EOF while parsing Python

So, in this Python tutorial, we understood how to convert PDF file to docx in Python. Here is the list of topics that we have covered in this tutorial.

Prerequisite for “Convert PDF file to Docx in Python”
Convert PDF file to Docx in Python Error
Convert PDF file to Docx in Python using Converter()
Convert PDF file to Docx in Python using parse()

Python is one of the most popular languages in the United States of America. I have been working with Python for a long time and I have expertise in working with various libraries on Tkinter, Pandas, NumPy, Turtle, Django, Matplotlib, Tensorflow, Scipy, Scikit-Learn, etc… I have experience in working with various clients in countries like United States, Canada, United Kingdom, Australia, New Zealand, etc. Check out my profile.

Источник