How to convert pdf to word python

Method #2). Convert PDFs to Word Using GroupDocs Python SDK

GroupDocs is a Cloud SDK for Python that will help you convert PDF to Word in one of the easiest and most convenient ways as long as you are able to follow this simple guide provided here. Unlike the previous PyPDF2 library, GroupDocs is capable of processing a richly formatted PDF in a way that it retains the original formatting in the converted Word file. It performs all this as a standalone tool without the need for any other extra tools or software.

Of course, this process to save PDF as Word is not that simple but this Python module will come in handy to kick the ball out of the park for you. The good thing is that we are providing you with a comprehensive and reliable tutorial on how to go about the task at hand even if you are a novice. It is now the perfect moment to let you in on how to convert PDF to Word using GroupDocs Python SDK.

Step 1: Get your APP SID and APP KEY. Simply sign up for free with https://dashboard.groupdocs.cloud and once you do so, you should be able to find the APP SID and APP KEY in the “My Apps” tab under the “Manage My Apps” sub-tab. These are necessary for the success of the process so have them ready.

Step 2: Create a directory and place the PDF file in it. For convenience purposes, it is advisable to create a fresh directory with a preferred name and then place the target PDF document inside.

Step 3: Install the necessary GroupDocs package. To achieve this, we are going to install the “groupdocs-conversion-cloud” Python package using the command line. So, from the file explorer address bar, type in the word “cmd” and hit the “Enter” key. In the resulting command-driven interface, type in the command below and hit the “Enter” key on the keyboard.

pip install groupdocs-conversion-cloud

The installation will not take ages to complete and you will be heading to the next step in a matter of moments.

Step 4: Create the required Python script. In the same directory as the PDF file, create a new “.py” script with the code below that will be responsible for the success of the process to save PDF as Word. Ensure that you replace the highlighted “app_sid” and “app_key” values with what you were assigned when you signed up. At the same time, ensure you replace the highlighted “filename” with the name of your PDF file. For convenience, save the Python script with the filename “groupdocs.py

# Import module

import groupdocs_conversion_cloud

# Get your app_sid and app_key at https://dashboard.groupdocs.cloud (free registration is required).

app_sid = "xxxxx-xxxx-xxxx-xxxx-xxxxxxxxx"

app_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# Create instance of the API

convert_api = groupdocs_conversion_cloud.ConvertApi.from_keys(app_sid, app_key)

file_api = groupdocs_conversion_cloud.FileApi.from_keys(app_sid, app_key)

try:

        #upload source file to storage

        filename = 'filename.pdf'

        remote_name = 'filename.pdf'

        output_name= 'filename.docx'

        strformat='docx'

        request_upload = groupdocs_conversion_cloud.UploadFileRequest(remote_name,filename)

        response_upload = file_api.upload_file(request_upload)

        #Extract Text from PDF document

        settings = groupdocs_conversion_cloud.ConvertSettings()

        settings.file_path =remote_name

        settings.format = strformat

        settings.output_path = output_name

        request = groupdocs_conversion_cloud.ConvertDocumentRequest(settings)

        response = convert_api.convert_document(request)

        print("Document converted successfully: " + str(response))

except groupdocs_conversion_cloud.ApiException as e:

        print("Exception when calling get_supported_conversion_types: {0}".format(e.message))

Double-check that you have made the necessary changes as required before saving the Python script. Once you have confirmed the highlights, head over to the next step. As a quick summary, this script will import the needed Python package, initialize the API, upload the source PDF file, convert the PDF to Word, and then deliver the output information.

Step 5: Run the Python script. It is now time to let the script work its magic by running it from the command line that we opened. To do that, type the command below, hit the “Enter” key, and be patient for it to complete.

groupdocs.py

To confirm that the process has completed as expected, you should be able to see some output information with a successful conversion message, the file path, size, and Url.

Step 6: Download the converted Word document. On your preferred web browser, navigate to https://dashboard.groupdocs.cloud/ and then head over to the “My Files” tab. This will open up your “Storage” and list all the files available and it is here that you will find the uploaded PDF file and the converted Word file. Simply tick the box on the left-hand side of the DOCX file and then hit the “Download” button. You will be asked to confirm whether you are sure you want to download the checked files and all you need to do here is click the “Yes” button.

The download process will start momentarily after which you will be able to find the Word document in your default downloads folder. From there, you can open the file and perform further actions that you deem necessary. This will mark the end of your task to save PDF as Word using the GroupDocs Python SDK.

groupdocs_pdftoword

At no cost at all, GroupDocs has delivered a reliable method at your disposal that will help you extract data from PDF files and on top of that retain the original layout and formatting to the highest degree. This means that you can say goodbye to the need for corrections after the conversion process besides enjoying a very efficient process.

At the end of the day, you can comfortably take advantage of the Python programming language anytime the need to convert PDF to Word arises, all thanks to these awesome libraries featured in this article. The guides on how to tackle the task at hand ensure that you do not encounter a steep learning curve when you decide to make the most out of these tools.

Therefore, any moment you feel you need to save a PDF as Word, you need not necessary hassle looking for fully-fledged software when you can do that easily using Python. Pick the one tool that has proven to lace your shoes in the best way, follow the guide on how to use it, and sail your way to the kind of results you are looking forward to.

How to convert a pdf file to docx. Is there a way of doing this using python?

I’ve saw some pages that allow user to upload PDF and returns a DOC file, like PdfToWord

Thanks in advance

rsc05's user avatar

rsc05

3,5062 gold badges34 silver badges56 bronze badges

asked Oct 14, 2014 at 10:16

AlvaroAV's user avatar

2

If you have LibreOffice installed

lowriter --invisible --convert-to doc '/your/file.pdf'

If you want to use Python for this:

import os
import subprocess

for top, dirs, files in os.walk('/my/pdf/folder'):
    for filename in files:
        if filename.endswith('.pdf'):
            abspath = os.path.join(top, filename)
            subprocess.call('lowriter --invisible --convert-to doc "{}"'
                            .format(abspath), shell=True)

answered Oct 14, 2014 at 10:30

5

This is difficult because PDFs are presentation oriented and word documents are content oriented. I have tested both and can recommend the following projects.

  1. PyPDF2
  2. PDFMiner

However, you are most definitely going to lose presentational aspects in the conversion.

answered Oct 14, 2014 at 10:30

ham-sandwich's user avatar

ham-sandwichham-sandwich

3,96410 gold badges33 silver badges46 bronze badges

If you want to convert PDF -> MS Word type file like docx, I came across this.

Ahsin Shabbir wrote:

import glob
import win32com.client
import os

word = win32com.client.Dispatch("Word.Application")
word.visible = 0

pdfs_path = "" # folder where the .pdf files are stored
for i, doc in enumerate(glob.iglob(pdfs_path+"*.pdf")):
    print(doc)
    filename = doc.split('\')[-1]
    in_file = os.path.abspath(doc)
    print(in_file)
    wb = word.Documents.Open(in_file)
    out_file = os.path.abspath(reqs_path +filename[0:-4]+ ".docx".format(i))
    print("outfilen",out_file)
    wb.SaveAs2(out_file, FileFormat=16) # file format for docx
    print("success...")
    wb.Close()

word.Quit()

This worked like a charm for me, converted 500 pages PDF with formatting and images.

answered Apr 6, 2020 at 19:06

eleks007's user avatar

eleks007eleks007

991 silver badge3 bronze badges

2

You can use GroupDocs.Conversion Cloud SDK for python without installing any third-party tool or software.

Sample Python code:

# Import module
import groupdocs_conversion_cloud

# Get your app_sid and app_key at https://dashboard.groupdocs.cloud (free registration is required).
app_sid = "xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx"
app_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# Create instance of the API
convert_api = groupdocs_conversion_cloud.ConvertApi.from_keys(app_sid, app_key)
file_api = groupdocs_conversion_cloud.FileApi.from_keys(app_sid, app_key)

try:

        #upload soruce file to storage
        filename = 'Sample.pdf'
        remote_name = 'Sample.pdf'
        output_name= 'sample.docx'
        strformat='docx'

        request_upload = groupdocs_conversion_cloud.UploadFileRequest(remote_name,filename)
        response_upload = file_api.upload_file(request_upload)
        #Convert PDF to Word document
        settings = groupdocs_conversion_cloud.ConvertSettings()
        settings.file_path =remote_name
        settings.format = strformat
        settings.output_path = output_name

        loadOptions = groupdocs_conversion_cloud.PdfLoadOptions()
        loadOptions.hide_pdf_annotations = True
        loadOptions.remove_embedded_files = False
        loadOptions.flatten_all_fields = True

        settings.load_options = loadOptions

        convertOptions = groupdocs_conversion_cloud.DocxConvertOptions()
        convertOptions.from_page = 1
        convertOptions.pages_count = 1

        settings.convert_options = convertOptions
 .               
        request = groupdocs_conversion_cloud.ConvertDocumentRequest(settings)
        response = convert_api.convert_document(request)

        print("Document converted successfully: " + str(response))
except groupdocs_conversion_cloud.ApiException as e:
        print("Exception when calling get_supported_conversion_types: {0}".format(e.message))

I’m developer evangelist at aspose.

answered Nov 7, 2019 at 15:29

Tilal Ahmad's user avatar

Tilal AhmadTilal Ahmad

9405 silver badges9 bronze badges

2

Based on previews answers this was the solution that worked best for me using Python 3.7.1

import win32com.client
import os

# INPUT/OUTPUT PATH
pdf_path = r"""C:path2pdf.pdf"""
output_path = r"""C:output_folder"""

word = win32com.client.Dispatch("Word.Application")
word.visible = 0  # CHANGE TO 1 IF YOU WANT TO SEE WORD APPLICATION RUNNING AND ALL MESSAGES OR WARNINGS SHOWN BY WORD

# GET FILE NAME AND NORMALIZED PATH
filename = pdf_path.split('\')[-1]
in_file = os.path.abspath(pdf_path)

# CONVERT PDF TO DOCX AND SAVE IT ON THE OUTPUT PATH WITH THE SAME INPUT FILE NAME
wb = word.Documents.Open(in_file)
out_file = os.path.abspath(output_path + '\' + filename[0:-4] + ".docx")
wb.SaveAs2(out_file, FileFormat=16)
wb.Close()
word.Quit()

answered Aug 4, 2021 at 12:33

Jonny_P's user avatar

Jonny_PJonny_P

1171 silver badge3 bronze badges

With Adobe on your machine

If you have adobe acrobate on your machine you can use the following function that enables you to save the PDF file as docx file

# Open PDF file, use Acrobat Exchange to save file as .docx file.

import win32com.client, win32com.client.makepy, os, winerror, errno, re
from win32com.client.dynamic import ERRORS_BAD_CONTEXT

def PDF_to_Word(input_file, output_file):
    
    ERRORS_BAD_CONTEXT.append(winerror.E_NOTIMPL)
    src = os.path.abspath(input_file)
    
    # Lunch adobe
    win32com.client.makepy.GenerateFromTypeLibSpec('Acrobat')
    adobe = win32com.client.DispatchEx('AcroExch.App')
    avDoc = win32com.client.DispatchEx('AcroExch.AVDoc')
    # Open file
    avDoc.Open(src, src)
    pdDoc = avDoc.GetPDDoc()
    jObject = pdDoc.GetJSObject()
    # Save as word document
    jObject.SaveAs(output_file, "com.adobe.acrobat.docx")
    avDoc.Close(-1)

Be mindful that the input_file and the output_file need to be as follow:

  1. D:OneDrive…file.pdf
  2. D:OneDrive…dafad.docx

answered Sep 17, 2022 at 14:27

rsc05's user avatar

rsc05rsc05

3,5062 gold badges34 silver badges56 bronze badges

0

For Linux users with LibreOffice installed try

soffice --invisible --convert-to doc file_name.pdf

If you get an error like Error: no export filter found, abording try this

soffice --infilter="writer_pdf_import" --convert-to doc file_name.pdf

answered Nov 17, 2022 at 11:28

el2e10's user avatar

el2e10el2e10

1,46823 silver badges21 bronze badges

Развлечение на сегодняшний вечер — показать вам, как можно использовать библиотеку pdf2docx для преобразования файлов PDF в расширение docx.

Наша задача — разработать Python-модуль для преобразования одного или нескольких файлов PDF, расположенных в одной папке, в форме легкой утилиты командной строки не полагаясь на какие-либо внешние утилиты за пределами экосистемы Python.

pdf2docx — это библиотека Python для извлечения данных из PDF с помощью PyMuPDF, анализа макета с помощью правил и создания файла docx с помощью python-docx. python-docx — это еще одна библиотека, которая используется pdf2docx для создания и обновления файлов Microsoft Word (.docx).

Короче, начинаем:

$ pip install pdf2docx==0.5.1

Импортируем нужные нам библиотеки:

# Импортировать библиотеки
from pdf2docx import parse
from typing import Tuple

Определим функцию, отвечающую за преобразование PDF в Docx:

def convert_pdf2docx(input_file: str, output_file: str, pages: Tuple = None):
    """Преобразует PDF в DOCX"""
    if pages:
        pages = [int(i) for i in list(pages) if i.isnumeric()]
    result = parse(pdf_file=input_file,
                   docx_with_path=output_file, pages=pages)
    summary = {
        "Исходный файл": input_file, "Страниц": str(pages), "Результат преобразования": output_file
    }
    # Печать сводки
    print("#### Отчет ########################################################")
    print("n".join("{}:{}".format(i, j) for i, j in summary.items()))
    print("###################################################################")
    return result

Функция convert_pdf2docx() позволяет указать диапазон страниц для преобразования, она преобразует файл PDF в файл Docx и в конце распечатывает отчет о своей работе.

Напишем обёртку для вызова этой функции:

if __name__ == "__main__":
    import sys
    input_file = sys.argv[1]
    output_file = sys.argv[2]
    convert_pdf2docx(input_file, output_file)

Просто используем встроенный в Python модуль sys для получения имен входных и выходных файлов из аргументов командной строки. Попробуем преобразовать образец PDF-файла (использованный пример можно забрать здесь):

$ python convert_pdf2docx.py Anketa_0.pdf Anketa_0.docx

В текущем каталоге появится новый файл Anketa_0.docx, и результат будет таким:

Parsing Page 1: 1/3...
Parsing Page 2: 2/3...
Parsing Page 3: 3/3...

Creating Page 1: 1/3...
Creating Page 2: 2/3...
Creating Page 3: 3/3...
--------------------------------------------------
Terminated in 0.9915917019999999s.
#### Отчет ########################################################
Исходный файл:Anketa_0.pdf
Страниц:None
Результат преобразования:Anketa_0.doc
###################################################################

Можно выборочно указать нужные страницы в функции convert_pdf2docx().

Надеюсь, сей простой и короткий урок вам понравился, и этот конвертер будет вам полезен.

Использованы материалы How to Convert PDF to Docx in Python

Print Friendly, PDF & Email

CC BY-NC 4.0
Как с помощью Python преобразовать pdf‑файлы в doc, опубликовано К ВВ, лицензия — Creative Commons Attribution-NonCommercial 4.0 International.

Респект и уважуха


 

·

Abdou Rockikz
·

3 min read
· Updated
jul 2022

· PDF File Handling

Disclosure: This post may contain affiliate links, meaning when you click the links and make a purchase, we receive a commission.

In this tutorial, we will dive into how we can use the pdf2docx library to convert PDF files into docx extension.

The goal of this tutorial is to develop a lightweight command-line-based utility, through Python-based modules without relying on external utilities outside the Python ecosystem in order to convert one or a collection of PDF files located within a folder.

pdf2docx is a Python library to extract data from PDF with PyMuPDF, parse layout with rules, and generate docx file with python-docx. python-docx is another library that is used by pdf2docx for creating and updating Microsoft Word (.docx) files.

Going into the requirements:

$ pip install pdf2docx==0.5.1

Let’s start by importing the modules:

# Import Libraries
from pdf2docx import parse
from typing import Tuple

Let’s define the function responsible for converting PDF to Docx:

def convert_pdf2docx(input_file: str, output_file: str, pages: Tuple = None):
    """Converts pdf to docx"""
    if pages:
        pages = [int(i) for i in list(pages) if i.isnumeric()]
    result = parse(pdf_file=input_file,
                   docx_with_path=output_file, pages=pages)
    summary = {
        "File": input_file, "Pages": str(pages), "Output File": output_file
    }
    # Printing Summary
    print("## Summary ########################################################")
    print("n".join("{}:{}".format(i, j) for i, j in summary.items()))
    print("###################################################################")
    return result

The convert_pdf2docx() function allows you to specify a range of pages to convert, it converts a PDF file into a Docx file and prints a summary of the conversion process in the end.

Let’s use it now:

if __name__ == "__main__":
    import sys
    input_file = sys.argv[1]
    output_file = sys.argv[2]
    convert_pdf2docx(input_file, output_file)

We simply use Python’s built-in sys module to get the input and output file names from command-line arguments. Let’s try to convert a sample PDF file (get it here):

$ python convert_pdf2docx.py letter.pdf letter.docx

A new letter.docx file will appear in the current directory, and the output will be like this:

Parsing Page 1: 1/1...
Creating Page 1: 1/1...
--------------------------------------------------
Terminated in 0.10869679999999998s.
## Summary ########################################################
File:letter.pdf
Pages:None
Output File:letter.docx
###################################################################

You can also specify the pages you want in the convert_pdf2docx() function.

I hope you enjoyed this short tutorial and you found this converter useful.

Learn also: How to Replace Text in Docx Files in Python.

PDF related tutorials:

  • How to Watermark PDF Files in Python.
  • How to Highlight and Redact Text in PDF Files with Python.
  • How to Extract Images from PDF in Python.
  • How to Extract All PDF Links in Python.
  • How to Extract Tables from PDF in Python.
  • How to Extract Text from Images in PDF Files with Python.

Finally, if you’re a beginner and want to learn Python, I suggest you take the Python For Everybody Coursera course, in which you’ll learn a lot about Python. You can also check our resources and courses page to see the Python resources I recommend on various topics!

Happy coding ♥

View Full Code

Ezoic

Read Also

How to Extract Text from Images in PDF Files with Python

How to Watermark PDF Files in Python

How to Highlight and Redact Text in PDF Files with Python

Comment panel

Project description

English | 中文

python-version
codecov
pypi-version
license
pypi-downloads

  • Extract data from PDF with PyMuPDF, e.g. text, images and drawings
  • Parse layout with rule, e.g. sections, paragraphs, images and tables
  • Generate docx with python-docx

Features

  • Parse and re-create page layout

    • page margin
    • section and column (1 or 2 columns only)
    • page header and footer [TODO]
  • Parse and re-create paragraph

    • OCR text [TODO]
    • text in horizontal/vertical direction: from left to right, from bottom to top
    • font style, e.g. font name, size, weight, italic and color
    • text format, e.g. highlight, underline, strike-through
    • list style [TODO]
    • external hyper link
    • paragraph horizontal alignment (left/right/center/justify) and vertical spacing
  • Parse and re-create image

    • in-line image
    • image in Gray/RGB/CMYK mode
    • transparent image
    • floating image, i.e. picture behind text
  • Parse and re-create table

    • border style, e.g. width, color
    • shading style, i.e. background color
    • merged cells
    • vertical direction cell
    • table with partly hidden borders
    • nested tables
  • Parsing pages with multi-processing

It can also be used as a tool to extract table contents since both table content and format/style is parsed.

Limitations

  • Text-based PDF file
  • Left to right language
  • Normal reading direction, no word transformation / rotation
  • Rule-based method can’t 100% convert the PDF layout

Documentation

  • Installation
  • Quickstart
    • Convert PDF
    • Extract table
    • Command Line Interface
    • Graphic User Interface
  • Technical Documentation (In Chinese)
  • API Documentation

Sample

sample_compare.png

Download files

Download the file for your platform. If you’re not sure which to choose, learn more about installing packages.

Source Distribution

Built Distribution

Like this post? Please share to your friends:
  • How to convert pdf to excel convert
  • How to convert from excel to pdf
  • How to convert files from pdf to word
  • How to convert excel to csv
  • How to compare with excel