Method #2). Convert PDFs to Word Using GroupDocs Python SDK
GroupDocs is a Cloud SDK for Python that will help you convert PDF to Word in one of the easiest and most convenient ways as long as you are able to follow this simple guide provided here. Unlike the previous PyPDF2 library, GroupDocs is capable of processing a richly formatted PDF in a way that it retains the original formatting in the converted Word file. It performs all this as a standalone tool without the need for any other extra tools or software.
Of course, this process to save PDF as Word is not that simple but this Python module will come in handy to kick the ball out of the park for you. The good thing is that we are providing you with a comprehensive and reliable tutorial on how to go about the task at hand even if you are a novice. It is now the perfect moment to let you in on how to convert PDF to Word using GroupDocs Python SDK.
Step 1: Get your APP SID and APP KEY. Simply sign up for free with https://dashboard.groupdocs.cloud and once you do so, you should be able to find the APP SID and APP KEY in the “My Apps” tab under the “Manage My Apps” sub-tab. These are necessary for the success of the process so have them ready.
Step 2: Create a directory and place the PDF file in it. For convenience purposes, it is advisable to create a fresh directory with a preferred name and then place the target PDF document inside.
Step 3: Install the necessary GroupDocs package. To achieve this, we are going to install the “groupdocs-conversion-cloud” Python package using the command line. So, from the file explorer address bar, type in the word “cmd” and hit the “Enter” key. In the resulting command-driven interface, type in the command below and hit the “Enter” key on the keyboard.
pip install groupdocs-conversion-cloud
The installation will not take ages to complete and you will be heading to the next step in a matter of moments.
Step 4: Create the required Python script. In the same directory as the PDF file, create a new “.py” script with the code below that will be responsible for the success of the process to save PDF as Word. Ensure that you replace the highlighted “app_sid” and “app_key” values with what you were assigned when you signed up. At the same time, ensure you replace the highlighted “filename” with the name of your PDF file. For convenience, save the Python script with the filename “groupdocs.py”
# Import module
import groupdocs_conversion_cloud
# Get your app_sid and app_key at https://dashboard.groupdocs.cloud (free registration is required).
app_sid = "xxxxx-xxxx-xxxx-xxxx-xxxxxxxxx"
app_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
# Create instance of the API
convert_api = groupdocs_conversion_cloud.ConvertApi.from_keys(app_sid, app_key)
file_api = groupdocs_conversion_cloud.FileApi.from_keys(app_sid, app_key)
try:
#upload source file to storage
filename = 'filename.pdf'
remote_name = 'filename.pdf'
output_name= 'filename.docx'
strformat='docx'
request_upload = groupdocs_conversion_cloud.UploadFileRequest(remote_name,filename)
response_upload = file_api.upload_file(request_upload)
#Extract Text from PDF document
settings = groupdocs_conversion_cloud.ConvertSettings()
settings.file_path =remote_name
settings.format = strformat
settings.output_path = output_name
request = groupdocs_conversion_cloud.ConvertDocumentRequest(settings)
response = convert_api.convert_document(request)
print("Document converted successfully: " + str(response))
except groupdocs_conversion_cloud.ApiException as e:
print("Exception when calling get_supported_conversion_types: {0}".format(e.message))
Double-check that you have made the necessary changes as required before saving the Python script. Once you have confirmed the highlights, head over to the next step. As a quick summary, this script will import the needed Python package, initialize the API, upload the source PDF file, convert the PDF to Word, and then deliver the output information.
Step 5: Run the Python script. It is now time to let the script work its magic by running it from the command line that we opened. To do that, type the command below, hit the “Enter” key, and be patient for it to complete.
groupdocs.py
To confirm that the process has completed as expected, you should be able to see some output information with a successful conversion message, the file path, size, and Url.
Step 6: Download the converted Word document. On your preferred web browser, navigate to https://dashboard.groupdocs.cloud/ and then head over to the “My Files” tab. This will open up your “Storage” and list all the files available and it is here that you will find the uploaded PDF file and the converted Word file. Simply tick the box on the left-hand side of the DOCX file and then hit the “Download” button. You will be asked to confirm whether you are sure you want to download the checked files and all you need to do here is click the “Yes” button.
The download process will start momentarily after which you will be able to find the Word document in your default downloads folder. From there, you can open the file and perform further actions that you deem necessary. This will mark the end of your task to save PDF as Word using the GroupDocs Python SDK.
At no cost at all, GroupDocs has delivered a reliable method at your disposal that will help you extract data from PDF files and on top of that retain the original layout and formatting to the highest degree. This means that you can say goodbye to the need for corrections after the conversion process besides enjoying a very efficient process.
At the end of the day, you can comfortably take advantage of the Python programming language anytime the need to convert PDF to Word arises, all thanks to these awesome libraries featured in this article. The guides on how to tackle the task at hand ensure that you do not encounter a steep learning curve when you decide to make the most out of these tools.
Therefore, any moment you feel you need to save a PDF as Word, you need not necessary hassle looking for fully-fledged software when you can do that easily using Python. Pick the one tool that has proven to lace your shoes in the best way, follow the guide on how to use it, and sail your way to the kind of results you are looking forward to.
How to convert a pdf file to docx. Is there a way of doing this using python?
I’ve saw some pages that allow user to upload PDF
and returns a DOC
file, like PdfToWord
Thanks in advance
rsc05
3,5062 gold badges34 silver badges56 bronze badges
asked Oct 14, 2014 at 10:16
2
If you have LibreOffice installed
lowriter --invisible --convert-to doc '/your/file.pdf'
If you want to use Python for this:
import os
import subprocess
for top, dirs, files in os.walk('/my/pdf/folder'):
for filename in files:
if filename.endswith('.pdf'):
abspath = os.path.join(top, filename)
subprocess.call('lowriter --invisible --convert-to doc "{}"'
.format(abspath), shell=True)
answered Oct 14, 2014 at 10:30
5
This is difficult because PDFs are presentation oriented and word documents are content oriented. I have tested both and can recommend the following projects.
- PyPDF2
- PDFMiner
However, you are most definitely going to lose presentational aspects in the conversion.
answered Oct 14, 2014 at 10:30
ham-sandwichham-sandwich
3,96410 gold badges33 silver badges46 bronze badges
If you want to convert PDF -> MS Word type file like docx, I came across this.
Ahsin Shabbir wrote:
import glob
import win32com.client
import os
word = win32com.client.Dispatch("Word.Application")
word.visible = 0
pdfs_path = "" # folder where the .pdf files are stored
for i, doc in enumerate(glob.iglob(pdfs_path+"*.pdf")):
print(doc)
filename = doc.split('\')[-1]
in_file = os.path.abspath(doc)
print(in_file)
wb = word.Documents.Open(in_file)
out_file = os.path.abspath(reqs_path +filename[0:-4]+ ".docx".format(i))
print("outfilen",out_file)
wb.SaveAs2(out_file, FileFormat=16) # file format for docx
print("success...")
wb.Close()
word.Quit()
This worked like a charm for me, converted 500 pages PDF with formatting and images.
answered Apr 6, 2020 at 19:06
eleks007eleks007
991 silver badge3 bronze badges
2
You can use GroupDocs.Conversion Cloud SDK for python without installing any third-party tool or software.
Sample Python code:
# Import module
import groupdocs_conversion_cloud
# Get your app_sid and app_key at https://dashboard.groupdocs.cloud (free registration is required).
app_sid = "xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx"
app_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
# Create instance of the API
convert_api = groupdocs_conversion_cloud.ConvertApi.from_keys(app_sid, app_key)
file_api = groupdocs_conversion_cloud.FileApi.from_keys(app_sid, app_key)
try:
#upload soruce file to storage
filename = 'Sample.pdf'
remote_name = 'Sample.pdf'
output_name= 'sample.docx'
strformat='docx'
request_upload = groupdocs_conversion_cloud.UploadFileRequest(remote_name,filename)
response_upload = file_api.upload_file(request_upload)
#Convert PDF to Word document
settings = groupdocs_conversion_cloud.ConvertSettings()
settings.file_path =remote_name
settings.format = strformat
settings.output_path = output_name
loadOptions = groupdocs_conversion_cloud.PdfLoadOptions()
loadOptions.hide_pdf_annotations = True
loadOptions.remove_embedded_files = False
loadOptions.flatten_all_fields = True
settings.load_options = loadOptions
convertOptions = groupdocs_conversion_cloud.DocxConvertOptions()
convertOptions.from_page = 1
convertOptions.pages_count = 1
settings.convert_options = convertOptions
.
request = groupdocs_conversion_cloud.ConvertDocumentRequest(settings)
response = convert_api.convert_document(request)
print("Document converted successfully: " + str(response))
except groupdocs_conversion_cloud.ApiException as e:
print("Exception when calling get_supported_conversion_types: {0}".format(e.message))
I’m developer evangelist at aspose.
answered Nov 7, 2019 at 15:29
Tilal AhmadTilal Ahmad
9405 silver badges9 bronze badges
2
Based on previews answers this was the solution that worked best for me using Python 3.7.1
import win32com.client
import os
# INPUT/OUTPUT PATH
pdf_path = r"""C:path2pdf.pdf"""
output_path = r"""C:output_folder"""
word = win32com.client.Dispatch("Word.Application")
word.visible = 0 # CHANGE TO 1 IF YOU WANT TO SEE WORD APPLICATION RUNNING AND ALL MESSAGES OR WARNINGS SHOWN BY WORD
# GET FILE NAME AND NORMALIZED PATH
filename = pdf_path.split('\')[-1]
in_file = os.path.abspath(pdf_path)
# CONVERT PDF TO DOCX AND SAVE IT ON THE OUTPUT PATH WITH THE SAME INPUT FILE NAME
wb = word.Documents.Open(in_file)
out_file = os.path.abspath(output_path + '\' + filename[0:-4] + ".docx")
wb.SaveAs2(out_file, FileFormat=16)
wb.Close()
word.Quit()
answered Aug 4, 2021 at 12:33
Jonny_PJonny_P
1171 silver badge3 bronze badges
With Adobe on your machine
If you have adobe acrobate on your machine you can use the following function that enables you to save the PDF file as docx file
# Open PDF file, use Acrobat Exchange to save file as .docx file.
import win32com.client, win32com.client.makepy, os, winerror, errno, re
from win32com.client.dynamic import ERRORS_BAD_CONTEXT
def PDF_to_Word(input_file, output_file):
ERRORS_BAD_CONTEXT.append(winerror.E_NOTIMPL)
src = os.path.abspath(input_file)
# Lunch adobe
win32com.client.makepy.GenerateFromTypeLibSpec('Acrobat')
adobe = win32com.client.DispatchEx('AcroExch.App')
avDoc = win32com.client.DispatchEx('AcroExch.AVDoc')
# Open file
avDoc.Open(src, src)
pdDoc = avDoc.GetPDDoc()
jObject = pdDoc.GetJSObject()
# Save as word document
jObject.SaveAs(output_file, "com.adobe.acrobat.docx")
avDoc.Close(-1)
Be mindful that the input_file and the output_file need to be as follow:
- D:OneDrive…file.pdf
- D:OneDrive…dafad.docx
answered Sep 17, 2022 at 14:27
rsc05rsc05
3,5062 gold badges34 silver badges56 bronze badges
0
For Linux users with LibreOffice installed try
soffice --invisible --convert-to doc file_name.pdf
If you get an error like Error: no export filter found, abording
try this
soffice --infilter="writer_pdf_import" --convert-to doc file_name.pdf
answered Nov 17, 2022 at 11:28
el2e10el2e10
1,46823 silver badges21 bronze badges
Развлечение на сегодняшний вечер — показать вам, как можно использовать библиотеку pdf2docx для преобразования файлов PDF в расширение docx.
Наша задача — разработать Python-модуль для преобразования одного или нескольких файлов PDF, расположенных в одной папке, в форме легкой утилиты командной строки не полагаясь на какие-либо внешние утилиты за пределами экосистемы Python.
pdf2docx — это библиотека Python для извлечения данных из PDF с помощью PyMuPDF, анализа макета с помощью правил и создания файла docx с помощью python-docx. python-docx — это еще одна библиотека, которая используется pdf2docx для создания и обновления файлов Microsoft Word (.docx).
Короче, начинаем:
$ pip install pdf2docx==0.5.1
Импортируем нужные нам библиотеки:
# Импортировать библиотеки from pdf2docx import parse from typing import Tuple
Определим функцию, отвечающую за преобразование PDF в Docx:
def convert_pdf2docx(input_file: str, output_file: str, pages: Tuple = None): """Преобразует PDF в DOCX""" if pages: pages = [int(i) for i in list(pages) if i.isnumeric()] result = parse(pdf_file=input_file, docx_with_path=output_file, pages=pages) summary = { "Исходный файл": input_file, "Страниц": str(pages), "Результат преобразования": output_file } # Печать сводки print("#### Отчет ########################################################") print("n".join("{}:{}".format(i, j) for i, j in summary.items())) print("###################################################################") return result
Функция convert_pdf2docx()
позволяет указать диапазон страниц для преобразования, она преобразует файл PDF в файл Docx и в конце распечатывает отчет о своей работе.
Напишем обёртку для вызова этой функции:
if __name__ == "__main__": import sys input_file = sys.argv[1] output_file = sys.argv[2] convert_pdf2docx(input_file, output_file)
Просто используем встроенный в Python модуль sys для получения имен входных и выходных файлов из аргументов командной строки. Попробуем преобразовать образец PDF-файла (использованный пример можно забрать здесь):
$ python convert_pdf2docx.py Anketa_0.pdf Anketa_0.docx
В текущем каталоге появится новый файл Anketa_0.docx, и результат будет таким:
Parsing Page 1: 1/3... Parsing Page 2: 2/3... Parsing Page 3: 3/3... Creating Page 1: 1/3... Creating Page 2: 2/3... Creating Page 3: 3/3... -------------------------------------------------- Terminated in 0.9915917019999999s. #### Отчет ######################################################## Исходный файл:Anketa_0.pdf Страниц:None Результат преобразования:Anketa_0.doc ###################################################################
Можно выборочно указать нужные страницы в функции convert_pdf2docx()
.
Надеюсь, сей простой и короткий урок вам понравился, и этот конвертер будет вам полезен.
Использованы материалы How to Convert PDF to Docx in Python
Как с помощью Python преобразовать pdf‑файлы в doc, опубликовано К ВВ, лицензия — Creative Commons Attribution-NonCommercial 4.0 International.
Респект и уважуха
·
Abdou Rockikz
·
3 min read
· Updated
jul 2022
· PDF File Handling
Disclosure: This post may contain affiliate links, meaning when you click the links and make a purchase, we receive a commission.
In this tutorial, we will dive into how we can use the pdf2docx library to convert PDF files into docx extension.
The goal of this tutorial is to develop a lightweight command-line-based utility, through Python-based modules without relying on external utilities outside the Python ecosystem in order to convert one or a collection of PDF files located within a folder.
pdf2docx is a Python library to extract data from PDF with PyMuPDF, parse layout with rules, and generate docx file with python-docx. python-docx is another library that is used by pdf2docx for creating and updating Microsoft Word (.docx) files.
Going into the requirements:
$ pip install pdf2docx==0.5.1
Let’s start by importing the modules:
# Import Libraries
from pdf2docx import parse
from typing import Tuple
Let’s define the function responsible for converting PDF to Docx:
def convert_pdf2docx(input_file: str, output_file: str, pages: Tuple = None):
"""Converts pdf to docx"""
if pages:
pages = [int(i) for i in list(pages) if i.isnumeric()]
result = parse(pdf_file=input_file,
docx_with_path=output_file, pages=pages)
summary = {
"File": input_file, "Pages": str(pages), "Output File": output_file
}
# Printing Summary
print("## Summary ########################################################")
print("n".join("{}:{}".format(i, j) for i, j in summary.items()))
print("###################################################################")
return result
The convert_pdf2docx()
function allows you to specify a range of pages to convert, it converts a PDF file into a Docx file and prints a summary of the conversion process in the end.
Let’s use it now:
if __name__ == "__main__":
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
convert_pdf2docx(input_file, output_file)
We simply use Python’s built-in sys module to get the input and output file names from command-line arguments. Let’s try to convert a sample PDF file (get it here):
$ python convert_pdf2docx.py letter.pdf letter.docx
A new letter.docx
file will appear in the current directory, and the output will be like this:
Parsing Page 1: 1/1...
Creating Page 1: 1/1...
--------------------------------------------------
Terminated in 0.10869679999999998s.
## Summary ########################################################
File:letter.pdf
Pages:None
Output File:letter.docx
###################################################################
You can also specify the pages you want in the convert_pdf2docx()
function.
I hope you enjoyed this short tutorial and you found this converter useful.
Learn also: How to Replace Text in Docx Files in Python.
PDF related tutorials:
- How to Watermark PDF Files in Python.
- How to Highlight and Redact Text in PDF Files with Python.
- How to Extract Images from PDF in Python.
- How to Extract All PDF Links in Python.
- How to Extract Tables from PDF in Python.
- How to Extract Text from Images in PDF Files with Python.
Finally, if you’re a beginner and want to learn Python, I suggest you take the Python For Everybody Coursera course, in which you’ll learn a lot about Python. You can also check our resources and courses page to see the Python resources I recommend on various topics!
Happy coding ♥
View Full Code
Read Also
Comment panel
Project description
English | 中文
- Extract data from PDF with
PyMuPDF
, e.g. text, images and drawings - Parse layout with rule, e.g. sections, paragraphs, images and tables
- Generate docx with
python-docx
Features
-
Parse and re-create page layout
- page margin
- section and column (1 or 2 columns only)
- page header and footer [TODO]
-
Parse and re-create paragraph
- OCR text [TODO]
- text in horizontal/vertical direction: from left to right, from bottom to top
- font style, e.g. font name, size, weight, italic and color
- text format, e.g. highlight, underline, strike-through
- list style [TODO]
- external hyper link
- paragraph horizontal alignment (left/right/center/justify) and vertical spacing
-
Parse and re-create image
- in-line image
- image in Gray/RGB/CMYK mode
- transparent image
- floating image, i.e. picture behind text
-
Parse and re-create table
- border style, e.g. width, color
- shading style, i.e. background color
- merged cells
- vertical direction cell
- table with partly hidden borders
- nested tables
-
Parsing pages with multi-processing
It can also be used as a tool to extract table contents since both table content and format/style is parsed.
Limitations
- Text-based PDF file
- Left to right language
- Normal reading direction, no word transformation / rotation
- Rule-based method can’t 100% convert the PDF layout
Documentation
- Installation
- Quickstart
- Convert PDF
- Extract table
- Command Line Interface
- Graphic User Interface
- Technical Documentation (In Chinese)
- API Documentation
Sample
Download files
Download the file for your platform. If you’re not sure which to choose, learn more about installing packages.