Did anyone already wrote code for converting doc or docx to pdf using Word 2007 or Word 2010 pdf export capabilities?
Deduplicator
44.3k7 gold badges65 silver badges115 bronze badges
asked Mar 30, 2011 at 9:50
I haven’t so far, but it shouldn’t be difficult:
- Create a Word COM server object using
CreateOLEObject
('Word.Application')
- Open the document using
Documents.Open
- Export the document to PDF using
ExportAsFixedFormat
Here’s a basic skeleton:
uses
ComObj;
const
wdExportFormatPDF = 17;
var
Word, Doc: OleVariant;
begin
Word := CreateOLEObject('Word.Application');
Doc := Word.Documents.Open('C:Document.docx');
Doc.ExportAsFixedFormat('C:Document.pdf', wdExportFormatPDF);
end;
Note that I’ve declare both the Word
and Doc
variables as OleVariants, so as to be version-indepent (i.e. this code will work whether you’re using Word 2007 or 2010). You could also use the VCL Office component libraries if you wanted. If you were to do a lot of processing in the document itself, that would definitely be faster.
answered Mar 30, 2011 at 10:17
MartijnMartijn
13.1k3 gold badges49 silver badges58 bronze badges
2
I do it with the following .vbs script. If you need it in Delphi code then it would be easy enough to convert:
Const wdDoNotSaveChanges = 0
Const wdRevisionsViewFinal = 0
Const wdFormatPDF = 17
Dim arguments
Set arguments = WScript.Arguments
Function DOC2PDF(sDocFile)
Dim fso ' As FileSystemObject
Dim wdo ' As Word.Application
Dim wdoc ' As Word.Document
Dim wdocs ' As Word.Documents
Set fso = CreateObject("Scripting.FileSystemObject")
sDocFile = fso.GetAbsolutePathName(sDocFile)
sPdfFile = fso.GetParentFolderName(sDocFile) + "" + fso.GetBaseName(sDocFile) + ".pdf"
Set wdo = CreateObject("Word.Application")
Set wdocs = wdo.Documents
WScript.Echo "Opening: " + sDocFile
Set wdoc = wdocs.Open(sDocFile)
if fso.FileExists(sPdfFile) Then
fso.DeleteFile sPdfFile, True
End If
WScript.Echo "Converting to PDF: " + sPdfFile
Set wview = wdoc.ActiveWindow.View
wview.ShowRevisionsAndComments = False
wview.RevisionsView = wdRevisionsViewFinal
wdoc.SaveAs sPdfFile, wdFormatPDF
WScript.Echo "Conversion completed"
wdo.Quit wdDoNotSaveChanges
Set fso = Nothing
Set wdo = Nothing
End Function
If arguments.Count=1 Then
Call DOC2PDF(arguments.Unnamed.Item(0))
Else
WScript.Echo "Generates a PDF file from a Word document using Word PDF export."
WScript.Echo ""
WScript.Echo "Usage: doc2pdf.vbs <doc-file>"
WScript.Echo ""
End If
answered Mar 30, 2011 at 10:17
David HeffernanDavid Heffernan
598k42 gold badges1061 silver badges1474 bronze badges
1
DocTo
Document Converter
Simple utility for converting a Microsoft Word Document ‘.doc’, Microsoft Excel ‘.xls’ and Microsoft Powerpoint .ppt files to any other supported format
such as .txt .csv .rtf .pdf.
Can also be used to convert .txt, .rtf, .csv to .doc, .xls or .pdf format.
Can be used to convert older word documents to latest format.
Must have Microsoft Word, Excel or Powerpoint installed on host machine.
Download Release From Github Releases — https://github.com/tobya/DocTo/releases/
Further Information available at https://tobya.github.io/DocTo/
Further Examples available at https://docto.toflidium.com
Features
- Convert Doc/RTF/Text file to any Word SaveAs Type Doc/Text/RTF/PDF
- Convert XLS/XLSX/CSV file to any Excel SaveAs Type CSV/Text/PDF
- Convert Text/CSV file to full fledged Word or Excel format.
- Single File Conversion
- Multiple / Directory File Conversion.
- Delete after conversion
- Fire https Webhook on each conversion.
Examples
More Examples available at
- View Examples
- https://docto.toflidium.com/
- Wiki
- All Parameters Explained
Installation
Download .exe from Release https://github.com/tobya/docTo/releases
Package Managers
Choco
Also Available for installation via Chocolatey
choco install docto
to upgrade to latest version before generally available (replace with current version)
choco upgrade docto —version=1.8
Node
Node Wrappers has been created by @KerimG & @brrd
https://www.npmjs.com/package/node-docto
https://github.com/brrd/msoconvert
Bugs and Features
Please log an issue for any bugs, features or suggestions.
Examples
Single
Convert Microsoft Word Document to text
docto -f C:DirectoryMyFile.doc -O "C:Output DirectoryMyTextFile.txt" -T wdFormatText
Convert Microsoft Excel Document to csv text
docto -XL -f C:DirectoryMyFile.xls -O "C:Output DirectoryMyTextFile.csv" -T xlCSV
Convert Microsoft Word Document to PDF (requires version of Microsoft Word that supports this).
docto -f C:DirectoryMyFile.doc -O "C:Output DirectoryMyTextFile.pdf" -T wdFormatPDF
Multiple Files and Folders
Convert All Microsoft Word Documents in Directory and its Sub Directories to PDF
docto -f "C:Dir with SpacesFilesToConvert" -O "C:DirToOutput" -T wdFormatPDF -OX .pdf
Delete Original File after Conversion
Delete Original Files after conversion (-R) .
docto -f "C:Dir with SpacesFilesToConvert" -O "C:DirToOutput" -T wdFormatPDF -OX .pdf -R true
Webhooks
Add a Webhook to fire on each conversion (-W)
docto -f "C:Dir with SpacesFilesToConvert" -O "C:DirToOutput" -T wdFormatPDF -OX .pdf -W https://toflidium.com/webhooks/docto/webhook_test.php
A Webhook is a url that can be called on each converstion to give you the ability to repond externally whenever a file is converted. Currently https
address is experimental so log an issue if you have any issues.
Use in the Wild
If you are using DocTo in the wild somewhere, please add details to this wiki page
OneDrive Conversion
If you need to upgrade a bunch of files to work without conversion on OneDrive /Office365 / Word 20XX then you can use DocTo.
See this StackExchange question
https://webapps.stackexchange.com/questions/74859/what-format-does-word-online-use
Command Line Help
Help
Docto Version:%s
Office Version : %s
Open Source: https://github.com/tobya/DocTo/
Description: DocTo converts Word Documents and Excel Spreadsheets to other formats.
Command Line Parameters:
Each Parameter should be followed by its value eg
-f "c:DocsMyDoc.doc"
Parameters markers are case insensitive.
-H This message
--HELP -?
-WD Use Word for Converstion (Default). Help '-h -wd'
--word
-XL Use Excel for Conversion. Help '-h -xl'
--excel
-PP Use Powerpoint for Conversion. help '-h -pp'
--powerpoint
-VS Use Visio for Conversion.
--visio
-F Input File or Directory
--inputfile
-FX Input file search for if -f is directory. Can use .rtf test*.txt etc
Default ".doc*" (will find ".docx" also)
--inputextension
-O Output File or Directory to place converted Docs
--outputfile
-OX Output Extension if -F is Directory. Please include '.' eg. '.pdf' .
If not provided, pulled from standard list.
--outputextension
-T Format(Type) to convert file to, either integer or wdSaveFormat constant.
Available from
https://docs.microsoft.com/en-us/dotnet/api/microsoft.office.interop.word.wdsaveformat
or https://docs.microsoft.com/en-us/dotnet/api/microsoft.office.interop.excel.xlfileformat
or https://docs.microsoft.com/en-us/office/vba/api/powerpoint.presentation.saveas
See current List Below.
--format
-TF Force Format. -T value if an integer, is checked against current list
compiled in. It is not passed if unavailable. -TF will pass through value
without checking. Word will return an "EOleException Value out of range"
error if invalid. Use instead of -T.
--forceformat
-L Log Level Integer: 1 ERRORS 2 STANDARD 5 CHATTY 9 DEBUG 10 VERBOSE. Default: 2=STANDARD
--loglevel
-C Compatibility Mode Integer. Set to an INTEGER value from
https://msdn.microsoft.com/en-us/library/office/ff192388.aspx.
Set the compatibility mode when you want to convert documents to a later
version of word. See help '-h -c' for further info.
--compatibility
-E Encoding Integer: Sets codepage Encoding. See
https://msdn.microsoft.com/en-us/library/office/ff860880.aspx
for more details and values.
--encoding
-M Ignore all files in __MACOSX subdirectory if it exists. Default True.
--ignoremacos
-N Make list of files that take over n seconds to complete.
Use number of seconds over that conversion takes and add to list.
Outputs to filename 'docto.ignore.txt'
--listlongrunning
-NX Ignore any file listed in docto.ignore.txt, created by -N
--ignorelongrunninglist
-G Write Log to file in directory
--writelogfile
-GL Log File Name to Use. Default 'DocTo.Log';
--logfilename
-Q Quiet Mode: Nothing will be output to console. To see any errors you must
set -G or -GL. Equivalent to setting -L 0
--quiet
-R Remove Files after successful conversion: Default false; To use specify
value eg -R true
--deletefiles
-W Webhook: Url to call on events. See help '-H -HW' for more details.
--webhook
-X Halt on COM Error: Default True; If you have trouble with some files
not converting, set this to false to ignore errors and continue with
batch job.
--halterror
-V Show Versions. DocTo and Word/Excel/Powerpoint
Long Parameters:
--BookmarkSource
PDF conversions can take their bookmarks from
WordBookmarks, WordHeadings (default) or None
--DoNotOverwrite
--no-overwrite
Existing files are overridden by default, if you do not wish a file to be
over written use this option.
--no-subdirs Only convert specified directory. Do not recurse sub directories
--ExportMarkup Value for wdExportItem - default wdExportDocumentContent.
use wdExportDocumentWithMarkup to export all word comments with pdf
--no-IncludeDocProperties
--no-DocProp
Do not include Document Properties in the exported pdf file.
--PDF-OpenAfterExport
If you wish for a converted PDF to be opened after creation. No value req.
--PDF-FromPage
Save a range of pages to pdf. Integer/String. If integer --PDF-ToPage must also be set.
Other values wdExportCurrentPage, wdExportSelection
--PDF-ToPage
Save a range of pages to pdf. Integer. --PDF-FromPage must also be set.
--PDF-OptimizeFor
Set the pdf/xps to be optimized for print or screen.
Default ForPrint | ForOnScreen
--XPS-no-IRM
Do not copy IRM permissions to exported XPS document.
--PDF-No-DocStructureTags
Do not include DocStructureTags to help screen readers.
--PDF-no-BitmapMissingFonts
Do not bitmap missing fonts, fonts will be substituted.
--use-ISO190051
Create PDF to the ISO 19005-1 standard.
Experimental:
--skipdocswithtoc
EXPERIMENTAL. Will skip any docs that contain a TOC to prevent hanging.
Currently matches some false positives. Default False.
--stdout
Send file to Stdout after conversion. ( Does not work correctly for binary files)
ERROR CODES:
200 : Invalid File Format specified
201 : Insufficient Inputs. Minimum of Input File, Output File & Type
202 : Incorrect switches. Switch requires value
203 : Unknown switch in command
204 : Input File does not exist
205 : Invalid Parameter Value
220 : Word or COM Error
221 : Word not Installed
400 : Unknown Error
Parameter Overview
Usage
3 Parameters are required
- -F Input File Name
- -O Output File Name
- -T Type to be converted to.
Parameters that take a value have a space seperating them from the value. Some parameters do
not require a value. All parameters are case insensitive.
Input File or Directory
-F —inputfile
The file or folder you wish docto to open. If it is a folder, docto will load all files in that
directory and its subdirectories. If you do not wish to load files from subdirectories see the --no-subdirs
parameter.
Conversion will be performed on each file in turn.
Output File or Folder
-O —outputfile
The filename or foldername where you would like the output files to be placed. If Input is a file but
output is a folder then the output file will have the same name as the input but with the new extension.
Conversion Type
-T —format
Specify what format you wish to convert to such as wdFormatPDF
or wdFormatText
etc.
View possible Word Formats
and Excel Formats. Can also use the integer value
Help
-H , —Help
Display the help text listing all parameters and versions of docto and office applications
Version
-V —version
Display the version string of both DocTo and Microsoft Office.
Application Selection
-WD -XL -PP -VS
This parameter tells DocTo which of the applications you wish to use to load and save your document
For historical reasons DocTo defaults to -WD if no value is given, however it is a good habit to get
into to always use one of these values any time you use Docto.
- -WD Microsoft Word
- -XL Microsoft Excel
- -PP Microsoft Powerpoint
- -VS Microsoft Visio
Input Folder Extension
-FX —inputextension
By default DocTo will load all files in the directory with the standard Application extension
eg.
- Word (.doc) matches .doc & .docx files
- Excel (.xls) matches .xls & .xlsx files
- Powerpoint (.ppt) matches .ppt & .pptx files
- Visio (.vsd)
If you wish to convert a differnt set of files eg *.rtf or *.txt you can specify it here by ext
such as .rtf
Output Extension
-OX —outputextension
The output extension on a conversion is pulled from a standard list, eg. if converting to wdFormatPDF the file
will be output with extension .pdf
. If you would like to specify your own extension (such as .pdfx
) you can
with this parameter.
Force Format Use
-TF —forceformat
If -T is an integer if it is a value that wasnt available when DocTo was compiled it will raise an error.
If you use -TF it will pass the integer value of -T to the Office Application without checking.
Logging
-L —loglevel
Set level of log output. -l 10 is useful for debugging. Use -l 0 or -Q to surpress logging.
####Levels
- 10 VERBOSE
- 9 CHATTY
- 5 STANDARD
- 1 ERRORS (default)
- 0 SILENT
Document Compatibility
-C —compatibility
Compatibility Mode Integer. Set to an INTEGER value from msdn list .
Set the compatibility mode of the version of word the document is to be compatible with. Particuarily
useful when wishing to convert older documents to current version. Can be used to convert old
word documents to be compatible with onedrive.
Document Encoding
-E —encoding
Sets codepage Encoding. See MSDN
for more details and values.
List Long running Files
-N —ListLongRunning
Some files when being converted can cause a dialog box to pop up. This can only be fixed by
manual intervention. By setting this parameter you can at least record the documents that are
causing difficulty (to a file called docto.ignore.txt
) and if you set -NX
these documents will be skipped on subsequent executions.
Skip Files in docto.ignore.txt file
-NX —IgnoreLongRunningList {no-value-required}
When set any files listed in docto.ignore.txt
in the same directory as DocTo.exe will be skipped.
This allows troublesome documents in a directory structure to be ignored.
Logging
Write to Log File
-G —writelogfile [no value required]
Write the log to a file as well as stdout. docto.log
by default.
Log File
-GL —logfilename {filename}
Specify the filename that you wish the logfile to be written to.
Quiet Mode
-Q —quiet [no value required]
No output to stdout. Everything including errors are surpressed. Use in conjunction with -G
to ensure you get errors.
Delete Input Files
-R —deletefiles {true|false}
If you would like for the inputfile to be deleted after conversion you can set this to true.
Fire a Webhook
-W —webhook
If you wish you can call a web url after each conversion or error.
The Webhook URL will be called on the following events with the following parameters
-
File Conversion
- action=convert
- type=wdFormatType (or int if no matching format type)
- ouputfilename=File being written to.
- inputfilename=File being converted.
-
Error
- action=error
- type=wdFormatType (or int if no matching format type)
- ouputfilename=File being written to.
- inputfilename=File being converted.
- error=Error Message
Return value is logged in DocTo Log
Halt on Errors
-X —halterror {true|false}
Docto will halt when a COM error is raised. If you wish to ignore the error and continue set this value
to true.
Bookmark Source
—BookmarkSource {source}
PDF conversions can take their bookmarks from WordBookmarks, WordHeadings (default) or None
Overwrite Files
—DoNotOverwrite —no-overwrite [no value required]
Existing files are overridden by default, if you do not wish a file to be over written
use this option.
Recurse SubDirectories
—no-subdirs
By default sub directories are converted. Use to only convert specified directory. Do not recurse sub directories
Export Markup
—ExportMarkup
Specifies
- wdExportDocumentContent Exports the document without markup.
- wdExportDocumentWithMarkup Exports the document with markup.
use wdExportDocumentWithMarkup to export all word comments with pdf
Open after Export
—PDF-OpenAfterExport
If you wish for the converted PDF to be opened after creation. No value req.
Convert Specific Pages
—PDF-FromPage
—PDF-ToPage
Only convert certain pages in the document.
Use ISO19005-1
—use-ISO190051
Create PDF to the ISO 19005-1 standard, also know as PDF-A or PDF Archive.
Special Case Parameters
Do not ignore __MACOSX Directory
-M —ignoreMACOS {true|false}
By default DocTo ignores any files in a hidden __MACOSX
directory that MACOS creates. This directory is often
present on an external disk that is shared between systems. If you wish to check this dir set this value. You must specify value eg -M false
.
Compiling
The project compiles with Delphi (I use 10.3 but it should compile with most versions including XE4 & 7). The project will not compile on Linux as it uses several Windows only components such as COM and Word and Excel do not have Linux versions anyway so there would be no point.
XLSTo
XLSTo is now incorporated into DocTo. Previously XLSTo was a seperate EXE that was used to convert xls files to csv or pdf. This can now be done with the main DocTo.exe
by simply adding the -XL flag.
Get Involved.
I am happy to accept any PR anyone might like to submit. If a large amount of work involved, please open an issue first to ensure the effort wont be wasted.
The main branch name in the repo is DocTo
I found a solution to my issue and after a request, will post it here to help others. Apologies if I missed any details, it’s been a while since I worked on this solution.
The first thing that is required is to install Openoffice.org on the server. I requested my hosting provider to install the open office RPM on my VPS. This can be done through WHM directly.
Now that the server has the capability to handle MS Office files you are able to convert the files by executing command line instructions via PHP. To handle this, I found PyODConverter: https://github.com/mirkonasato/pyodconverter
I created a directory on the server and placed the PyODConverter python file within it. I also created a plain text file above the web root (I named it «adocpdf»), with the following command line instructions in it:
directory=$1
filename=$2
extension=$3
SERVICE='soffice'
if [ "`ps ax|grep -v grep|grep -c $SERVICE`" -lt 1 ]; then
unset DISPLAY
/usr/bin/soffice -headless -accept="socket,host=127.0.0.1,port=8100;urp;" -nofirststartwizard &
sleep 5s
fi
python /home/website/python/DocumentConverter.py /home/website/$directory$filename$extension /home/website/$directory$filename.pdf
This checks that the openoffice.org libraries are running and then calls the PyODConverter script to process the file and output it as a PDF. The 3 variables on the first three lines are provided when the script is executed from with a PHP file. The delay («sleep 5s») is used to ensure that openoffice.org has enough to time to initiate if required. I have used this for months now and the 5s gap seems to give enough breathing room.
The script will create a PDF version of the document in the same directory as the original.
Finally, initiating the conversion of a Word / Excel file from within PHP (I have it within a function that checks if the file we are dealing with is a word / excel document)…
//use openoffice.org
$output = array();
$return_var = 0;
exec("/opt/adocpdf {$directory} {$filename} {$extension}", $output, $return_var);
This PHP function is called once the Word / Excel file has been uploaded to the server. The 3 variables in the exec() call relate directly to the 3 at the start of the plain text script above. Note that the $directory variable requires no leading forward slash if the file for conversion is within the web root.
OK, that’s it! Hopefully this will be useful to someone and save them the difficulties and learning curve I faced.