i run a job search site, and i need to convert doc, docx and pdf files into HTML on linux CentOS server running php. People submit these files as resumes. So far, I found PHPDocx to be great at converting docx to html. But I am stuck at doc/pdf. PDFTOHTML gives error «bad color» when i run tests. As far as doc, i only found wvwave, which seems complex and bulky to install.
does anyone have any ideas on how to easily convert doc/pdf to HTML?
Nauphal
6,1564 gold badges27 silver badges43 bronze badges
asked May 13, 2011 at 20:31
5
The only thing i can think of is FPDF.
It is intended for creating PDF files in PHP but it can also open PDF files.
Maybe you can use that as a base and develop some sort of toHTML function for it.
It is completely free to use and it has some extensions already.
It MIGHT help you.
http://www.fpdf.org
EDIT:
Thanks for the addition to my post in the comments to Pierre:
You can use fpdi: http://www.setasign.de/products/pdf-php-solutions/fpdi but the input pdf is just like an image.
I havent taken a look at it myself so far but this might help.
answered Aug 20, 2013 at 7:28
Ch33fCh33f
6098 silver badges17 bronze badges
3
As far as .doc files go how about trying OpenOffice/LibreOffice, something like:
lowriter -convert-to html doc_file.doc –
As far as PDF goes, if the PDF is a graphical representation of text then you’re out of luck, best you can do is try convert it to an image with ImageMagick, if it is a proper text it should easily convert.
answered Aug 22, 2013 at 15:45
To easily convert pdf to html, I would suggest pdf2htmlEX which produces outstanding HTML and is fast enough for runtime converting. You should first put some effort to optimize and build it for your system. There is simple build howto included on the project link.
answered Aug 25, 2013 at 14:22
BreignBreign
1468 bronze badges
Python 3 AbiWord Library Script to Convert Word Documents & PDF Documents to HTML Files in Linux & Windows The article is a lively article because it contains interesting information and your favorite.
abiword -t output.html resume.doc; cat output.html
import subprocess
import os
import uuid
def document_to_html(file_path):
tmp = "/tmp"
guid = str(uuid.uuid1())
# convert the file, using a temporary file w/ a random name
command = "abiword -t %(tmp)s/%(guid)s.html %(file_path)s; cat %(tmp)s/%(guid)s.html" % locals()
p = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True, cwd=os.path.join(settings.PROJECT_DIR, "website/templates"))
error = p.stderr.readlines()
if error:
raise Exception("".join(error))
html = p.stdout.readlines()
return "".join(html)
Read Also: Python 3 igramscraper Library Script to Scrape Instagram User Followers, Likes & Biography Information on Command Line
Final Words
I hope you find the article Python 3 AbiWord Library Script to Convert Word Documents & PDF Documents to HTML Files in Linux & Windows uses. The reason is that we have told you all the information through this article in a way that you can understand. And if you have any doubts, you can express your doubts through the comment box. We also ask that you help share this article with your friends.
Hi, I’m Selva a full-time Blogger, YouTuber, Affiliate Marketer, & founder of Coding Deekshi. Here, I post about programming to help developers.
How do I convert MS Word to HTML on Linux?
I think OpenOffice can do that.
Related Articles
How to manage documents between Word and HTML?
Apologies in advance if this question is not appropriate for this website. I have written some documents in Microsoft Word which I need to also display on a website as HTML. To do this I need to enter the content of these documents into a database wi
How can I convert a Word document to XML using PHP?
I want to convert Word documents (.doc and .docx) to XML. How i can do this using PHP? Once I have done that, I have to add some data in that XML file. Could anyone please help me?A Word document (docx) is a xml file. Simply unzip it.
How do I convert raw text to HTML (preferably using Perl)?
Is there a way to take a plain text file and convert it to a simple HTML? A couple of ‘sophisticated’ stuff that will be great identify hyper-links. identify (tab delimited) tables. UPDATE I just found this HTML::FromText. Checking to see if it meets
How do I convert Jade files to html?
I am starting with Node.js application, so I wanted to take some help from the existing solutions but in views their are always jade files what are they and how can i transform them to html easily?You can use EJS for templating other then Jade which
How do I convert strings to a html code code?
I’d like to represent strings as arbitrary html colors. Example: «blah blah» = #FFCC00 «foo foo 2» = #565656 It doesn’t matter what the actual color code is, so long as it’s a valid hexadecimal HTML color code and the whole spectrum is
How can I convert CSV to an HTML table using Perl?
In my code, I want to view all data from a CSV in table form, but it only displays the last line. How about lines 1 and 2? Here’s the data: 1,HF6,08-Oct-08,34:22:13,df,jhj,fh,fh,ffgh,gh,g,rt,ffgsaf,asdf,dd,yoawa,DWP,tester,Pattern 2,hf35,08-Oct-08,34
How can I convert a PDF to HTML?
What good libraries are there, in any common language, for converting PDF to HTML?PDFBox at apache has an html extraction capability. http://pdfbox.apache.org/
How do I convert the entry to HTML characters correctly?
Let’s say I’m including a file which contains html. The html have characters as exclamation symbols, Spanish accents (á, ó). The parsed included text gets processed as symbols instead of their correct value. This happens on FF but not on IE (8). I ha
How do I convert 100% px to HTML
I am making a slider based component in GWT. Parent : width = 100% .————————————————-. | .———————————————. | | | Child : width = X | | | ‘———————————————‘ | ‘
Convert RGB values to HTML color code?
How can I convert RGB values into HTML code? For example, using values like this: red:0 blue:0 green:0 html color code = #000000 Is there any formula for converting it?You just need to convert each component value to its corresponding hex representat
How to convert the date into words in HTML
I have a date displayed in HTML. The date is currently displayed as 2012-03-12. So now, I want to display this date as words i.e it should be displayed as 12 March 2012. Below is the HTML code I used. <tr> <th>Date of Birth: </th> <td
In Ruby on Rails, how do I convert html to word?
how can I convert html to word thanks.I am not aware of any solution which does this, i.e. convert HTML to Word format. If you literally mean that, you will have to parse the HTML document first using something like Nokogiri. If you mean you want to
How to convert a word using Perl?
I am new to perl script…. I need to convert doc to doc because i will text changes in original doc so how to start this conversion using perl script, I tried to convert to html and change that text and html file convert to doc… But I don’t know h
Java: use apache how to convert ms word file to pdf?
By using apache POI how to convert ms word file to pdf? I an using the following code but its not working giving errors I guess I am importing the wrong classes? import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; im
Convert a docx (OOXML) file to semantic HTML.
All of Word formatting nonsense is stripped away and
you’re left with a cleanly-formatted version of the content.
Usage
>>> from docx2html import convert >>> html = convert('path/to/docx/file')
Running Tests for Development
$ virtualenv path/to/new/virtualenv $ source path/to/new/virtualenv/bin/activate $ cd path/to/workspace $ git clone git://github.com/PolicyStat/docx2html.git $ cd docx2html $ pip install . $ pip install -r test_requirements.txt $ ./run_tests.sh
Description
docx2html is designed to take a docx file and extract the content out and
convert that content to html. It does not care about styles or fonts or
anything that changes how the content is displayed (with few exceptions). Below
is a list of what currently works:
-
- Paragraphs
-
-
Bold
-
Italics
-
Underline
-
Hyperlinks
-
-
- Lists
-
-
Nested lists
-
List styles (letters, roman numerals, etc.)
-
Tables
-
Paragraphs
-
-
- Tables
-
-
Rowspans
-
Colspans
-
Nested tables
-
Lists
-
-
- Images
-
-
Resizing
-
Converting to smaller formats (for bitmaps and tiffs)
-
There is a hook to allow setting the src of the image tag out of context,
more on this later
-
-
- Headings
-
-
Simple headings
-
Root level lists that are upper case roman numerals get converted to h2
tags
-
Handling embedded images
docx2html allows you to specify how you would like to handle image uploading.
For example, you might be uploading your images to Amazon S3 eg:
Note: This documentation sucks, so you might need to read the source.
import os.path from shutil import copyfile from docx2html import convert def handle_image(image_id, relationship_dict): image_path = relationship_dict[image_id] # Now do something to the image. Let's move it somewhere. _, filename = os.path.split(image_path) destination_path = os.path.join('/tmp', filename) copyfile(image_path, destination_path) # Return the `src` attribute to be used in the img tag return 'file://%s' % destination html = convert('path/to/docx/file', image_handler=handle_image)
Naming Conventions
There are two main naming conventions in the source for docx2html there are
build functions, which will return an etree element that represents HTML. And
there are get_content functions which return string representations of HTML.
Changelog
-
- 0.2.3
-
-
There was a bug with hyperlinks that had a break tag in them. The
document would fail to convert. This issue has been fixed.
-
-
- 0.2.2
-
-
There was a bug with hyperlinks that were missing text. The document
would fail to convert. This issue has been fixed.
-
-
- 0.2.1
-
-
If a list had an inconsistency in the ilvls, the content for the
inconsistent ilvl would be lost. Now we roll that inconsistent list into
the root, no longer losing the content.
-
-
- 0.2.0
-
-
If a list had a numId that was not stored in the numbering dict, then a
key error would be thrown. Now if either the numId or the ilvl for a
given list tag is invalid it defaults to returning a list type of
decimal.
-
-
- 0.1.11
-
-
Sometimes in the OOXML an image will have a height or width of 0. If this
happens we are now ignoring the height and width in the OOXML and using
the full image instead.
-
-
- 0.1.10
-
-
Added a user facing version
-
-
- 0.1.9
-
-
There was a problem for some lists that would cause missing content if
the list id’s were not well behaved. This issue has been addressed.
-
-
- 0.1.8
-
-
Fixed missing content with hyperlinks with more than one run tag and
smartTags. -
Certain image types are now being ignored. These include: emf, wmf and
svg.
-
-
- 0.1.7
-
-
If the indentation level of a set of lists (with the same list id) were
mangled (Starting off with a higher indentation level followed by a
lower) then the entire sub list (the list with the lower indentation
level) would not be added to the root list. This would result in removing
the mangled list from the final output. This issue has been addressed.
-
-
- 0.1.6
-
-
Header detection was relying on case. However it is possible for a lower
case version of headers to show up. Those are now handled correctly.
-
-
- 0.1.4
-
-
Added a function to remove tags, in addition stripped ‘sectPr’ tags since
they have to do with headers and footers.
-
-
- 0.1.3
-
-
Hyperlinks with no text no longer throw an error
-
Fixed a bug with determining the font size with an incomplete styles dict
-
-
- 0.1.2
-
-
Fixed a bug with determining the font size of a paragraph tag
-
-
- 0.1.1
-
-
Added a changelog
-
Styles are now stripped from hyperlinks
-
jinja2 is now used to render test xml
-
-
- 0.1.0
-
-
Correctly handle tables and paragraphs in lists. Before if there was a
table in a list it would break the list into two halves, the half before
the table and the half after the table (with the table inbetween them). Now
if there is a table or paragraph in a list those elements get rolled into
the list.
-
Does anyone know of any software (preferably open source, Linux or PHP) that can convert both PDF and/or DOC/DOCX (and maybe other document formats too: rtf, txt, etc.) to HTML?
I’ve got «PDFtoHTML» software working but this not appear to also convert DOC/DOCX files.
studiohack♦
13.5k19 gold badges85 silver badges118 bronze badges
asked Oct 26, 2011 at 19:16
You should give unoconv a spin. It should be able to convert anything that Open Office can read to anything it can write.
This works on doc/docx and a whole lot of other files. It does not seem to work on PDF’s so I guess you’re stuck with using 2 separate programs for the job.
answered Oct 28, 2011 at 12:24
jpjacobsjpjacobs
2863 silver badges5 bronze badges
1
I successfully put a portable version of libreoffice on my host’s webserver, which I call with PHP to do a commandline conversion to .docx, etc. to pdf. on the fly. I do not have admin rights on my host’s webserver. Here is my blog post of what I did:
Link
Yay! Convert directly from .docx or .odt to .pdf using PHP with LibreOffice (OpenOffice’s successor)!
Glorfindel
4,0798 gold badges23 silver badges37 bronze badges
answered Nov 20, 2011 at 1:50
Have you tried PHPDocX? It allows you to do quite a few more things with docx files.
There is a generateXHTML method.
answered May 8, 2012 at 6:33
1