I’m wondering how you can convert Word .doc/.docx files to text files through Java. I understand that there’s an option where I can do this through Word itself but I would like to be able to do something like this:
java DocConvert somedocfile.doc converted.txt
Thanks.
skaffman
397k96 gold badges815 silver badges768 bronze badges
asked Apr 25, 2010 at 20:55
Coding DistrictCoding District
11.9k4 gold badges25 silver badges30 bronze badges
If you’re interested in a Java library that deals with Word document files, you might want to look at e.g. Apache POI. A quote from the website:
Why should I use Apache POI?
A major use of the Apache POI api is
for Text Extraction applications such
as web spiders, index builders, and
content management systems.
P.S.: If, on the other hand, you’re simply looking for a conversion utility, Stack Overflow may not be the most appropriate place to ask for this.
Edit: If you don’t want to use an existing library but do all the hard work yourself, you’ll be glad to hear that Microsoft has published the required file format specifications. (The Microsoft Open Specification Promise lists the available specifications. Just google for any of them that you’re interested in. In your case, you’d need e.g. the OLE2 Compound File Format, the Word 97 binary file format, and the Open XML formats.)
answered Apr 25, 2010 at 20:59
1
Use command line utility Apache Tika. Tika suports a wide number of formats (ex: doc, docx, pdf, html, rtf …)
java -jar tika-app-1.3.jar -t somedocfile.doc > converted.txt
Programatically:
File inputFile = ...;
Tika tika = new Tika();
String extractedText = tika.parseToString(inputFile);
You can use Apache POI too. They have a tool to extract text from doc/docx Text Extraction. If you want to extract only the text, you can use the code below. If you want to extract Rich Text (such as formatting and styling), you can use Apache Tika.
Extract doc:
InputStream fis = new FileInputStream(...);
POITextExtractor extractor;
// if docx
if (fileName.toLowerCase().endsWith(".docx")) {
XWPFDocument doc = new XWPFDocument(fis);
extractor = new XWPFWordExtractor(doc);
} else {
// if doc
POIFSFileSystem fileSystem = new POIFSFileSystem(fis);
extractor = ExtractorFactory.createExtractor(fileSystem);
}
String extractedText = extractor.getText();
answered Feb 14, 2014 at 12:44
You should consider using this library. Its Apache POI
Excerpt from the website
In short, you can read and write MS
Excel files using Java. In addition,
you can read and write MS Word and MS
PowerPoint files using Java. Apache
POI is your Java Excel solution (for
Excel 97-2008). We have a complete API
for porting other OOXML and OLE2
formats and welcome others to
participate.
answered Apr 25, 2010 at 20:59
bragboybragboy
34.6k30 gold badges112 silver badges171 bronze badges
Docmosis can read a doc and spit out the text in it. Requires some infrastructure to be installed (such as OpenOffice).
You can also use JODConverter.
answered May 12, 2010 at 8:14
Paul JowettPaul Jowett
6,4932 gold badges21 silver badges19 bronze badges
TXT is a common text format that can be used on many computers and mobile devices. The TXT document is known for its small size, and it makes the storage of text content more convenient. This article will demonstrate how to extract the text content in a Word document and save it as .txt format by using Free Spire.Doc for Java.
Import JAR Dependency to Your Java Application
Method 1: Download the Free Spire.Doc for Java and unzip it. Then add the Spire.Doc.jar file to your Java application as dependency.
Method 2: You can also add the jar dependency to maven project by adding the following configurations to the pom.xml.
<repositories>
<repository>
<id>com.e-iceblue</id>
<name>e-iceblue</name>
<url>http://repo.e-iceblue.com/nexus/content/groups/public/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>e-iceblue</groupId>
<artifactId>spire.doc.free</artifactId>
<version>3.9.0</version>
</dependency>
</dependencies>
Enter fullscreen mode
Exit fullscreen mode
Extract Text
import com.spire.doc.Document;
import java.io.FileWriter;
import java.io.IOException;
public class ExtractText {
public static void main(String[] args) throws IOException {
//Load Word document
Document document = new Document();
document.loadFromFile("Island.docx");
//Get text from document as string
String text=document.getText();
//Write string to a .txt file
writeStringToTxt(text," Extracted.txt");
}
public static void writeStringToTxt(String content, String txtFileName) throws IOException{
FileWriter fWriter= new FileWriter(txtFileName,true);
try {
fWriter.write(content);
}catch(IOException ex){
ex.printStackTrace();
}finally{
try{
fWriter.flush();
fWriter.close();
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
}
Enter fullscreen mode
Exit fullscreen mode
In this article we will be discussing about ways and techniques to read word documents in Java using Apache POI library. The word document may contain images, tables or plain text. Apart from this a standard word file has header and footers too. Here in the following examples we will be parsing a word document by reading its different paragraph, runs, images, tables along with headers and footers. We will also take a look into identifying different styles associated with the paragraphs such as font-size, font-family, font-color etc.
Maven Dependencies
Following is the poi maven depedency required to read word documents. For latest artifacts visit here
pom.xml
<dependencies> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-ooxml</artifactId> <version>3.16</version> </dependency> </dependencies>
Reading Complete Text from Word Document
The class XWPFDocument
has many methods defined to read and extract .docx
file contents. getText()
can be used to read all the texts in a .docx word document. Following is an example.
TextReader.java
public class TextReader { public static void main(String[] args) { try { FileInputStream fis = new FileInputStream("test.docx"); XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis)); XWPFWordExtractor extractor = new XWPFWordExtractor(xdoc); System.out.println(extractor.getText()); } catch(Exception ex) { ex.printStackTrace(); } } }
Reading Headers and Foooters of Word Document
Apache POI provides inbuilt methods to read headers and footers of a word document. Following is an example that reads and prints header and footer of a word document. The example .docx file is available in the source which can be downloaded at the end of thos article.
HeaderFooter.java
public class HeaderFooterReader { public static void main(String[] args) { try { FileInputStream fis = new FileInputStream("test.docx"); XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis)); XWPFHeaderFooterPolicy policy = new XWPFHeaderFooterPolicy(xdoc); XWPFHeader header = policy.getDefaultHeader(); if (header != null) { System.out.println(header.getText()); } XWPFFooter footer = policy.getDefaultFooter(); if (footer != null) { System.out.println(footer.getText()); } } catch (Exception ex) { ex.printStackTrace(); } } }
Output
This is Header
This is footer
Other Interesting Posts Java 8 Lambda Expression Java 8 Stream Operations Java 8 Datetime Conversions Random Password Generator in Java
Read Each Paragraph of a Word Document
Among the many methods defined in XWPFDocument
class, we can use getParagraphs()
to read a .docx word document paragraph wise.This method returns a list of all the paragraphs(XWPFParagraph) of a word document. Again the XWPFParagraph has many utils method defined to extract information related to any paragraph such as text alignment, style associated with the paragrpahs.
To have more control over the text reading of a word document,each paragraph is again divided into multiple runs. Run defines a region of text with a common set of properties.Following is an example to read paragraphs from a .docx word document.
ParagraphReader.java
public class ParagraphReader { public static void main(String[] args) { try { FileInputStream fis = new FileInputStream("test.docx"); XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis)); List paragraphList = xdoc.getParagraphs(); for (XWPFParagraph paragraph : paragraphList) { System.out.println(paragraph.getText()); System.out.println(paragraph.getAlignment()); System.out.print(paragraph.getRuns().size()); System.out.println(paragraph.getStyle()); // Returns numbering format for this paragraph, eg bullet or lowerLetter. System.out.println(paragraph.getNumFmt()); System.out.println(paragraph.getAlignment()); System.out.println(paragraph.isWordWrapped()); System.out.println("********************************************************************"); } } catch (Exception ex) { ex.printStackTrace(); } } }
Reading Tables from Word Document
Following is an example to read tables present in a word document. It will print all the text rows wise.
TableReader.java
public class TableReader { public static void main(String[] args) { try { FileInputStream fis = new FileInputStream("test.docx"); XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis)); Iterator bodyElementIterator = xdoc.getBodyElementsIterator(); while (bodyElementIterator.hasNext()) { IBodyElement element = bodyElementIterator.next(); if ("TABLE".equalsIgnoreCase(element.getElementType().name())) { List tableList = element.getBody().getTables(); for (XWPFTable table : tableList) { System.out.println("Total Number of Rows of Table:" + table.getNumberOfRows()); for (int i = 0; i < table.getRows().size(); i++) { for (int j = 0; j < table.getRow(i).getTableCells().size(); j++) { System.out.println(table.getRow(i).getCell(j).getText()); } } } } } } catch (Exception ex) { ex.printStackTrace(); } } }
Reading Styles from Word Document
Styles are associated with runs of a paragraph. There are many methods available in the XWPFRun
class to identify the styles associated with the text.There are methods to identify boldness, highlighted words, capitalized words etc.
StyleReader.java
public class StyleReader { public static void main(String[] args) { try { FileInputStream fis = new FileInputStream("test.docx"); XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis)); List paragraphList = xdoc.getParagraphs(); for (XWPFParagraph paragraph : paragraphList) { for (XWPFRun rn : paragraph.getRuns()) { System.out.println(rn.isBold()); System.out.println(rn.isHighlighted()); System.out.println(rn.isCapitalized()); System.out.println(rn.getFontSize()); } System.out.println("********************************************************************"); } } catch (Exception ex) { ex.printStackTrace(); } } }
Reading Image from Word Document
Following is an example to read image files from a word document.
public class ImageReader { public static void main(String[] args) { try { FileInputStream fis = new FileInputStream("test.docx"); XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis)); List pic = xdoc.getAllPictures(); if (!pic.isEmpty()) { System.out.print(pic.get(0).getPictureType()); System.out.print(pic.get(0).getData()); } } catch (Exception ex) { ex.printStackTrace(); } } }
Conclusion
I hope this article served you that you were looking for. If you have anything that you want to add or share then please share it below in the comment section.
Download source
The WordToTextConverter can be used to convert from a microsoft word 2007+ file to a text file. Again, this program uses the Apache POI library or tika library to extract text from Microsoft Word 2007+ file.
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.util.Iterator;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
public class WordToTextConverter{
public static void main(String[] args){
try{
convertWordToText(args[0],args[1]);
}catch(ArrayIndexOutOfBoundsException aiobe){
System.out.println(«Usage:java WordToTextConverter <word_file> <text_file>»);
}
}
public static void convertWordToText(String src, String desc){
try{
//create file inputstream object to read data from file
FileInputStream fs=new FileInputStream(src);
//create document object to wrap the file inputstream object
XWPFDocument docx=new XWPFDocument(fs);
//create text extractor object to extract text from the document
XWPFWordExtractor extractor=new XWPFWordExtractor(docx);
//create file writer object to write text to the output file
FileWriter fw=new FileWriter(desc);
//write text to the output file
fw.write(extractor.getText());
//clear data from memory
fw.flush();
//close inputstream and file writer
fs.close();
fw.close();
}catch(IOException e){e.printStackTrace();}
}
}
In the code of the program above, XWPFDocument is used to construct a Microsoft Word document object from the FileInputStream object. FileInputStream object contains all data of the original Microsoft Word file. To extract all text from the document, you need to use the XWPFWordExtractor class. You will pass the document object to the constructor of the XWPFWordExtractor when you create an object of the XWPFWordExtractor class. From the XWPFWordExtractor object, you can get the text content of this object by using its getText method. Once you have the text, the FileWriter class can be used to output it to the destination file.
This chapter explains how to extract simple text data from a Word document using Java. In case you want to extract metadata from a Word document, make use of Apache Tika.
For .docx files, we use the class org.apache.poi.xwpf.extractor.XPFFWordExtractor that extracts and returns simple data from a Word file. In the same way, we have different methodologies to extract headings, footnotes, table data, etc. from a Word file.
The following code shows how to extract simple text from a Word file −
import java.io.FileInputStream; import org.apache.poi.xwpf.extractor.XWPFWordExtractor; import org.apache.poi.xwpf.usermodel.XWPFDocument; public class WordExtractor { public static void main(String[] args)throws Exception { XWPFDocument docx = new XWPFDocument(new FileInputStream("createparagraph.docx")); //using XWPFWordExtractor Class XWPFWordExtractor we = new XWPFWordExtractor(docx); System.out.println(we.getText()); } }
Save the above code as WordExtractor.java. Compile and execute it from the command prompt as follows −
$javac WordExtractor.java $java WordExtractor
It will generate the following output −
At tutorialspoint.com, we strive hard to provide quality tutorials for self-learning purpose in the domains of Academics, Information Technology, Management and Computer Programming Languages.
This article demonstrates how to extract text and images from Word documents by using Spire.Doc for Java.
Extract Text
import com.spire.doc.Document; import java.io.FileWriter; import java.io.IOException; public class ExtractText { public static void main(String[] args) throws IOException { //load Word document Document document = new Document(); document.loadFromFile("C:\Users\Administrator\Desktop\sample.docx"); //get text from document as string String text=document.getText(); //write string to a .txt file writeStringToTxt(text," ExtractedText.txt"); } public static void writeStringToTxt(String content, String txtFileName) throws IOException{ FileWriter fWriter= new FileWriter(txtFileName,true); try { fWriter.write(content); }catch(IOException ex){ ex.printStackTrace(); }finally{ try{ fWriter.flush(); fWriter.close(); } catch (IOException ex) { ex.printStackTrace(); } } } }
Extract Images
import com.spire.doc.Document; import com.spire.doc.documents.DocumentObjectType; import com.spire.doc.fields.DocPicture; import com.spire.doc.interfaces.ICompositeObject; import com.spire.doc.interfaces.IDocumentObject; import javax.imageio.ImageIO; import java.awt.image.BufferedImage; import java.io.File; import java.io.IOException; import java.util.ArrayList; import java.util.LinkedList; import java.util.List; import java.util.Queue; public class ExtractImages { public static void main(String[] args) throws IOException { //load word document Document document = new Document(); document.loadFromFile("C:\Users\Administrator\Desktop\sample.docx"); //create a Queue object Queue nodes = new LinkedList(); nodes.add(document); //create a List object List images = new ArrayList(); //loop through the child objects of the document while (nodes.size() > 0) { ICompositeObject node = nodes.poll(); for (int i = 0; i < node.getChildObjects().getCount(); i++) { IDocumentObject child = node.getChildObjects().get(i); if (child instanceof ICompositeObject) { nodes.add((ICompositeObject) child); //get each image and add it to the list if (child.getDocumentObjectType() == DocumentObjectType.Picture) { DocPicture picture = (DocPicture) child; images.add(picture.getImage()); } } } } //save images as .png files for (int i = 0; i < images.size(); i++) { File file = new File(String.format("output/ExtractedImage-%d.png", i)); ImageIO.write(images.get(i), "PNG", file); } } }