Java word to text

I’m wondering how you can convert Word .doc/.docx files to text files through Java. I understand that there’s an option where I can do this through Word itself but I would like to be able to do something like this:

java DocConvert somedocfile.doc converted.txt

Thanks.

skaffman's user avatar

skaffman

397k96 gold badges815 silver badges768 bronze badges

asked Apr 25, 2010 at 20:55

Coding District's user avatar

Coding DistrictCoding District

11.9k4 gold badges25 silver badges30 bronze badges

If you’re interested in a Java library that deals with Word document files, you might want to look at e.g. Apache POI. A quote from the website:

Why should I use Apache POI?

A major use of the Apache POI api is
for Text Extraction applications such
as web spiders, index builders, and
content management systems.


P.S.: If, on the other hand, you’re simply looking for a conversion utility, Stack Overflow may not be the most appropriate place to ask for this.


Edit: If you don’t want to use an existing library but do all the hard work yourself, you’ll be glad to hear that Microsoft has published the required file format specifications. (The Microsoft Open Specification Promise lists the available specifications. Just google for any of them that you’re interested in. In your case, you’d need e.g. the OLE2 Compound File Format, the Word 97 binary file format, and the Open XML formats.)

answered Apr 25, 2010 at 20:59

stakx - no longer contributing's user avatar

1

Use command line utility Apache Tika. Tika suports a wide number of formats (ex: doc, docx, pdf, html, rtf …)

java -jar tika-app-1.3.jar -t somedocfile.doc > converted.txt

Programatically:

File inputFile = ...;
Tika tika = new Tika();
String extractedText = tika.parseToString(inputFile);

You can use Apache POI too. They have a tool to extract text from doc/docx Text Extraction. If you want to extract only the text, you can use the code below. If you want to extract Rich Text (such as formatting and styling), you can use Apache Tika.

Extract doc:

InputStream fis = new FileInputStream(...);
POITextExtractor extractor;
// if docx
if (fileName.toLowerCase().endsWith(".docx")) {
    XWPFDocument doc = new XWPFDocument(fis);
    extractor = new XWPFWordExtractor(doc);
} else {
    // if doc
    POIFSFileSystem fileSystem = new POIFSFileSystem(fis);
    extractor = ExtractorFactory.createExtractor(fileSystem);
}
String extractedText = extractor.getText();

answered Feb 14, 2014 at 12:44

palhares's user avatar

You should consider using this library. Its Apache POI

Excerpt from the website

In short, you can read and write MS
Excel files using Java. In addition,
you can read and write MS Word and MS
PowerPoint files using Java. Apache
POI is your Java Excel solution (for
Excel 97-2008). We have a complete API
for porting other OOXML and OLE2
formats and welcome others to
participate.

answered Apr 25, 2010 at 20:59

bragboy's user avatar

bragboybragboy

34.6k30 gold badges112 silver badges171 bronze badges

Docmosis can read a doc and spit out the text in it. Requires some infrastructure to be installed (such as OpenOffice).
You can also use JODConverter.

answered May 12, 2010 at 8:14

Paul Jowett's user avatar

Paul JowettPaul Jowett

6,4932 gold badges21 silver badges19 bronze badges

TXT is a common text format that can be used on many computers and mobile devices. The TXT document is known for its small size, and it makes the storage of text content more convenient. This article will demonstrate how to extract the text content in a Word document and save it as .txt format by using Free Spire.Doc for Java.

Import JAR Dependency to Your Java Application
Method 1: Download the Free Spire.Doc for Java and unzip it. Then add the Spire.Doc.jar file to your Java application as dependency.
Method 2: You can also add the jar dependency to maven project by adding the following configurations to the pom.xml.

<repositories>
   <repository>
      <id>com.e-iceblue</id>
      <name>e-iceblue</name>
      <url>http://repo.e-iceblue.com/nexus/content/groups/public/</url>
   </repository>
</repositories>
<dependencies>
   <dependency>
      <groupId>e-iceblue</groupId>
      <artifactId>spire.doc.free</artifactId>
      <version>3.9.0</version>
   </dependency>
</dependencies>

Enter fullscreen mode

Exit fullscreen mode

Extract Text

import com.spire.doc.Document;

        import java.io.FileWriter;
        import java.io.IOException;

public class ExtractText {

    public static void main(String[] args) throws IOException {

        //Load Word document
        Document document = new Document();
        document.loadFromFile("Island.docx");

        //Get text from document as string
        String text=document.getText();

        //Write string to a .txt file
        writeStringToTxt(text," Extracted.txt");
    }

    public static void writeStringToTxt(String content, String txtFileName) throws IOException{

        FileWriter fWriter= new FileWriter(txtFileName,true);
        try {
            fWriter.write(content);
        }catch(IOException ex){
            ex.printStackTrace();
        }finally{
            try{
                fWriter.flush();
                fWriter.close();
            } catch (IOException ex) {
                ex.printStackTrace();
            }
        }
    }
}

Enter fullscreen mode

Exit fullscreen mode

Extract text

In this article we will be discussing about ways and techniques to read word documents in Java using Apache POI library. The word document may contain images, tables or plain text. Apart from this a standard word file has header and footers too. Here in the following examples we will be parsing a word document by reading its different paragraph, runs, images, tables along with headers and footers. We will also take a look into identifying different styles associated with the paragraphs such as font-size, font-family, font-color etc.

Maven Dependencies

Following is the poi maven depedency required to read word documents. For latest artifacts visit here

pom.xml

	<dependencies>
		<dependency>
                     <groupId>org.apache.poi</groupId>
                     <artifactId>poi-ooxml</artifactId>
		     <version>3.16</version>
                 </dependency>
	</dependencies>

Reading Complete Text from Word Document

The class XWPFDocument has many methods defined to read and extract .docx file contents. getText() can be used to read all the texts in a .docx word document. Following is an example.

TextReader.java

public class TextReader {
	
	public static void main(String[] args) {
	 try {
		   FileInputStream fis = new FileInputStream("test.docx");
		   XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis));
		   XWPFWordExtractor extractor = new XWPFWordExtractor(xdoc);
		   System.out.println(extractor.getText());
		} catch(Exception ex) {
		    ex.printStackTrace();
		}
 }

}

Reading Headers and Foooters of Word Document

Apache POI provides inbuilt methods to read headers and footers of a word document. Following is an example that reads and prints header and footer of a word document. The example .docx file is available in the source which can be downloaded at the end of thos article.

HeaderFooter.java

public class HeaderFooterReader {

	public static void main(String[] args) {
		
		try {
			FileInputStream fis = new FileInputStream("test.docx");
			XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis));
			XWPFHeaderFooterPolicy policy = new XWPFHeaderFooterPolicy(xdoc);

			XWPFHeader header = policy.getDefaultHeader();
			if (header != null) {
				System.out.println(header.getText());
			}

			XWPFFooter footer = policy.getDefaultFooter();
			if (footer != null) {
				System.out.println(footer.getText());
			}
		} catch (Exception ex) {
			ex.printStackTrace();
		}

	}

}

Output

This is Header

This is footer

 Other Interesting Posts
Java 8 Lambda Expression
Java 8 Stream Operations
Java 8 Datetime Conversions
Random Password Generator in Java

Read Each Paragraph of a Word Document

Among the many methods defined in XWPFDocument class, we can use getParagraphs() to read a .docx word document paragraph wise.This method returns a list of all the paragraphs(XWPFParagraph) of a word document. Again the XWPFParagraph has many utils method defined to extract information related to any paragraph such as text alignment, style associated with the paragrpahs.

To have more control over the text reading of a word document,each paragraph is again divided into multiple runs. Run defines a region of text with a common set of properties.Following is an example to read paragraphs from a .docx word document.

ParagraphReader.java

public class ParagraphReader {

	public static void main(String[] args) {
		try {
			FileInputStream fis = new FileInputStream("test.docx");
			XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis));

			List paragraphList = xdoc.getParagraphs();

			for (XWPFParagraph paragraph : paragraphList) {

				System.out.println(paragraph.getText());
				System.out.println(paragraph.getAlignment());
				System.out.print(paragraph.getRuns().size());
				System.out.println(paragraph.getStyle());

				// Returns numbering format for this paragraph, eg bullet or lowerLetter.
				System.out.println(paragraph.getNumFmt());
				System.out.println(paragraph.getAlignment());

				System.out.println(paragraph.isWordWrapped());

				System.out.println("********************************************************************");
			}
		} catch (Exception ex) {
			ex.printStackTrace();
		}
	}
}

Reading Tables from Word Document

Following is an example to read tables present in a word document. It will print all the text rows wise.

TableReader.java

public class TableReader {

	public static void main(String[] args) {
		try {
			FileInputStream fis = new FileInputStream("test.docx");
			XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis));
			Iterator bodyElementIterator = xdoc.getBodyElementsIterator();
			while (bodyElementIterator.hasNext()) {
				IBodyElement element = bodyElementIterator.next();

				if ("TABLE".equalsIgnoreCase(element.getElementType().name())) {
					List tableList = element.getBody().getTables();
					for (XWPFTable table : tableList) {
						System.out.println("Total Number of Rows of Table:" + table.getNumberOfRows());
						for (int i = 0; i < table.getRows().size(); i++) {

							for (int j = 0; j < table.getRow(i).getTableCells().size(); j++) {
								System.out.println(table.getRow(i).getCell(j).getText());
							}
						}
					}
				}
			}
		} catch (Exception ex) {
			ex.printStackTrace();
		}
	}
}

Reading Styles from Word Document

Styles are associated with runs of a paragraph. There are many methods available in the XWPFRun class to identify the styles associated with the text.There are methods to identify boldness, highlighted words, capitalized words etc.

StyleReader.java

public class StyleReader {

	public static void main(String[] args) {
		try {
			FileInputStream fis = new FileInputStream("test.docx");
			XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis));

			List paragraphList = xdoc.getParagraphs();

			for (XWPFParagraph paragraph : paragraphList) {

				for (XWPFRun rn : paragraph.getRuns()) {

					System.out.println(rn.isBold());
					System.out.println(rn.isHighlighted());
					System.out.println(rn.isCapitalized());
					System.out.println(rn.getFontSize());
				}

				System.out.println("********************************************************************");
			}
		} catch (Exception ex) {
			ex.printStackTrace();
		}

	}

}

Reading Image from Word Document

Following is an example to read image files from a word document.

public class ImageReader {

	public static void main(String[] args) {

		try {
			FileInputStream fis = new FileInputStream("test.docx");
			XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis));
			List pic = xdoc.getAllPictures();
			if (!pic.isEmpty()) {
				System.out.print(pic.get(0).getPictureType());
				System.out.print(pic.get(0).getData());
			}

		} catch (Exception ex) {
			ex.printStackTrace();
		}
	}

}

Conclusion

I hope this article served you that you were looking for. If you have anything that you want to add or share then please share it below in the comment section.

Download source

The WordToTextConverter can be used to convert from a microsoft word 2007+ file to a text file. Again, this program uses the Apache POI library or tika library to extract text from Microsoft Word 2007+ file.

Word file to text converter

import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.util.Iterator;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;

public class WordToTextConverter{
public static void main(String[] args){
try{
convertWordToText(args[0],args[1]);
}catch(ArrayIndexOutOfBoundsException aiobe){
System.out.println(«Usage:java WordToTextConverter <word_file> <text_file>»);

}

}


public static void convertWordToText(String src, String desc){
try{
//create file inputstream object to read data from file
FileInputStream fs=new FileInputStream(src);
//create document object to wrap the file inputstream object
XWPFDocument docx=new XWPFDocument(fs);
//create text extractor object to extract text from the document
XWPFWordExtractor extractor=new XWPFWordExtractor(docx);
//create file writer object to write text to the output file
FileWriter fw=new FileWriter(desc);
//write text to the output file
fw.write(extractor.getText());
//clear data from memory
fw.flush();
//close inputstream and file writer
fs.close();
fw.close();

}catch(IOException e){e.printStackTrace();}
}
}

In the code of the program above,  XWPFDocument is used to construct a Microsoft Word document object from the FileInputStream object. FileInputStream object contains all data of the original Microsoft Word file. To extract all text from the document, you need to use the  XWPFWordExtractor class. You will pass the document object to the constructor of the XWPFWordExtractor when you create an object of the XWPFWordExtractor class. From the XWPFWordExtractor object, you can get the text content of this object by using its getText method. Once you have the text, the FileWriter class can be used to output it to the destination file.

Merge or Combine PDF, Txt, Images


This chapter explains how to extract simple text data from a Word document using Java. In case you want to extract metadata from a Word document, make use of Apache Tika.

For .docx files, we use the class org.apache.poi.xwpf.extractor.XPFFWordExtractor that extracts and returns simple data from a Word file. In the same way, we have different methodologies to extract headings, footnotes, table data, etc. from a Word file.

The following code shows how to extract simple text from a Word file −

import java.io.FileInputStream;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;

public class WordExtractor {
   public static void main(String[] args)throws Exception {
      XWPFDocument docx = new XWPFDocument(new FileInputStream("createparagraph.docx"));
      
      //using XWPFWordExtractor Class
      XWPFWordExtractor we = new XWPFWordExtractor(docx);
      System.out.println(we.getText());
   }
}

Save the above code as WordExtractor.java. Compile and execute it from the command prompt as follows −

$javac WordExtractor.java
$java WordExtractor

It will generate the following output −

At tutorialspoint.com, we strive hard to provide quality tutorials for self-learning purpose
in the domains of Academics, Information Technology, Management and Computer Programming Languages.

This article demonstrates how to extract text and images from Word documents by using Spire.Doc for Java.

Extract Text

import com.spire.doc.Document;

import java.io.FileWriter;
import java.io.IOException;

public class ExtractText {

    public static void main(String[] args) throws IOException {

        //load Word document
        Document document = new Document();
        document.loadFromFile("C:\Users\Administrator\Desktop\sample.docx");

        //get text from document as string
        String text=document.getText();

        //write string to a .txt file
        writeStringToTxt(text," ExtractedText.txt");
    }

    public static void writeStringToTxt(String content, String txtFileName) throws IOException{

        FileWriter fWriter= new FileWriter(txtFileName,true);
        try {
            fWriter.write(content);
        }catch(IOException ex){
            ex.printStackTrace();
        }finally{
            try{
                fWriter.flush();
                fWriter.close();
            } catch (IOException ex) {
                ex.printStackTrace();
            }
        }
    }
}

Extract Text and Images from Word in Java

Extract Images

import com.spire.doc.Document;
import com.spire.doc.documents.DocumentObjectType;
import com.spire.doc.fields.DocPicture;
import com.spire.doc.interfaces.ICompositeObject;
import com.spire.doc.interfaces.IDocumentObject;

import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.LinkedList;
import java.util.List;
import java.util.Queue;

public class ExtractImages {

    public static void main(String[] args) throws IOException {

        //load word document
        Document document = new Document();
        document.loadFromFile("C:\Users\Administrator\Desktop\sample.docx");

        //create a Queue object
        Queue nodes = new LinkedList();
        nodes.add(document);

        //create a List object
        List images = new ArrayList();

        //loop through the child objects of the document
        while (nodes.size() > 0) {
            ICompositeObject node = nodes.poll();
            for (int i = 0; i < node.getChildObjects().getCount(); i++) {
                IDocumentObject child = node.getChildObjects().get(i);
                if (child instanceof ICompositeObject) {
                    nodes.add((ICompositeObject) child);

                    //get each image and add it to the list
                    if (child.getDocumentObjectType() == DocumentObjectType.Picture) {
                        DocPicture picture = (DocPicture) child;
                        images.add(picture.getImage());
                    }
                }
            }
        }

        //save images as .png files
        for (int i = 0; i < images.size(); i++) {
            File file = new File(String.format("output/ExtractedImage-%d.png", i));
            ImageIO.write(images.get(i), "PNG", file);
        }
    }
}

Extract Text and Images from Word in Java

Понравилась статья? Поделить с друзьями:
  • Java word for number
  • Javascript if word is in string
  • Java to excel macro
  • Javascript for excel export
  • Java split one word