I want to read a word file in java
import org.apache.poi.poifs.filesystem.*;
import org.apache.poi.hpsf.DocumentSummaryInformation;
import org.apache.poi.hwpf.*;
import org.apache.poi.hwpf.extractor.*;
import org.apache.poi.hwpf.usermodel.HeaderStories;
import java.io.*;
public class ReadDocFileFromJava {
public static void main(String[] args) {
/**This is the document that you want to read using Java.**/
String fileName = "C:\Path to file\Test.doc";
/**Method call to read the document (demonstrate some useage of POI)**/
readMyDocument(fileName);
}
public static void readMyDocument(String fileName){
POIFSFileSystem fs = null;
try {
fs = new POIFSFileSystem(new FileInputStream(fileName));
HWPFDocument doc = new HWPFDocument(fs);
/** Read the content **/
readParagraphs(doc);
int pageNumber=1;
/** We will try reading the header for page 1**/
readHeader(doc, pageNumber);
/** Let's try reading the footer for page 1**/
readFooter(doc, pageNumber);
/** Read the document summary**/
readDocumentSummary(doc);
} catch (Exception e) {
e.printStackTrace();
}
}
public static void readParagraphs(HWPFDocument doc) throws Exception{
WordExtractor we = new WordExtractor(doc);
/**Get the total number of paragraphs**/
String[] paragraphs = we.getParagraphText();
System.out.println("Total Paragraphs: "+paragraphs.length);
for (int i = 0; i < paragraphs.length; i++) {
System.out.println("Length of paragraph "+(i +1)+": "+ paragraphs[i].length());
System.out.println(paragraphs[i].toString());
}
}
public static void readHeader(HWPFDocument doc, int pageNumber){
HeaderStories headerStore = new HeaderStories( doc);
String header = headerStore.getHeader(pageNumber);
System.out.println("Header Is: "+header);
}
public static void readFooter(HWPFDocument doc, int pageNumber){
HeaderStories headerStore = new HeaderStories( doc);
String footer = headerStore.getFooter(pageNumber);
System.out.println("Footer Is: "+footer);
}
public static void readDocumentSummary(HWPFDocument doc) {
DocumentSummaryInformation summaryInfo=doc.getDocumentSummaryInformation();
String category = summaryInfo.getCategory();
String company = summaryInfo.getCompany();
int lineCount=summaryInfo.getLineCount();
int sectionCount=summaryInfo.getSectionCount();
int slideCount=summaryInfo.getSlideCount();
enter code here
System.out.println("---------------------------");
System.out.println("Category: "+category);
System.out.println("Company: "+company);
System.out.println("Line Count: "+lineCount);
System.out.println("Section Count: "+sectionCount);
System.out.println("Slide Count: "+slideCount);
}
}
http://sanjaal.com/java/tag/java-and-docx-format/
I want to read a doc or docx file in Java
README
What is docx4j?
docx4j is an open source (Apache v2) library for creating, editing, and saving OpenXML «packages», including docx, pptx, and xslx.
It uses JAXB to create the Java representation.
- Open existing docx/pptx/xlsx
- Create new docx/pptx/xlsx
- Programmatically manipulate docx/pptx/xlsx (anything the file format allows)
- Document generation via variable, content control data binding, or MERGEFIELD
- CustomXML binding (with support for pictures, rich text, checkboxes, and OpenDoPE extensions for repeats & conditionals, and importing XHTML)
- Export as HTML
- Export as PDF, choice of 3 strategies, see https://www.docx4java.org/blog/2020/09/office-pptxxlsxdocx-to-pdf-to-in-docx4j-8-2-3/
- Produce/consume Word 2007’s xmlPackage (pkg) format
- Apply transforms, including common filters
- Font support (font substitution, and use of any fonts embedded in the document)
docx4j for JAXB 3.0 and Java 11+
docx4j v11.4.5 uses Jakarta XML Binding API 3.0, as opposed to JAXB 2.x used in earlier versions (which import javax.xml.bind.*). Since this release uses jakarta.xml.bind, rather than javax.xml.bind, if you have existing code which imports javax.xml.bind, you’ll need to search/replace across your code base, replacing javax.xml.bind with jakarta.xml.bind. You’ll also need to replace your JAXB jars (which Maven will do for you automatically; otherwise get them from the relevant zip file).
Being a JPMS modularised release, the jars also contain module-info.class entries.
To use it, add the dep corresponding to the JAXB implementation you wish to use
docx4j-8
This is docx4j for Java 8. Although in principle it would compile and run under Java 6, some of its
dependencies are Java 8 only. So to run it under Java 6, you’d need to use the same version of the deps
which docx4j 6.x uses.
docx4j v8 is a multi-module Maven project.
To use docx4j v8, add the dep corresponding to the JAXB implementation you wish to use
You should use one and only one of docx4j-JAXB-*
How do I build docx4j?
Get it from GitHub, at https://github.com/plutext/docx4j
Some of the tests might fail on Windows. For now, you could skip them: mvn install -DskipTests
For more details, see http://www.docx4java.org/blog/2015/06/docx4j-from-github-in-eclipse-5-years-on/
If you are working with the source code, please join the developer
mailing list:
docx4j-dev-subscribe@docx4java.org
Where do I get a binary?
http://www.docx4java.org/downloads.html
How do I get started?
See the Getting Started guide: https://github.com/plutext/docx4j/tree/master/docs
and the Cheat Sheet: http://www.docx4java.org/blog/2013/05/docx4j-in-a-single-page/
And see the sample code: https://github.com/plutext/docx4j/tree/master/src/samples
You’ll probably want the Helper AddIn to generate code: http://www.docx4java.org/blog/2016/05/docx4j-helper-word-addin-new-version-v3-3-0/
Where to get help?
http://www.docx4java.org/forums or StackOverflow (use tag ‘docx4j’)
Please post to one or the other, not both
Legal Information
docx4j is published under the Apache License version 2.0. For the license
text, please see the following files in the legals directory:
- LICENSE
- NOTICE
Legal information on libraries used by docx4j can be found in the
«legals/NOTICE» file.
Эта статья является продолжением знакомства с возможностями библиотеки Apache POI. В прошлой статье мы научились создавать новые Word документы на Java, а сегодня рассмотрим простой пример считывания данных с файлов в формате docx.
Считывание Word документа с помощью Apache POI
Давайте рассмотрим краткие теоретические сведения по работе с библиотекой, колонтитулами и параграфами. Считанный в память docx документ представляет собой экземпляр класса XWPFDocument
, который мы будем разбирать на составляющие. Для этого нам понадобятся специальные классы:
- Отдельные классы
XWPFHeader
иXWPFFooter
— для работы (считывания/создания) верхнего и нижнего колонтитулов. Доступ к ним можно получить через специальный класс-поставщикXWPFHeaderFooterPolicy
. - Класс
XWPFParagraph
— для работы с параграфами. - Класс
XWPFWordExtractor
— для парсинга содержимого всей страницы docx документа
Apache POI содержит множество других полезных классов для работы с таблицами и медиа объектами внутри Word документа, но в этой ознакомительной статье мы ограничимся лишь с разбором колонтитулов и парсингом текстовой информации.
Пример чтения документа Word в формате docx с помощью Apache POI
Теперь добавим в проект библиотеку Apache POI для работы с Word именно в docx формате. Я использую maven, поэтому просто добавлю в проект еще одну зависимость
<dependency> <groupId>org.apache.poi</groupId> <artifactId>poi—ooxml</artifactId> <version>3.11</version> </dependency> |
Если вы используете gradle или хотите вручную добавить библиотеку в проект, то найти ее можно здесь.
Парсить/считывать я буду docx документ, полученный в предыдущей статье — Создание Word файла. Вы можете использовать свой файл. Содержимое моего документа следующее:
Теперь напишем простой класс для считывания данных из колонтитулов и параграфов документа:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
package ua.com.prologistic.excel; import org.apache.poi.openxml4j.opc.OPCPackage; import org.apache.poi.xwpf.model.XWPFHeaderFooterPolicy; import org.apache.poi.xwpf.usermodel.XWPFDocument; import org.apache.poi.xwpf.usermodel.XWPFFooter; import org.apache.poi.xwpf.usermodel.XWPFHeader; import org.apache.poi.xwpf.usermodel.XWPFParagraph; import java.io.FileInputStream; import java.util.List; public class WordReader { public static void main(String[] args) { try (FileInputStream fileInputStream = new FileInputStream(«F:/Apache POI Word Test.docx»)) { // открываем файл и считываем его содержимое в объект XWPFDocument XWPFDocument docxFile = new XWPFDocument(OPCPackage.open(fileInputStream)); XWPFHeaderFooterPolicy headerFooterPolicy = new XWPFHeaderFooterPolicy(docxFile); // считываем верхний колонтитул (херед документа) XWPFHeader docHeader = headerFooterPolicy.getDefaultHeader(); System.out.println(docHeader.getText()); // печатаем содержимое всех параграфов документа в консоль List<XWPFParagraph> paragraphs = docxFile.getParagraphs(); for (XWPFParagraph p : paragraphs) { System.out.println(p.getText()); } // считываем нижний колонтитул (футер документа) XWPFFooter docFooter = headerFooterPolicy.getDefaultFooter(); System.out.println(docFooter.getText()); /*System.out.println(«_____________________________________»); // печатаем все содержимое Word файла XWPFWordExtractor extractor = new XWPFWordExtractor(docxFile); System.out.println(extractor.getText());*/ } catch (Exception ex) { ex.printStackTrace(); } } } |
Запустим и смотрим в консоль:
Верхний колонтитул — создано с помощью Apache POI на Java :) Prologistic.com.ua — новые статьи по Java и Android каждую неделю. Подписывайтесь! Просто нижний колонтитул |
Начинающие Java программисты, обратите внимание, что мы использовали конструкцию try-with-resources — особенность Java 7. Подробнее читайте в специальном разделе Особенности Java 7.
Другой способ считать содержимое Word файла
Приведенный выше пример сначала парсит отдельные части документа, а потом печатает в консоль их содержимое. А как быть, если мы просто хотим посмотреть все содержимое файла сразу? Для этого в Apache POI есть специальный класс XWPFWordExtractor, с помощью которого мы в 2 строчки сделаем то, что нам нужно.
Просто раскомментируйте код в листинге выше и еще раз запустите проект. В консоле просто продублируется вывод на экран.
Подробнее о библиотеке Apache POI читайте здесь, а также посмотрите пример чтения Excel файла, а также создания Excel (xls) документа все помощью Apache POI.
Подписывайтесь на новые статьи по Java и Android.
In this article we will be discussing about ways and techniques to read word documents in Java using Apache POI library. The word document may contain images, tables or plain text. Apart from this a standard word file has header and footers too. Here in the following examples we will be parsing a word document by reading its different paragraph, runs, images, tables along with headers and footers. We will also take a look into identifying different styles associated with the paragraphs such as font-size, font-family, font-color etc.
Maven Dependencies
Following is the poi maven depedency required to read word documents. For latest artifacts visit here
pom.xml
<dependencies> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-ooxml</artifactId> <version>3.16</version> </dependency> </dependencies>
Reading Complete Text from Word Document
The class XWPFDocument
has many methods defined to read and extract .docx
file contents. getText()
can be used to read all the texts in a .docx word document. Following is an example.
TextReader.java
public class TextReader { public static void main(String[] args) { try { FileInputStream fis = new FileInputStream("test.docx"); XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis)); XWPFWordExtractor extractor = new XWPFWordExtractor(xdoc); System.out.println(extractor.getText()); } catch(Exception ex) { ex.printStackTrace(); } } }
Reading Headers and Foooters of Word Document
Apache POI provides inbuilt methods to read headers and footers of a word document. Following is an example that reads and prints header and footer of a word document. The example .docx file is available in the source which can be downloaded at the end of thos article.
HeaderFooter.java
public class HeaderFooterReader { public static void main(String[] args) { try { FileInputStream fis = new FileInputStream("test.docx"); XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis)); XWPFHeaderFooterPolicy policy = new XWPFHeaderFooterPolicy(xdoc); XWPFHeader header = policy.getDefaultHeader(); if (header != null) { System.out.println(header.getText()); } XWPFFooter footer = policy.getDefaultFooter(); if (footer != null) { System.out.println(footer.getText()); } } catch (Exception ex) { ex.printStackTrace(); } } }
Output
This is Header
This is footer
Other Interesting Posts Java 8 Lambda Expression Java 8 Stream Operations Java 8 Datetime Conversions Random Password Generator in Java
Read Each Paragraph of a Word Document
Among the many methods defined in XWPFDocument
class, we can use getParagraphs()
to read a .docx word document paragraph wise.This method returns a list of all the paragraphs(XWPFParagraph) of a word document. Again the XWPFParagraph has many utils method defined to extract information related to any paragraph such as text alignment, style associated with the paragrpahs.
To have more control over the text reading of a word document,each paragraph is again divided into multiple runs. Run defines a region of text with a common set of properties.Following is an example to read paragraphs from a .docx word document.
ParagraphReader.java
public class ParagraphReader { public static void main(String[] args) { try { FileInputStream fis = new FileInputStream("test.docx"); XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis)); List paragraphList = xdoc.getParagraphs(); for (XWPFParagraph paragraph : paragraphList) { System.out.println(paragraph.getText()); System.out.println(paragraph.getAlignment()); System.out.print(paragraph.getRuns().size()); System.out.println(paragraph.getStyle()); // Returns numbering format for this paragraph, eg bullet or lowerLetter. System.out.println(paragraph.getNumFmt()); System.out.println(paragraph.getAlignment()); System.out.println(paragraph.isWordWrapped()); System.out.println("********************************************************************"); } } catch (Exception ex) { ex.printStackTrace(); } } }
Reading Tables from Word Document
Following is an example to read tables present in a word document. It will print all the text rows wise.
TableReader.java
public class TableReader { public static void main(String[] args) { try { FileInputStream fis = new FileInputStream("test.docx"); XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis)); Iterator bodyElementIterator = xdoc.getBodyElementsIterator(); while (bodyElementIterator.hasNext()) { IBodyElement element = bodyElementIterator.next(); if ("TABLE".equalsIgnoreCase(element.getElementType().name())) { List tableList = element.getBody().getTables(); for (XWPFTable table : tableList) { System.out.println("Total Number of Rows of Table:" + table.getNumberOfRows()); for (int i = 0; i < table.getRows().size(); i++) { for (int j = 0; j < table.getRow(i).getTableCells().size(); j++) { System.out.println(table.getRow(i).getCell(j).getText()); } } } } } } catch (Exception ex) { ex.printStackTrace(); } } }
Reading Styles from Word Document
Styles are associated with runs of a paragraph. There are many methods available in the XWPFRun
class to identify the styles associated with the text.There are methods to identify boldness, highlighted words, capitalized words etc.
StyleReader.java
public class StyleReader { public static void main(String[] args) { try { FileInputStream fis = new FileInputStream("test.docx"); XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis)); List paragraphList = xdoc.getParagraphs(); for (XWPFParagraph paragraph : paragraphList) { for (XWPFRun rn : paragraph.getRuns()) { System.out.println(rn.isBold()); System.out.println(rn.isHighlighted()); System.out.println(rn.isCapitalized()); System.out.println(rn.getFontSize()); } System.out.println("********************************************************************"); } } catch (Exception ex) { ex.printStackTrace(); } } }
Reading Image from Word Document
Following is an example to read image files from a word document.
public class ImageReader { public static void main(String[] args) { try { FileInputStream fis = new FileInputStream("test.docx"); XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis)); List pic = xdoc.getAllPictures(); if (!pic.isEmpty()) { System.out.print(pic.get(0).getPictureType()); System.out.print(pic.get(0).getData()); } } catch (Exception ex) { ex.printStackTrace(); } } }
Conclusion
I hope this article served you that you were looking for. If you have anything that you want to add or share then please share it below in the comment section.