I’d like to read the «text8» corpus in Java and reformat some words. The problem is, in this 100MB corpus all words are on one line. So if I try to load it with BufferedReader
and readLine
, it takes away too much space at once and can’t handle it to separate all the words in one list/array.
So my question: Is it possible in Java to read instead of line by line a corpus, to read it word by word? So for example because all words are on one line, to read for example 100 words per iteration?
Kara
6,08516 gold badges51 silver badges57 bronze badges
asked Nov 4, 2015 at 10:18
2
you can try using Scanner
and set the delimiter to whatever suits you:
Scanner input=new Scanner(myFile);
input.useDelimiter(" +"); //delimitor is one or more spaces
while(input.hasNext()){
System.out.println(input.next());
}
answered Nov 4, 2015 at 10:32
nafasnafas
5,2533 gold badges28 silver badges56 bronze badges
I would suggest you to use the «Character stream» with FileReader
Here is the example code from http://www.tutorialspoint.com/java/java_files_io.htm
import java.io.*;
public class CopyFile {
public static void main(String args[]) throws IOException
{
FileReader in = null;
FileWriter out = null;
try {
in = new FileReader("input.txt");
out = new FileWriter("output.txt");
int c;
while ((c = in.read()) != -1) {
out.write(c);
}
}finally {
if (in != null) {
in.close();
}
if (out != null) {
out.close();
}
}
}
}
It reads 16 bit Unicode characters. This way it doesnt matter if your text is in one whole line.
Since you’re trying to search word by word, you can easy read till you stumble upon a space and there’s your word.
answered Nov 4, 2015 at 10:28
MiKEMiKE
5146 silver badges12 bronze badges
Use the next
method of java.util.Scanner
The
next
method finds and returns the next complete token from this scanner. A
complete token is preceded and followed by input that matches the
delimiter pattern. This method may block while waiting for input to
scan, even if a previous invocation of Scanner.hasNext returned true.
Example:
public static void main(String[] args) {
Scanner sc = new Scanner (System.in);
String a = sc.next();
String b = sc.next();
System.out.println("First Word: "+a);
System.out.println("Second Word: "+b);
sc.close();
}
Input :
Hello Stackoverflow
Output :
First Word: Hello
Second Word: Stackoverflow
In your case use Scanner
for reading the file and then use scannerobject.next()
method for reading each token(word)
answered Nov 4, 2015 at 10:41
rhitzrhitz
1,8722 gold badges22 silver badges26 bronze badges
0
try(FileInputStream fis = new FileInputStream("Example.docx")) {
ZipSecureFile.setMinInflateRatio(0.009);
XWPFDocument file = new XWPFDocument(OPCPackage.open(fis));
ext = new XWPFWordExtractor(file);
Scanner scanner = new Scanner(ext.getText());
while(scanner.hasNextLine()) {
String[] value = scanner.nextLine().split(" ");
for(String v:value) {
System.out.println(v);
}
}
}catch(Exception e) {
System.out.println(e);
}
answered Jan 9, 2020 at 14:12
I need to write a program that rads the text of a text file and returns the amount of times a specific word shows up. I cant figure out how to write a program that reads the text file word by word, ive only managed to write a file that reads line by line. How would i make it so it reads word by word?
import java.io.*;
import java.util.Scanner;
public class main
{
public static void main (String [] args)
{
Scanner scan = new Scanner(System.in);
System.out.println("Name of file: ");
String filename= scan.nextLine();
int count =0;
try
{
FileReader file = new FileReader(filename);
BufferedReader reader = new BufferedReader(file);
String a = "";
int linecount = 0;
String line;
System.out.println("What word are you looking for: ");
String a1 = scan.nextLine();
while((line = reader.readLine()) != null)
{
linecount++;
if(line.equalsIgnoreCase("that"));
count++;
}
reader.close();
}
catch(IOException e)
{
System.out.println("File Not found");
}
System.out.println("the word that appears " + count + " many times");
}
}
Gary kwlai
Greenhorn
Posts: 12
posted 14 years ago
-
Number of slices to send:
Optional ‘thank-you’ note:
Hi A beginner question.
I have a text file not in text format (.txt) but it does contain text and numbers.
I would like to know How to read a file line by line and store each word or number into an arraylist, then output them on a new file?
e.g. my text file call ( colorsANDnumbers.data )
Red 2 Blue 3 Yellow 4 Green 5
2 Red 3 Blue 4 Yellow 5 Green
Is that possible to be done with just one arraylist?
regards
Gaz
Bijj shar
Greenhorn
Posts: 13
posted 14 years ago
-
Number of slices to send:
Optional ‘thank-you’ note:
[edit]Add code tags. CR[/edit]
Campbell Ritchie
Marshal
Posts: 77646
posted 14 years ago
-
Number of slices to send:
Optional ‘thank-you’ note:
Please use the CODE button; I have edited that post so you can see how much better it looks.
Please don’t simply give out code like that. Since it is pretty standard code, which could have been copied from the Java Tutorials, I think I shall let it stand. But (look at the Beginners’ Forum contents page), where we explain that people learn a lot better if they work out things for themselves.
It doesn’t actually work in its present condition, and I can see a potentially serious error, which I shall let you find for yourself . I shall also leave you to work out what people would do in Java5 or Java6.
*************************************************************************************************
Yes, you can put those entries into a single List<String>, but is that really appropriate? I suggest you go through the different interfaces in the Collections Framework and you might find something more appropriate for keeping colours and numbers.
Gary kwlai
Greenhorn
Posts: 12
posted 14 years ago
-
Number of slices to send:
Optional ‘thank-you’ note:
Impressive
I have few things not quite understand from the code, what does line 21 and 34 actually doing??, because I have not cover WInputStreamReader and Iterator yet.
Also almost every codes thesedays has Try and Catch in them… are those required? does it prevent the program from crashing or halt when there is an error?
regards
Gaz
Bijj shar
Greenhorn
Posts: 13
posted 14 years ago
-
Number of slices to send:
Optional ‘thank-you’ note:
Ritchie-
Thanks for letting me know to use Code Button. What error you are seeing in present condition please explain and user has asked about read and write data in file and he is reading data from existing file why you are giving him suggestion out of box.
Gregg Bolinger
Ranch Hand
Posts: 15304
posted 14 years ago
-
Number of slices to send:
Optional ‘thank-you’ note:
Gary Lai wrote:Also almost every codes thesedays has Try and Catch in them… are those required? does it prevent the program from crashing or halt when there is an error?
regards
If the API throws any kind of exception that inherets from java.lang.Exception the compiler will force you to surround the code with a try/catch block. This allows you to catch any exceptions that are thrown and deal with them. Some API’s throw RuntimeExceptions which don’t require try/catch blocks but if they throw an exception, the application will just die.
Campbell Ritchie
Marshal
Posts: 77646
posted 14 years ago
-
Number of slices to send:
Optional ‘thank-you’ note:
You are using the wrong classes for reading; you ought to use FileReader and BufferedReader because it is a text file. DataInputStreams are not designed for text files.
You are opening several Readers; I may be mistaken, but are you actually closing them? If you leave the Reader open, you may suffer a memory leak. That was what worried me. Anyway, when I tried your code, I couldn’t get it to work; I got what appears to be a FileNotFoundException.
I would simply use the Scanner and Formatter classes for text files; they are much easier to use. Since they «consume» their Exceptions, you can get away without the try-catch.
Liron Meir
Greenhorn
Posts: 2
posted 11 years ago
-
1
-
Number of slices to send:
Optional ‘thank-you’ note:
Hi this is not reading word by word. This is how it’s done:
Scanner input = new Scanner(new File(«liron.txt»));
while(input.hasNext()) {
String word = input.next();
}
fred rosenberger
lowercase baba
Posts: 13086
posted 11 years ago
-
Number of slices to send:
Optional ‘thank-you’ note:
Liron Meir wrote:Hi this is not reading word by word. This is how it’s done:
Given that the question, and the last reply, was almost three years ago, i doubt the original poster is still waiting for an answer, or is terribly worried about it anymore.
There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors
Liron Meir
Greenhorn
Posts: 2
posted 11 years ago
-
1
-
Number of slices to send:
Optional ‘thank-you’ note:
Yes, but if someone is looking for a solution to read word by word, this is not it.
Campbell Ritchie
Marshal
Posts: 77646
posted 11 years ago
-
Number of slices to send:
Optional ‘thank-you’ note:
Welcome to the Ranch
That is what I was hinting at when I mentioned Scanner. We prefer not to give the full solution and it says the following on this forum’s title page:
We’re all here to learn, so when responding to others, please focus on helping them discover their own solutions, instead of simply providing answers.
-
October 21st, 2011, 10:28 AM
#1
Junior Member
Reading a text file word by word
Hi all, new to Java programming. Got into it, cause a friend also has been doing it, but he’s been doing it for a while, so is really good. Anyways.
So what I want to do it, read a specific .txt file, then go through that, and split it into words. If that makes sense.
What I want the program to do, is, go through a file, and identify and count how many palindromes are inside the file. So I have been able to do the palindrome part of it. But I am just confused as to how you go about reading the text, word for word, then testing it out.
And help will be appreciated, and not in a rush, this is just a hobby, that I am really enjoying
-
Related threads:
-
October 21st, 2011, 11:00 AM
#2
Re: Reading a text file word by word
Here is the first hit after googling «java file io»: Lesson: Basic I/O (The Java� Tutorials > Essential Classes)
But this sounds like a job for the Scanner class. The API is your new best friend: Java Platform SE 6
-
October 21st, 2011, 12:23 PM
#3
Junior Member
Re: Reading a text file word by word
OK, hi again. I ended up figuring out how to do it. But I have a new problem. When I run the program, I want it to only find single word palindromes. At the moment, when I do it, it is also getting multiple word palindromes. So, in a text file I have, it has, for example «avid diva». It is coming up that each of these are palindromes. And I want it too be only one word palindromes, for example «otto». Any help again is appreciated. Here is what I currently have
import java.io.*; public class PalindromDetector { public static void main(String[] args) { try { FileInputStream fstream = new FileInputStream("C:/Test1.txt"); DataInputStream in = new DataInputStream(fstream); BufferedReader br = new BufferedReader(new InputStreamReader(in)); String strLine = null; while ((strLine = br.readLine()) != null) { String reverse = new StringBuffer(strLine).reverse(). toString(); int i,j,counter=0; String m[]=strLine.split(" "); String[] word=reverse.split(" "); System.out.println("The palindrome words are:"); for(i=0;i<m.length;i++) { for(j=word.length-1;j>=0;j--) { if(m[i].equalsIgnoreCase(word[j])) { System.out.println(m[i]); counter++; break; } } } System.out.println("Number of palindromes:"+counter); } } catch(IOException e){} } }
Dunno what I have done. Obviously something silly. I’m sure what I have done is supposed to be harder that what I am supposed to have done
-
October 21st, 2011, 02:06 PM
#4
Re: Reading a text file word by word
Instead of taking the file one line at a time, why don’t you take it one word at a time, since that’s what you really care about?
What I’m looking to do is an accountant who adds the value of every word in his ASCI code. For example: Hello (H = 45, or = 56, l = -45, a = 23. Total value = 79 ).
This is my accountant’s code:
public class Contador {
public static final char espacio = ' '; char[] pal; //Definim l'array paraula Paraula p = new Paraula();
public int ContarPes() {
String S; //Creamos el String para leer desde teclado
S = new LT().llegirLinia(); //Leemos todo el string
pal = S.toCharArray(); //Pasamos el String S a Array
int PesPal = 0; //Ponemos el contador a 0 , el cual dice qué pesa cada palabrafor (int i = 0; i < pal.length; i++) { PesPal = PesPal + pal[i]; } return PesPal;
}
!
My problem lies in knowing when you’ve already read a shovel, how you can pass me to the next. And in the end I print the sum of every word per screen.
EDIT: I can only use String for entry or exit.
In this article we will be discussing about ways and techniques to read word documents in Java using Apache POI library. The word document may contain images, tables or plain text. Apart from this a standard word file has header and footers too. Here in the following examples we will be parsing a word document by reading its different paragraph, runs, images, tables along with headers and footers. We will also take a look into identifying different styles associated with the paragraphs such as font-size, font-family, font-color etc.
Maven Dependencies
Following is the poi maven depedency required to read word documents. For latest artifacts visit here
pom.xml
<dependencies> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-ooxml</artifactId> <version>3.16</version> </dependency> </dependencies>
Reading Complete Text from Word Document
The class XWPFDocument
has many methods defined to read and extract .docx
file contents. getText()
can be used to read all the texts in a .docx word document. Following is an example.
TextReader.java
public class TextReader { public static void main(String[] args) { try { FileInputStream fis = new FileInputStream("test.docx"); XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis)); XWPFWordExtractor extractor = new XWPFWordExtractor(xdoc); System.out.println(extractor.getText()); } catch(Exception ex) { ex.printStackTrace(); } } }
Reading Headers and Foooters of Word Document
Apache POI provides inbuilt methods to read headers and footers of a word document. Following is an example that reads and prints header and footer of a word document. The example .docx file is available in the source which can be downloaded at the end of thos article.
HeaderFooter.java
public class HeaderFooterReader { public static void main(String[] args) { try { FileInputStream fis = new FileInputStream("test.docx"); XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis)); XWPFHeaderFooterPolicy policy = new XWPFHeaderFooterPolicy(xdoc); XWPFHeader header = policy.getDefaultHeader(); if (header != null) { System.out.println(header.getText()); } XWPFFooter footer = policy.getDefaultFooter(); if (footer != null) { System.out.println(footer.getText()); } } catch (Exception ex) { ex.printStackTrace(); } } }
Output
This is Header
This is footer
Other Interesting Posts Java 8 Lambda Expression Java 8 Stream Operations Java 8 Datetime Conversions Random Password Generator in Java
Read Each Paragraph of a Word Document
Among the many methods defined in XWPFDocument
class, we can use getParagraphs()
to read a .docx word document paragraph wise.This method returns a list of all the paragraphs(XWPFParagraph) of a word document. Again the XWPFParagraph has many utils method defined to extract information related to any paragraph such as text alignment, style associated with the paragrpahs.
To have more control over the text reading of a word document,each paragraph is again divided into multiple runs. Run defines a region of text with a common set of properties.Following is an example to read paragraphs from a .docx word document.
ParagraphReader.java
public class ParagraphReader { public static void main(String[] args) { try { FileInputStream fis = new FileInputStream("test.docx"); XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis)); List paragraphList = xdoc.getParagraphs(); for (XWPFParagraph paragraph : paragraphList) { System.out.println(paragraph.getText()); System.out.println(paragraph.getAlignment()); System.out.print(paragraph.getRuns().size()); System.out.println(paragraph.getStyle()); // Returns numbering format for this paragraph, eg bullet or lowerLetter. System.out.println(paragraph.getNumFmt()); System.out.println(paragraph.getAlignment()); System.out.println(paragraph.isWordWrapped()); System.out.println("********************************************************************"); } } catch (Exception ex) { ex.printStackTrace(); } } }
Reading Tables from Word Document
Following is an example to read tables present in a word document. It will print all the text rows wise.
TableReader.java
public class TableReader { public static void main(String[] args) { try { FileInputStream fis = new FileInputStream("test.docx"); XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis)); Iterator bodyElementIterator = xdoc.getBodyElementsIterator(); while (bodyElementIterator.hasNext()) { IBodyElement element = bodyElementIterator.next(); if ("TABLE".equalsIgnoreCase(element.getElementType().name())) { List tableList = element.getBody().getTables(); for (XWPFTable table : tableList) { System.out.println("Total Number of Rows of Table:" + table.getNumberOfRows()); for (int i = 0; i < table.getRows().size(); i++) { for (int j = 0; j < table.getRow(i).getTableCells().size(); j++) { System.out.println(table.getRow(i).getCell(j).getText()); } } } } } } catch (Exception ex) { ex.printStackTrace(); } } }
Reading Styles from Word Document
Styles are associated with runs of a paragraph. There are many methods available in the XWPFRun
class to identify the styles associated with the text.There are methods to identify boldness, highlighted words, capitalized words etc.
StyleReader.java
public class StyleReader { public static void main(String[] args) { try { FileInputStream fis = new FileInputStream("test.docx"); XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis)); List paragraphList = xdoc.getParagraphs(); for (XWPFParagraph paragraph : paragraphList) { for (XWPFRun rn : paragraph.getRuns()) { System.out.println(rn.isBold()); System.out.println(rn.isHighlighted()); System.out.println(rn.isCapitalized()); System.out.println(rn.getFontSize()); } System.out.println("********************************************************************"); } } catch (Exception ex) { ex.printStackTrace(); } } }
Reading Image from Word Document
Following is an example to read image files from a word document.
public class ImageReader { public static void main(String[] args) { try { FileInputStream fis = new FileInputStream("test.docx"); XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis)); List pic = xdoc.getAllPictures(); if (!pic.isEmpty()) { System.out.print(pic.get(0).getPictureType()); System.out.print(pic.get(0).getData()); } } catch (Exception ex) { ex.printStackTrace(); } } }
Conclusion
I hope this article served you that you were looking for. If you have anything that you want to add or share then please share it below in the comment section.