Java how to read word by word

I’d like to read the «text8» corpus in Java and reformat some words. The problem is, in this 100MB corpus all words are on one line. So if I try to load it with BufferedReader and readLine, it takes away too much space at once and can’t handle it to separate all the words in one list/array.

So my question: Is it possible in Java to read instead of line by line a corpus, to read it word by word? So for example because all words are on one line, to read for example 100 words per iteration?

Kara's user avatar

Kara

6,08516 gold badges51 silver badges57 bronze badges

asked Nov 4, 2015 at 10:18

Rainflow's user avatar

2

you can try using Scanner and set the delimiter to whatever suits you:

Scanner input=new Scanner(myFile);
input.useDelimiter(" +"); //delimitor is one or more spaces

while(input.hasNext()){
  System.out.println(input.next());
}

answered Nov 4, 2015 at 10:32

nafas's user avatar

nafasnafas

5,2533 gold badges28 silver badges56 bronze badges

I would suggest you to use the «Character stream» with FileReader

Here is the example code from http://www.tutorialspoint.com/java/java_files_io.htm

import java.io.*;

public class CopyFile {
   public static void main(String args[]) throws IOException
   {
      FileReader in = null;
      FileWriter out = null;

      try {
         in = new FileReader("input.txt");
         out = new FileWriter("output.txt");

         int c;
         while ((c = in.read()) != -1) {
            out.write(c);
         }
      }finally {
         if (in != null) {
            in.close();
         }
         if (out != null) {
            out.close();
         }
      }
   }
}

It reads 16 bit Unicode characters. This way it doesnt matter if your text is in one whole line.

Since you’re trying to search word by word, you can easy read till you stumble upon a space and there’s your word.

answered Nov 4, 2015 at 10:28

MiKE's user avatar

MiKEMiKE

5146 silver badges12 bronze badges

Use the next method of java.util.Scanner

The next method finds and returns the next complete token from this scanner. A
complete token is preceded and followed by input that matches the
delimiter pattern. This method may block while waiting for input to
scan, even if a previous invocation of Scanner.hasNext returned true.

Example:

public static void main(String[] args) {
        Scanner sc = new Scanner (System.in); 
        String a = sc.next();
        String b = sc.next();
        System.out.println("First Word: "+a);
        System.out.println("Second Word: "+b);
        sc.close();
    }

Input :

Hello Stackoverflow

Output :

First Word: Hello

Second Word: Stackoverflow

In your case use Scanner for reading the file and then use scannerobject.next() method for reading each token(word)

answered Nov 4, 2015 at 10:41

rhitz's user avatar

rhitzrhitz

1,8722 gold badges22 silver badges26 bronze badges

0

    try(FileInputStream fis = new FileInputStream("Example.docx")) { 
        ZipSecureFile.setMinInflateRatio(0.009);
        XWPFDocument file   = new XWPFDocument(OPCPackage.open(fis));  
        ext = new XWPFWordExtractor(file);  
        Scanner scanner = new Scanner(ext.getText());
        while(scanner.hasNextLine()) {
            String[] value = scanner.nextLine().split(" ");
            for(String v:value) {
                System.out.println(v);
            }
        }
    }catch(Exception e) {  
        System.out.println(e);  
    }  

answered Jan 9, 2020 at 14:12

Rajan Kesharwani's user avatar

I need to write a program that rads the text of a text file and returns the amount of times a specific word shows up. I cant figure out how to write a program that reads the text file word by word, ive only managed to write a file that reads line by line. How would i make it so it reads word by word?

import java.io.*;
import java.util.Scanner;


public class main 
{
	public static void main (String [] args) 
	{		
		Scanner scan = new Scanner(System.in);
		System.out.println("Name of file: ");		
		String filename= scan.nextLine();
	int count =0; 

	try
	{		
		FileReader file = new FileReader(filename);			
		BufferedReader reader = new BufferedReader(file);			
		String a = "";			
		int linecount = 0;						
		String line;			
		System.out.println("What word are you looking for: ");			
		String a1 = scan.nextLine();						
	
		while((line = reader.readLine()) != null)			
		{				
			linecount++;								
			if(line.equalsIgnoreCase("that"));
					count++;			
		}								
		reader.close();		
	}
	catch(IOException e)
	{
		System.out.println("File Not found");
	}
	System.out.println("the word that appears " + count + " many times");
}
}

Gary kwlai

Greenhorn

Posts: 12


posted 14 years ago

  • Mark post as helpful


  • send pies

    Number of slices to send:

    Optional ‘thank-you’ note:



  • Quote
  • Report post to moderator

Hi A beginner question.

I have a text file not in text format (.txt) but it does contain text and numbers.

I would like to know How to read a file line by line and store each word or number into an arraylist, then output them on a new file?

e.g. my text file call ( colorsANDnumbers.data )

Red 2 Blue 3 Yellow 4 Green 5

2 Red 3 Blue 4 Yellow 5 Green

Is that possible to be done with just one arraylist?

regards

Gaz

Bijj shar

Greenhorn

Posts: 13


posted 14 years ago

  • Mark post as helpful


  • send pies

    Number of slices to send:

    Optional ‘thank-you’ note:



  • Quote
  • Report post to moderator


[edit]Add code tags. CR[/edit]

Campbell Ritchie

Marshal

Posts: 77646


posted 14 years ago

  • Mark post as helpful


  • send pies

    Number of slices to send:

    Optional ‘thank-you’ note:



  • Quote
  • Report post to moderator

Please use the CODE button; I have edited that post so you can see how much better it looks.

Please don’t simply give out code like that. Since it is pretty standard code, which could have been copied from the Java Tutorials, I think I shall let it stand. But (look at the Beginners’ Forum contents page), where we explain that people learn a lot better if they work out things for themselves.

It doesn’t actually work in its present condition, and I can see a potentially serious error, which I shall let you find for yourself . I shall also leave you to work out what people would do in Java5 or Java6.

*************************************************************************************************

Yes, you can put those entries into a single List<String>, but is that really appropriate? I suggest you go through the different interfaces in the Collections Framework and you might find something more appropriate for keeping colours and numbers.

Gary kwlai

Greenhorn

Posts: 12


posted 14 years ago

  • Mark post as helpful


  • send pies

    Number of slices to send:

    Optional ‘thank-you’ note:



  • Quote
  • Report post to moderator

Impressive

I have few things not quite understand from the code, what does line 21 and 34 actually doing??, because I have not cover WInputStreamReader and Iterator yet.

Also almost every codes thesedays has Try and Catch in them… are those required? does it prevent the program from crashing or halt when there is an error?

regards

Gaz

Bijj shar

Greenhorn

Posts: 13


posted 14 years ago

  • Mark post as helpful


  • send pies

    Number of slices to send:

    Optional ‘thank-you’ note:



  • Quote
  • Report post to moderator

Ritchie-

Thanks for letting me know to use Code Button. What error you are seeing in present condition please explain and user has asked about read and write data in file and he is reading data from existing file why you are giving him suggestion out of box.

Gregg Bolinger

Ranch Hand

Posts: 15304

Mac OS X
IntelliJ IDE
Chrome


posted 14 years ago

  • Mark post as helpful


  • send pies

    Number of slices to send:

    Optional ‘thank-you’ note:



  • Quote
  • Report post to moderator

Gary Lai wrote:Also almost every codes thesedays has Try and Catch in them… are those required? does it prevent the program from crashing or halt when there is an error?

regards

If the API throws any kind of exception that inherets from java.lang.Exception the compiler will force you to surround the code with a try/catch block. This allows you to catch any exceptions that are thrown and deal with them. Some API’s throw RuntimeExceptions which don’t require try/catch blocks but if they throw an exception, the application will just die.

Campbell Ritchie

Marshal

Posts: 77646


posted 14 years ago

  • Mark post as helpful


  • send pies

    Number of slices to send:

    Optional ‘thank-you’ note:



  • Quote
  • Report post to moderator

You are using the wrong classes for reading; you ought to use FileReader and BufferedReader because it is a text file. DataInputStreams are not designed for text files.

You are opening several Readers; I may be mistaken, but are you actually closing them? If you leave the Reader open, you may suffer a memory leak. That was what worried me. Anyway, when I tried your code, I couldn’t get it to work; I got what appears to be a FileNotFoundException.

I would simply use the Scanner and Formatter classes for text files; they are much easier to use. Since they «consume» their Exceptions, you can get away without the try-catch.

Liron Meir

Greenhorn

Posts: 2


posted 11 years ago


  • Likes 1
  • Mark post as helpful


  • send pies

    Number of slices to send:

    Optional ‘thank-you’ note:



  • Quote
  • Report post to moderator

Hi this is not reading word by word. This is how it’s done:

Scanner input = new Scanner(new File(«liron.txt»));

while(input.hasNext()) {

String word = input.next();

}

fred rosenberger

lowercase baba

Posts: 13086

Chrome
Java
Linux


posted 11 years ago

  • Mark post as helpful


  • send pies

    Number of slices to send:

    Optional ‘thank-you’ note:



  • Quote
  • Report post to moderator

Liron Meir wrote:Hi this is not reading word by word. This is how it’s done:

Given that the question, and the last reply, was almost three years ago, i doubt the original poster is still waiting for an answer, or is terribly worried about it anymore.

There are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors

Liron Meir

Greenhorn

Posts: 2


posted 11 years ago


  • Likes 1
  • Mark post as helpful


  • send pies

    Number of slices to send:

    Optional ‘thank-you’ note:



  • Quote
  • Report post to moderator

Yes, but if someone is looking for a solution to read word by word, this is not it.

Campbell Ritchie

Marshal

Posts: 77646


posted 11 years ago

  • Mark post as helpful


  • send pies

    Number of slices to send:

    Optional ‘thank-you’ note:



  • Quote
  • Report post to moderator

Welcome to the Ranch

That is what I was hinting at when I mentioned Scanner. We prefer not to give the full solution and it says the following on this forum’s title page:

We’re all here to learn, so when responding to others, please focus on helping them discover their own solutions, instead of simply providing answers.

  1. October 21st, 2011, 10:28 AM


    #1

    dylanka is offline


    Junior Member


    Default Reading a text file word by word

    Hi all, new to Java programming. Got into it, cause a friend also has been doing it, but he’s been doing it for a while, so is really good. Anyways.

    So what I want to do it, read a specific .txt file, then go through that, and split it into words. If that makes sense.

    What I want the program to do, is, go through a file, and identify and count how many palindromes are inside the file. So I have been able to do the palindrome part of it. But I am just confused as to how you go about reading the text, word for word, then testing it out.

    And help will be appreciated, and not in a rush, this is just a hobby, that I am really enjoying


  2. Default Related threads:


  3. October 21st, 2011, 11:00 AM


    #2

    Default Re: Reading a text file word by word

    Here is the first hit after googling «java file io»: Lesson: Basic I/O (The Java� Tutorials > Essential Classes)

    But this sounds like a job for the Scanner class. The API is your new best friend: Java Platform SE 6


  4. October 21st, 2011, 12:23 PM


    #3

    dylanka is offline


    Junior Member


    Default Re: Reading a text file word by word

    OK, hi again. I ended up figuring out how to do it. But I have a new problem. When I run the program, I want it to only find single word palindromes. At the moment, when I do it, it is also getting multiple word palindromes. So, in a text file I have, it has, for example «avid diva». It is coming up that each of these are palindromes. And I want it too be only one word palindromes, for example «otto». Any help again is appreciated. Here is what I currently have

    import java.io.*;
    public class PalindromDetector {
    	public static void main(String[] args) {
     
    		try { 
     
    			FileInputStream fstream = new FileInputStream("C:/Test1.txt");
     
    			DataInputStream in = new DataInputStream(fstream);
    			BufferedReader br = new BufferedReader(new InputStreamReader(in));
    			String strLine = null;
    			while ((strLine = br.readLine()) != null)   {
     
     
     
    				String reverse = new
    				StringBuffer(strLine).reverse().
    				toString();
    				int i,j,counter=0;
     
    				String m[]=strLine.split(" ");
    				String[] word=reverse.split(" ");
     
    				System.out.println("The palindrome words are:");
    				for(i=0;i<m.length;i++) {
    					for(j=word.length-1;j>=0;j--) {
    						if(m[i].equalsIgnoreCase(word[j])) {
    							System.out.println(m[i]);
    							counter++;
    							break;
    						}
     
    					}
    				}
    				System.out.println("Number of palindromes:"+counter);
    			}
    		}
    		catch(IOException e){}
     
    	}
    }

    Dunno what I have done. Obviously something silly. I’m sure what I have done is supposed to be harder that what I am supposed to have done


  5. October 21st, 2011, 02:06 PM


    #4

    Default Re: Reading a text file word by word

    Instead of taking the file one line at a time, why don’t you take it one word at a time, since that’s what you really care about?


What I’m looking to do is an accountant who adds the value of every word in his ASCI code. For example: Hello (H = 45, or = 56, l = -45, a = 23. Total value = 79 ).

This is my accountant’s code:

public class Contador {

public static final char espacio = ' ';
char[] pal; //Definim l'array paraula
Paraula p = new Paraula();

public int ContarPes() {
String S; //Creamos el String para leer desde teclado
S = new LT().llegirLinia(); //Leemos todo el string
pal = S.toCharArray(); //Pasamos el String S a Array
int PesPal = 0; //Ponemos el contador a 0 , el cual dice qué pesa cada palabra

for (int i = 0; i &lt; pal.length; i++) {
    PesPal = PesPal + pal[i];
}
return PesPal;

}

!

My problem lies in knowing when you’ve already read a shovel, how you can pass me to the next. And in the end I print the sum of every word per screen.

EDIT: I can only use String for entry or exit.

In this article we will be discussing about ways and techniques to read word documents in Java using Apache POI library. The word document may contain images, tables or plain text. Apart from this a standard word file has header and footers too. Here in the following examples we will be parsing a word document by reading its different paragraph, runs, images, tables along with headers and footers. We will also take a look into identifying different styles associated with the paragraphs such as font-size, font-family, font-color etc.

Maven Dependencies

Following is the poi maven depedency required to read word documents. For latest artifacts visit here

pom.xml

	<dependencies>
		<dependency>
                     <groupId>org.apache.poi</groupId>
                     <artifactId>poi-ooxml</artifactId>
		     <version>3.16</version>
                 </dependency>
	</dependencies>

Reading Complete Text from Word Document

The class XWPFDocument has many methods defined to read and extract .docx file contents. getText() can be used to read all the texts in a .docx word document. Following is an example.

TextReader.java

public class TextReader {
	
	public static void main(String[] args) {
	 try {
		   FileInputStream fis = new FileInputStream("test.docx");
		   XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis));
		   XWPFWordExtractor extractor = new XWPFWordExtractor(xdoc);
		   System.out.println(extractor.getText());
		} catch(Exception ex) {
		    ex.printStackTrace();
		}
 }

}

Reading Headers and Foooters of Word Document

Apache POI provides inbuilt methods to read headers and footers of a word document. Following is an example that reads and prints header and footer of a word document. The example .docx file is available in the source which can be downloaded at the end of thos article.

HeaderFooter.java

public class HeaderFooterReader {

	public static void main(String[] args) {
		
		try {
			FileInputStream fis = new FileInputStream("test.docx");
			XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis));
			XWPFHeaderFooterPolicy policy = new XWPFHeaderFooterPolicy(xdoc);

			XWPFHeader header = policy.getDefaultHeader();
			if (header != null) {
				System.out.println(header.getText());
			}

			XWPFFooter footer = policy.getDefaultFooter();
			if (footer != null) {
				System.out.println(footer.getText());
			}
		} catch (Exception ex) {
			ex.printStackTrace();
		}

	}

}

Output

This is Header

This is footer

 Other Interesting Posts
Java 8 Lambda Expression
Java 8 Stream Operations
Java 8 Datetime Conversions
Random Password Generator in Java

Read Each Paragraph of a Word Document

Among the many methods defined in XWPFDocument class, we can use getParagraphs() to read a .docx word document paragraph wise.This method returns a list of all the paragraphs(XWPFParagraph) of a word document. Again the XWPFParagraph has many utils method defined to extract information related to any paragraph such as text alignment, style associated with the paragrpahs.

To have more control over the text reading of a word document,each paragraph is again divided into multiple runs. Run defines a region of text with a common set of properties.Following is an example to read paragraphs from a .docx word document.

ParagraphReader.java

public class ParagraphReader {

	public static void main(String[] args) {
		try {
			FileInputStream fis = new FileInputStream("test.docx");
			XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis));

			List paragraphList = xdoc.getParagraphs();

			for (XWPFParagraph paragraph : paragraphList) {

				System.out.println(paragraph.getText());
				System.out.println(paragraph.getAlignment());
				System.out.print(paragraph.getRuns().size());
				System.out.println(paragraph.getStyle());

				// Returns numbering format for this paragraph, eg bullet or lowerLetter.
				System.out.println(paragraph.getNumFmt());
				System.out.println(paragraph.getAlignment());

				System.out.println(paragraph.isWordWrapped());

				System.out.println("********************************************************************");
			}
		} catch (Exception ex) {
			ex.printStackTrace();
		}
	}
}

Reading Tables from Word Document

Following is an example to read tables present in a word document. It will print all the text rows wise.

TableReader.java

public class TableReader {

	public static void main(String[] args) {
		try {
			FileInputStream fis = new FileInputStream("test.docx");
			XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis));
			Iterator bodyElementIterator = xdoc.getBodyElementsIterator();
			while (bodyElementIterator.hasNext()) {
				IBodyElement element = bodyElementIterator.next();

				if ("TABLE".equalsIgnoreCase(element.getElementType().name())) {
					List tableList = element.getBody().getTables();
					for (XWPFTable table : tableList) {
						System.out.println("Total Number of Rows of Table:" + table.getNumberOfRows());
						for (int i = 0; i < table.getRows().size(); i++) {

							for (int j = 0; j < table.getRow(i).getTableCells().size(); j++) {
								System.out.println(table.getRow(i).getCell(j).getText());
							}
						}
					}
				}
			}
		} catch (Exception ex) {
			ex.printStackTrace();
		}
	}
}

Reading Styles from Word Document

Styles are associated with runs of a paragraph. There are many methods available in the XWPFRun class to identify the styles associated with the text.There are methods to identify boldness, highlighted words, capitalized words etc.

StyleReader.java

public class StyleReader {

	public static void main(String[] args) {
		try {
			FileInputStream fis = new FileInputStream("test.docx");
			XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis));

			List paragraphList = xdoc.getParagraphs();

			for (XWPFParagraph paragraph : paragraphList) {

				for (XWPFRun rn : paragraph.getRuns()) {

					System.out.println(rn.isBold());
					System.out.println(rn.isHighlighted());
					System.out.println(rn.isCapitalized());
					System.out.println(rn.getFontSize());
				}

				System.out.println("********************************************************************");
			}
		} catch (Exception ex) {
			ex.printStackTrace();
		}

	}

}

Reading Image from Word Document

Following is an example to read image files from a word document.

public class ImageReader {

	public static void main(String[] args) {

		try {
			FileInputStream fis = new FileInputStream("test.docx");
			XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis));
			List pic = xdoc.getAllPictures();
			if (!pic.isEmpty()) {
				System.out.print(pic.get(0).getPictureType());
				System.out.print(pic.get(0).getData());
			}

		} catch (Exception ex) {
			ex.printStackTrace();
		}
	}

}

Conclusion

I hope this article served you that you were looking for. If you have anything that you want to add or share then please share it below in the comment section.

Download source

Понравилась статья? Поделить с друзьями:
  • Javascript application vnd ms excel
  • Java excel чтение данных
  • Java excel to object
  • Javascript and microsoft excel
  • Java excel to json