Python for each word in string

I wanted to know how to iterate through a string word by word.

string = "this is a string"
for word in string:
    print (word)

The above gives an output:

t
h
i
s

i
s

a

s
t
r
i
n
g

But I am looking for the following output:

this
is
a
string

Pavel's user avatar

Pavel

5,0804 gold badges29 silver badges53 bronze badges

asked Aug 6, 2015 at 1:41

m0bi5's user avatar

1

When you do —

for word in string:

You are not iterating through the words in the string, you are iterating through the characters in the string. To iterate through the words, you would first need to split the string into words , using str.split() , and then iterate through that . Example —

my_string = "this is a string"
for word in my_string.split():
    print (word)

Please note, str.split() , without passing any arguments splits by all whitespaces (space, multiple spaces, tab, newlines, etc).

Olivier Pons's user avatar

Olivier Pons

15.3k26 gold badges118 silver badges211 bronze badges

answered Aug 6, 2015 at 1:43

Anand S Kumar's user avatar

Anand S KumarAnand S Kumar

87.3k18 gold badges183 silver badges172 bronze badges

2

This is one way to do it:

string = "this is a string"
ssplit = string.split()
for word in ssplit:
    print (word)

Output:

this
is
a
string

answered Aug 6, 2015 at 1:45

Joe T. Boka's user avatar

Joe T. BokaJoe T. Boka

6,5046 gold badges27 silver badges48 bronze badges

for word in string.split():
    print word

answered Aug 6, 2015 at 1:50

Connor's user avatar

ConnorConnor

1346 bronze badges

2

Using nltk.

from nltk.tokenize import sent_tokenize, word_tokenize
sentences = sent_tokenize("This is a string.")
words_in_each_sentence = word_tokenize(sentences)

You may use TweetTokenizer for parsing casual text with emoticons and such.

answered Oct 16, 2019 at 21:24

noɥʇʎԀʎzɐɹƆ's user avatar

noɥʇʎԀʎzɐɹƆnoɥʇʎԀʎzɐɹƆ

9,6992 gold badges47 silver badges65 bronze badges

One way to do this is using a dictionary. The problem for the code above is it counts each letter in a string, instead of each word. To solve this problem, you should first turn the string into a list by using the split() method, and then create a variable counts each comma in the list as its own value. The code below returns each time a word appears in a string in the form of a dictionary.

    s = input('Enter a string to see if strings are repeated: ')
    d = dict()
    p = s.split()
    word = ','
    for word in p:
        if word not in d:
            d[word] = 1
        else:
            d[word] += 1
    print (d)

answered Nov 17, 2021 at 2:55

The_Chosen_One69's user avatar

s = 'hi how are you'
l = list(map(lambda x: x,s.split()))
print(l)

Output: ['hi', 'how', 'are', 'you']

answered Dec 11, 2019 at 15:56

Nanda Thota's user avatar

Nanda ThotaNanda Thota

3023 silver badges10 bronze badges

You can try this method also:

sentence_1 = «This is a string»

list = sentence_1.split()

for i in list:

print (i)

answered Aug 9, 2022 at 9:31

Samartha Chakrawarti's user avatar

2

Given a String comprising of many words separated by space, write a Python program to iterate over these words of the string.

Examples:

Input: str = “GeeksforGeeks is a computer science portal for Geeks” 
Output: GeeksforGeeks is a computer science portal for Geeks 

Input: str = “Geeks for Geeks” 
Output: Geeks for Geeks

Method 1: Using split() Using split() function, we can split the string into a list of words and is the most generic and recommended method if one wished to accomplish this particular task. But the drawback is that it fails in the cases in string contains punctuation marks. 

Python3

test_string = "GeeksforGeeks is a computer science portal for Geeks"

print ("The original string is : " + test_string)

res = test_string.split()

print ("nThe words of string are")

for i in res:

    print(i)

Output

The original string is : GeeksforGeeks is a computer science portal for Geeks

The words of string are
GeeksforGeeks
is
a
computer
science
portal
for
Geeks

Time complexity: O(n)
Auxiliary Space: O(n)

Method 2: Using re.findall() In the cases which contain all the special characters and punctuation marks, as discussed above, the conventional method of finding words in a string using split can fail and hence requires regular expressions to perform this task. findall() function returns the list after filtering the string and extracting words ignoring punctuation marks. 

Python3

import re

test_string = "GeeksforGeeks is a computer science portal for Geeks !!!"

print ("The original string is : " + test_string)

res = re.findall(r'w+', test_string)

print ("nThe words of string are")

for i in res:

    print(i)

Output

The original string is : GeeksforGeeks is a computer science portal for Geeks !!!

The words of string are
GeeksforGeeks
is
a
computer
science
portal
for
Geeks

Method 3: Using for loop and string slicing

Python3

test_string = "GeeksforGeeks is a computer science portal for Geeks"

print("The original string is: " + test_string)

start = 0

for i in range(len(test_string)):

    if test_string[i] == ' ':

        print(test_string[start:i])

        start = i+1

print(test_string[start:])

Output

The original string is: GeeksforGeeks is a computer science portal for Geeks
GeeksforGeeks
is
a
computer
science
portal
for
Geeks

This approach uses a for loop to iterate through the characters in the string and a variable to keep track of the starting index of the current word. When a space is encountered, the current word is printed and the starting index is updated to the next character. The last word in the string is printed outside the loop.

Time Complexity: O(n)
Auxiliary Space : O(n) 

Method 4: Using the split() method with a regular expression

Use the split() method with a regular expression to split the string into words

Step-by-step approach:

  • Initialize a list to store the words extracted from the string.
  • Use the split() method with a regular expression to split the string into words.
  • Iterate over the words and append them to the list.
  • Print the list of words.

Follow the below steps to implement the above idea:

Python3

import re

test_string = "GeeksforGeeks is a computer science portal for Geeks !!!"

print ("The original string is : " + test_string)

res = re.split(r'W+', test_string)

words = []

for word in res:

    if word:

        words.append(word)

print ("nThe words of string are")

for word in words:

    print(word)

Output

The original string is : GeeksforGeeks is a computer science portal for Geeks !!!

The words of string are
GeeksforGeeks
is
a
computer
science
portal
for
Geeks

Time complexity: O(n), where n is the length of the input string.
Auxiliary space: O(n), where n is the length of the input string.

In this tutorial, you will find out different ways to iterate strings in Python. You could use a for loop, range in Python, slicing operator, and a few more methods to traverse the characters in a string.

Multiple Ways to Iterate Strings in Python

The following are various ways to iterate the chars in a Python string. Let’s first begin with the for loop method.

Using for loop to traverse a string

It is the most prominent and straightforward technique to iterate strings. Follow the below sample code:

"""
Python Program:
 Using for loop to iterate over a string in Python
"""
string_to_iterate = "Data Science"
for char in string_to_iterate:
   print(char)

The result of the above coding snippet is as follows:

D
a
t
a

S
c
i
e
n
c
e

Python range to iterate over a string

Another quite simple way to traverse the string is by using Python range function. This method lets us access string elements using the index.

Go through the sample code given below:

"""
Python Program:
 Using range() to iterate over a string in Python
"""
string_to_iterate = "Data Science"
for char_index in range(len(string_to_iterate)):
   print(string_to_iterate[char_index])

The result of the above coding snippet is as follows:

D
a
t
a

S
c
i
e
n
c
e

Slice operator to iterate strings partially

You can traverse a string as a substring by using the Python slice operator ([]). It cuts off a substring from the original string and thus allows to iterate over it partially.

The [] operator has the following syntax:

# Slicing Operator
string [starting index : ending index : step value]

To use this method, provide the starting and ending indices along with a step value and then traverse the string. Below is the example code that iterates over the first six letters of a string.

"""
Python Program:
 Using slice [] operator to iterate over a string partially
"""
string_to_iterate = "Python Data Science"
for char in string_to_iterate[0 : 6 : 1]:
   print(char)

The result of the above coding snippet is as follows:

P
y
t
h
o
n

You can take the slice operator usage further by using it to iterate over a string but leaving every alternate character. Check out the below example:

"""
Python Program:
 Using slice [] operator to iterate over a specific parts of a string
"""
string_to_iterate = "Python_Data_Science"
for char in string_to_iterate[ :  : 2]:
   print(char)

The result of the above coding snippet is as follows:

P
t
o
_
a
a
S
i
n
e

Traverse string backward using slice operator

If you pass a -ve step value and skipping the starting as well as ending indices, then you can iterate in the backward direction. Go through the given code sample.

"""
Python Program:
 Using slice [] operator to iterate string backward
"""
string_to_iterate = "Machine Learning"
for char in string_to_iterate[ :  : -1]:
   print(char)

The result of the above coding snippet is as follows:

g
n
i
n
r
a
e
L

e
n
i
h
c
a
M

Using indexing to iterate strings backward

Slice operator first generates a reversed string, and then we use the for loop to traverse it. Instead of doing it, we can use the indexing to iterate strings backward.

"""
Python Program:
 Using indexing to iterate string backward
"""
string_to_iterate = "Machine Learning"
char_index = len(string_to_iterate) - 1
while char_index >= 0:
   print(string_to_iterate[char_index])
   char_index -= 1

The result of the above coding snippet is as follows:

g
n
i
n
r
a
e
L

e
n
i
h
c
a
M

Alternatively, we can pass -ve index value and traverse the string backward. See the below example.

"""
Python Program:
 Using -ve index to iterate string backward
"""
string_to_iterate = "Learn Python"
char_index = 1
while char_index <= len(string_to_iterate):
   print(string_to_iterate[-char_index])
   char_index += 1

The result of the above coding snippet is as follows:

n
o
h
t
y
P

n
r
a
e
L

Summary – Program to iterate strings char by char

Let’s now consolidate all examples inside the Main() function and execute from there.

"""
Program:
 Python Program to iterate strings char by char
"""
def Main():
   string_to_iterate = "Data Science"
   for char in string_to_iterate:
      print(char)

   string_to_iterate = "Data Science"
   for char_index in range(len(string_to_iterate)):
      print(string_to_iterate[char_index])
      
   string_to_iterate = "Python Data Science"
   for char in string_to_iterate[0 : 6 : 1]:
      print(char)
      
   string_to_iterate = "Python_Data_Science"
   for char in string_to_iterate[ :  : 2]:
      print(char)

   string_to_iterate = "Machine Learning"
   for char in string_to_iterate[ :  : -1]:
      print(char)

   string_to_iterate = "Machine Learning"
   char_index = len(string_to_iterate) - 1
   while char_index >= 0:
      print(string_to_iterate[char_index])
      char_index -= 1

   string_to_iterate = "Learn Python"
   char_index = 1
   while char_index <= len(string_to_iterate):
      print(string_to_iterate[-char_index])
      char_index += 1

if __name__ == "__main__":
    Main()

The result of the above coding snippet is as follows:

D
a
t
a
 
S
c
i
e
n
c
e
D
a
t
a
 
S
c
i
e
n
c
e
P
y
t
h
o
n
P
t
o
_
a
a
S
i
n
e
g
n
i
n
r
a
e
L
 
e
n
i
h
c
a
M
g
n
i
n
r
a
e
L
 
e
n
i
h
c
a
M
n
o
h
t
y
P
 
n
r
a
e
L

In this Python tutorial, you will learn how to separate each word from a sentence in Python and then calculate the number of vowels in each word.

We shall be using certain string functions in Python like split() and lower()

The approach that we shall take

  1. string.lower() to convert all the characters of the given string to their respective lower case.
  2. string.split() method to separate words from a given sentence.
  3. After we separate the words, it will be stored in a list called ‘words’.
  4. Initialise a list called vowels that will contain all the vowels present in the English alphabet.
  5. Iterate over the list words and initialise a counter that will count the number of vowels present in the word.
  6. Start a nested loop that iterates over the word in question and check whether any character present in the word is a vowel or not
  7. If a character is a vowel, increase the counter.
  8. Print the word pertaining to the current iteration and the value of the counter associated with it(which contains the number of vowels in the sentence.
  9. Keep on iterating till we have reached the end of the list words.

lower() function in Python

The lower function in Python is used to convert all the characters in string to lower case.

How does the lower function in Python work?

#Initialising some strings 
sentence1 = "The Sun Rises In THE EAST" 
sentence2 = "CODING in PYTHON is fun" 
sentence3 = "CODESPEEDY is a great website" 
sentence4 = "STRINGS are FUN to work with"
#prining the original sentences
print("The original strings are:-")
print(sentence1)
print(sentence2)
print(sentence3)
print(sentence4)
#printing the words of the sentences after converting them to lower case
print("After applying lower() function:-")
print(sentence1.lower())
print(sentence2.lower())
print(sentence3.lower())
print(sentence4.lower())

Output:-

The original strings are:-
The Sun Rises In THE EAST
CODING in PYTHON is fun
CODESPEEDY is a great website
STRINGS are FUN to work with
After applying lower() function:-
the sun rises in the east
coding in python is fun
codespeedy is a great website
strings are fun to work with

We can see that the lower() function in Python has converted words like ‘PYTHON’, ‘STRINGS’ to ‘python’ and ‘strings’ respectively.

We shall use this because the vowels list that we shall initialise later contains the vowels in lower case.

split() method in Python

split() method in Python breaks up a sentence into its constituent words on the basis of a particular separator. Here we are separating on the basis of the spaces in between the words.

How does the split() method in Python work?

#Initialising some strings
sentence1 = "sun rises in the east"
sentence2 = "coding in python is fun"
sentence3 = "codespeedy is a great website"
sentence4 = "strings are fun to work with"
#using the split function
words1 = sentence1.split()
words2 = sentence2.split()
words3 = sentence3.split()
words4 = sentence4.split()
#printing the words of the sentences after splitting them
print("The words of the first sentence are::", words1)
print("The words of the second sentence are::", words2)
print("The words of the third sentence are::", words3)
print("The words of the fourth sentence are::", words4)

Let’s look at the output:-

The words of the first sentence are:: ['sun', 'rises', 'in', 'the', 'east']
The words of the second sentence are:: ['coding', 'in', 'python', 'is', 'fun']
The words of the third sentence are:: ['codespeedy', 'is', 'a', 'great', 'website']
The words of the fourth sentence are:: ['strings', 'are', 'fun', 'to', 'work', 'with']

Here, Python has this facility via the split() function where we are getting a separate list based on the placement of whitespaces in between words.

Code and Output in Python

Study the code in Python given below and try to associate it with the approach mentioned above:-

s = "Python is a fun language and I really love coding in it" 
s = s.lower()
words = s.split() 
vowels = ['a','e','i','o','u'] 
for word in words: 
c = 0 
for i in range(0,len(word)): 
if word[i] in vowels: 
c+=1 
print(f"The number of vowels in the word '{word}'' is {c}")

The output for the code in Python given above is:-

The number of vowels in the word 'python'' is 1
The number of vowels in the word 'is'' is 1
The number of vowels in the word 'a'' is 1
The number of vowels in the word 'fun'' is 1
The number of vowels in the word 'language'' is 4
The number of vowels in the word 'and'' is 1
The number of vowels in the word 'i'' is 1
The number of vowels in the word 'really'' is 2
The number of vowels in the word 'love'' is 2
The number of vowels in the word 'coding'' is 2
The number of vowels in the word 'in'' is 1
The number of vowels in the word 'it'' is 1

Explanation of the Python code:-

  • Convert all the characters in sentence to lower case using lower() function in Python.
  • Split sentence up into its constituent words. We do so using the split() function  in Python which separates all the words from the string(‘sentence’) and stores it in a list (‘words’).
  • Then initialise a list which contains all the vowels in the English alphabet [‘a’,’e’,’i’,’o’,’u’] so that we can check if the extracted character from the words of a sentence is a vowel or not.
  • Iterate over the list words and then iterate over the string of the list words. We are nesting two for loops here.
  • Initialise a counter and set the initial value to 0 in the outer loop
  • In the inner loop, we compare every character of the word with the list vowels, hence checking if the character in question is a vowel or not.
  • If the character is a vowel, we add 1 to the counter, hence counting the total number of vowels in the word that we are iterating against.
  • When the inner loop is executed, print the word and the number of vowels in it.
  • This process continues till all the elements of the list words are exhausted and we have essentially checked whether every character of every word is a vowel or not

I hope this Python tutorial was helpful!!

In this tutorial, we’ll look at how to count the frequency of each word in a string corpus in python. We’ll also compare the frequency with visualizations like bar charts.

To count the frequency of each word in a string, you’ll first have to tokenize the string into individual words. Then, you can use the collections.Counter module to count each element in the list resulting in a dictionary of word counts. The following is the syntax:

import collections
s = "the cat and the dog are fighting"
s_counts = collections.Counter(s.split(" "))

Here, s_counts is a dictionary(more precisely, an object of collections.Counter which is a subclass of dict) storing the word: count mapping based on the frequency in the corpus. You can use it for all dictionary like functions. But, if you specifically want to convert it into a dictionary use dict(s_counts)

Let’s look at an example of extracting the frequency of each word from a string corpus in python.

Count of each word in Movie Reviews dataset

We use the IMDB movie reviews dataset which you can download here. The dataset has 50000 reviews of movies filled by users. We’ll be using this dataset to see the most frequent words used by the reviewers in positive and negative reviews.

1 – Load the data

First we load the data as a pandas dataframe using the read_csv() function.

import pandas as pd

# read the csv file as a dataframe
reviews_df = pd.read_csv(r"C:UserspiyushDocumentsProjectsmovie_reviews_dataIMDB Dataset.csv")
print(reviews_df.head())

Output:

                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive

The dataframe has two columns – “review” storing the review of the movie and “sentiment” storing the sentiment associated with the review. Let’s examine how many samples do we have for each sentiment.

print(reviews_df['sentiment'].value_counts())

Output:

positive    25000
negative    25000
Name: sentiment, dtype: int64

We have 25000 samples each for “positive” and “negative” sentiments.

2 – Cleaning the text

If we look at the entries in the “review” column, we can find that the reviews contain a number of unwanted elements or styles such as HTML tags, punctuations, inconsistent use of lower and upper case, etc. that could hinder our analysis. For example,

print(reviews_df['review'][1])

Output:

A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done.

You can see that in the above review, we have HTML tags, quotes, punctuations, etc. that could be cleaned. Let’s write a function to clean the text in the reviews.

import re
import string

def clean_text(text):
    """
    Function to clean the text.
    
    Parameters:
    text: the raw text as a string value that needs to be cleaned
    
    Returns:
    cleaned_text: the cleaned text as string
    """
    # convert to lower case
    cleaned_text = text.lower()
    # remove HTML tags
    html_pattern = re.compile('<.*?>')
    cleaned_text = re.sub(html_pattern, '', cleaned_text)
    # remove punctuations
    cleaned_text = cleaned_text.translate(str.maketrans('', '', string.punctuation))
    
    return cleaned_text.strip()

The above function performs the following operations on the text:

  1. Convert the text to lower case
  2. Remove HTML tags from the text using regular expressions.
  3. Remove punctuations from the text using a translation table.

Let’s see the above function in action.

print(clean_text(reviews_df['review'][1]))

Output:

a wonderful little production the filming technique is very unassuming very oldtimebbc fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece the actors are extremely well chosen michael sheen not only has got all the polari but he has all the voices down pat too you can truly see the seamless editing guided by the references to williams diary entries not only is it well worth the watching but it is a terrificly written and performed piece a masterful production about one of the great masters of comedy and his life the realism really comes home with the little things the fantasy of the guard which rather than use the traditional dream techniques remains solid then disappears it plays on our knowledge and our senses particularly with the scenes concerning orton and halliwell and the sets particularly of their flat with halliwells murals decorating every surface are terribly well done

You can see that now the text if fairly consistent to be split into individual words. Let’s apply this function to the “reviews” column and create a new column of clean reviews.

reviews_df['clean_review'] = reviews_df['review'].apply(clean_text)

2 – Tokenize the text into words

You can use the string split() function to create a list of individual tokens from a string. For example,

print(clean_text(reviews_df['review'][1]).split(" "))

Output:

['a', 'wonderful', 'little', 'production', 'the', 'filming', 'technique', 'is', 'very', 'unassuming', 'very', 'oldtimebbc', 'fashion', 'and', 'gives', 'a', 'comforting', 'and', 'sometimes', 'discomforting', 'sense', 'of', 'realism', 'to', 'the', 'entire', 'piece', 'the', 'actors', 'are', 'extremely', 'well', 'chosen', 'michael', 'sheen', 'not', 'only', 'has', 'got', 'all', 'the', 'polari', 'but', 'he', 'has', 'all', 'the', 'voices', 'down', 'pat', 'too', 'you', 'can', 'truly', 'see', 'the', 'seamless', 'editing', 'guided', 'by', 'the', 'references', 'to', 'williams', 'diary', 'entries', 'not', 'only', 'is', 'it', 'well', 'worth', 'the', 'watching', 'but', 'it', 'is', 'a', 'terrificly', 'written', 'and', 'performed', 'piece', 'a', 'masterful', 'production', 'about', 'one', 'of', 'the', 'great', 'masters', 'of', 'comedy', 'and', 'his', 'life', 'the', 'realism', 'really', 'comes', 'home', 'with', 'the', 'little', 'things', 'the', 'fantasy', 'of', 'the', 'guard', 'which', 'rather', 'than', 'use', 'the', 'traditional', 'dream', 'techniques', 'remains', 'solid', 'then', 'disappears', 'it', 'plays', 'on', 'our', 'knowledge', 'and', 'our', 'senses', 'particularly', 'with', 'the', 'scenes', 'concerning', 'orton', 'and', 'halliwell', 'and', 'the', 'sets', 'particularly', 'of', 'their', 'flat', 'with', 'halliwells', 'murals', 'decorating', 'every', 'surface', 'are', 'terribly', 'well', 'done']

Let’s create a new column with a list of tokenized words for each review.

reviews_df['review_ls'] = reviews_df['clean_review'].apply(lambda x: x.split(" "))
reviews_df.head()

Output:

dataframe of reviews with additional columns for clean text and tokenized list of words

3 – Create a corpus for positive and negative reviews

Now that we have tokenized the reviews, we can create lists containing words in all the positive and negative reviews. For this, we’ll use itertools to chain together all the positive and negative reviews in single lists.

import itertools

# positive reviews
positive_reviews = reviews_df[reviews_df['sentiment']=='positive']['review_ls']
print("Total positive reviews: ", len(positive_reviews))
positive_reviews_words = list(itertools.chain(*positive_reviews))
print("Total words in positive reviews:", len(positive_reviews_words))

# negative reviews
negative_reviews = reviews_df[reviews_df['sentiment']=='negative']['review_ls']
print("Total negative reviews: ", len(negative_reviews))
negative_reviews_words = list(itertools.chain(*negative_reviews))
print("Total words in negative reviews:", len(negative_reviews_words))

Output:

Total positive reviews:  25000
Total words in positive reviews: 5721948
Total negative reviews:  25000
Total words in negative reviews: 5631466

Now we have one list each for all the words used in positive reviews and all the words used in negative reviews.

4 – Estimate the word frequency in the corpus

Let’s find the frequency of each word in the positive and the negative corpus. For this, we’ll use collections.Counter that returns an object which is essentially a dictionary with word to frequency mappings.

import collections

positive_words_frequency = collections.Counter(positive_reviews_words)
# top 10 most frequent words in positive reviews
print("Most common positive words:", positive_words_frequency.most_common(10))

negative_words_frequency = collections.Counter(negative_reviews_words)
# top 10 most frequent words in positive reviews
print("Most common negative words:", negative_words_frequency.most_common(10))

Output:

Most common positive words: [('the', 332496), ('and', 174195), ('a', 162381), ('of', 151419), ('to', 130495), ('is', 111355), ('in', 97366), ('it', 75383), ('i', 68680), ('this', 66846)]
Most common negative words: [('the', 318041), ('a', 156823), ('and', 145139), ('of', 136641), ('to', 135780), ('is', 98688), ('in', 85745), ('this', 78581), ('i', 76770), ('it', 75840)]

You can see that we get just the generic words like “the”, “a”, “and”, etc. as the most frequent words. Such words are called “stop words”, these words occur frequently in a corpus but does not necessarily offer discriminative information.

Let’s remove these “stop words” and see which words occur more frequently. To remove the stop words we’ll use the nltk library which has a predefined list of stop words for multiple languages.

import nltk
nltk.download("stopwords")

The above code downloads the stopwords from nltk. We can now go ahead and create a list of English stopwords.

from nltk.corpus import stopwords

# list of english stop words
stopwords_ls = list(set(stopwords.words("english")))
print("Total English stopwords: ", len(stopwords_ls))
print(stopwords_ls[:10])

Output:

Total English stopwords:  179
['some', 'than', 'below', 'once', 'ourselves', "it's", 'these', 'been', 'more', 'which']

We get a list of 179 English stopwords. Note that some of the stopwords have punctuations. If we are to remove stopwords from our corpus, it makes sense to apply the same preprocessing to the stopwords as well that we did to our corpus text.

# cleaning the words in the stopwords list
stopwords_ls = [clean_text(word) for word in stopwords_ls]
print(stopwords_ls[:10])

Output:

['some', 'than', 'below', 'once', 'ourselves', 'its', 'these', 'been', 'more', 'which']

Now, let’s go ahead and remove these words from our positive and negative reviews corpuses using list comprehensions.

# remove stopwords
positive_reviews_words = [word for word in positive_reviews_words if word not in stopwords_ls]
print("Total words in positive reviews:", len(positive_reviews_words))
negative_reviews_words = [word for word in negative_reviews_words if word not in stopwords_ls]
print("Total words in negative reviews:", len(negative_reviews_words))

Output:

Total words in positive reviews: 3019338
Total words in negative reviews: 2944033

We can see a significant reduction in size of the corpuses post removal of the stopwords. Now let’s see the most common words in the positive and the negative corpuses.

positive_words_frequency = collections.Counter(positive_reviews_words)
# top 10 most frequent words in positive reviews
print("Most common positive words:", positive_words_frequency.most_common(10))

negative_words_frequency = collections.Counter(negative_reviews_words)
# top 10 most frequent words in positive reviews
print("Most common negative words:", negative_words_frequency.most_common(10))

Output:

Most common positive words: [('film', 39412), ('movie', 36018), ('one', 25727), ('', 19273), ('like', 17054), ('good', 14342), ('great', 12643), ('story', 12368), ('see', 11864), ('time', 11770)]
Most common negative words: [('movie', 47480), ('film', 35040), ('one', 24632), ('like', 21768), ('', 21677), ('even', 14916), ('good', 14140), ('bad', 14065), ('would', 13633), ('really', 12218)]

You can see that have words like “good” and “great” occur frequently in positive reviews while the word “bad” is frequently present in negative reviews. Also, note that a number of words occur commonly in both positive and negative reviews. For example, “movie”, “film”, etc. which is due to the nature of the text data itself since it is mostly movie reviews.

5 – Visualize the word counts

We can visualize the above frequencies as charts to better show their counts. Let’s plot a horizontal bar chart of the 10 most frequent words in both the corpuses.

First, let’s create a dataframe each for the top 10 most frequent words in positive and negative corpuses.

positive_freq_words_df = pd.DataFrame(positive_words_frequency.most_common(10),
                                     columns=["Word", "Frequency"])
print(positive_freq_words_df)

Output:

    Word  Frequency
0   film      39412
1  movie      36018
2    one      25727
3             19273
4   like      17054
5   good      14342
6  great      12643
7  story      12368
8    see      11864
9   time      11770
negative_freq_words_df = pd.DataFrame(negative_words_frequency.most_common(10),
                                     columns=["Word", "Frequency"])
print(negative_freq_words_df)

Output:

     Word  Frequency
0   movie      47480
1    film      35040
2     one      24632
3    like      21768
4              21677
5    even      14916
6    good      14140
7     bad      14065
8   would      13633
9  really      12218

Horizontal bar plot of the most frequent words in the positive reviews:

import matplotlib.pyplot as plt

# set figure size
fig, ax = plt.subplots(figsize=(12, 8))
# plot horizontal bar plot
positive_freq_words_df.sort_values(by='Frequency').plot.barh(x="Word", y="Frequency", ax=ax)
# set the title
plt.title("Most Common words in positive corpus")
plt.show()

Output:

Horizontal bar plot of most frequent words in the positive corpus.

Horizontal bar plot of the most frequent words in the negative reviews:

# set figure size
fig, ax = plt.subplots(figsize=(10, 8))
# plot horizontal bar plot
negative_freq_words_df.sort_values(by='Frequency').plot.barh(x="Word", y="Frequency", ax=ax)
# set the title
plt.title("Most Common words in negative corpus")
plt.show()

Output:

Horizontal bar plot of most frequent words in the negative reviews.

Next Steps

The above was a good exploratory analysis to see the most frequent words used in the IMDB movie reviews dataset for positive and negative reviews. As a next step, you can go ahead and train your own sentiment analysis model to take in a movie review and predict whether it’s positive or negative.

With this, we come to the end of this tutorial. The code examples and results presented in this tutorial have been implemented in a Jupyter Notebook with a python (version 3.8.3) kernel having pandas version 1.0.5

Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.

  • Piyush Raj

    Piyush is a data professional passionate about using data to understand things better and make informed decisions. He has experience working as a Data Scientist in the consulting domain and holds an engineering degree from IIT Roorkee. His hobbies include watching cricket, reading, and working on side projects.

    View all posts

Python program to count the frequency of each word in a string :

In this python tutorial, we will learn how to count the frequency of each word in a user input string. The program will read all words, find out the number of occurrences for each word and print them out. It will also sort all the words alphabetically.

To solve this problem, we will use one dictionary. Dictionary is an unordered and mutable collection. It stores data as key-value pairs. Using any key, we can access its value. We can even modify the value for a specific key.

A python dictionary is written using a curly bracket. Each key and value are separated using colon (:), and all key-value pairs are separated with a comma (,).

We will use one dictionary to store the frequency of the word in a string. For this dictionary, the keys will be the words of the string, and the values will be the frequency for that word. For the string “hello world hello”, it will look as like below :

key - hello , value - 2
key - world , value - 1

As you can see, the word ‘hello’ appeared two times in the string. So, the value is 2 for the key ‘hello’. Similarly, for the key ‘world’, value is 1.

Also, it will print the value of the world before hello i.e. alphabetically.

Algorithm :

The algorithm for the above problem is like below :

  1. Ask the user to enter the string. Store it in a variable.

  2. Create one dictionary to store the frequency of each word in the string.

  3. Read the words in the string one by one.

  4. For each word, check if the dictionary has any key equal to the current word. If yes, increment the value for that key by 1. If not, add one new key-value pair with key equal to the word and value as 1.

  5. Sort all keys in the dictionary alphabetically.

  6. Finally, print out the frequency of each word to the user.

Let’s take a look at the program :

Python program :

#1
input_line = input("Enter a string : ")

#2
words_dict = {}

#3
for word in input_line.split():
    words_dict[word] = words_dict.get(word,0) + 1

#4
for key in sorted(words_dict):
  print("{} : {}".format(key,words_dict[key]))

The source code is shared on Github here.

python find frequency of each word string

Explanation :

The commented numbers in the above program denote the step-number below :

  1. Ask the user to enter a string. Read and store it in the input_line variable.

  2. Create one dictionary to store the key-value pair, where the key is the word and value is the frequency of that word. This is an empty dictionary. For creating an empty dictionary, we can use one empty curly braces.

  3. Start scanning the words of the string one by one. Read the current frequency value for that word from the dictionary and add 1 to it or increment it by 1. If the current frequency is not available, return 0.

Here, we are splitting the string using ‘split()’ method. Python string split() method returns one list holding all the words in the string. Using the for loop, we are iterating through the list items, i.e. iterating the words of the string.

  1. Sort all keys of the dictionary alphabetically. That means, sort all words contains in the dictionary alphabetically. The sorted() method is used to sort the keys in the dictionary.

Finally, print out the value of the frequency of each word.

Sample Output :

python find frequency of each word string

Similar tutorials :

  • Python program to replace character in a string with a symbol
  • Python program to convert a list to string
  • Python program to find larger string among two strings
  • Python program to capitalize first letter of each words of a string
  • Python program to convert a string to an integer
  • Python tutorial to calculate the sum of two string numbers

As a part of text analytics, we frequently need to count words and assign weightage to them for processing in various algorithms, so in this article we will see how we can find the frequency of each word in a given sentence. We can do it with three approaches as shown below.

Using Counter

We can use the Counter() from collections module to get the frequency of the words. Here we first apply the split() to generate the words from the line and then apply the most_common ().

Example

 Live Demo

from collections import Counter
line_text = "Learn and practice and learn to practice"
freq = Counter(line_text.split()).most_common()
print(freq)

Running the above code gives us the following result −

[('and', 2), ('practice', 2), ('Learn', 1), ('learn', 1), ('to', 1)]

Using FreqDist()

The natural language tool kit provides the FreqDist function which shows the number of words in the string as well as the number of distinct words. Applying the most_common() gives us the frequency of each word.

Example

from nltk import FreqDist
text = "Learn and practice and learn to practice"
words = text.split()
fdist1 = FreqDist(words)
print(fdist1)
print(fdist1.most_common())

Running the above code gives us the following result −

<FreqDist with 5 samples and 7 outcomes>
[('and', 2), ('practice', 2), ('Learn', 1), ('learn', 1), ('to', 1)]

Using Dictionary

In this approach we store the words of the line in a dictionary. Then we apply the count() to get the frequency of each word. Then zip the words with the word frequency values. The final result is shown as a dictionary.

Example

 Live Demo

text = "Learn and practice and learn to practice"
words = []
words = text.split()
wfreq=[words.count(w) for w in words]
print(dict(zip(words,wfreq)))

Running the above code gives us the following result:

{'Learn': 1, 'and': 2, 'practice': 2, 'learn': 1, 'to': 1}

Понравилась статья? Поделить с друзьями:
  • Python first letter of word
  • Python find word in list
  • Python find all word in string
  • Python excel ширина столбца
  • Python excel чтение столбца