Most common word python

I’ve just started coding; so I’m not using dictionaries or sets or import or anything more advanced than for/while loops and if statements

list1 = ["cry", "me", "me", "no", "me", "no", "no", "cry", "me"] 
list2 = ["cry", "cry", "cry", "no", "no", "no", "me", "me", "me"] 

def codedlist(number):
      max= 0
      for k in hello:
            if first.count(number) > max:
                    max= first.count(number)

alex's user avatar

alex

10.8k14 gold badges73 silver badges99 bronze badges

asked Sep 9, 2019 at 14:06

Teamcoachafl1's user avatar

7

You can use collections.Counter to find it with one-liner:

from collections import Counter

list1 = ["cry", "me", "me", "no", "me", "no", "no", "cry", "me"] 
Counter(list1).most_common()[-1]

Output:

('cry', 2)

(most_common() returns the list of counted elements sorted by their count, and the last element [-1] is the least count)

Or a bit more complicated if you can have several minimal elements:

from collections import Counter

list1 = [1,2,3,4,4,4,4,4]
counted = Counter(list1).most_common()
least_count = min(counted, key=lambda y: y[1])[1]
list(filter(lambda x: x[1] == least_count, counted))

Output:

[(1, 1), (2, 1), (3, 1)]

answered Sep 9, 2019 at 14:12

vurmux's user avatar

vurmuxvurmux

9,2553 gold badges23 silver badges44 bronze badges

2

You can use collections.Counter to count frequencies of each string, and then use min to get the minimum frequency, and then a list-comprehension to get strings that have that minimum frequency:

from collections import Counter

def codedlist(number):
    c = Counter(number)
    m = min(c.values())
    return [s for s, i in c.items() if i == m]

print(codedlist(list1))
print(codedlist(list2))

Output:

['cry']
['cry', 'no', 'me']

answered Sep 9, 2019 at 14:11

DjaouadNM's user avatar

DjaouadNMDjaouadNM

21.8k4 gold badges32 silver badges54 bronze badges

from collections import OrderedDict, Counter

def least_common(words):
    d = dict(Counter(words))
    min_freq = min(d.values())
    return [(k,v) for k,v in d.items() if v == min_freq]

words = ["cry", "cry", "cry", "no", "no", "no", "me", "me", "me"]

print(least_common(words))

answered Sep 9, 2019 at 14:18

Akash Pagar's user avatar

Akash PagarAkash Pagar

6278 silver badges21 bronze badges

A simple, algorithmic way to do this:

def codedlist(my_list):
    least = 99999999 # A very high number
    word = ''
    for element in my_list:
        repeated = my_list.count(element)
        if repeated < least:
            least = repeated # This is just a counter
            word = element # This is the word
    return word

It’s not very performatic though. There are better ways to do this, but i think that it’s an easy way to understand for a beginner.

answered Sep 9, 2019 at 14:28

Alexander Santos's user avatar

If you want all words sorted by min value:

import numpy as np

list1 = ["cry", "me", "me", "no", "me", "no", "no", "cry", "me"]
list2 = ["cry", "cry", "cry", "no", "no", "no", "me", "me", "me"]

uniques_values = np.unique(list1)

final_list = []
for i in range(0,len(uniques_values)):
    final_list.append((uniques_values[i], list1.count(uniques_values[i])))

def takeSecond(elem):
    return elem[1]

final_list.sort(key=takeSecond)

print(final_list)

For list1:

[(‘cry’, 2), (‘no’, 3), (‘me’, 4)]

For list2:

[(‘cry’, 3), (‘me’, 3), (‘no’, 3)]

Be careful with the code, to change the list you have to edit the code in two points.

Some useful explanation:

  • numpy.unique gives you non-repeated values

  • def takeSecond(elem) with return elem[1], is a function which allows you to sort a array by the [1] column (the second value).

It could be useful to display values or get all items sorted by this criteria.

Hope it helps.

answered Sep 9, 2019 at 14:33

Nicolás Rodríguez's user avatar

Finding the minimum is often similar to finding the maximum. You count the number of occurrences of an element and if this count is smaller than counter(for least common element occurrence count): you replace the counter.

This is a crude solution that uses a lot of memory and takes a lot of time to run. You will understand more of lists (and their manipulation) if you try to shorten the run time and memory usage. I hope this helps!

list1 = ["cry", "me", "me", "no", "me", "no", "no", "cry", "me"]
list2 = ["cry", "cry", "cry", "no", "no", "no", "me", "me", "me"]

def codedlist(l):
    min = False #This is out counter
    indices = [] #This records the positions of the counts
    for i in range(0,len(l)):
        count = 0
        for x in l: #You can possibly shorten the run time here
            if(x == l[i]):
                count += 1
        if not min: #Also can be read as: If this is the first element.
            min = count
            indices = [i]
        elif min > count: #If this element is the least common
            min = count #Replace the counter
            indices = [i] # This is your only index
        elif min == count: #If this least common (but there were more element with the same count)
            indices.append(i) #Add it to our indices counter

    tempList = []
    #You can possibly shorten the run time below
    for ind in indices:
        tempList.append(l[ind])
    rList = []
    for x in tempList: #Remove duplicates in the list
        if x not in rList:
            rList.append(x)
    return rList

print(codedlist(list1))
print(codedlist(list2))

Output

['cry']
['cry', 'no', 'me']

answered Sep 9, 2019 at 14:35

Harsha's user avatar

HarshaHarsha

3531 silver badge15 bronze badges

Probably the most simple and fastest approach to recieve the least common item in a collection.

min(list1, key=list1.count)

In action:

>>> data = ["cry", "me", "me", "no", "me", "no", "no", "cry", "me"]
>>> min(data, key=data.count)
'cry'

Tested the speed vs the collections.Counter approach and it’s much faster. See this REPL.

P.S: The same can be done with max to find the most common item.

Edit

To get multiple least common items you can extend this approach using a comprehension.

>>> lc = data.count(min(data, key=data.count))
>>> {i for i in data if data.count(i) == lc}
{'no', 'me', 'cry'}

answered Sep 9, 2019 at 14:43

Jab's user avatar

JabJab

26.4k21 gold badges75 silver badges114 bronze badges

Basically you want to go through your list and at each element ask yourself:

«Have I seen this element before?»

If the answer is yes you add 1 to the count of that element if the answer is no you add it to the dictionary of seen values. Finally we sort it by values and then pick the first word as that one is the smallest.Lets implement it:

import operator

words = ['blah','blah','car']
seen_dictionary = {}
for w in words:
    if w in seen_dic.keys():
        seen_dictionary[w] += 1 
    else:
        seen_dic.update({w : 1})

final_word = sorted(x.items(), key=operator.itemgetter(1))[0][0] #as the output will be 2D tuple sorted by the second element in each of smaller tuples.

answered Sep 10, 2019 at 11:31

Matej Novosad's user avatar

def codedlist(list):
    dict = {}
    for item in list:
        dict[item]=list.count(item)
    most_common_number = max(dict.values())
    most_common = []
    for k,v in dict.items():
        if most_common_number == v:
            most_common.append(k)
    return most_common
list1 = ["cry", "me", "me", "no", "me", "no", "no", "cry", "me"] 
list2 = ["cry", "cry", "cry", "no", "no", "no", "me", "me", "me"] 

print(codedlist(list1))

answered Sep 9, 2019 at 14:43

Vineet's user avatar

I’d read a select amount of the file at a time. Split it into characters, and then split on each empty space. This is better than splitting on each new line as the file may be one line.

To do the former in Python 3 is fairly simple:

def read_chunks(file, chunk_size):
    while True:
        chunk = file.read(chunk_size)
        if not chunk:
            break
        yield from chunk

This has $O(text{chunk_size})$ memory usage, which is $O(1)$ as it’s a constant. It also correctly ends the iterator, when the file ends.

After this, you want to split the words up. Since we’re using str.split without any arguments, we should write just that method of splitting. We can use a fairly simple algorithm:

from string import whitespace

def split_whitespace(it):
    chunk = []
    for char in it:
        if char not in whitespace:
            chunk.append(char)
        elif chunk:
            yield tuple(chunk)
            chunk = []
    if chunk:
        yield tuple(chunk)

This has $O(k)$ memory, where $k$ is the size of the largest word. What we’d expect of a splitting function.

Finally we’d change from tuples to strings, using ''.join, and then use the collections.Counter. And split the word reading, and finding the most common into two different functions.

And so for an $O(k)$ memory usage version of your code, I’d use:

import sys
from collections import Counter
from string import whitespace


def read_chunks(file, chunk_size):
    while True:
        chunk = file.read(chunk_size)
        if not chunk:
            break
        yield from chunk


def split_whitespace(it):
    chunk = []
    for char in it:
        if char not in whitespace:
            chunk.append(char)
        elif chunk:
            yield tuple(chunk)
            chunk = []
    if chunk:
        yield tuple(chunk)


def read_words(path, chunk_size=1024):
    with open(path) as f:
        chars = read_chunks(f, chunk_size)
        tuple_words = split_whitespace(chars)
        yield from map(''.join, tuple_words)


def most_common_words(words, top=10):
    return dict(Counter(words).most_common(top))


if __name__ == '__main__':
    words = read_words(sys.argv[1])
    top_five_words = most_common_words(words, 5)

The challenge

Write a function that, given a string of text (possibly with punctuation and line-breaks), returns an array of the top-3 most occurring words, in descending order of the number of occurrences.

Assumptions:

  • A word is a string of letters (A to Z) optionally containing one or more apostrophes (‘) in ASCII. (No need to handle fancy punctuation.)
  • Matches should be case-insensitive, and the words in the result should be lowercased.
  • Ties may be broken arbitrarily.
  • If a text contains fewer than three unique words, then either the top-2 or top-1 words should be returned, or an empty array if a text contains no words.

Examples:

top_3_words("In a village of La Mancha, the name of which I have no desire to call to
mind, there lived not long since one of those gentlemen that keep a lance
in the lance-rack, an old buckler, a lean hack, and a greyhound for
coursing. An olla of rather more beef than mutton, a salad on most
nights, scraps on Saturdays, lentils on Fridays, and a pigeon or so extra
on Sundays, made away with three-quarters of his income.")
# => ["a", "of", "on"]

top_3_words("e e e e DDD ddd DdD: ddd ddd aa aA Aa, bb cc cC e e e")
# => ["e", "ddd", "aa"]

top_3_words("  //wont won't won't")
# => ["won't", "wont"]

Bonus points:

  1. Avoid creating an array whose memory footprint is roughly as big as the input text.
  2. Avoid sorting the entire array of unique words.

Test cases

from random import choice, randint, sample, shuffle, choices
import re
from collections import Counter


def check(s, this=None):                                            # this: only for debugging purpose
    returned_result = top_3_words(s) if this is None else this
    fs = Counter(w for w in re.findall(r"[a-zA-Z']+", s.lower()) if w != "'" * len(w))
    exp,expected_frequencies = map(list,zip(*fs.most_common(3))) if fs else ([],[])
    
    msg = ''
    wrong_words = [w for w in returned_result if not fs[w]]
    actual_freq = [fs[w] for w in returned_result]
    
    if wrong_words:
        msg = 'Incorrect match: words not present in the string. Your output: {}. One possible valid answer: {}'.format(returned_result, exp)
    elif len(set(returned_result)) != len(returned_result):
        msg = 'The result should not contain copies of the same word. Your output: {}. One possible output: {}'.format(returned_result, exp)
    elif actual_freq!=expected_frequencies:
        msg = "Incorrect frequencies: {} should be {}. Your output: {}. One possible output: {}".format(actual_freq, expected_frequencies, returned_result, exp)
    
    Test.expect(not msg, msg)



@test.describe("Fixed tests")
def fixed_tests():

    TESTS = (
    "a a a  b  c c  d d d d  e e e e e",
    "e e e e DDD ddd DdD: ddd ddd aa aA Aa, bb cc cC e e e",
    "  //wont won't won't ",
    "  , e   .. ",
    "  ...  ",
    "  '  ",
    "  '''  ",
    """In a village of La Mancha, the name of which I have no desire to cao
    mind, there lived not long since one of those gentlemen that keep a lance
    in the lance-rack, an old buckler, a lean hack, and a greyhound for
    coursing. An olla of rather more beef than mutton, a salad on most
    nights, scraps on Saturdays, lentils on Fridays, and a pigeon or so extra
    on Sundays, made away with three-quarters of his income.""",
    "a a a  b  c c X",
    "a a c b b",
    )
    for s in TESTS: check(s)
    
@test.describe("Random tests")
def random_tests():
    
    def gen_word():
        return "".join(choice("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'") for _ in range(randint(3, 10)))
    
    def gen_string():
        words = []
        nums = choices(range(1, 31), k=20)
        for _ in range(randint(0, 20)):
            words += [gen_word()] * nums.pop()
        shuffle(words)
        s = ""
        while words:
            s += words.pop() + "".join(choice("-,.?!_:;/ ") for _ in range(randint(1, 5)))
        return s
    
    @test.it("Tests")
    def it_1():
        for _ in range(100): check(gen_string())
            

The solution using Python

Option 1:

# use the Counter module
from collections import Counter
# use the regex module
import re

def top_3_words(text):
    # count the input, pass through a regex and lowercase it
    c = Counter(re.findall(r"[a-z']+", re.sub(r" '+ ", " ", text.lower())))
    # return the `most common` 3 items
    return [w for w,_ in c.most_common(3)]

Option 2:

def top_3_words(text):
    # loop through each character in the string
    for c in text:
        # if it's not alphanumeric or an apostrophe
        if not (c.isalpha() or c=="'"):
            # replace with a space
            text = text.replace(c,' ')
    # create some `list` variables
    words,counts,out = [],[],[]

    # loop through the words in the text
    for word in list(filter(None,text.lower().split())):
        # if in all, then continue
        if all([not c.isalpha() for c in word]):
            continue
        # if the word is in the words list
        if word in words:
            # increment the count
            counts[words.index(word)] += 1
        else:
            # otherwise create a new entry
            words.append(word); counts.append(0)

    # loop while bigger than 0 and less than 3
    while len(words)>0 and len(out)<3:
        # append the counts
        out.append(words.pop(counts.index(max(counts))).lower())
        counts.remove(max(counts))
    # return the counts
    return out

Option 3:

def top_3_words(text):
    wrds = {}
    for p in r'!"#$%&()*+,./:;<=>[email protected][]^_`{|}~-':
        text = text.replace(p, ' ')
    for w in text.lower().split():
        if w.replace("'", '') != '':
            wrds[w] = wrds.get(w, 0) + 1
    return [y[0] for y in sorted(wrds.items(), key=lambda x: x[1], reverse=True)[:3]]

Given Strings List, write a Python program to get word with most number of occurrences.

Example:

Input : test_list = [“gfg is best for geeks”, “geeks love gfg”, “gfg is best”] 
Output : gfg 
Explanation : gfg occurs 3 times, most in strings in total.

Input : test_list = [“geeks love gfg”, “geeks are best”] 
Output : geeks 
Explanation : geeks occurs 2 times, most in strings in total. 

Method #1 : Using loop + max() + split() + defaultdict()

In this, we perform task of getting each word using split(), and increase its frequency by memorizing it using defaultdict(). At last, max(), is used with parameter to get count of maximum frequency string.

Python3

from collections import defaultdict

test_list = ["gfg is best for geeks", "geeks love gfg", "gfg is best"]

print("The original list is : " + str(test_list))

temp = defaultdict(int)

for sub in test_list:

    for wrd in sub.split():

        temp[wrd] += 1

res = max(temp, key=temp.get)

print("Word with maximum frequency : " + str(res))

Output

The original list is : ['gfg is best for geeks', 'geeks love gfg', 'gfg is best']
Word with maximum frequency : gfg

Time Complexity: O(n*n)
Auxiliary Space: O(n)

Method #2 : Using list comprehension + mode()

In this, we get all the words using list comprehension and get maximum frequency using mode().

Python3

from statistics import mode

test_list = ["gfg is best for geeks", "geeks love gfg", "gfg is best"]

print("The original list is : " + str(test_list))

temp = [wrd for sub in test_list for wrd in sub.split()]

res = mode(temp)

print("Word with maximum frequency : " + str(res))

Output

The original list is : ['gfg is best for geeks', 'geeks love gfg', 'gfg is best']
Word with maximum frequency : gfg

Method #3: Using list() and Counter()

  • Append all words to empty list and calculate frequency of all words using Counter() function.
  • Find max count and print that key.

Below is the implementation:

Python3

from collections import Counter

def mostFrequentWord(words):

    lis = []

    for i in words:

        for j in i.split():

            lis.append(j)

    freq = Counter(lis)

    max = 0

    for i in freq:

        if(freq[i] > max):

            max = freq[i]

            word = i

            return word

words = ["gfg is best for geeks", "geeks love gfg", "gfg is best"]

print("The original list is : " + str(words))

print("Word with maximum frequency : " + mostFrequentWord(words))

Output

The original list is : ['gfg is best for geeks', 'geeks love gfg', 'gfg is best']
Word with maximum frequency : gfg

The time and space complexity for all the methods are the same:

Time Complexity: O(n2)

Space Complexity: O(n)

Method #4: Using Counter() and reduce()
Here is an approach to solve the problem using the most_common() function of the collections module’s Counter class and the reduce() function from the functools module:

Python3

from collections import Counter

from functools import reduce

def most_frequent_word(test_list):

    all_words = reduce(lambda a, b: a + b, [sub.split() for sub in test_list])

    word_counts = Counter(all_words)

    return word_counts.most_common(1)[0][0]

test_list = ["gfg is best for geeks", "geeks love gfg", "gfg is best"]

print("The original list is: ", test_list)

print("Word with most frequency: ", most_frequent_word(test_list))

Output

The original list is:  ['gfg is best for geeks', 'geeks love gfg', 'gfg is best']
Word with most frequency:  gfg

Explanation:

We use the reduce() function to concatenate the list of all words from each string in the test_list.
We then create a Counter object from the list of all words to get a count of the frequency of each word.
Finally, we use the most_common() function to get the word with the highest frequency and return it.
Time complexity: O(n * k), where n is the number of strings in the test_list and k is the average number of words in each string.

Auxiliary Space: O(n * k), since we are storing the words in a list before creating a Counter object.

In this tutorial, you’ll learn how to use the Python Counter class from the collections module to count items. The Counter class provides an incredibly pythonic method to count items in lists, tuples, strings, and more. Because counting items is a common task in programming, being able to do this easily and elegantly is a useful skill for any Pythonista.

The Counter class provides a subclass to the Python dictionary, adding in many useful ways to easily count items in another object. For example, you can easily return the number of items, the most common item, and even undertake arithmetic on different Counter items.

By the end of this tutorial, you’ll have learned:

  • How to use the Counter class to count items in Python
  • How to get the most and least common items in a counter object
  • How to add and subtract different Counter objects
  • how to update Counter objects in Python
Video tutorial for Python Collections Counter

Understanding Python’s Collection Counter Class

The Python Counter class is an integral part of the collections module. The class provides incredibly intuitive and Pythonic methods to count items in an iterable, such as lists, tuples, or strings. This allows you to count the frequency of items within that iterable, including finding the most common item.

Let’s start by creating an empty Counter object. We first need to import the class from the collections module. Following that, we can instantiate the object:

# Creating an Empty Counter Object
from collections import Counter

counter = Counter()

Now that we have our first Counter object created, let’s explore some of the properties of the object. For example, we can check its type by using the type() function. We can also verify that the object is a subclass of the Python dictionary.

# Checking Attributes of the Python Counter Class
from collections import Counter

counter = Counter()

print('Type of counter is: ',type(counter))
print('Counter is a subclass of a dictionary: ', issubclass(Counter, dict))

# Returns:
# Type of counter is:  <class 'collections.Counter'>
# Counter is a subclass of a dictionary:  True

Now that you have an understanding of the Python Counter class, let’s get started with creating our first Counter object!

Creating a Counter Object in Python

Let’s create our first Python Counter object. We can pass in a string and the Counter object will return the counts of all the letters in that string.

The class takes only a single parameter, the item we want to count. Let’s see how we can use it:

# Creating Our First Counter
from collections import Counter

a_string = 'hello! welcome to datagy'
counter = Counter(a_string)

print(counter)

# Returns:
# Counter({'e': 3, 'l': 3, 'o': 3, ' ': 3, 't': 2, 'a': 2, 'h': 1, '!': 1, 'w': 1, 'c': 1, 'm': 1, 'd': 1, 'g': 1, 'y': 1})

By printing out our counter, we’re able to see that it returns a dictionary-like object. The items are sorted by their frequency of each item in the object. In this case, we can see that the letter 'e' exists three times in our string.

Accessing Counter Values in Python

Because the Counter object returns a subclass of a dictionary, we can use dictionary methods to access the counts of an item in that dictionary. Let’s see how we can access the number of times the letter 'a' appears in our string:

# Accessing Counts in a Counter Object
from collections import Counter

a_string = 'hello! welcome to datagy'
counter = Counter(a_string)

print(counter['a'])

# Returns: 2

We can see that the letter 'a' exists twice in our string. We can even access the counts of items that don’t exist in our object.

# Counting Items that Don't Exist
from collections import Counter

a_string = 'hello! welcome to datagy'
counter = Counter(a_string)

print(counter['z'])

# Returns: 0

In a normal Python dictionary, this would raise a KeyError. However, the Counter class has been designed to prevent this by overriding the default behavior.

Finding the Most Common Item in a Python Counter

The Counter class makes it easy to find the most common item in a given object. This can be done by applying the .most_common() method onto the object. Let’s see how we can find the most common item in our object:

# Finding the Most Common Item
from collections import Counter

a_string = 'hello! welcome to datagy'
counter = Counter(a_string)

print(counter.most_common())

# Returns: [('e', 3), ('l', 3), ('o', 3), (' ', 3), ('t', 2), ('a', 2), ('h', 1), ('!', 1), ('w', 1), ('c', 1), ('m', 1), ('d', 1), ('g', 1), ('y', 1)]

We can see that this returns a list of tuples that’s been ordered by placing the most common items first. Because of this, we can access the most common item by accessing the first index:

# Accessing the Most Common Item
from collections import Counter

a_string = 'hello! welcome to datagy'
counter = Counter(a_string)

print(counter.most_common()[0])

# Returns: ('e', 3)

Finding the Least Common Item in a Python Counter

Similarly, we can access the least common item by getting the last index:

# Accessing the Least Common Item
from collections import Counter

a_string = 'hello! welcome to datagy'
counter = Counter(a_string)

print(counter.most_common()[-1])

# Returns: ('y', 1)

Finding n Most Common Items in a Python Counter

Te method also allows you to pass in an integer that returns just that number of items. Say we wanted to get the three most common items, you could write:

# Getting n Number of Most Common Items
from collections import Counter

a_string = 'hello! welcome to datagy'
counter = Counter(a_string)

print(counter.most_common(3))

# Returns: [('e', 3), ('l', 3), ('o', 3)]

Updating Counter Values in Python

One of the great things about the Python Counter is that values can also be updated. This can be done using the .update() method. The method accepts another iterable, which will update the values in place.

Let’s see how we can first count items using the collection’s Counter class and then pass in another iterable to update our counts.

# Updating Counter Values in Python
from collections import Counter

a_list = [1,2,3,1,2,3,1,1,2,1]
counter = Counter(a_list)
print(counter)

counter.update([1,2,3,4])
print(counter)

# Returns:
# Counter({1: 5, 2: 3, 3: 2})
# Counter({1: 6, 2: 4, 3: 3, 4: 1})

We can see that the values were updated in place, with the values of the new item.

Deleting Counter Values in Python

It’s also very easy to delete an item from a Counter object. This can be useful when you either want to reset a value or simple find a way to remove an item from being counted.

You can delete an item from a Counter object by using the del keyword. Let’s load a Counter object and then delete a value:

# Deleting an Item from a Counter Object
from collections import Counter

a_list = [1,2,3,1,2,3,1,1,2,1]
counter = Counter(a_list)
print(counter)

del counter[1]
print(counter)

# Returns:
# Counter({1: 5, 2: 3, 3: 2})
# Counter({2: 3, 3: 2})

Arithmetic Operations on Counter Objects in Python

It’s equally easy to apply arithmetic operations like addition and subtraction on Counter objects. This allows you to combine Counter objects or find the difference between two items.

This can be done using the + and the - operators respectively. Let’s take a look at addition first:

# Adding 2 Counter Objects Together
from collections import Counter

counter1 = Counter([1,1,2,3,4])
counter2 = Counter([1,2,3,4])

print(counter1 + counter2)

# Returns:
# Counter({1: 3, 2: 2, 3: 2, 4: 2})

Now let’s subtract the two counters:

# Subtracting 2 Counter Objects
from collections import Counter

counter1 = Counter([1,1,2,3,4])
counter2 = Counter([1,2,3,4])

print(counter1 - counter2)

# Returns:
# Counter({1: 1})

Combining Counter Objects in Python

We can also combine Counter objects using the & and | operators. These serve very different purposes. Let’s break them down a bit:

  • & will return the common positive minumum values
  • | will return the positive maximum values

Let’s take a look at the & operator first:

# Finding Common Minimum Elements
from collections import Counter

counter1 = Counter([1,1,2,2,2,3,3,4])
counter2 = Counter([1,2,3,4,5])

print(counter1 & counter2)

# Returns:
# Counter({1: 1, 2: 1, 3: 1, 4: 1})

Now let’s take a look at the maximums between the two Counter objects:

# Finding Maximum Elements
from collections import Counter

counter1 = Counter([1,1,2,2,2,3,3,4])
counter2 = Counter([1,2,3,4,5])

print(counter1 | counter2)

# Returns:
# Counter({2: 3, 1: 2, 3: 2, 4: 1, 5: 1})

Finding the Most Common Word in a Python String

Before closing out the tutorial, let’s take a look at a practical example. We can use Python’s Counter class to count find the most common word in a string. Let’s load the Zen of Python and find the most common word in that string.

Before we pass the string into the Counter class, we need to split it. We can use the .split() method to split at any white-space character, including newlines. Then we can apply the .most_common() method and access the first item’s value by accessing the [0][0] item:

# Finding the Most Frequent Word in a String
from collections import Counter

text = """
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
"""

counter = Counter(text.split())
print(counter.most_common()[0][0])

# Returns: is

Conclusion

In this post, you learned how to use the Python collection’s Counter class. You started off by learning how the class can be used to create frequencies of an iterable object. You then learned how to find the counts of a particular item and how to find the most and least frequent item. You then learned how to update counts, as well as perform arithmetic on these count items.

Additional Resources

To learn more about related topics, check out the tutorials below:

  • Python Defaultdict: Overview and Examples
  • Python: Add Key:Value Pair to Dictionary
  • Python Merge Dictionaries – Combine Dictionaries (7 Ways)
  • Python: Sort a Dictionary by Values
  • Official Documentation: Python collections Counter

In this tutorial, we’ll look at how to count the frequency of each word in a string corpus in python. We’ll also compare the frequency with visualizations like bar charts.

To count the frequency of each word in a string, you’ll first have to tokenize the string into individual words. Then, you can use the collections.Counter module to count each element in the list resulting in a dictionary of word counts. The following is the syntax:

import collections
s = "the cat and the dog are fighting"
s_counts = collections.Counter(s.split(" "))

Here, s_counts is a dictionary(more precisely, an object of collections.Counter which is a subclass of dict) storing the word: count mapping based on the frequency in the corpus. You can use it for all dictionary like functions. But, if you specifically want to convert it into a dictionary use dict(s_counts)

Let’s look at an example of extracting the frequency of each word from a string corpus in python.

Count of each word in Movie Reviews dataset

We use the IMDB movie reviews dataset which you can download here. The dataset has 50000 reviews of movies filled by users. We’ll be using this dataset to see the most frequent words used by the reviewers in positive and negative reviews.

1 – Load the data

First we load the data as a pandas dataframe using the read_csv() function.

import pandas as pd

# read the csv file as a dataframe
reviews_df = pd.read_csv(r"C:UserspiyushDocumentsProjectsmovie_reviews_dataIMDB Dataset.csv")
print(reviews_df.head())

Output:

                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive

The dataframe has two columns – “review” storing the review of the movie and “sentiment” storing the sentiment associated with the review. Let’s examine how many samples do we have for each sentiment.

print(reviews_df['sentiment'].value_counts())

Output:

positive    25000
negative    25000
Name: sentiment, dtype: int64

We have 25000 samples each for “positive” and “negative” sentiments.

2 – Cleaning the text

If we look at the entries in the “review” column, we can find that the reviews contain a number of unwanted elements or styles such as HTML tags, punctuations, inconsistent use of lower and upper case, etc. that could hinder our analysis. For example,

print(reviews_df['review'][1])

Output:

A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done.

You can see that in the above review, we have HTML tags, quotes, punctuations, etc. that could be cleaned. Let’s write a function to clean the text in the reviews.

import re
import string

def clean_text(text):
    """
    Function to clean the text.
    
    Parameters:
    text: the raw text as a string value that needs to be cleaned
    
    Returns:
    cleaned_text: the cleaned text as string
    """
    # convert to lower case
    cleaned_text = text.lower()
    # remove HTML tags
    html_pattern = re.compile('<.*?>')
    cleaned_text = re.sub(html_pattern, '', cleaned_text)
    # remove punctuations
    cleaned_text = cleaned_text.translate(str.maketrans('', '', string.punctuation))
    
    return cleaned_text.strip()

The above function performs the following operations on the text:

  1. Convert the text to lower case
  2. Remove HTML tags from the text using regular expressions.
  3. Remove punctuations from the text using a translation table.

Let’s see the above function in action.

print(clean_text(reviews_df['review'][1]))

Output:

a wonderful little production the filming technique is very unassuming very oldtimebbc fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece the actors are extremely well chosen michael sheen not only has got all the polari but he has all the voices down pat too you can truly see the seamless editing guided by the references to williams diary entries not only is it well worth the watching but it is a terrificly written and performed piece a masterful production about one of the great masters of comedy and his life the realism really comes home with the little things the fantasy of the guard which rather than use the traditional dream techniques remains solid then disappears it plays on our knowledge and our senses particularly with the scenes concerning orton and halliwell and the sets particularly of their flat with halliwells murals decorating every surface are terribly well done

You can see that now the text if fairly consistent to be split into individual words. Let’s apply this function to the “reviews” column and create a new column of clean reviews.

reviews_df['clean_review'] = reviews_df['review'].apply(clean_text)

2 – Tokenize the text into words

You can use the string split() function to create a list of individual tokens from a string. For example,

print(clean_text(reviews_df['review'][1]).split(" "))

Output:

['a', 'wonderful', 'little', 'production', 'the', 'filming', 'technique', 'is', 'very', 'unassuming', 'very', 'oldtimebbc', 'fashion', 'and', 'gives', 'a', 'comforting', 'and', 'sometimes', 'discomforting', 'sense', 'of', 'realism', 'to', 'the', 'entire', 'piece', 'the', 'actors', 'are', 'extremely', 'well', 'chosen', 'michael', 'sheen', 'not', 'only', 'has', 'got', 'all', 'the', 'polari', 'but', 'he', 'has', 'all', 'the', 'voices', 'down', 'pat', 'too', 'you', 'can', 'truly', 'see', 'the', 'seamless', 'editing', 'guided', 'by', 'the', 'references', 'to', 'williams', 'diary', 'entries', 'not', 'only', 'is', 'it', 'well', 'worth', 'the', 'watching', 'but', 'it', 'is', 'a', 'terrificly', 'written', 'and', 'performed', 'piece', 'a', 'masterful', 'production', 'about', 'one', 'of', 'the', 'great', 'masters', 'of', 'comedy', 'and', 'his', 'life', 'the', 'realism', 'really', 'comes', 'home', 'with', 'the', 'little', 'things', 'the', 'fantasy', 'of', 'the', 'guard', 'which', 'rather', 'than', 'use', 'the', 'traditional', 'dream', 'techniques', 'remains', 'solid', 'then', 'disappears', 'it', 'plays', 'on', 'our', 'knowledge', 'and', 'our', 'senses', 'particularly', 'with', 'the', 'scenes', 'concerning', 'orton', 'and', 'halliwell', 'and', 'the', 'sets', 'particularly', 'of', 'their', 'flat', 'with', 'halliwells', 'murals', 'decorating', 'every', 'surface', 'are', 'terribly', 'well', 'done']

Let’s create a new column with a list of tokenized words for each review.

reviews_df['review_ls'] = reviews_df['clean_review'].apply(lambda x: x.split(" "))
reviews_df.head()

Output:

dataframe of reviews with additional columns for clean text and tokenized list of words

3 – Create a corpus for positive and negative reviews

Now that we have tokenized the reviews, we can create lists containing words in all the positive and negative reviews. For this, we’ll use itertools to chain together all the positive and negative reviews in single lists.

import itertools

# positive reviews
positive_reviews = reviews_df[reviews_df['sentiment']=='positive']['review_ls']
print("Total positive reviews: ", len(positive_reviews))
positive_reviews_words = list(itertools.chain(*positive_reviews))
print("Total words in positive reviews:", len(positive_reviews_words))

# negative reviews
negative_reviews = reviews_df[reviews_df['sentiment']=='negative']['review_ls']
print("Total negative reviews: ", len(negative_reviews))
negative_reviews_words = list(itertools.chain(*negative_reviews))
print("Total words in negative reviews:", len(negative_reviews_words))

Output:

Total positive reviews:  25000
Total words in positive reviews: 5721948
Total negative reviews:  25000
Total words in negative reviews: 5631466

Now we have one list each for all the words used in positive reviews and all the words used in negative reviews.

4 – Estimate the word frequency in the corpus

Let’s find the frequency of each word in the positive and the negative corpus. For this, we’ll use collections.Counter that returns an object which is essentially a dictionary with word to frequency mappings.

import collections

positive_words_frequency = collections.Counter(positive_reviews_words)
# top 10 most frequent words in positive reviews
print("Most common positive words:", positive_words_frequency.most_common(10))

negative_words_frequency = collections.Counter(negative_reviews_words)
# top 10 most frequent words in positive reviews
print("Most common negative words:", negative_words_frequency.most_common(10))

Output:

Most common positive words: [('the', 332496), ('and', 174195), ('a', 162381), ('of', 151419), ('to', 130495), ('is', 111355), ('in', 97366), ('it', 75383), ('i', 68680), ('this', 66846)]
Most common negative words: [('the', 318041), ('a', 156823), ('and', 145139), ('of', 136641), ('to', 135780), ('is', 98688), ('in', 85745), ('this', 78581), ('i', 76770), ('it', 75840)]

You can see that we get just the generic words like “the”, “a”, “and”, etc. as the most frequent words. Such words are called “stop words”, these words occur frequently in a corpus but does not necessarily offer discriminative information.

Let’s remove these “stop words” and see which words occur more frequently. To remove the stop words we’ll use the nltk library which has a predefined list of stop words for multiple languages.

import nltk
nltk.download("stopwords")

The above code downloads the stopwords from nltk. We can now go ahead and create a list of English stopwords.

from nltk.corpus import stopwords

# list of english stop words
stopwords_ls = list(set(stopwords.words("english")))
print("Total English stopwords: ", len(stopwords_ls))
print(stopwords_ls[:10])

Output:

Total English stopwords:  179
['some', 'than', 'below', 'once', 'ourselves', "it's", 'these', 'been', 'more', 'which']

We get a list of 179 English stopwords. Note that some of the stopwords have punctuations. If we are to remove stopwords from our corpus, it makes sense to apply the same preprocessing to the stopwords as well that we did to our corpus text.

# cleaning the words in the stopwords list
stopwords_ls = [clean_text(word) for word in stopwords_ls]
print(stopwords_ls[:10])

Output:

['some', 'than', 'below', 'once', 'ourselves', 'its', 'these', 'been', 'more', 'which']

Now, let’s go ahead and remove these words from our positive and negative reviews corpuses using list comprehensions.

# remove stopwords
positive_reviews_words = [word for word in positive_reviews_words if word not in stopwords_ls]
print("Total words in positive reviews:", len(positive_reviews_words))
negative_reviews_words = [word for word in negative_reviews_words if word not in stopwords_ls]
print("Total words in negative reviews:", len(negative_reviews_words))

Output:

Total words in positive reviews: 3019338
Total words in negative reviews: 2944033

We can see a significant reduction in size of the corpuses post removal of the stopwords. Now let’s see the most common words in the positive and the negative corpuses.

positive_words_frequency = collections.Counter(positive_reviews_words)
# top 10 most frequent words in positive reviews
print("Most common positive words:", positive_words_frequency.most_common(10))

negative_words_frequency = collections.Counter(negative_reviews_words)
# top 10 most frequent words in positive reviews
print("Most common negative words:", negative_words_frequency.most_common(10))

Output:

Most common positive words: [('film', 39412), ('movie', 36018), ('one', 25727), ('', 19273), ('like', 17054), ('good', 14342), ('great', 12643), ('story', 12368), ('see', 11864), ('time', 11770)]
Most common negative words: [('movie', 47480), ('film', 35040), ('one', 24632), ('like', 21768), ('', 21677), ('even', 14916), ('good', 14140), ('bad', 14065), ('would', 13633), ('really', 12218)]

You can see that have words like “good” and “great” occur frequently in positive reviews while the word “bad” is frequently present in negative reviews. Also, note that a number of words occur commonly in both positive and negative reviews. For example, “movie”, “film”, etc. which is due to the nature of the text data itself since it is mostly movie reviews.

5 – Visualize the word counts

We can visualize the above frequencies as charts to better show their counts. Let’s plot a horizontal bar chart of the 10 most frequent words in both the corpuses.

First, let’s create a dataframe each for the top 10 most frequent words in positive and negative corpuses.

positive_freq_words_df = pd.DataFrame(positive_words_frequency.most_common(10),
                                     columns=["Word", "Frequency"])
print(positive_freq_words_df)

Output:

    Word  Frequency
0   film      39412
1  movie      36018
2    one      25727
3             19273
4   like      17054
5   good      14342
6  great      12643
7  story      12368
8    see      11864
9   time      11770
negative_freq_words_df = pd.DataFrame(negative_words_frequency.most_common(10),
                                     columns=["Word", "Frequency"])
print(negative_freq_words_df)

Output:

     Word  Frequency
0   movie      47480
1    film      35040
2     one      24632
3    like      21768
4              21677
5    even      14916
6    good      14140
7     bad      14065
8   would      13633
9  really      12218

Horizontal bar plot of the most frequent words in the positive reviews:

import matplotlib.pyplot as plt

# set figure size
fig, ax = plt.subplots(figsize=(12, 8))
# plot horizontal bar plot
positive_freq_words_df.sort_values(by='Frequency').plot.barh(x="Word", y="Frequency", ax=ax)
# set the title
plt.title("Most Common words in positive corpus")
plt.show()

Output:

Horizontal bar plot of most frequent words in the positive corpus.

Horizontal bar plot of the most frequent words in the negative reviews:

# set figure size
fig, ax = plt.subplots(figsize=(10, 8))
# plot horizontal bar plot
negative_freq_words_df.sort_values(by='Frequency').plot.barh(x="Word", y="Frequency", ax=ax)
# set the title
plt.title("Most Common words in negative corpus")
plt.show()

Output:

Horizontal bar plot of most frequent words in the negative reviews.

Next Steps

The above was a good exploratory analysis to see the most frequent words used in the IMDB movie reviews dataset for positive and negative reviews. As a next step, you can go ahead and train your own sentiment analysis model to take in a movie review and predict whether it’s positive or negative.

With this, we come to the end of this tutorial. The code examples and results presented in this tutorial have been implemented in a Jupyter Notebook with a python (version 3.8.3) kernel having pandas version 1.0.5

Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.

  • Piyush Raj

    Piyush is a data professional passionate about using data to understand things better and make informed decisions. He has experience working as a Data Scientist in the consulting domain and holds an engineering degree from IIT Roorkee. His hobbies include watching cricket, reading, and working on side projects.

    View all posts

Понравилась статья? Поделить с друзьями:
  • Most common word problems
  • More than word extreme cover
  • Most common word list english
  • More than word album
  • More than sign in word