I’ve just started coding; so I’m not using dictionaries or sets or import or anything more advanced than for/while loops and if statements
list1 = ["cry", "me", "me", "no", "me", "no", "no", "cry", "me"]
list2 = ["cry", "cry", "cry", "no", "no", "no", "me", "me", "me"]
def codedlist(number):
max= 0
for k in hello:
if first.count(number) > max:
max= first.count(number)
alex
10.8k14 gold badges73 silver badges99 bronze badges
asked Sep 9, 2019 at 14:06
7
You can use collections.Counter to find it with one-liner:
from collections import Counter
list1 = ["cry", "me", "me", "no", "me", "no", "no", "cry", "me"]
Counter(list1).most_common()[-1]
Output:
('cry', 2)
(most_common() returns the list of counted elements sorted by their count, and the last element [-1] is the least count)
Or a bit more complicated if you can have several minimal elements:
from collections import Counter
list1 = [1,2,3,4,4,4,4,4]
counted = Counter(list1).most_common()
least_count = min(counted, key=lambda y: y[1])[1]
list(filter(lambda x: x[1] == least_count, counted))
Output:
[(1, 1), (2, 1), (3, 1)]
answered Sep 9, 2019 at 14:12
vurmuxvurmux
9,2553 gold badges23 silver badges44 bronze badges
2
You can use collections.Counter
to count frequencies of each string, and then use min
to get the minimum frequency, and then a list-comprehension to get strings that have that minimum frequency:
from collections import Counter
def codedlist(number):
c = Counter(number)
m = min(c.values())
return [s for s, i in c.items() if i == m]
print(codedlist(list1))
print(codedlist(list2))
Output:
['cry']
['cry', 'no', 'me']
answered Sep 9, 2019 at 14:11
DjaouadNMDjaouadNM
21.8k4 gold badges32 silver badges54 bronze badges
from collections import OrderedDict, Counter
def least_common(words):
d = dict(Counter(words))
min_freq = min(d.values())
return [(k,v) for k,v in d.items() if v == min_freq]
words = ["cry", "cry", "cry", "no", "no", "no", "me", "me", "me"]
print(least_common(words))
answered Sep 9, 2019 at 14:18
Akash PagarAkash Pagar
6278 silver badges21 bronze badges
A simple, algorithmic way to do this:
def codedlist(my_list):
least = 99999999 # A very high number
word = ''
for element in my_list:
repeated = my_list.count(element)
if repeated < least:
least = repeated # This is just a counter
word = element # This is the word
return word
It’s not very performatic though. There are better ways to do this, but i think that it’s an easy way to understand for a beginner.
answered Sep 9, 2019 at 14:28
If you want all words sorted by min value:
import numpy as np
list1 = ["cry", "me", "me", "no", "me", "no", "no", "cry", "me"]
list2 = ["cry", "cry", "cry", "no", "no", "no", "me", "me", "me"]
uniques_values = np.unique(list1)
final_list = []
for i in range(0,len(uniques_values)):
final_list.append((uniques_values[i], list1.count(uniques_values[i])))
def takeSecond(elem):
return elem[1]
final_list.sort(key=takeSecond)
print(final_list)
For list1:
[(‘cry’, 2), (‘no’, 3), (‘me’, 4)]
For list2:
[(‘cry’, 3), (‘me’, 3), (‘no’, 3)]
Be careful with the code, to change the list you have to edit the code in two points.
Some useful explanation:
-
numpy.unique gives you non-repeated values
-
def takeSecond(elem) with return elem[1], is a function which allows you to sort a array by the [1] column (the second value).
It could be useful to display values or get all items sorted by this criteria.
Hope it helps.
answered Sep 9, 2019 at 14:33
Finding the minimum is often similar to finding the maximum. You count the number of occurrences of an element and if this count is smaller than counter(for least common element occurrence count): you replace the counter.
This is a crude solution that uses a lot of memory and takes a lot of time to run. You will understand more of lists (and their manipulation) if you try to shorten the run time and memory usage. I hope this helps!
list1 = ["cry", "me", "me", "no", "me", "no", "no", "cry", "me"]
list2 = ["cry", "cry", "cry", "no", "no", "no", "me", "me", "me"]
def codedlist(l):
min = False #This is out counter
indices = [] #This records the positions of the counts
for i in range(0,len(l)):
count = 0
for x in l: #You can possibly shorten the run time here
if(x == l[i]):
count += 1
if not min: #Also can be read as: If this is the first element.
min = count
indices = [i]
elif min > count: #If this element is the least common
min = count #Replace the counter
indices = [i] # This is your only index
elif min == count: #If this least common (but there were more element with the same count)
indices.append(i) #Add it to our indices counter
tempList = []
#You can possibly shorten the run time below
for ind in indices:
tempList.append(l[ind])
rList = []
for x in tempList: #Remove duplicates in the list
if x not in rList:
rList.append(x)
return rList
print(codedlist(list1))
print(codedlist(list2))
Output
['cry']
['cry', 'no', 'me']
answered Sep 9, 2019 at 14:35
HarshaHarsha
3531 silver badge15 bronze badges
Probably the most simple and fastest approach to recieve the least common item in a collection.
min(list1, key=list1.count)
In action:
>>> data = ["cry", "me", "me", "no", "me", "no", "no", "cry", "me"]
>>> min(data, key=data.count)
'cry'
Tested the speed vs the collections.Counter
approach and it’s much faster. See this REPL.
P.S: The same can be done with max
to find the most common item.
Edit
To get multiple least common items you can extend this approach using a comprehension.
>>> lc = data.count(min(data, key=data.count))
>>> {i for i in data if data.count(i) == lc}
{'no', 'me', 'cry'}
answered Sep 9, 2019 at 14:43
JabJab
26.4k21 gold badges75 silver badges114 bronze badges
Basically you want to go through your list and at each element ask yourself:
«Have I seen this element before?»
If the answer is yes you add 1 to the count of that element if the answer is no you add it to the dictionary of seen values. Finally we sort it by values and then pick the first word as that one is the smallest.Lets implement it:
import operator
words = ['blah','blah','car']
seen_dictionary = {}
for w in words:
if w in seen_dic.keys():
seen_dictionary[w] += 1
else:
seen_dic.update({w : 1})
final_word = sorted(x.items(), key=operator.itemgetter(1))[0][0] #as the output will be 2D tuple sorted by the second element in each of smaller tuples.
answered Sep 10, 2019 at 11:31
def codedlist(list):
dict = {}
for item in list:
dict[item]=list.count(item)
most_common_number = max(dict.values())
most_common = []
for k,v in dict.items():
if most_common_number == v:
most_common.append(k)
return most_common
list1 = ["cry", "me", "me", "no", "me", "no", "no", "cry", "me"]
list2 = ["cry", "cry", "cry", "no", "no", "no", "me", "me", "me"]
print(codedlist(list1))
answered Sep 9, 2019 at 14:43
I’d read a select amount of the file at a time. Split it into characters, and then split on each empty space. This is better than splitting on each new line as the file may be one line.
To do the former in Python 3 is fairly simple:
def read_chunks(file, chunk_size):
while True:
chunk = file.read(chunk_size)
if not chunk:
break
yield from chunk
This has $O(text{chunk_size})$ memory usage, which is $O(1)$ as it’s a constant. It also correctly ends the iterator, when the file ends.
After this, you want to split the words up. Since we’re using str.split
without any arguments, we should write just that method of splitting. We can use a fairly simple algorithm:
from string import whitespace
def split_whitespace(it):
chunk = []
for char in it:
if char not in whitespace:
chunk.append(char)
elif chunk:
yield tuple(chunk)
chunk = []
if chunk:
yield tuple(chunk)
This has $O(k)$ memory, where $k$ is the size of the largest word. What we’d expect of a splitting function.
Finally we’d change from tuples to strings, using ''.join
, and then use the collections.Counter
. And split the word reading, and finding the most common into two different functions.
And so for an $O(k)$ memory usage version of your code, I’d use:
import sys
from collections import Counter
from string import whitespace
def read_chunks(file, chunk_size):
while True:
chunk = file.read(chunk_size)
if not chunk:
break
yield from chunk
def split_whitespace(it):
chunk = []
for char in it:
if char not in whitespace:
chunk.append(char)
elif chunk:
yield tuple(chunk)
chunk = []
if chunk:
yield tuple(chunk)
def read_words(path, chunk_size=1024):
with open(path) as f:
chars = read_chunks(f, chunk_size)
tuple_words = split_whitespace(chars)
yield from map(''.join, tuple_words)
def most_common_words(words, top=10):
return dict(Counter(words).most_common(top))
if __name__ == '__main__':
words = read_words(sys.argv[1])
top_five_words = most_common_words(words, 5)
The challenge
Write a function that, given a string of text (possibly with punctuation and line-breaks), returns an array of the top-3 most occurring words, in descending order of the number of occurrences.
Assumptions:
- A word is a string of letters (A to Z) optionally containing one or more apostrophes (‘) in ASCII. (No need to handle fancy punctuation.)
- Matches should be case-insensitive, and the words in the result should be lowercased.
- Ties may be broken arbitrarily.
- If a text contains fewer than three unique words, then either the top-2 or top-1 words should be returned, or an empty array if a text contains no words.
Examples:
top_3_words("In a village of La Mancha, the name of which I have no desire to call to
mind, there lived not long since one of those gentlemen that keep a lance
in the lance-rack, an old buckler, a lean hack, and a greyhound for
coursing. An olla of rather more beef than mutton, a salad on most
nights, scraps on Saturdays, lentils on Fridays, and a pigeon or so extra
on Sundays, made away with three-quarters of his income.")
# => ["a", "of", "on"]
top_3_words("e e e e DDD ddd DdD: ddd ddd aa aA Aa, bb cc cC e e e")
# => ["e", "ddd", "aa"]
top_3_words(" //wont won't won't")
# => ["won't", "wont"]
Bonus points:
- Avoid creating an array whose memory footprint is roughly as big as the input text.
- Avoid sorting the entire array of unique words.
Test cases
from random import choice, randint, sample, shuffle, choices
import re
from collections import Counter
def check(s, this=None): # this: only for debugging purpose
returned_result = top_3_words(s) if this is None else this
fs = Counter(w for w in re.findall(r"[a-zA-Z']+", s.lower()) if w != "'" * len(w))
exp,expected_frequencies = map(list,zip(*fs.most_common(3))) if fs else ([],[])
msg = ''
wrong_words = [w for w in returned_result if not fs[w]]
actual_freq = [fs[w] for w in returned_result]
if wrong_words:
msg = 'Incorrect match: words not present in the string. Your output: {}. One possible valid answer: {}'.format(returned_result, exp)
elif len(set(returned_result)) != len(returned_result):
msg = 'The result should not contain copies of the same word. Your output: {}. One possible output: {}'.format(returned_result, exp)
elif actual_freq!=expected_frequencies:
msg = "Incorrect frequencies: {} should be {}. Your output: {}. One possible output: {}".format(actual_freq, expected_frequencies, returned_result, exp)
Test.expect(not msg, msg)
@test.describe("Fixed tests")
def fixed_tests():
TESTS = (
"a a a b c c d d d d e e e e e",
"e e e e DDD ddd DdD: ddd ddd aa aA Aa, bb cc cC e e e",
" //wont won't won't ",
" , e .. ",
" ... ",
" ' ",
" ''' ",
"""In a village of La Mancha, the name of which I have no desire to cao
mind, there lived not long since one of those gentlemen that keep a lance
in the lance-rack, an old buckler, a lean hack, and a greyhound for
coursing. An olla of rather more beef than mutton, a salad on most
nights, scraps on Saturdays, lentils on Fridays, and a pigeon or so extra
on Sundays, made away with three-quarters of his income.""",
"a a a b c c X",
"a a c b b",
)
for s in TESTS: check(s)
@test.describe("Random tests")
def random_tests():
def gen_word():
return "".join(choice("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'") for _ in range(randint(3, 10)))
def gen_string():
words = []
nums = choices(range(1, 31), k=20)
for _ in range(randint(0, 20)):
words += [gen_word()] * nums.pop()
shuffle(words)
s = ""
while words:
s += words.pop() + "".join(choice("-,.?!_:;/ ") for _ in range(randint(1, 5)))
return s
@test.it("Tests")
def it_1():
for _ in range(100): check(gen_string())
The solution using Python
Option 1:
# use the Counter module
from collections import Counter
# use the regex module
import re
def top_3_words(text):
# count the input, pass through a regex and lowercase it
c = Counter(re.findall(r"[a-z']+", re.sub(r" '+ ", " ", text.lower())))
# return the `most common` 3 items
return [w for w,_ in c.most_common(3)]
Option 2:
def top_3_words(text):
# loop through each character in the string
for c in text:
# if it's not alphanumeric or an apostrophe
if not (c.isalpha() or c=="'"):
# replace with a space
text = text.replace(c,' ')
# create some `list` variables
words,counts,out = [],[],[]
# loop through the words in the text
for word in list(filter(None,text.lower().split())):
# if in all, then continue
if all([not c.isalpha() for c in word]):
continue
# if the word is in the words list
if word in words:
# increment the count
counts[words.index(word)] += 1
else:
# otherwise create a new entry
words.append(word); counts.append(0)
# loop while bigger than 0 and less than 3
while len(words)>0 and len(out)<3:
# append the counts
out.append(words.pop(counts.index(max(counts))).lower())
counts.remove(max(counts))
# return the counts
return out
Option 3:
def top_3_words(text):
wrds = {}
for p in r'!"#$%&()*+,./:;<=>[email protected][]^_`{|}~-':
text = text.replace(p, ' ')
for w in text.lower().split():
if w.replace("'", '') != '':
wrds[w] = wrds.get(w, 0) + 1
return [y[0] for y in sorted(wrds.items(), key=lambda x: x[1], reverse=True)[:3]]
Given Strings List, write a Python program to get word with most number of occurrences.
Example:
Input : test_list = [“gfg is best for geeks”, “geeks love gfg”, “gfg is best”]
Output : gfg
Explanation : gfg occurs 3 times, most in strings in total.Input : test_list = [“geeks love gfg”, “geeks are best”]
Output : geeks
Explanation : geeks occurs 2 times, most in strings in total.
Method #1 : Using loop + max() + split() + defaultdict()
In this, we perform task of getting each word using split(), and increase its frequency by memorizing it using defaultdict(). At last, max(), is used with parameter to get count of maximum frequency string.
Python3
from
collections
import
defaultdict
test_list
=
[
"gfg is best for geeks"
,
"geeks love gfg"
,
"gfg is best"
]
print
(
"The original list is : "
+
str
(test_list))
temp
=
defaultdict(
int
)
for
sub
in
test_list:
for
wrd
in
sub.split():
temp[wrd]
+
=
1
res
=
max
(temp, key
=
temp.get)
print
(
"Word with maximum frequency : "
+
str
(res))
Output
The original list is : ['gfg is best for geeks', 'geeks love gfg', 'gfg is best'] Word with maximum frequency : gfg
Time Complexity: O(n*n)
Auxiliary Space: O(n)
Method #2 : Using list comprehension + mode()
In this, we get all the words using list comprehension and get maximum frequency using mode().
Python3
from
statistics
import
mode
test_list
=
[
"gfg is best for geeks"
,
"geeks love gfg"
,
"gfg is best"
]
print
(
"The original list is : "
+
str
(test_list))
temp
=
[wrd
for
sub
in
test_list
for
wrd
in
sub.split()]
res
=
mode(temp)
print
(
"Word with maximum frequency : "
+
str
(res))
Output
The original list is : ['gfg is best for geeks', 'geeks love gfg', 'gfg is best'] Word with maximum frequency : gfg
Method #3: Using list() and Counter()
- Append all words to empty list and calculate frequency of all words using Counter() function.
- Find max count and print that key.
Below is the implementation:
Python3
from
collections
import
Counter
def
mostFrequentWord(words):
lis
=
[]
for
i
in
words:
for
j
in
i.split():
lis.append(j)
freq
=
Counter(lis)
max
=
0
for
i
in
freq:
if
(freq[i] >
max
):
max
=
freq[i]
word
=
i
return
word
words
=
[
"gfg is best for geeks"
,
"geeks love gfg"
,
"gfg is best"
]
print
(
"The original list is : "
+
str
(words))
print
(
"Word with maximum frequency : "
+
mostFrequentWord(words))
Output
The original list is : ['gfg is best for geeks', 'geeks love gfg', 'gfg is best'] Word with maximum frequency : gfg
The time and space complexity for all the methods are the same:
Time Complexity: O(n2)
Space Complexity: O(n)
Method #4: Using Counter() and reduce()
Here is an approach to solve the problem using the most_common() function of the collections module’s Counter class and the reduce() function from the functools module:
Python3
from
collections
import
Counter
from
functools
import
reduce
def
most_frequent_word(test_list):
all_words
=
reduce
(
lambda
a, b: a
+
b, [sub.split()
for
sub
in
test_list])
word_counts
=
Counter(all_words)
return
word_counts.most_common(
1
)[
0
][
0
]
test_list
=
[
"gfg is best for geeks"
,
"geeks love gfg"
,
"gfg is best"
]
print
(
"The original list is: "
, test_list)
print
(
"Word with most frequency: "
, most_frequent_word(test_list))
Output
The original list is: ['gfg is best for geeks', 'geeks love gfg', 'gfg is best'] Word with most frequency: gfg
Explanation:
We use the reduce() function to concatenate the list of all words from each string in the test_list.
We then create a Counter object from the list of all words to get a count of the frequency of each word.
Finally, we use the most_common() function to get the word with the highest frequency and return it.
Time complexity: O(n * k), where n is the number of strings in the test_list and k is the average number of words in each string.
Auxiliary Space: O(n * k), since we are storing the words in a list before creating a Counter object.
In this tutorial, you’ll learn how to use the Python Counter class from the collections module to count items. The Counter class provides an incredibly pythonic method to count items in lists, tuples, strings, and more. Because counting items is a common task in programming, being able to do this easily and elegantly is a useful skill for any Pythonista.
The Counter
class provides a subclass to the Python dictionary, adding in many useful ways to easily count items in another object. For example, you can easily return the number of items, the most common item, and even undertake arithmetic on different Counter
items.
By the end of this tutorial, you’ll have learned:
- How to use the
Counter
class to count items in Python - How to get the most and least common items in a counter object
- How to add and subtract different Counter objects
- how to update Counter objects in Python
Understanding Python’s Collection Counter Class
The Python Counter
class is an integral part of the collections
module. The class provides incredibly intuitive and Pythonic methods to count items in an iterable, such as lists, tuples, or strings. This allows you to count the frequency of items within that iterable, including finding the most common item.
Let’s start by creating an empty Counter
object. We first need to import the class from the collections
module. Following that, we can instantiate the object:
# Creating an Empty Counter Object
from collections import Counter
counter = Counter()
Now that we have our first Counter
object created, let’s explore some of the properties of the object. For example, we can check its type by using the type()
function. We can also verify that the object is a subclass of the Python dictionary.
# Checking Attributes of the Python Counter Class
from collections import Counter
counter = Counter()
print('Type of counter is: ',type(counter))
print('Counter is a subclass of a dictionary: ', issubclass(Counter, dict))
# Returns:
# Type of counter is: <class 'collections.Counter'>
# Counter is a subclass of a dictionary: True
Now that you have an understanding of the Python Counter
class, let’s get started with creating our first Counter object!
Creating a Counter Object in Python
Let’s create our first Python Counter object. We can pass in a string and the Counter
object will return the counts of all the letters in that string.
The class takes only a single parameter, the item we want to count. Let’s see how we can use it:
# Creating Our First Counter
from collections import Counter
a_string = 'hello! welcome to datagy'
counter = Counter(a_string)
print(counter)
# Returns:
# Counter({'e': 3, 'l': 3, 'o': 3, ' ': 3, 't': 2, 'a': 2, 'h': 1, '!': 1, 'w': 1, 'c': 1, 'm': 1, 'd': 1, 'g': 1, 'y': 1})
By printing out our counter, we’re able to see that it returns a dictionary-like object. The items are sorted by their frequency of each item in the object. In this case, we can see that the letter 'e'
exists three times in our string.
Accessing Counter Values in Python
Because the Counter object returns a subclass of a dictionary, we can use dictionary methods to access the counts of an item in that dictionary. Let’s see how we can access the number of times the letter 'a'
appears in our string:
# Accessing Counts in a Counter Object
from collections import Counter
a_string = 'hello! welcome to datagy'
counter = Counter(a_string)
print(counter['a'])
# Returns: 2
We can see that the letter 'a'
exists twice in our string. We can even access the counts of items that don’t exist in our object.
# Counting Items that Don't Exist
from collections import Counter
a_string = 'hello! welcome to datagy'
counter = Counter(a_string)
print(counter['z'])
# Returns: 0
In a normal Python dictionary, this would raise a KeyError
. However, the Counter
class has been designed to prevent this by overriding the default behavior.
Finding the Most Common Item in a Python Counter
The Counter
class makes it easy to find the most common item in a given object. This can be done by applying the .most_common()
method onto the object. Let’s see how we can find the most common item in our object:
# Finding the Most Common Item
from collections import Counter
a_string = 'hello! welcome to datagy'
counter = Counter(a_string)
print(counter.most_common())
# Returns: [('e', 3), ('l', 3), ('o', 3), (' ', 3), ('t', 2), ('a', 2), ('h', 1), ('!', 1), ('w', 1), ('c', 1), ('m', 1), ('d', 1), ('g', 1), ('y', 1)]
We can see that this returns a list of tuples that’s been ordered by placing the most common items first. Because of this, we can access the most common item by accessing the first index:
# Accessing the Most Common Item
from collections import Counter
a_string = 'hello! welcome to datagy'
counter = Counter(a_string)
print(counter.most_common()[0])
# Returns: ('e', 3)
Finding the Least Common Item in a Python Counter
Similarly, we can access the least common item by getting the last index:
# Accessing the Least Common Item
from collections import Counter
a_string = 'hello! welcome to datagy'
counter = Counter(a_string)
print(counter.most_common()[-1])
# Returns: ('y', 1)
Finding n Most Common Items in a Python Counter
Te method also allows you to pass in an integer that returns just that number of items. Say we wanted to get the three most common items, you could write:
# Getting n Number of Most Common Items
from collections import Counter
a_string = 'hello! welcome to datagy'
counter = Counter(a_string)
print(counter.most_common(3))
# Returns: [('e', 3), ('l', 3), ('o', 3)]
Updating Counter Values in Python
One of the great things about the Python Counter
is that values can also be updated. This can be done using the .update()
method. The method accepts another iterable, which will update the values in place.
Let’s see how we can first count items using the collection’s Counter
class and then pass in another iterable to update our counts.
# Updating Counter Values in Python
from collections import Counter
a_list = [1,2,3,1,2,3,1,1,2,1]
counter = Counter(a_list)
print(counter)
counter.update([1,2,3,4])
print(counter)
# Returns:
# Counter({1: 5, 2: 3, 3: 2})
# Counter({1: 6, 2: 4, 3: 3, 4: 1})
We can see that the values were updated in place, with the values of the new item.
Deleting Counter Values in Python
It’s also very easy to delete an item from a Counter object. This can be useful when you either want to reset a value or simple find a way to remove an item from being counted.
You can delete an item from a Counter object by using the del
keyword. Let’s load a Counter object and then delete a value:
# Deleting an Item from a Counter Object
from collections import Counter
a_list = [1,2,3,1,2,3,1,1,2,1]
counter = Counter(a_list)
print(counter)
del counter[1]
print(counter)
# Returns:
# Counter({1: 5, 2: 3, 3: 2})
# Counter({2: 3, 3: 2})
Arithmetic Operations on Counter Objects in Python
It’s equally easy to apply arithmetic operations like addition and subtraction on Counter objects. This allows you to combine Counter objects or find the difference between two items.
This can be done using the +
and the -
operators respectively. Let’s take a look at addition first:
# Adding 2 Counter Objects Together
from collections import Counter
counter1 = Counter([1,1,2,3,4])
counter2 = Counter([1,2,3,4])
print(counter1 + counter2)
# Returns:
# Counter({1: 3, 2: 2, 3: 2, 4: 2})
Now let’s subtract the two counters:
# Subtracting 2 Counter Objects
from collections import Counter
counter1 = Counter([1,1,2,3,4])
counter2 = Counter([1,2,3,4])
print(counter1 - counter2)
# Returns:
# Counter({1: 1})
Combining Counter Objects in Python
We can also combine Counter objects using the &
and |
operators. These serve very different purposes. Let’s break them down a bit:
&
will return the common positive minumum values|
will return the positive maximum values
Let’s take a look at the &
operator first:
# Finding Common Minimum Elements
from collections import Counter
counter1 = Counter([1,1,2,2,2,3,3,4])
counter2 = Counter([1,2,3,4,5])
print(counter1 & counter2)
# Returns:
# Counter({1: 1, 2: 1, 3: 1, 4: 1})
Now let’s take a look at the maximums between the two Counter objects:
# Finding Maximum Elements
from collections import Counter
counter1 = Counter([1,1,2,2,2,3,3,4])
counter2 = Counter([1,2,3,4,5])
print(counter1 | counter2)
# Returns:
# Counter({2: 3, 1: 2, 3: 2, 4: 1, 5: 1})
Finding the Most Common Word in a Python String
Before closing out the tutorial, let’s take a look at a practical example. We can use Python’s Counter
class to count find the most common word in a string. Let’s load the Zen of Python and find the most common word in that string.
Before we pass the string into the Counter
class, we need to split it. We can use the .split()
method to split at any white-space character, including newlines. Then we can apply the .most_common()
method and access the first item’s value by accessing the [0][0]
item:
# Finding the Most Frequent Word in a String
from collections import Counter
text = """
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
"""
counter = Counter(text.split())
print(counter.most_common()[0][0])
# Returns: is
Conclusion
In this post, you learned how to use the Python collection’s Counter
class. You started off by learning how the class can be used to create frequencies of an iterable object. You then learned how to find the counts of a particular item and how to find the most and least frequent item. You then learned how to update counts, as well as perform arithmetic on these count items.
Additional Resources
To learn more about related topics, check out the tutorials below:
- Python Defaultdict: Overview and Examples
- Python: Add Key:Value Pair to Dictionary
- Python Merge Dictionaries – Combine Dictionaries (7 Ways)
- Python: Sort a Dictionary by Values
- Official Documentation: Python collections Counter
In this tutorial, we’ll look at how to count the frequency of each word in a string corpus in python. We’ll also compare the frequency with visualizations like bar charts.
To count the frequency of each word in a string, you’ll first have to tokenize the string into individual words. Then, you can use the collections.Counter module to count each element in the list resulting in a dictionary of word counts. The following is the syntax:
import collections s = "the cat and the dog are fighting" s_counts = collections.Counter(s.split(" "))
Here, s_counts is a dictionary(more precisely, an object of collections.Counter which is a subclass of dict) storing the word: count mapping based on the frequency in the corpus. You can use it for all dictionary like functions. But, if you specifically want to convert it into a dictionary use dict(s_counts)
Let’s look at an example of extracting the frequency of each word from a string corpus in python.
Count of each word in Movie Reviews dataset
We use the IMDB movie reviews dataset which you can download here. The dataset has 50000 reviews of movies filled by users. We’ll be using this dataset to see the most frequent words used by the reviewers in positive and negative reviews.
1 – Load the data
First we load the data as a pandas dataframe using the read_csv() function.
import pandas as pd # read the csv file as a dataframe reviews_df = pd.read_csv(r"C:UserspiyushDocumentsProjectsmovie_reviews_dataIMDB Dataset.csv") print(reviews_df.head())
Output:
review sentiment 0 One of the other reviewers has mentioned that ... positive 1 A wonderful little production. <br /><br />The... positive 2 I thought this was a wonderful way to spend ti... positive 3 Basically there's a family where a little boy ... negative 4 Petter Mattei's "Love in the Time of Money" is... positive
The dataframe has two columns – “review” storing the review of the movie and “sentiment” storing the sentiment associated with the review. Let’s examine how many samples do we have for each sentiment.
print(reviews_df['sentiment'].value_counts())
Output:
positive 25000 negative 25000 Name: sentiment, dtype: int64
We have 25000 samples each for “positive” and “negative” sentiments.
2 – Cleaning the text
If we look at the entries in the “review” column, we can find that the reviews contain a number of unwanted elements or styles such as HTML tags, punctuations, inconsistent use of lower and upper case, etc. that could hinder our analysis. For example,
print(reviews_df['review'][1])
Output:
A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done.
You can see that in the above review, we have HTML tags, quotes, punctuations, etc. that could be cleaned. Let’s write a function to clean the text in the reviews.
import re import string def clean_text(text): """ Function to clean the text. Parameters: text: the raw text as a string value that needs to be cleaned Returns: cleaned_text: the cleaned text as string """ # convert to lower case cleaned_text = text.lower() # remove HTML tags html_pattern = re.compile('<.*?>') cleaned_text = re.sub(html_pattern, '', cleaned_text) # remove punctuations cleaned_text = cleaned_text.translate(str.maketrans('', '', string.punctuation)) return cleaned_text.strip()
The above function performs the following operations on the text:
- Convert the text to lower case
- Remove HTML tags from the text using regular expressions.
- Remove punctuations from the text using a translation table.
Let’s see the above function in action.
print(clean_text(reviews_df['review'][1]))
Output:
a wonderful little production the filming technique is very unassuming very oldtimebbc fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece the actors are extremely well chosen michael sheen not only has got all the polari but he has all the voices down pat too you can truly see the seamless editing guided by the references to williams diary entries not only is it well worth the watching but it is a terrificly written and performed piece a masterful production about one of the great masters of comedy and his life the realism really comes home with the little things the fantasy of the guard which rather than use the traditional dream techniques remains solid then disappears it plays on our knowledge and our senses particularly with the scenes concerning orton and halliwell and the sets particularly of their flat with halliwells murals decorating every surface are terribly well done
You can see that now the text if fairly consistent to be split into individual words. Let’s apply this function to the “reviews” column and create a new column of clean reviews.
reviews_df['clean_review'] = reviews_df['review'].apply(clean_text)
2 – Tokenize the text into words
You can use the string split() function to create a list of individual tokens from a string. For example,
print(clean_text(reviews_df['review'][1]).split(" "))
Output:
['a', 'wonderful', 'little', 'production', 'the', 'filming', 'technique', 'is', 'very', 'unassuming', 'very', 'oldtimebbc', 'fashion', 'and', 'gives', 'a', 'comforting', 'and', 'sometimes', 'discomforting', 'sense', 'of', 'realism', 'to', 'the', 'entire', 'piece', 'the', 'actors', 'are', 'extremely', 'well', 'chosen', 'michael', 'sheen', 'not', 'only', 'has', 'got', 'all', 'the', 'polari', 'but', 'he', 'has', 'all', 'the', 'voices', 'down', 'pat', 'too', 'you', 'can', 'truly', 'see', 'the', 'seamless', 'editing', 'guided', 'by', 'the', 'references', 'to', 'williams', 'diary', 'entries', 'not', 'only', 'is', 'it', 'well', 'worth', 'the', 'watching', 'but', 'it', 'is', 'a', 'terrificly', 'written', 'and', 'performed', 'piece', 'a', 'masterful', 'production', 'about', 'one', 'of', 'the', 'great', 'masters', 'of', 'comedy', 'and', 'his', 'life', 'the', 'realism', 'really', 'comes', 'home', 'with', 'the', 'little', 'things', 'the', 'fantasy', 'of', 'the', 'guard', 'which', 'rather', 'than', 'use', 'the', 'traditional', 'dream', 'techniques', 'remains', 'solid', 'then', 'disappears', 'it', 'plays', 'on', 'our', 'knowledge', 'and', 'our', 'senses', 'particularly', 'with', 'the', 'scenes', 'concerning', 'orton', 'and', 'halliwell', 'and', 'the', 'sets', 'particularly', 'of', 'their', 'flat', 'with', 'halliwells', 'murals', 'decorating', 'every', 'surface', 'are', 'terribly', 'well', 'done']
Let’s create a new column with a list of tokenized words for each review.
reviews_df['review_ls'] = reviews_df['clean_review'].apply(lambda x: x.split(" ")) reviews_df.head()
Output:
3 – Create a corpus for positive and negative reviews
Now that we have tokenized the reviews, we can create lists containing words in all the positive and negative reviews. For this, we’ll use itertools to chain together all the positive and negative reviews in single lists.
import itertools # positive reviews positive_reviews = reviews_df[reviews_df['sentiment']=='positive']['review_ls'] print("Total positive reviews: ", len(positive_reviews)) positive_reviews_words = list(itertools.chain(*positive_reviews)) print("Total words in positive reviews:", len(positive_reviews_words)) # negative reviews negative_reviews = reviews_df[reviews_df['sentiment']=='negative']['review_ls'] print("Total negative reviews: ", len(negative_reviews)) negative_reviews_words = list(itertools.chain(*negative_reviews)) print("Total words in negative reviews:", len(negative_reviews_words))
Output:
Total positive reviews: 25000 Total words in positive reviews: 5721948 Total negative reviews: 25000 Total words in negative reviews: 5631466
Now we have one list each for all the words used in positive reviews and all the words used in negative reviews.
4 – Estimate the word frequency in the corpus
Let’s find the frequency of each word in the positive and the negative corpus. For this, we’ll use collections.Counter
that returns an object which is essentially a dictionary with word to frequency mappings.
import collections positive_words_frequency = collections.Counter(positive_reviews_words) # top 10 most frequent words in positive reviews print("Most common positive words:", positive_words_frequency.most_common(10)) negative_words_frequency = collections.Counter(negative_reviews_words) # top 10 most frequent words in positive reviews print("Most common negative words:", negative_words_frequency.most_common(10))
Output:
Most common positive words: [('the', 332496), ('and', 174195), ('a', 162381), ('of', 151419), ('to', 130495), ('is', 111355), ('in', 97366), ('it', 75383), ('i', 68680), ('this', 66846)] Most common negative words: [('the', 318041), ('a', 156823), ('and', 145139), ('of', 136641), ('to', 135780), ('is', 98688), ('in', 85745), ('this', 78581), ('i', 76770), ('it', 75840)]
You can see that we get just the generic words like “the”, “a”, “and”, etc. as the most frequent words. Such words are called “stop words”, these words occur frequently in a corpus but does not necessarily offer discriminative information.
Let’s remove these “stop words” and see which words occur more frequently. To remove the stop words we’ll use the nltk
library which has a predefined list of stop words for multiple languages.
import nltk nltk.download("stopwords")
The above code downloads the stopwords from nltk. We can now go ahead and create a list of English stopwords.
from nltk.corpus import stopwords # list of english stop words stopwords_ls = list(set(stopwords.words("english"))) print("Total English stopwords: ", len(stopwords_ls)) print(stopwords_ls[:10])
Output:
Total English stopwords: 179 ['some', 'than', 'below', 'once', 'ourselves', "it's", 'these', 'been', 'more', 'which']
We get a list of 179 English stopwords. Note that some of the stopwords have punctuations. If we are to remove stopwords from our corpus, it makes sense to apply the same preprocessing to the stopwords as well that we did to our corpus text.
# cleaning the words in the stopwords list stopwords_ls = [clean_text(word) for word in stopwords_ls] print(stopwords_ls[:10])
Output:
['some', 'than', 'below', 'once', 'ourselves', 'its', 'these', 'been', 'more', 'which']
Now, let’s go ahead and remove these words from our positive and negative reviews corpuses using list comprehensions.
# remove stopwords positive_reviews_words = [word for word in positive_reviews_words if word not in stopwords_ls] print("Total words in positive reviews:", len(positive_reviews_words)) negative_reviews_words = [word for word in negative_reviews_words if word not in stopwords_ls] print("Total words in negative reviews:", len(negative_reviews_words))
Output:
Total words in positive reviews: 3019338 Total words in negative reviews: 2944033
We can see a significant reduction in size of the corpuses post removal of the stopwords. Now let’s see the most common words in the positive and the negative corpuses.
positive_words_frequency = collections.Counter(positive_reviews_words) # top 10 most frequent words in positive reviews print("Most common positive words:", positive_words_frequency.most_common(10)) negative_words_frequency = collections.Counter(negative_reviews_words) # top 10 most frequent words in positive reviews print("Most common negative words:", negative_words_frequency.most_common(10))
Output:
Most common positive words: [('film', 39412), ('movie', 36018), ('one', 25727), ('', 19273), ('like', 17054), ('good', 14342), ('great', 12643), ('story', 12368), ('see', 11864), ('time', 11770)] Most common negative words: [('movie', 47480), ('film', 35040), ('one', 24632), ('like', 21768), ('', 21677), ('even', 14916), ('good', 14140), ('bad', 14065), ('would', 13633), ('really', 12218)]
You can see that have words like “good” and “great” occur frequently in positive reviews while the word “bad” is frequently present in negative reviews. Also, note that a number of words occur commonly in both positive and negative reviews. For example, “movie”, “film”, etc. which is due to the nature of the text data itself since it is mostly movie reviews.
5 – Visualize the word counts
We can visualize the above frequencies as charts to better show their counts. Let’s plot a horizontal bar chart of the 10 most frequent words in both the corpuses.
First, let’s create a dataframe each for the top 10 most frequent words in positive and negative corpuses.
positive_freq_words_df = pd.DataFrame(positive_words_frequency.most_common(10), columns=["Word", "Frequency"]) print(positive_freq_words_df)
Output:
Word Frequency 0 film 39412 1 movie 36018 2 one 25727 3 19273 4 like 17054 5 good 14342 6 great 12643 7 story 12368 8 see 11864 9 time 11770
negative_freq_words_df = pd.DataFrame(negative_words_frequency.most_common(10), columns=["Word", "Frequency"]) print(negative_freq_words_df)
Output:
Word Frequency 0 movie 47480 1 film 35040 2 one 24632 3 like 21768 4 21677 5 even 14916 6 good 14140 7 bad 14065 8 would 13633 9 really 12218
Horizontal bar plot of the most frequent words in the positive reviews:
import matplotlib.pyplot as plt # set figure size fig, ax = plt.subplots(figsize=(12, 8)) # plot horizontal bar plot positive_freq_words_df.sort_values(by='Frequency').plot.barh(x="Word", y="Frequency", ax=ax) # set the title plt.title("Most Common words in positive corpus") plt.show()
Output:
Horizontal bar plot of the most frequent words in the negative reviews:
# set figure size fig, ax = plt.subplots(figsize=(10, 8)) # plot horizontal bar plot negative_freq_words_df.sort_values(by='Frequency').plot.barh(x="Word", y="Frequency", ax=ax) # set the title plt.title("Most Common words in negative corpus") plt.show()
Output:
Next Steps
The above was a good exploratory analysis to see the most frequent words used in the IMDB movie reviews dataset for positive and negative reviews. As a next step, you can go ahead and train your own sentiment analysis model to take in a movie review and predict whether it’s positive or negative.
With this, we come to the end of this tutorial. The code examples and results presented in this tutorial have been implemented in a Jupyter Notebook with a python (version 3.8.3) kernel having pandas version 1.0.5
Subscribe to our newsletter for more informative guides and tutorials.
We do not spam and you can opt out any time.
-
Piyush is a data professional passionate about using data to understand things better and make informed decisions. He has experience working as a Data Scientist in the consulting domain and holds an engineering degree from IIT Roorkee. His hobbies include watching cricket, reading, and working on side projects.
View all posts