From Wikipedia, the free encyclopedia
Studies that estimate and rank the most common words in English examine texts written in English. Perhaps the most comprehensive such analysis is one that was conducted against the Oxford English Corpus (OEC), a massive text corpus that is written in the English language.
In total, the texts in the Oxford English Corpus contain more than 2 billion words.[1] The OEC includes a wide variety of writing samples, such as literary works, novels, academic journals, newspapers, magazines, Hansard’s Parliamentary Debates, blogs, chat logs, and emails.[2]
Another English corpus that has been used to study word frequency is the Brown Corpus, which was compiled by researchers at Brown University in the 1960s. The researchers published their analysis of the Brown Corpus in 1967. Their findings were similar, but not identical, to the findings of the OEC analysis.
According to The Reading Teacher’s Book of Lists, the first 25 words in the OEC make up about one-third of all printed material in English, and the first 100 words make up about half of all written English.[3] According to a study cited by Robert McCrum in The Story of English, all of the first hundred of the most common words in English are of Old English origin,[4] except for «people», ultimately from Latin «populus», and «because», in part from Latin «causa».
Some lists of common words distinguish between word forms, while others rank all forms of a word as a single lexeme (the form of the word as it would appear in a dictionary). For example, the lexeme be (as in to be) comprises all its conjugations (is, was, am, are, were, etc.), and contractions of those conjugations.[5] These top 100 lemmas listed below account for 50% of all the words in the Oxford English Corpus.[1]
100 most common words
A list of 100 words that occur most frequently in written English is given below, based on an analysis of the Oxford English Corpus (a collection of texts in the English language, comprising over 2 billion words).[1] A part of speech is provided for most of the words, but part-of-speech categories vary between analyses, and not all possibilities are listed. For example, «I» may be a pronoun or a Roman numeral; «to» may be a preposition or an infinitive marker; «time» may be a noun or a verb. Also, a single spelling can represent more than one root word. For example, «singer» may be a form of either «sing» or «singe». Different corpora may treat such difference differently.
The number of distinct senses that are listed in Wiktionary is shown in the polysemy column. For example, «out» can refer to an escape, a removal from play in baseball, or any of 36 other concepts. On average, each word in the list has 15.38 senses. The sense count does not include the use of terms in phrasal verbs such as «put out» (as in «inconvenienced») and other multiword expressions such as the interjection «get out!», where the word «out» does not have an individual meaning.[6] As an example, «out» occurs in at least 560 phrasal verbs[7] and appears in nearly 1700 multiword expressions.[8]
The table also includes frequencies from other corpora. Note that as well as usage differences, lemmatisation may differ from corpus to corpus – for example splitting the prepositional use of «to» from the use as a particle. Also the Corpus of Contemporary American English (COCA) list includes dispersion as well as frequency to calculate rank.
Word | Parts of speech | OEC rank | COCA rank[9] | Dolch level | Polysemy |
---|---|---|---|---|---|
the | Article | 1 | 1 | Pre-primer | 12 |
be | Verb | 2 | 2 | Primer | 21 |
to | Preposition | 3 | 7, 9 | Pre-primer | 17 |
of | Preposition | 4 | 4 | Grade 1 | 12 |
and | Conjunction | 5 | 3 | Pre-primer | 16 |
a | Article | 6 | 5 | Pre-primer | 20 |
in | Preposition | 7 | 6, 128, 3038 | Pre-primer | 23 |
that | Conjunction et al. | 8 | 12, 27, 903 | Primer | 17 |
have | Verb | 9 | 8 | Primer | 25 |
I | Pronoun | 10 | 11 | Pre-primer | 7 |
it | Pronoun | 11 | 10 | Pre-primer | 18 |
for | Preposition | 12 | 13, 2339 | Pre-primer | 19 |
not | Adverb et al. | 13 | 28, 2929 | Pre-primer | 5 |
on | Preposition | 14 | 17, 155 | Primer | 43 |
with | Preposition | 15 | 16 | Primer | 11 |
he | Pronoun | 16 | 15 | Primer | 7 |
as | Adverb, conjunction, et al. | 17 | 33, 49, 129 | Grade 1 | 17 |
you | Pronoun | 18 | 14 | Pre-primer | 9 |
do | Verb, noun | 19 | 18 | Primer | 38 |
at | Preposition | 20 | 22 | Primer | 14 |
this | Determiner, adverb, noun | 21 | 20, 4665 | Primer | 9 |
but | Preposition, adverb, conjunction | 22 | 23, 1715 | Primer | 17 |
his | Possessive pronoun | 23 | 25, 1887 | Grade 1 | 6 |
by | Preposition | 24 | 30, 1190 | Grade 1 | 19 |
from | Preposition | 25 | 26 | Grade 1 | 4 |
they | Pronoun | 26 | 21 | Primer | 6 |
we | Pronoun | 27 | 24 | Pre-primer | 6 |
say | Verb et al. | 28 | 19 | Primer | 17 |
her | Possessive pronoun | 29, 106 | 42 | Grade 1 | 3 |
she | Pronoun | 30 | 31 | Primer | 7 |
or | Conjunction | 31 | 32 | Grade 2 | 11 |
an | Article | 32 | (a) | Grade 1 | 6 |
will | Verb, noun | 33 | 48, 1506 | Primer | 16 |
my | Possessive pronoun | 34 | 44 | Pre-primer | 5 |
one | Noun, adjective, et al. | 35 | 51, 104, 839 | Pre-primer | 24 |
all | Adjective | 36 | 43, 222 | Primer | 15 |
would | Verb | 37 | 41 | Grade 2 | 13 |
there | Adverb, pronoun, et al. | 38 | 53, 116 | Primer | 14 |
their | Possessive pronoun | 39 | 36 | Grade 2 | 2 |
what | Pronoun, adverb, et al. | 40 | 34 | Primer | 19 |
so | Conjunction, adverb, et al. | 41 | 55, 196 | Primer | 18 |
up | Adverb, preposition, et al. | 42 | 50, 456 | Pre-primer | 50 |
out | Preposition | 43 | 64, 149 | Primer | 38 |
if | Conjunction | 44 | 40 | Grade 3 | 9 |
about | Preposition, adverb, et al. | 45 | 46, 179 | Grade 3 | 18 |
who | Pronoun, noun | 46 | 38 | Primer | 5 |
get | Verb | 47 | 39 | Primer | 37 |
which | Pronoun | 48 | 58 | Grade 2 | 7 |
go | Verb, noun | 49 | 35 | Pre-primer | 54 |
me | Pronoun | 50 | 61 | Pre-primer | 10 |
when | Adverb | 51 | 57, 136 | Grade 1 | 11 |
make | Verb, noun | 52 | 45 | Grade 2 [as «made»] | 48 |
can | Verb, noun | 53 | 37, 2973 | Pre-primer | 18 |
like | Preposition, verb | 54 | 74, 208, 1123, 1684, 2702 | Primer | 26 |
time | Noun | 55 | 52 | Dolch list of 95 nouns | 14 |
no | Determiner, adverb | 56 | 93, 699, 916, 1111, 4555 | Primer | 10 |
just | Adjective | 57 | 66, 1823 | 14 | |
him | Pronoun | 58 | 68 | 5 | |
know | Verb, noun | 59 | 47 | 13 | |
take | Verb, noun | 60 | 63 | 66 | |
people | Noun | 61 | 62 | 9 | |
into | Preposition | 62 | 65 | 10 | |
year | Noun | 63 | 54 | 7 | |
your | Possessive pronoun | 64 | 69 | 4 | |
good | Adjective | 65 | 110, 2280 | 32 | |
some | Determiner, pronoun | 66 | 60 | 10 | |
could | Verb | 67 | 71 | 6 | |
them | Pronoun | 68 | 59 | 3 | |
see | Verb | 69 | 67 | 25 | |
other | Adjective, pronoun | 70 | 75, 715, 2355 | 12 | |
than | Conjunction, preposition | 71 | 73, 712 | 4 | |
then | Adverb | 72 | 77 | 10 | |
now | Preposition | 73 | 72, 1906 | 13 | |
look | Verb | 74 | 85, 604 | 17 | |
only | Adverb | 75 | 101, 329 | 11 | |
come | Verb | 76 | 70 | 20 | |
its | Possessive pronoun | 77 | 78 | 2 | |
over | Preposition | 78 | 124, 182 | 19 | |
think | Verb | 79 | 56 | 10 | |
also | Adverb | 80 | 87 | 2 | |
back | Noun, adverb | 81 | 108, 323, 1877 | 36 | |
after | Preposition | 82 | 120, 260 | 14 | |
use | Verb, noun | 83 | 92, 429 | 17 | |
two | Noun | 84 | 80 | 6 | |
how | Adverb | 85 | 76 | 11 | |
our | Possessive pronoun | 86 | 79 | 3 | |
work | Verb, noun | 87 | 117, 199 | 28 | |
first | Adjective | 88 | 86, 2064 | 10 | |
well | Adverb | 89 | 100, 644 | 30 | |
way | Noun, adverb | 90 | 84, 4090 | 16 | |
even | Adjective | 91 | 107, 484 | 23 | |
new | Adjective et al. | 92 | 88 | 18 | |
want | Verb | 93 | 83 | 10 | |
because | Conjunction | 94 | 89, 509 | 7 | |
any | Pronoun | 95 | 109, 4720 | 4 | |
these | Pronoun | 96 | 82 | 2 | |
give | Verb | 97 | 98 | 19 | |
day | Noun | 98 | 90 | 9 | |
most | Adverb | 99 | 144, 187 | 12 | |
us | Pronoun | 100 | 113 | 6 |
Parts of speech
The following is a very similar list, subdivided by part of speech.[1] The list labeled «Others» includes pronouns, possessives, articles, modal verbs, adverbs, and conjunctions.
Rank | Nouns | Verbs | Adjectives | Prepositions | Others |
---|---|---|---|---|---|
1 | time | be | good | to | the |
2 | person | have | new | of | and |
3 | year | do | first | in | a |
4 | way | say | last | for | that |
5 | day | get | long | on | I |
6 | thing | make | great | with | it |
7 | man | go | little | at | not |
8 | world | know | own | by | he |
9 | life | take | other | from | as |
10 | hand | see | old | up | you |
11 | part | come | right | about | this |
12 | child | think | big | into | but |
13 | eye | look | high | over | his |
14 | woman | want | different | after | they |
15 | place | give | small | her | |
16 | work | use | large | she | |
17 | week | find | next | or | |
18 | case | tell | early | an | |
19 | point | ask | young | will | |
20 | government | work | important | my | |
21 | company | seem | few | one | |
22 | number | feel | public | all | |
23 | group | try | bad | would | |
24 | problem | leave | same | there | |
25 | fact | call | able | their |
See also
- Basic English
- Frequency analysis, the study of the frequency of letters or groups of letters
- Letter frequencies
- Oxford English Corpus
- Swadesh list, a compilation of basic concepts for the purpose of historical-comparative linguistics
- Zipf’s law, a theory stating that the frequency of any word is inversely proportional to its rank in a frequency table
Word lists
- Dolch Word List, a list of frequently used English words
- General Service List
- Word lists by frequency
References
- ^ a b c d «The Oxford English Corpus: Facts about the language». OxfordDictionaries.com. Oxford University Press. What is the commonest word?. Archived from the original on December 26, 2011. Retrieved June 22, 2011.
- ^ «The Oxford English Corpus». AskOxford.com. Archived from the original on May 4, 2006. Retrieved June 22, 2006.
- ^ The First 100 Most Commonly Used English Words Archived 2013-06-16 at the Wayback Machine.
- ^ Bill Bryson, The Mother Tongue: English and How It Got That Way, Harper Perennial, 2001, page 58
- ^ Benjamin Zimmer. June 22, 2006. Time after time after time…. Language Log. Retrieved June 22, 2006.
- ^ Benjamin, Martin (2019). «Polysemy in top 100 Oxford English Corpus words within Wiktionary». Teach You Backwards. Retrieved December 28, 2019.
- ^ Garcia-Vega, M (2010). «Teasing out the meaning of «out»«. 29th International Conference on Lexis and Grammar.
- ^ «out — English-French Dictionary». www.wordreference.com. Retrieved November 22, 2022.
- ^ «Word frequency: based on 450 million word COCA corpus». www.wordfrequency.info. Retrieved April 11, 2018.
External links
What is a popular word finder?
learn more about this tool
With this online tool, you can find the most common words in any text. The program runs through all the words in the text and in the output, it prints the count of their occurrences. The information about the most popular words often gives clues about the topic, language, and purpose of the text. For example, if the most common words are «disco», «music», and «dance», then it’s most likely text about dancing. If the most common words are «the», «a», and «is», then the text is most likely written in the English language, and if the most common word are «di», «che», and «la», then the text is most likely in Italian. Even the information about a single word can tell a lot about the text. For example, if the most popular word in the text is «charity», then most likely the purpose of the text is to help those who need it. In addition to printing single-word statistics, this tool can also analyze the frequency of multi-word phrases in the text. You can choose to analyze combinations of two, three, or more words, and the tool will display the distribution of n-word groups. For example, if the input text is «Owls hoot in the dark.» then the program will generate four word pairs (also called word bigrams) – «Owls hoot», «hoot in», «in the», and «the dark», and if the word group size is 3 (called word trigrams), then there will be three word triplets – «Owls hoot in», «hoot in the», and «in the dark». You can also choose in the options (via the «Stop at Sentence Boundary») whether to create a stream of words from neighboring sentences to form joint groups or not. For example, if the input text is «Long cat is red. Short cat is black.», then with this option on, the bigrams would be «Long cat», «cat is», «is red», «Short cat», «cat is», «is black». But with this option off, the full-stop is ignored and the bigrams would be «Long cat», «cat is», «is red», «red Short», «Short cat», «cat is», «is black». In this example, the words maintained their sentence case but by enabling the «Ignore Word Case» option, you can analyze all words in lowercase. You can also exclude or replace punctuation marks in the text before analysis. For example, if a word is wrapped in parentheses «(owl)» then you can remove the parentheses by entering them in the «Punctuation to Delete» option. If a word contains internal punctuation, such as hyphenation in the word «full-scale», then you can replace the hyphen «-» with a space and analyze this word as two separate words «full» and «scale». In addition to the total number of words in the text, you can also display their usage percentage and print a fractional representation of each word’s number of uses relative to the total number of words in the text. Additionally, you can sort the output words alphabetically or by the usage counts. Textabulous!
The challenge
Write a function that, given a string of text (possibly with punctuation and line-breaks), returns an array of the top-3 most occurring words, in descending order of the number of occurrences.
Assumptions:
- A word is a string of letters (A to Z) optionally containing one or more apostrophes (‘) in ASCII. (No need to handle fancy punctuation.)
- Matches should be case-insensitive, and the words in the result should be lowercased.
- Ties may be broken arbitrarily.
- If a text contains fewer than three unique words, then either the top-2 or top-1 words should be returned, or an empty array if a text contains no words.
Examples:
top_3_words("In a village of La Mancha, the name of which I have no desire to call to
mind, there lived not long since one of those gentlemen that keep a lance
in the lance-rack, an old buckler, a lean hack, and a greyhound for
coursing. An olla of rather more beef than mutton, a salad on most
nights, scraps on Saturdays, lentils on Fridays, and a pigeon or so extra
on Sundays, made away with three-quarters of his income.")
# => ["a", "of", "on"]
top_3_words("e e e e DDD ddd DdD: ddd ddd aa aA Aa, bb cc cC e e e")
# => ["e", "ddd", "aa"]
top_3_words(" //wont won't won't")
# => ["won't", "wont"]
Bonus points:
- Avoid creating an array whose memory footprint is roughly as big as the input text.
- Avoid sorting the entire array of unique words.
Test cases
from random import choice, randint, sample, shuffle, choices
import re
from collections import Counter
def check(s, this=None): # this: only for debugging purpose
returned_result = top_3_words(s) if this is None else this
fs = Counter(w for w in re.findall(r"[a-zA-Z']+", s.lower()) if w != "'" * len(w))
exp,expected_frequencies = map(list,zip(*fs.most_common(3))) if fs else ([],[])
msg = ''
wrong_words = [w for w in returned_result if not fs[w]]
actual_freq = [fs[w] for w in returned_result]
if wrong_words:
msg = 'Incorrect match: words not present in the string. Your output: {}. One possible valid answer: {}'.format(returned_result, exp)
elif len(set(returned_result)) != len(returned_result):
msg = 'The result should not contain copies of the same word. Your output: {}. One possible output: {}'.format(returned_result, exp)
elif actual_freq!=expected_frequencies:
msg = "Incorrect frequencies: {} should be {}. Your output: {}. One possible output: {}".format(actual_freq, expected_frequencies, returned_result, exp)
Test.expect(not msg, msg)
@test.describe("Fixed tests")
def fixed_tests():
TESTS = (
"a a a b c c d d d d e e e e e",
"e e e e DDD ddd DdD: ddd ddd aa aA Aa, bb cc cC e e e",
" //wont won't won't ",
" , e .. ",
" ... ",
" ' ",
" ''' ",
"""In a village of La Mancha, the name of which I have no desire to cao
mind, there lived not long since one of those gentlemen that keep a lance
in the lance-rack, an old buckler, a lean hack, and a greyhound for
coursing. An olla of rather more beef than mutton, a salad on most
nights, scraps on Saturdays, lentils on Fridays, and a pigeon or so extra
on Sundays, made away with three-quarters of his income.""",
"a a a b c c X",
"a a c b b",
)
for s in TESTS: check(s)
@test.describe("Random tests")
def random_tests():
def gen_word():
return "".join(choice("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'") for _ in range(randint(3, 10)))
def gen_string():
words = []
nums = choices(range(1, 31), k=20)
for _ in range(randint(0, 20)):
words += [gen_word()] * nums.pop()
shuffle(words)
s = ""
while words:
s += words.pop() + "".join(choice("-,.?!_:;/ ") for _ in range(randint(1, 5)))
return s
@test.it("Tests")
def it_1():
for _ in range(100): check(gen_string())
The solution using Python
Option 1:
# use the Counter module
from collections import Counter
# use the regex module
import re
def top_3_words(text):
# count the input, pass through a regex and lowercase it
c = Counter(re.findall(r"[a-z']+", re.sub(r" '+ ", " ", text.lower())))
# return the `most common` 3 items
return [w for w,_ in c.most_common(3)]
Option 2:
def top_3_words(text):
# loop through each character in the string
for c in text:
# if it's not alphanumeric or an apostrophe
if not (c.isalpha() or c=="'"):
# replace with a space
text = text.replace(c,' ')
# create some `list` variables
words,counts,out = [],[],[]
# loop through the words in the text
for word in list(filter(None,text.lower().split())):
# if in all, then continue
if all([not c.isalpha() for c in word]):
continue
# if the word is in the words list
if word in words:
# increment the count
counts[words.index(word)] += 1
else:
# otherwise create a new entry
words.append(word); counts.append(0)
# loop while bigger than 0 and less than 3
while len(words)>0 and len(out)<3:
# append the counts
out.append(words.pop(counts.index(max(counts))).lower())
counts.remove(max(counts))
# return the counts
return out
Option 3:
def top_3_words(text):
wrds = {}
for p in r'!"#$%&()*+,./:;<=>[email protected][]^_`{|}~-':
text = text.replace(p, ' ')
for w in text.lower().split():
if w.replace("'", '') != '':
wrds[w] = wrds.get(w, 0) + 1
return [y[0] for y in sorted(wrds.items(), key=lambda x: x[1], reverse=True)[:3]]
Write a function that, given a string of text (possibly with punctuation and line-breaks), returns an array of the top-3 most occurring words, in descending order of the number of occurrences.
top_3_words("In a village of La Mancha, the name of which I have no desire to call to
mind, there lived not long since one of those gentlemen that keep a lance
in the lance-rack, an old buckler, a lean hack, and a greyhound for
coursing. An olla of rather more beef than mutton, a salad on most
nights, scraps on Saturdays, lentils on Fridays, and a pigeon or so extra
on Sundays, made away with three-quarters of his income.")
# => ["a", "of", "on"]
top_3_words("e e e e DDD ddd DdD: ddd ddd aa aA Aa, bb cc cC e e e")
# => ["e", "ddd", "aa"]
top_3_words(" //wont won't won't")
# => ["won't", "wont"]
WordCounter analyzes your text and tells you the most common words and phrases.
This tool helps you count words, bigrams, and trigrams in plain text. This is often the first step in quantitative text analysis.