Most used word in text

From Wikipedia, the free encyclopedia

Studies that estimate and rank the most common words in English examine texts written in English. Perhaps the most comprehensive such analysis is one that was conducted against the Oxford English Corpus (OEC), a massive text corpus that is written in the English language.

In total, the texts in the Oxford English Corpus contain more than 2 billion words.^[1] The OEC includes a wide variety of writing samples, such as literary works, novels, academic journals, newspapers, magazines, Hansard’s Parliamentary Debates, blogs, chat logs, and emails.^[2]

Another English corpus that has been used to study word frequency is the Brown Corpus, which was compiled by researchers at Brown University in the 1960s. The researchers published their analysis of the Brown Corpus in 1967. Their findings were similar, but not identical, to the findings of the OEC analysis.

According to The Reading Teacher’s Book of Lists, the first 25 words in the OEC make up about one-third of all printed material in English, and the first 100 words make up about half of all written English.^[3] According to a study cited by Robert McCrum in The Story of English, all of the first hundred of the most common words in English are of Old English origin,^[4] except for «people», ultimately from Latin «populus», and «because», in part from Latin «causa».

Some lists of common words distinguish between word forms, while others rank all forms of a word as a single lexeme (the form of the word as it would appear in a dictionary). For example, the lexeme be (as in to be) comprises all its conjugations (is, was, am, are, were, etc.), and contractions of those conjugations.^[5] These top 100 lemmas listed below account for 50% of all the words in the Oxford English Corpus.^[1]

100 most common words

A list of 100 words that occur most frequently in written English is given below, based on an analysis of the Oxford English Corpus (a collection of texts in the English language, comprising over 2 billion words).^[1] A part of speech is provided for most of the words, but part-of-speech categories vary between analyses, and not all possibilities are listed. For example, «I» may be a pronoun or a Roman numeral; «to» may be a preposition or an infinitive marker; «time» may be a noun or a verb. Also, a single spelling can represent more than one root word. For example, «singer» may be a form of either «sing» or «singe». Different corpora may treat such difference differently.

The number of distinct senses that are listed in Wiktionary is shown in the polysemy column. For example, «out» can refer to an escape, a removal from play in baseball, or any of 36 other concepts. On average, each word in the list has 15.38 senses. The sense count does not include the use of terms in phrasal verbs such as «put out» (as in «inconvenienced») and other multiword expressions such as the interjection «get out!», where the word «out» does not have an individual meaning.^[6] As an example, «out» occurs in at least 560 phrasal verbs^[7] and appears in nearly 1700 multiword expressions.^[8]

The table also includes frequencies from other corpora. Note that as well as usage differences, lemmatisation may differ from corpus to corpus – for example splitting the prepositional use of «to» from the use as a particle. Also the Corpus of Contemporary American English (COCA) list includes dispersion as well as frequency to calculate rank.

Word	Parts of speech	OEC rank	COCA rank^[9]	Dolch level	Polysemy
the	Article	1	1	Pre-primer	12
be	Verb	2	2	Primer	21
to	Preposition	3	7, 9	Pre-primer	17
of	Preposition	4	4	Grade 1	12
and	Conjunction	5	3	Pre-primer	16
a	Article	6	5	Pre-primer	20
in	Preposition	7	6, 128, 3038	Pre-primer	23
that	Conjunction et al.	8	12, 27, 903	Primer	17
have	Verb	9	8	Primer	25
I	Pronoun	10	11	Pre-primer	7
it	Pronoun	11	10	Pre-primer	18
for	Preposition	12	13, 2339	Pre-primer	19
not	Adverb et al.	13	28, 2929	Pre-primer	5
on	Preposition	14	17, 155	Primer	43
with	Preposition	15	16	Primer	11
he	Pronoun	16	15	Primer	7
as	Adverb, conjunction, et al.	17	33, 49, 129	Grade 1	17
you	Pronoun	18	14	Pre-primer	9
do	Verb, noun	19	18	Primer	38
at	Preposition	20	22	Primer	14
this	Determiner, adverb, noun	21	20, 4665	Primer	9
but	Preposition, adverb, conjunction	22	23, 1715	Primer	17
his	Possessive pronoun	23	25, 1887	Grade 1	6
by	Preposition	24	30, 1190	Grade 1	19
from	Preposition	25	26	Grade 1	4
they	Pronoun	26	21	Primer	6
we	Pronoun	27	24	Pre-primer	6
say	Verb et al.	28	19	Primer	17
her	Possessive pronoun	29, 106	42	Grade 1	3
she	Pronoun	30	31	Primer	7
or	Conjunction	31	32	Grade 2	11
an	Article	32	(a)	Grade 1	6
will	Verb, noun	33	48, 1506	Primer	16
my	Possessive pronoun	34	44	Pre-primer	5
one	Noun, adjective, et al.	35	51, 104, 839	Pre-primer	24
all	Adjective	36	43, 222	Primer	15
would	Verb	37	41	Grade 2	13
there	Adverb, pronoun, et al.	38	53, 116	Primer	14
their	Possessive pronoun	39	36	Grade 2	2
what	Pronoun, adverb, et al.	40	34	Primer	19
so	Conjunction, adverb, et al.	41	55, 196	Primer	18
up	Adverb, preposition, et al.	42	50, 456	Pre-primer	50
out	Preposition	43	64, 149	Primer	38
if	Conjunction	44	40	Grade 3	9
about	Preposition, adverb, et al.	45	46, 179	Grade 3	18
who	Pronoun, noun	46	38	Primer	5
get	Verb	47	39	Primer	37
which	Pronoun	48	58	Grade 2	7
go	Verb, noun	49	35	Pre-primer	54
me	Pronoun	50	61	Pre-primer	10
when	Adverb	51	57, 136	Grade 1	11
make	Verb, noun	52	45	Grade 2 [as «made»]	48
can	Verb, noun	53	37, 2973	Pre-primer	18
like	Preposition, verb	54	74, 208, 1123, 1684, 2702	Primer	26
time	Noun	55	52	Dolch list of 95 nouns	14
no	Determiner, adverb	56	93, 699, 916, 1111, 4555	Primer	10
just	Adjective	57	66, 1823		14
him	Pronoun	58	68		5
know	Verb, noun	59	47		13
take	Verb, noun	60	63		66
people	Noun	61	62		9
into	Preposition	62	65		10
year	Noun	63	54		7
your	Possessive pronoun	64	69		4
good	Adjective	65	110, 2280		32
some	Determiner, pronoun	66	60		10
could	Verb	67	71		6
them	Pronoun	68	59		3
see	Verb	69	67		25
other	Adjective, pronoun	70	75, 715, 2355		12
than	Conjunction, preposition	71	73, 712		4
then	Adverb	72	77		10
now	Preposition	73	72, 1906		13
look	Verb	74	85, 604		17
only	Adverb	75	101, 329		11
come	Verb	76	70		20
its	Possessive pronoun	77	78		2
over	Preposition	78	124, 182		19
think	Verb	79	56		10
also	Adverb	80	87		2
back	Noun, adverb	81	108, 323, 1877		36
after	Preposition	82	120, 260		14
use	Verb, noun	83	92, 429		17
two	Noun	84	80		6
how	Adverb	85	76		11
our	Possessive pronoun	86	79		3
work	Verb, noun	87	117, 199		28
first	Adjective	88	86, 2064		10
well	Adverb	89	100, 644		30
way	Noun, adverb	90	84, 4090		16
even	Adjective	91	107, 484		23
new	Adjective et al.	92	88		18
want	Verb	93	83		10
because	Conjunction	94	89, 509		7
any	Pronoun	95	109, 4720		4
these	Pronoun	96	82		2
give	Verb	97	98		19
day	Noun	98	90		9
most	Adverb	99	144, 187		12
us	Pronoun	100	113		6

Parts of speech

The following is a very similar list, subdivided by part of speech.^[1] The list labeled «Others» includes pronouns, possessives, articles, modal verbs, adverbs, and conjunctions.

Rank	Nouns	Verbs	Adjectives	Prepositions	Others
1	time	be	good	to	the
2	person	have	new	of	and
3	year	do	first	in	a
4	way	say	last	for	that
5	day	get	long	on	I
6	thing	make	great	with	it
7	man	go	little	at	not
8	world	know	own	by	he
9	life	take	other	from	as
10	hand	see	old	up	you
11	part	come	right	about	this
12	child	think	big	into	but
13	eye	look	high	over	his
14	woman	want	different	after	they
15	place	give	small		her
16	work	use	large		she
17	week	find	next		or
18	case	tell	early		an
19	point	ask	young		will
20	government	work	important		my
21	company	seem	few		one
22	number	feel	public		all
23	group	try	bad		would
24	problem	leave	same		there
25	fact	call	able		their

References

^ ^a ^b ^c ^d «The Oxford English Corpus: Facts about the language». OxfordDictionaries.com. Oxford University Press. What is the commonest word?. Archived from the original on December 26, 2011. Retrieved June 22, 2011.
^ «The Oxford English Corpus». AskOxford.com. Archived from the original on May 4, 2006. Retrieved June 22, 2006.
^ The First 100 Most Commonly Used English Words Archived 2013-06-16 at the Wayback Machine.
^ Bill Bryson, The Mother Tongue: English and How It Got That Way, Harper Perennial, 2001, page 58
^ Benjamin Zimmer. June 22, 2006. Time after time after time…. Language Log. Retrieved June 22, 2006.
^ Benjamin, Martin (2019). «Polysemy in top 100 Oxford English Corpus words within Wiktionary». Teach You Backwards. Retrieved December 28, 2019.
^ Garcia-Vega, M (2010). «Teasing out the meaning of «out»«. 29th International Conference on Lexis and Grammar.
^ «out — English-French Dictionary». www.wordreference.com. Retrieved November 22, 2022.
^ «Word frequency: based on 450 million word COCA corpus». www.wordfrequency.info. Retrieved April 11, 2018.

External links

Источник

What is a popular word finder?

learn more about this tool

With this online tool, you can find the most common words in any text. The program runs through all the words in the text and in the output, it prints the count of their occurrences. The information about the most popular words often gives clues about the topic, language, and purpose of the text. For example, if the most common words are «disco», «music», and «dance», then it’s most likely text about dancing. If the most common words are «the», «a», and «is», then the text is most likely written in the English language, and if the most common word are «di», «che», and «la», then the text is most likely in Italian. Even the information about a single word can tell a lot about the text. For example, if the most popular word in the text is «charity», then most likely the purpose of the text is to help those who need it. In addition to printing single-word statistics, this tool can also analyze the frequency of multi-word phrases in the text. You can choose to analyze combinations of two, three, or more words, and the tool will display the distribution of n-word groups. For example, if the input text is «Owls hoot in the dark.» then the program will generate four word pairs (also called word bigrams) – «Owls hoot», «hoot in», «in the», and «the dark», and if the word group size is 3 (called word trigrams), then there will be three word triplets – «Owls hoot in», «hoot in the», and «in the dark». You can also choose in the options (via the «Stop at Sentence Boundary») whether to create a stream of words from neighboring sentences to form joint groups or not. For example, if the input text is «Long cat is red. Short cat is black.», then with this option on, the bigrams would be «Long cat», «cat is», «is red», «Short cat», «cat is», «is black». But with this option off, the full-stop is ignored and the bigrams would be «Long cat», «cat is», «is red», «red Short», «Short cat», «cat is», «is black». In this example, the words maintained their sentence case but by enabling the «Ignore Word Case» option, you can analyze all words in lowercase. You can also exclude or replace punctuation marks in the text before analysis. For example, if a word is wrapped in parentheses «(owl)» then you can remove the parentheses by entering them in the «Punctuation to Delete» option. If a word contains internal punctuation, such as hyphenation in the word «full-scale», then you can replace the hyphen «-» with a space and analyze this word as two separate words «full» and «scale». In addition to the total number of words in the text, you can also display their usage percentage and print a fractional representation of each word’s number of uses relative to the total number of words in the text. Additionally, you can sort the output words alphabetically or by the usage counts. Textabulous!

Источник

The challenge

Write a function that, given a string of text (possibly with punctuation and line-breaks), returns an array of the top-3 most occurring words, in descending order of the number of occurrences.

Assumptions:

A word is a string of letters (A to Z) optionally containing one or more apostrophes (‘) in ASCII. (No need to handle fancy punctuation.)
Matches should be case-insensitive, and the words in the result should be lowercased.
Ties may be broken arbitrarily.
If a text contains fewer than three unique words, then either the top-2 or top-1 words should be returned, or an empty array if a text contains no words.

Examples:

top_3_words("In a village of La Mancha, the name of which I have no desire to call to
mind, there lived not long since one of those gentlemen that keep a lance
in the lance-rack, an old buckler, a lean hack, and a greyhound for
coursing. An olla of rather more beef than mutton, a salad on most
nights, scraps on Saturdays, lentils on Fridays, and a pigeon or so extra
on Sundays, made away with three-quarters of his income.")
# => ["a", "of", "on"]

top_3_words("e e e e DDD ddd DdD: ddd ddd aa aA Aa, bb cc cC e e e")
# => ["e", "ddd", "aa"]

top_3_words("  //wont won't won't")
# => ["won't", "wont"]

Bonus points:

Avoid creating an array whose memory footprint is roughly as big as the input text.
Avoid sorting the entire array of unique words.

Test cases

from random import choice, randint, sample, shuffle, choices
import re
from collections import Counter


def check(s, this=None):                                            # this: only for debugging purpose
    returned_result = top_3_words(s) if this is None else this
    fs = Counter(w for w in re.findall(r"[a-zA-Z']+", s.lower()) if w != "'" * len(w))
    exp,expected_frequencies = map(list,zip(*fs.most_common(3))) if fs else ([],[])
    
    msg = ''
    wrong_words = [w for w in returned_result if not fs[w]]
    actual_freq = [fs[w] for w in returned_result]
    
    if wrong_words:
        msg = 'Incorrect match: words not present in the string. Your output: {}. One possible valid answer: {}'.format(returned_result, exp)
    elif len(set(returned_result)) != len(returned_result):
        msg = 'The result should not contain copies of the same word. Your output: {}. One possible output: {}'.format(returned_result, exp)
    elif actual_freq!=expected_frequencies:
        msg = "Incorrect frequencies: {} should be {}. Your output: {}. One possible output: {}".format(actual_freq, expected_frequencies, returned_result, exp)
    
    Test.expect(not msg, msg)



@test.describe("Fixed tests")
def fixed_tests():

    TESTS = (
    "a a a  b  c c  d d d d  e e e e e",
    "e e e e DDD ddd DdD: ddd ddd aa aA Aa, bb cc cC e e e",
    "  //wont won't won't ",
    "  , e   .. ",
    "  ...  ",
    "  '  ",
    "  '''  ",
    """In a village of La Mancha, the name of which I have no desire to cao
    mind, there lived not long since one of those gentlemen that keep a lance
    in the lance-rack, an old buckler, a lean hack, and a greyhound for
    coursing. An olla of rather more beef than mutton, a salad on most
    nights, scraps on Saturdays, lentils on Fridays, and a pigeon or so extra
    on Sundays, made away with three-quarters of his income.""",
    "a a a  b  c c X",
    "a a c b b",
    )
    for s in TESTS: check(s)
    
@test.describe("Random tests")
def random_tests():
    
    def gen_word():
        return "".join(choice("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'") for _ in range(randint(3, 10)))
    
    def gen_string():
        words = []
        nums = choices(range(1, 31), k=20)
        for _ in range(randint(0, 20)):
            words += [gen_word()] * nums.pop()
        shuffle(words)
        s = ""
        while words:
            s += words.pop() + "".join(choice("-,.?!_:;/ ") for _ in range(randint(1, 5)))
        return s
    
    @test.it("Tests")
    def it_1():
        for _ in range(100): check(gen_string())

The solution using Python

Option 1:

# use the Counter module
from collections import Counter
# use the regex module
import re

def top_3_words(text):
    # count the input, pass through a regex and lowercase it
    c = Counter(re.findall(r"[a-z']+", re.sub(r" '+ ", " ", text.lower())))
    # return the `most common` 3 items
    return [w for w,_ in c.most_common(3)]

Option 2:

def top_3_words(text):
    # loop through each character in the string
    for c in text:
        # if it's not alphanumeric or an apostrophe
        if not (c.isalpha() or c=="'"):
            # replace with a space
            text = text.replace(c,' ')
    # create some `list` variables
    words,counts,out = [],[],[]

    # loop through the words in the text
    for word in list(filter(None,text.lower().split())):
        # if in all, then continue
        if all([not c.isalpha() for c in word]):
            continue
        # if the word is in the words list
        if word in words:
            # increment the count
            counts[words.index(word)] += 1
        else:
            # otherwise create a new entry
            words.append(word); counts.append(0)

    # loop while bigger than 0 and less than 3
    while len(words)>0 and len(out)<3:
        # append the counts
        out.append(words.pop(counts.index(max(counts))).lower())
        counts.remove(max(counts))
    # return the counts
    return out

Option 3:

def top_3_words(text):
    wrds = {}
    for p in r'!"#$%&()*+,./:;<=>[email protected][]^_`{|}~-':
        text = text.replace(p, ' ')
    for w in text.lower().split():
        if w.replace("'", '') != '':
            wrds[w] = wrds.get(w, 0) + 1
    return [y[0] for y in sorted(wrds.items(), key=lambda x: x[1], reverse=True)[:3]]

Источник

Write a function that, given a string of text (possibly with punctuation and line-breaks), returns an array of the top-3 most occurring words, in descending order of the number of occurrences.

top_3_words("In a village of La Mancha, the name of which I have no desire to call to
mind, there lived not long since one of those gentlemen that keep a lance
in the lance-rack, an old buckler, a lean hack, and a greyhound for
coursing. An olla of rather more beef than mutton, a salad on most
nights, scraps on Saturdays, lentils on Fridays, and a pigeon or so extra
on Sundays, made away with three-quarters of his income.")
# => ["a", "of", "on"]

top_3_words("e e e e DDD ddd DdD: ddd ddd aa aA Aa, bb cc cC e e e")
# => ["e", "ddd", "aa"]

top_3_words("  //wont won't won't")
# => ["won't", "wont"]

Источник

WordCounter analyzes your text and tells you the most common words and phrases.

This tool helps you count words, bigrams, and trigrams in plain text. This is often the first step in quantitative text analysis.

Источник

100 most common words

Parts of speech

See also

Word lists

References

External links

What is a popular word finder?

The challenge

Assumptions:

Examples:

Bonus points:

Test cases

The solution using Python

WordCounter analyzes your text and tells you the most common words and phrases.