Determine if word is english

I want to check in a Python program if a word is in the English dictionary.

I believe nltk wordnet interface might be the way to go but I have no clue how to use it for such a simple task.

def is_english_word(word):
    pass # how to I implement is_english_word?

is_english_word(token.lower())

In the future, I might want to check if the singular form of a word is in the dictionary (e.g., properties -> property -> english word). How would I achieve that?

Salvador Dali

211k145 gold badges695 silver badges750 bronze badges

asked Sep 24, 2010 at 16:01

For (much) more power and flexibility, use a dedicated spellchecking library like PyEnchant. There’s a tutorial, or you could just dive straight in:

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>> d.suggest("Helo")
['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"]
>>>

PyEnchant comes with a few dictionaries (en_GB, en_US, de_DE, fr_FR), but can use any of the OpenOffice ones if you want more languages.

There appears to be a pluralisation library called inflect, but I’ve no idea whether it’s any good.

answered Sep 24, 2010 at 16:26

KatrielKatriel

119k19 gold badges134 silver badges168 bronze badges

It won’t work well with WordNet, because WordNet does not contain all english words.
Another possibility based on NLTK without enchant is NLTK’s words corpus

>>> from nltk.corpus import words
>>> "would" in words.words()
True
>>> "could" in words.words()
True
>>> "should" in words.words()
True
>>> "I" in words.words()
True
>>> "you" in words.words()
True

answered Jan 28, 2014 at 8:38

SadıkSadık

4,1777 gold badges53 silver badges89 bronze badges

Using NLTK:

from nltk.corpus import wordnet

if not wordnet.synsets(word_to_test):
  #Not an English Word
else:
  #English Word

You should refer to this article if you have trouble installing wordnet or want to try other approaches.

nickb

59k12 gold badges105 silver badges141 bronze badges

answered Mar 18, 2011 at 11:29

Susheel JavadiSusheel Javadi

2,9843 gold badges32 silver badges34 bronze badges

Using a set to store the word list because looking them up will be faster:

with open("english_words.txt") as word_file:
    english_words = set(word.strip().lower() for word in word_file)

def is_english_word(word):
    return word.lower() in english_words

print is_english_word("ham")  # should be true if you have a good english_words.txt

To answer the second part of the question, the plurals would already be in a good word list, but if you wanted to specifically exclude those from the list for some reason, you could indeed write a function to handle it. But English pluralization rules are tricky enough that I’d just include the plurals in the word list to begin with.

As to where to find English word lists, I found several just by Googling «English word list». Here is one: http://www.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt You could Google for British or American English if you want specifically one of those dialects.

answered Sep 24, 2010 at 16:12

kindallkindall

177k35 gold badges271 silver badges305 bronze badges

For All Linux/Unix Users

If your OS uses the Linux kernel, there is a simple way to get all the words from the English/American dictionary. In the directory /usr/share/dict you have a words file. There is also a more specific american-english and british-english files. These contain all of the words in that specific language. You can access this throughout every programming language which is why I thought you might want to know about this.

Now, for python specific users, the python code below should assign the list words to have the value of every single word:

import re
file = open("/usr/share/dict/words", "r")
words = re.sub("[^w]", " ",  file.read()).split()
file.close()
    
def is_word(word):
    return word.lower() in words
 
is_word("tarts") ## Returns true
is_word("jwiefjiojrfiorj") ## Returns False

Hope this helps!

answered Apr 28, 2020 at 12:09

For a faster NLTK-based solution you could hash the set of words to avoid a linear search.

from nltk.corpus import words as nltk_words
def is_english_word(word):
    # creation of this dictionary would be done outside of 
    #     the function because you only need to do it once.
    dictionary = dict.fromkeys(nltk_words.words(), None)
    try:
        x = dictionary[word]
        return True
    except KeyError:
        return False

answered Jun 27, 2016 at 19:58

Eb AbadiEb Abadi

5355 silver badges17 bronze badges

I find that there are 3 package-based solutions to solve the problem. They are pyenchant, wordnet and corpus(self-defined or from ntlk). Pyenchant couldn’t installed easily in win64 with py3. Wordnet doesn’t work very well because it’s corpus isn’t complete. So for me, I choose the solution answered by @Sadik, and use ‘set(words.words())’ to speed up.

First:

pip3 install nltk
python3

import nltk
nltk.download('words')

Then:

from nltk.corpus import words
setofwords = set(words.words())

print("hello" in setofwords)
>>True

answered Feb 3, 2019 at 3:53

Young YangYoung Yang

1341 silver badge5 bronze badges

With pyEnchant.checker SpellChecker:

from enchant.checker import SpellChecker

def is_in_english(quote):
    d = SpellChecker("en_US")
    d.set_text(quote)
    errors = [err.word for err in d]
    return False if ((len(errors) > 4) or len(quote.split()) < 3) else True

print(is_in_english('“办理美国加州州立大学圣贝纳迪诺分校高仿成绩单Q/V2166384296加州州立大学圣贝纳迪诺分校学历学位认证'))
print(is_in_english('“Two things are infinite: the universe and human stupidity; and I'm not sure about the universe.”'))

> False
> True

answered May 4, 2017 at 14:16

For a semantic web approach, you could run a sparql query against WordNet in RDF format. Basically just use urllib module to issue GET request and return results in JSON format, parse using python ‘json’ module. If it’s not English word you’ll get no results.

As another idea, you could query Wiktionary’s API.

answered Sep 24, 2010 at 17:28

burkestarburkestar

7531 gold badge4 silver badges12 bronze badges

use nltk.corpus instead of enchant. Enchant gives ambiguous results. For example :
for benchmark and bench-mark enchant is returning true. It should suppose to return false for benchmark.

answered Apr 10, 2021 at 11:51

Download this txt file https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt

then create a Set out of it using the following python code snippet that loads about 370k non-alphanumeric words in english

>>> with open("/PATH/TO/words_alpha.txt") as f:
>>>     words = set(f.read().split('n'))
>>> len(words)
370106

From here onwards, you can check for existence in constant time using

>>> word_to_check = 'baboon'
>>> word_to_check in words
True

Note that this set might not be comprehensive but still gets the job done, user should do quality checks to make sure it works for their use-case as well.

answered May 23, 2022 at 18:19

AyushAyush

4522 gold badges8 silver badges24 bronze badges

Источник

Here I introduce several ways to identify if the word consists of the English alphabet or not.

1. Using isalpha method

In Python, string object has a method called isalpha

word = "Hello"
if word.isalpha():
    print("It is an alphabet")
    
word = "123"
if word.isalpha():
    print("It is an alphabet")
else:
    print("It is not an alphabet")

However, this approach has a minor problem; for example, if you use the Korean alphabet, it still considers the Korean word as an alphabet. (Of course, for the non-Korean speaker, it wouldn’t be a problem 😅 )

To avoid this behavior, you should add encode method before call isalpha.

word = "한글"
if word.encode().isalpha():
    print("It is an alphabet")
else:
    print("It is not an alphabet")

2. Using Regular Expression.

I think this is a universal approach, regardless of programming language.

import re
word="hello"
reg = re.compile(r'[a-zA-Z]')

if reg.match(word):
    print("It is an alphabet")
else:
    print("It is not an alphabet")
    
word="123"
reg = re.compile(r'[a-z]')
if reg.match(word):
    print("It is an alphabet")
else:
    print("It is not an alphabet")

3. Using operator

It depends on the precondition; however, we will just assume the goal is if all characters should be the English alphabet or not.

Therefore, we can apply the comparison operator.

word = "hello"

if 'a' <= word[0] <= "z" or 'A' <= word[0] <='Z':
    print("It is an alphabet")
else:
    print("It is not an alphabet")

Note that we have to consider both upper and lower cases. Also, we shouldn’t use the entire word because the comparison would work differently based on the length of the word.

We can also simplify this code using the lower or upper method in the string.

word = "hello"

if 'a' <= word[0].lower() <= "z":
    print("It is an alphabet")
else:
    print("It is not an alphabet")

4. Using lower and upper method

This is my favorite approach. Since the English alphabet has Lower and Upper cases, unlike other characters (number or Korean), we can leverage this characteristic to identify the word.

word = "hello"
if word.upper() != word.lower():
    print("It is an alphabet")
else:
    print("It is not an alphabet")

Happy coding!

Источник

During NLP and text analytics, several varieties of features can be extracted from a document of words to use for predictive modeling. These include the following.

ngrams

Take a random sample of words from words.txt. For each word in sample, extract every possible bi-gram of letters. For example, the word strength consists of these bi-grams: {st, tr, re, en, ng, gt, th}. Group by bi-gram and compute the frequency of each bi-gram in your corpus. Now do the same thing for tri-grams, … all the way up to n-grams. At this point you have a rough idea of the frequency distribution of how Roman letters combine to create English words.

ngram + word boundaries

To do a proper analysis you should probably create tags to indicate n-grams at the start and end of a word, (dog -> {^d, do, og, g^}) — this would allow you to capture phonological/orthographic constraints that might otherwise be missed (e.g., the sequence ng can never occur at the beginning of a native English word, thus the sequence ^ng is not permissible — one of the reasons why Vietnamese names like Nguyễn are hard to pronounce for English speakers).

Call this collection of grams the word_set. If you reverse sort by frequency, your most frequent grams will be at the top of the list — these will reflect the most common sequences across English words. Below I show some (ugly) code using package {ngram} to extract the letter ngrams from words then compute the gram frequencies:

#' Return orthographic n-grams for word
#' @param w character vector of length 1
#' @param n integer type of n-gram
#' @return character vector
#' 
getGrams <- function(w, n = 2) {
  require(ngram)
  (w <- gsub("(^[A-Za-z])", "^\1", w))
  (w <- gsub("([A-Za-z]$)", "\1^", w))


  # for ngram processing must add spaces between letters
  (ww <- gsub("([A-Za-z^'])", "\1 \2", w))
  w <- gsub("[ ]$", "", ww)

  ng <- ngram(w, n = n)
  grams <- get.ngrams(ng)
  out_grams <- sapply(grams, function(gram){return(gsub(" ", "", gram))}) #remove spaces
  return(out_grams)
}

words <- list("dog", "log", "bog", "frog")
res <- sapply(words, FUN = getGrams)
grams <- unlist(as.vector(res))
table(grams)

## ^b ^d ^f ^l bo do fr g^ lo og ro 
##  1  1  1  1  1  1  1  4  1  4  1

Your program will just take an incoming sequence of characters as input, break it into grams as previously discussed and compare to list of top grams. Obviously you will have to reduce your top n picks to fit the program size requirement.

consonants & vowels

Another possible feature or approach would be to look at consonant vowel sequences. Basically convert all words in consonant vowel strings (e.g., pancake -> CVCCVCV) and follow the same strategy previously discussed. This program could probably be much smaller but it would suffer from accuracy because it abstracts phones into high-order units.

nchar

Another useful feature will be string length, as the possibility for legitimate English words decreases as the number of characters increases.

library(dplyr)
library(ggplot2)

file_name <- "words.txt"
df <- read.csv(file_name, header = FALSE, stringsAsFactors = FALSE)
names(df) <- c("word")

df$nchar <- sapply(df$word, nchar)
grouped <- dplyr::group_by(df, nchar)
res <- dplyr::summarize(grouped, count = n())
qplot(res$nchar, res$count, geom="path", 
      xlab = "Number of characters", 
      ylab = "Frequency", 
      main = "Distribution of English word lengths in words.txt",
      col=I("red"))

Error Analysis

The type of errors produced by this type of machine should be nonsense words — words that look like they should be English words but which aren’t (e.g., ghjrtg would be correctly rejected (true negative) but barkle would incorrectly classified as an English word (false positive)).

Interestingly, zyzzyvas would be incorrectly rejected (false negative), because zyzzyvas is a real English word (at least according to words.txt), but its gram sequences are extremely rare and thus not likely to contribute much discriminatory power.

Источник

Solution 1

There is a library called langdetect. It is ported from Google’s language-detection available here:

https://pypi.python.org/pypi/langdetect

It supports 55 languages out of the box.

Solution 2

You might be interested in my paper The WiLI benchmark dataset for written
language identification. I also benchmarked a couple of tools.

TL;DR:

CLD-2 is pretty good and extremely fast
lang-detect is a tiny bit better, but much slower
langid is good, but CLD-2 and lang-detect are much better
NLTK’s Textcat is neither efficient nor effective.

You can install lidtk and classify languages:

$ lidtk cld2 predict --text "this is some text written in English"
eng
$ lidtk cld2 predict --text "this is some more text written in English"
eng
$ lidtk cld2 predict --text "Ce n'est pas en anglais"                  
fra

Solution 3

Pretrained Fast Text Model Worked Best For My Similar Needs

I arrived at your question with a very similar need. I appreciated Martin Thoma’s answer. However, I found the most help from Rabash’s answer part 7 HERE.

After experimenting to find what worked best for my needs, which were making sure text files were in English in 60,000+ text files, I found that fasttext was an excellent tool.

With a little work, I had a tool that worked very fast over many files. Below is the code with comments. I believe that you and others will be able to modify this code for your more specific needs.

class English_Check:
    def __init__(self):
        # Don't need to train a model to detect languages. A model exists
        #    that is very good. Let's use it.
        pretrained_model_path = 'location of your lid.176.ftz file from fasttext'
        self.model = fasttext.load_model(pretrained_model_path)

    def predictionict_languages(self, text_file):
        this_D = {}
        with open(text_file, 'r') as f:
            fla = f.readlines()  # fla = file line array.
            # fasttext doesn't like newline characters, but it can take
            #    an array of lines from a file. The two list comprehensions
            #    below, just clean up the lines in fla
            fla = [line.rstrip('n').strip(' ') for line in fla]
            fla = [line for line in fla if len(line) > 0]

            for line in fla:  # Language predict each line of the file
                language_tuple = self.model.predictionict(line)
                # The next two lines simply get at the top language prediction
                #    string AND the confidence value for that prediction.
                prediction = language_tuple[0][0].replace('__label__', '')
                value = language_tuple[1][0]

                # Each top language prediction for the lines in the file
                #    becomes a unique key for the this_D dictionary.
                #    Everytime that language is found, add the confidence
                #    score to the running tally for that language.
                if prediction not in this_D.keys():
                    this_D[prediction] = 0
                this_D[prediction] += value

        self.this_D = this_D

    def determine_if_file_is_english(self, text_file):
        self.predictionict_languages(text_file)

        # Find the max tallied confidence and the sum of all confidences.
        max_value = max(self.this_D.values())
        sum_of_values = sum(self.this_D.values())
        # calculate a relative confidence of the max confidence to all
        #    confidence scores. Then find the key with the max confidence.
        confidence = max_value / sum_of_values
        max_key = [key for key in self.this_D.keys()
                   if self.this_D[key] == max_value][0]

        # Only want to know if this is english or not.
        return max_key == 'en'

Below is the application / instantiation and use of the above class for my needs.

file_list = # some tool to get my specific list of files to check for English

en_checker = English_Check()
for file in file_list:
    check = en_checker.determine_if_file_is_english(file)
    if not check:
        print(file)

Solution 4

This is what I’ve used some time ago.
It works for texts longer than 3 words and with less than 3 non-recognized words.
Of course, you can play with the settings, but for my use case (website scraping) those worked pretty well.

from enchant.checker import SpellChecker

max_error_count = 4
min_text_length = 3

def is_in_english(quote):
  d = SpellChecker("en_US")
  d.set_text(quote)
  errors = [err.word for err in d]
  return False if ((len(errors) > max_error_count) or len(quote.split()) < min_text_length) else True

print(is_in_english('“中文”'))
print(is_in_english('“Two things are infinite: the universe and human stupidity; and I'm not sure about the universe.”'))

> False
> True

Solution 5

Use the enchant library

import enchant

dictionary = enchant.Dict("en_US") #also available are en_GB, fr_FR, etc

dictionary.check("Hello") # prints True
dictionary.check("Helo") #prints False

This example is taken directly from their website

Comments

I am using both Nltk and Scikit Learn to do some text processing. However, within my list of documents I have some documents that are not in English. For example, the following could be true:
```
[ "this is some text written in English", 
  "this is some more text written in English", 
  "Ce n'est pas en anglais" ] 
```
For the purposes of my analysis, I want all sentences that are not in English to be removed as part of pre-processing. However, is there a good way to do this? I have been Googling, but cannot find anything specific that will let me recognize if strings are in English or not. Is this something that is not offered as functionality in either Nltk or Scikit learn? EDIT I’ve seen questions both like this and this but both are for individual words… Not a «document». Would I have to loop through every word in a sentence to check if the whole sentence is in English?

I’m using Python, so libraries that are in Python would be preferable, but I can switch languages if needed, just thought that Python would be the best for this.

Recents

Источник

It won’t work well with WordNet, because WordNet does not contain all english words.
Another possibility based on NLTK without enchant is NLTK’s words corpus

>>> from nltk.corpus import words
>>> "would" in words.words()
True
>>> "could" in words.words()
True
>>> "should" in words.words()
True
>>> "I" in words.words()
True
>>> "you" in words.words()
True

For (much) more power and flexibility, use a dedicated spellchecking library like PyEnchant. There’s a tutorial, or you could just dive straight in:

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>> d.suggest("Helo")
['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"]
>>>

PyEnchant comes with a few dictionaries (en_GB, en_US, de_DE, fr_FR), but can use any of the OpenOffice ones if you want more languages.

There appears to be a pluralisation library called inflect, but I’ve no idea whether it’s any good.

For All Linux/Unix Users

1. Using isalpha method

2. Using Regular Expression.

3. Using operator

4. Using lower and upper method

Solution 1

Solution 2

Solution 3

Pretrained Fast Text Model Worked Best For My Similar Needs

Solution 4

Solution 5

Comments

Recents

Related

Tags:

Python

Nltk

Wordnet

Related