Count unique words in word

ProfessionalComputers

This calculator counts the number of unique words in a text (total number of words minus all word repetitions). It also counts a number of repeated words. It also can remove all the repetitions from the text.

PLANETCALC, Unique word calculator

Unique word calculator

Text

Text, which contains repeated words

Remove repeated words

Removes repeated words, starting from the second occurence.

Show quantity

To display number of words near a word.

Case sensitive

Exclude words

Words to be excluded from count

Unique word count

Text

Valid words

The file is very large. Browser slowdown may occur during loading and creation.

URL copied to clipboard

Similar calculators

  • • Count the number of words
  • • Tank volume
  • • Complex Fraction Simplifier
  • • Cryptarithmetic puzzle solver
  • • Conversion of pounds into inches
  • • Computers section ( 65 calculators )

 Computers Literature number of words unique words word word count words

PLANETCALC, Unique words count

Anton2021-09-30 12:26:11

Comments

Your message

Subscribe to comments notifications

I want to count unique words in a text, but I want to make sure that words followed by special characters aren’t treated differently, and that the evaluation is case-insensitive.

Take this example

text = "There is one handsome boy. The boy has now grown up. He is no longer a boy now." 
print len(set(w.lower() for w in text.split()))

The result would be 16, but I expect it to return 14. The problem is that ‘boy.’ and ‘boy’ are evaluated differently, because of the punctuation.

eandersson's user avatar

eandersson

25.5k8 gold badges89 silver badges108 bronze badges

asked Apr 16, 2013 at 23:17

1

import re
print len(re.findall('w+', text))

Using a regular expression makes this very simple. All you need to keep in mind is to make sure that all the characters are in lowercase, and finally combine the result using set to ensure that there are no duplicate items.

print len(set(re.findall('w+', text.lower())))

answered Apr 16, 2013 at 23:20

eandersson's user avatar

eanderssoneandersson

25.5k8 gold badges89 silver badges108 bronze badges

0

you can use regex here:

In [65]: text = "There is one handsome boy. The boy has now grown up. He is no longer a boy now."

In [66]: import re

In [68]: set(m.group(0).lower() for m in re.finditer(r"w+",text))

Out[68]: 
set(['grown',
     'boy',
     'he',
     'now',
     'longer',
     'no',
     'is',
     'there',
     'up',
     'one',
     'a',
     'the',
     'has',
     'handsome'])

answered Apr 16, 2013 at 23:18

Ashwini Chaudhary's user avatar

Ashwini ChaudharyAshwini Chaudhary

242k58 gold badges456 silver badges502 bronze badges

I think that you have the right idea of using the Python built-in set type.
I think that it can be done if you first remove the ‘.’ by doing a replace:

text = "There is one handsome boy. The boy has now grown up. He is no longer a boy now."
punc_char= ",.?!'"
for letter in text:
    if letter == '"' or letter in punc_char:
        text= text.replace(letter, '')
text= set(text.split())
len(text)

that should work for you. And if you need any of the other signs or punctuation points you can easily
add them into punc_char and they will be filtered out.

Abraham J.

answered Apr 17, 2013 at 1:01

user2288672's user avatar

2

First, you need to get a list of words. You can use a regex as eandersson suggested:

import re
words = re.findall('w+', text)

Now, you want to get the number of unique entries. There are a couple of ways to do this. One way would be iterate through the words list and use a dictionary to keep track of the number of times you have seen a word:

cwords = {}
for word in words:
     try:
         cwords[word] += 1
     except KeyError:
         cwords[word] = 1

Now, finally, you can get the number of unique words by

len(cwords)

answered Apr 16, 2013 at 23:30

vowelless's user avatar

1

Apart from VBA, one can develop such an application using API of OpenOffice to read the contents of the Word document; process it and export the results as a CSV file to open in a spreadsheet application.

However it’s actually just a few line of codes if you’re familiar with any programming language.
For example in Python you can easily do it like that:

Here we define a simple function which counts words given a list

def countWords(a_list):
    words = {}
    for i in range(len(a_list)):
        item = a_list[i]
        count = a_list.count(item)
        words[item] = count
    return sorted(words.items(), key = lambda item: item[1], reverse=True)

The rest is to manipulate the content of the document.First paste it:

content = """This is the content of the word document. Just copy paste it. 
It can be very very very very long and it can contain punctuation 
(they will be ignored) and numbers like 123 and 4567 (they will be counted)."""

Here we remove the punctuation, EOL, parentheses etc. and then generate a word list for our function:

import re

cleanContent = re.sub('[^a-zA-Z0-9]',' ', content)

wordList = cleanContent.lower().split()

Then we run our function and store its result (word-count pairs) in another list and print the results:

result = countWords(wordList)

for words in result:
    print(words)

So the result is:

('very', 4)
('and', 3)
('it', 3)
('be', 3)
('they', 2)
('will', 2)
('can', 2)
('the', 2)
('ignored', 1)
('just', 1)
('is', 1)
('numbers', 1)
('punctuation', 1)
('long', 1)
('content', 1)
('document', 1)
('123', 1)
('4567', 1)
('copy', 1)
('paste', 1)
('word', 1)
('like', 1)
('this', 1)
('of', 1)
('contain', 1)
('counted', 1)

You can remove parentheses and comma using search/replace if you want.

All you need to do download Python 3, install it, open IDLE (comes with Python), replace the content of your word document and run the commands one at a time and in the given order.

Помимо VBA, такое приложение можно разработать с использованием API OpenOffice для чтения содержимого документа Word; обработайте его и экспортируйте результаты в виде файла CSV, чтобы открыть в приложении электронной таблицы.

Однако на самом деле это всего лишь несколько строк кода, если вы знакомы с любым языком программирования.
Например, в Python вы можете легко сделать это так:

Здесь мы определяем простую функцию, которая считает слова по заданному списку

def countWords(a_list):
    words = {}
    for i in range(len(a_list)):
        item = a_list[i]
        count = a_list.count(item)
        words[item] = count
    return sorted(words.items(), key = lambda item: item[1], reverse=True)

Остальное — манипулировать содержимым документа.Сначала вставьте это:

content = """This is the content of the word document. Just copy paste it. 
It can be very very very very long and it can contain punctuation 
(they will be ignored) and numbers like 123 and 4567 (they will be counted)."""

Здесь мы удаляем пунктуацию, EOL, скобки и т.д., А затем генерируем список слов для нашей функции:

import re

cleanContent = re.sub('[^a-zA-Z0-9]',' ', content)

wordList = cleanContent.lower().split()

Затем мы запускаем нашу функцию и сохраняем ее результат (пары подсчета слов) в другом списке и печатаем результаты:

result = countWords(wordList)

for words in result:
    print(words)

Итак, результат:

('very', 4)
('and', 3)
('it', 3)
('be', 3)
('they', 2)
('will', 2)
('can', 2)
('the', 2)
('ignored', 1)
('just', 1)
('is', 1)
('numbers', 1)
('punctuation', 1)
('long', 1)
('content', 1)
('document', 1)
('123', 1)
('4567', 1)
('copy', 1)
('paste', 1)
('word', 1)
('like', 1)
('this', 1)
('of', 1)
('contain', 1)
('counted', 1)

Вы можете удалить скобки и запятую, используя поиск / замену, если хотите.

Все, что вам нужно сделать, это загрузить Python 3, установить его, открыть IDLE (поставляется с Python), заменить содержимое вашего текстового документа и запускать команды по одной в указанном порядке.

October 15th, 2009

Bookmark and Share

Hey, Scripting Guy! Question

Hey, Scripting Guy! I need to obtain a listing of unique words from a Microsoft Word document. I know that there is the Sort-Object cmdlet that can be used to retrieve unique items, and there is the Get-Content cmdlet that can read the text of a text file. However, the Get-Content cmdlet is not able to read a Microsoft Word document, and I do not think I can use the Sort-Object cmdlet to produce a unique listing of words.

— EM

Hey, Scripting Guy! Answer

Hello EM,

Microsoft Scripting Guy Ed Wilson here. I am listening to the Bourbon Street Rag on my Zune, and was day dreaming a bit about my last trip to New Orleans. The really good news is that TechEd 2010 will be held in New Orleans, and (drum roll please) the Microsoft Scripting Guys already have set aside the budget to be there! “Do you know what it means to miss New Orleans?” the song continues to amble. Now the upbeat sound of Van Halen is coming from my Zune. Quite the segue! It’s a shuffle kind of day.

I am having a great day today, and I have responded to several really cool questions sent to scripter@microsoft.com e-mail. EM, your question was really interesting, and I decided to write the GetUniqueWordsFromWord.ps1 script that is shown here.

GetUniqueWordsFromWord.ps1

$document = “C:fsoWhyUsePs2.docx”
$app = New-Object -ComObject word.application
$app.Visible = $false
$doc = $app.Documents.Open($document)
$words = $doc.words
$outputObject = @()
“There are ” + $words.count + ” words in the document”
For($i = 1 ; $i -le $words.count ; $i ++)
     {
      $object = New-Object -typeName PSObject
       $object |
       Add-Member -MemberType noteProperty -name word -value $words.item($i).text
       $outputObject += $object
     }
$doc.close()
$app.quit()
$outputObject | sort-object -property word -unique

Before jumping into the GetUniqueWordsFromWord.ps1 script, take a look at the Word document seen here:

Image of Word document with 231 words

As you can see, there are 231 words in the document. Many of these words are unique such as “after,” but some of the words are not unique such as the word “the.” The GetUniqueWordsFromWord.ps1 script will display a list of all the unique words in the Microsoft Word document.

To display the unique words in the Microsoft Word document, the GetUniqueWordsFromWord.ps1 script begins by using the $document variable to hold the path to the Microsoft Word document that is to be analyzed. Next, the word.application COM object is used to create an instance of the application object. The application object is the main object that is used when working with the Microsoft Word automation model. The visible property is set to $false, which means the Microsoft Word document will not be visible while the Windows PowerShell script is running. This section of the script is shown here:

$document = “C:fsoWhyUsePs2.docx”

$app = New-Object -ComObject word.application

$app.Visible = $false

After the application object has been created, the documents property from the application object is used to obtain an instance of the documents collection object. The open method from the documents collection object is used to open the document that is specified in the $document variable. The open method from the documents collection object returns a document object that is stored in the $doc variable. This line of the script is shown here:

$doc = $app.Documents.Open($document)

The words property of the document object is used to return a words collection object that represents all of the words in the document. The words collection object is stored in the $words variable as seen here:

$words = $doc.words

After the words collection has been created, it is time to create an empty array that will be used to store the custom object the script will create. It is also time to display a message on the Windows PowerShell console that indicates how many words are in the document. Please note that in most cases, the number of words displayed by the count property of the words collection object will not correspond with the number that is shown at the bottom of the Microsoft Word document. This is because different characters are considered words by the count property than the ones shown in the document. This section of the script is seen here:

$outputObject = @()

“There are ” + $words.count + ” words in the document”

The for statement is used to set up a loop that will be used to walk through the collection of words stored in the words collection object. The loop begins at 1 and continues as long as the value of the variable $i is less than or equal to the count of the number of words in the collection. On each pass through the loop, the value of the $i variable will be incremented by 1. This is seen here:

For($i = 1 ; $i -le $words.count ; $i ++)

     {

Inside each loop, a custom Windows PowerShell PSObject is created by using the New-Object cmdlet and the returned PSObject is stored in the $object variable. This is shown here:

      $object = New-Object -typeName PSObject

The Add-Member cmdlet is used to add a noteProperty to the PSObject stored in the $object variable. The name of the noteProperty is word, and the value is the next word in the collection of words. The item method is used to retrieve the word from the words collection by index number. This is not a direct retrieval, however, because the item method returns a range object and not a word object. The range object does have a text property that is used either to get or to set the value of the text in the selected range. Because this range is a single word, the text property from the range object retrieves the next word from the words collection object. This is shown here:

       $object |

       Add-Member -MemberType noteProperty -name word -value $words.item($i).text

After the word property has been added to the PSObject, the PSObject that is stored in the $object variable is added to the $outputObject array, as shown here:

       $outputObject += $object

     }

The document object is closed by using the close method and the application object is destroyed by calling the quit method. This is shown here:

$doc.close()

$app.quit()

The array of objects stored in the $outputObject variable is piped to the Sort-Object cmdlet, where the object is sorted on the word property and only unique words are displayed on the Windows PowerShell console. This line of code is shown here:

$outputObject | sort-object -property word -unique

When the script is run, the output shown in the following image is displayed:

Image of output of the script

Well, EM, that is about all there is to retrieving unique words from a Microsoft Word document.

If you want to know exactly what we will be looking at tomorrow, follow us on Twitter or Facebook. If you have any questions, send e-mail to us at scripter@microsoft.com or post your questions on the Official Scripting Guys Forum. See you tomorrow. Until then, peace.

Ed Wilson and Craig Liebendorfer, Scripting Guys

Понравилась статья? Поделить с друзьями:
  • Count the unique values in excel
  • Count the numbers in excel
  • Count the number of words in word
  • Count the letters in word
  • Count the duplicates in excel