Liwc linguistic inquiry and word count

Home page > Tools

Linguistic Inquiry and Word Count (LIWC)

Brief description

Linguistic Inquiry and Word Count (LIWC; pronounced «Luke») is a text analysis program that calculates the percentage of words in a given text that fall into one or more of over 80 linguistic, psychological and topical categories indicating various social, cognitive, and affective processes. You can use LIWC, for example, to determine the degree in which a text uses positive or negative emotions, self-references or causal words.

The core of the program is a dictionary containing words that belong to these categories. Dictionaries for many languages are available; it is also possible to define your own dictionary, for example to define one or more categories that are not included in the standard dictionary.

Instruction

Operator’s Manual LIWC 2015
Extensive online software manual.

Introduction to Linguistic Inquiry and Word Count (Centre for Human Evolution, Cognition and Culture at University of British Columbia).
Video clip introducing the use of LIWC.

Availability

LIWC is available on VU-pc’s for staff and students of the Faculty of Humanities (with limitations on concurrent access).

More information

LIWC website.

How it works. Brief background information about LIWC.

Tausczik, Y.R. & Pennebaker, J.W. 2014. The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology, 29 (1), 24-54. DOI: 10.1177/0261927X09351676
This article reviews several computerized text analysis methods and describes how LIWC was created and validated.

Logo LIWC

Project description

PyPI version
Travis CI Build Status

Linguistic Inquiry and Word Count (LIWC) analyzer.

The LIWC lexicon is proprietary, so it is not included in this repository,
but this Python package requires it.
The lexicon data can be acquired (purchased) from liwc.net.
This package reads from the LIWC2007_English100131.dic (MD5: 2a8c06ee3748218aa89b975574b4e84d) file,
which must be available on any system where this package is used.

The LIWC2007 .dic format looks like this:

%
1   funct
2   pronoun
[...]
%
a   1   10
abdomen*    146 147
about   1   16  17
[...]

Setup

Install from PyPI:

pip install -U liwc

Example

import re
from collections import Counter

def tokenize(text):
    # you may want to use a smarter tokenizer
    for match in re.finditer(r'w+', text, re.UNICODE):
        yield match.group(0)

import liwc
parse, category_names = liwc.load_token_parser('LIWC2007_English100131.dic')
  • parse is a function from a token of text (a string) to a list of matching LIWC categories (a list of strings)
  • category_names is all LIWC categories in the lexicon (a list of strings)
gettysburg = '''Four score and seven years ago our fathers brought forth on
  this continent a new nation, conceived in liberty, and dedicated to the
  proposition that all men are created equal. Now we are engaged in a great
  civil war, testing whether that nation, or any nation so conceived and so
  dedicated, can long endure. We are met on a great battlefield of that war.
  We have come to dedicate a portion of that field, as a final resting place
  for those who here gave their lives that that nation might live. It is
  altogether fitting and proper that we should do this.'''
gettysburg_tokens = tokenize(gettysburg)
# now flatmap over all the categories in all of the tokens using a generator:
gettysburg_counts = Counter(category for token in gettysburg_tokens for category in parse(token))
# and print the results:
print(gettysburg_counts)

License

Copyright (c) 2012-2019 Christopher Brown.
MIT Licensed.

Download files

Download the file for your platform. If you’re not sure which to choose, learn more about installing packages.

Source Distribution

Built Distribution

LIWC Dictionary (Linguistic Inquiry and Word Count)

The LIWC dictionary used in this demonstration is composed of 5,690 words and word stems.
Each word or word stem defines one or more word categories.
For example, the word ‘cried’ is part of four word categories: sadness,
negative emotion, overall affect, and a past tense verb. Hence,
if it is found in the target text, each of these four category
scale scores will be incremented. As in this example, many of the
LIWC categories are arranged hierarchically. All anger words,
by definition, will be categorized as negative emotion and overall
emotion words.

Each of the 69 preset LIWC categories used in this demo is composed of a list
of dictionary words that define that scale. The table below provides
a comprehensive list of these LIWC categories with
sample scale words.

LIWC Dimensions and Sample Words

DIMENSION

EXAMPLES

I. STANDARD LINGUISTIC DIMENSIONS

Pronouns I, them, itself
Articles a, an, the
Past tense walked, were, had
Present tense Is, does, hear
Future tense will, gonna
Prepositions with, above
Negations no, never, not
Numbers one, thirty, million
Swear words *****

II. PSYCHOLOGICAL PROCESSES

Social Processes talk, us, friend
Friends pal, buddy, coworker
Family mom, brother, cousin
Humans boy, woman, group
Affective Processes happy, ugly, bitter
Positive Emotions happy, pretty, good
Negative Emotions hate, worthless, enemy
Anxiety nervous, afraid, tense
Anger hate, kill, pissed
Sadness grief, cry, sad
Cognitive Processes cause, know, ought
Insight think, know, consider
Causation because, effect, hence
Discrepancy should, would, could
Tentative maybe, perhaps, guess
Certainty always, never
Inhibition block, constrain
Inclusive with, and, include
Exclusive but, except, without
Perceptual Processes see, touch, listen
Seeing view, saw, look
Hearing heard, listen, sound
Feeling touch, hold, felt
Biological Processes eat, blood, pain
Body ache, heart, cough
Sexuality horny, love, incest
Relativity area, bend, exit, stop
Motion walk, move, go
Space Down, in, thin
Time hour, day, oclock

III. PERSONAL CONCERNS

Work work, class, boss
Achievement try, goal, win
Leisure house, TV, music
Home house, kitchen, lawn
Money audit, cash, owe
Religion altar, church, mosque
Death bury, coffin, kill

 IV. SPOKEN CATEGORIES

Assent agree, OK, yes
Nonfluencies uh, rr*
Fillers blah, you know, I mean

Selected References

Pennebaker, J. W. (1997). Writing
about emotional experiences as a therapeutic process. Psychological
Science, 8
, 162-166.

Pennebaker, J.W., & Francis, M.E. (1996). Cognitive, emotional,
and language processes in disclosure. Cognition and Emotion,
10
, 601-626.

Pennebaker, J.W., & King, L.A. (1999). Linguistic
styles: Language use as an individual difference. Journal
of Personality and Social Psychology, 77
, 1296-1312.

Pennebaker, J. W., Mayne, T., & Francis, M. E. (1997). Linguistic
predictors of adaptive bereavement. Journal of Personality
and Social Psychology, 72
, 863-871.

Pennebaker, J.W. (2002). What
our words can say about us: Toward a broader language psychology.
Psychological Science Agenda, 15, 8-9.

Newman, M.L., Pennebaker, J.W., Berry, D.S., & Richards, J.M.
(2003). Lying
words: Predicting deception from linguistic styles.
Personality and social psychology bulletin, 29, 5, 665-675.

Introducing the Linguistic Inquiry and Word Count

by Dr. Ryan Nichols, Philosophy, Cal State Fullerton, Orange County CA

As I write this column there are, remarkably, no Youtube guides for the use of the Linguistic Inquiry and Word Count. This is a shame since the Linguistic Inquiry and Word Count, ‘LIWC’ (pronounced ‘luke’) for short, is one of the best textual analysis software tools out there.

LIWC2007 logo represented with some word categories. Source: Author image

LIWC2007 logo represented with some word categories. Source: Author image

LIWC allows users to look under the hood of works of literature. When uploading a text to LIWC, the user will receive an output containing more than 70 columns of data. For example, if I upload this blog post to LIWC, it might return the result that 17.32% of the text falls under LIWC’s cognition category while only 1.2% falls under the religion category, and so on. This is useful information for several reasons illustrated in this and the following post.

LIWC’s design has made it a favorite for psychologists, but it also finds use in marketing, twitter analysis, mental health diagnostics and much more. Psychologists across the world have developed LIWC dictionaries in their native languages. As of writing, languages supported include Arabic, Chinese, Dutch, English, French, German, Italian, Portuguese, Russian, Serbian, Spanish, and Turkish. LIWC is an extremely affordable software tool. LIWClite7 is $30 USD while LIWC2007, the full version, is $90 USD. (When compared to shareware text analysis software, this is not cheap. But proceeds from LIWC funnel to the University of Texas Department of Psychology to support its work.)

Another key reason for praising LIWC is the quality of LIWC’s dictionary design. The LIWC2007 dictionary contains 4500 words and word stems. Each is filed into one or more subdictionaries. Subdictionaries represent one of the 55 word categories through which LIWC compiles a text. For example, the word “cried” is part of “five word categories: sadness, negative emotion, overall affect, verb, and past tense verb. Hence, if it is found in the target text, each of these five subdictionary scale scores will be incremented” (Pennebaker et al., 2007, p. 4). What makes this so special is that Professor Jamie Pennebaker and developers psychometrically validated the subdictionaries with great effort. This means that values across LIWC categories have been shown to correlate with big-five personality traits (Pennebaker & King, 1999; Mehl, Gosling, & Pennebaker, 2006).

The psychometric validation of LIWC categories is significant because it allows LIWC users to draw justified inferences from word frequencies to psychological states of the authors. For this reason the potential for LIWC’s use in the context of the humanities, religion in particular, is largely untapped. CERC is using it for a few projects. In a pilot research project designed to test the application of LIWC to research questions in the humanities, Justin Lynn, Ben Purzycki and I compiled a large corpus of literary texts from three genres, Science Fiction, Fantasy, and Mystery, in order to test the interpretations of humanities scholars about genre. In a research project about contemporary Protestantism Oliver Gunther, Carson Logan and I compiled about 400 sermons drawn from 12 denominations in order to test whether the language across the denominations, in particular, their use of supernatural agency terms, strongly correlated with known differences in theological orientation and known categories in the sociology of religion.

In two upcoming posts about LIWC we will describe each of these in more detail in order to give a sense for the questions a humanist can pursue with the Linguistic Inquiry and Word Count. In the meantime, however, due to the dearth of instructional videos about LIWC, I recorded a video introduction here.

NLP tutorial

fun thought:
The process of writing thesis seems to be following a pattern:

  1. I can do this! 2. Oh wait, why does this data look like this? I need to check 3. oh crap, I need to change the whole thing 4. Phew still it is fixable 5. I can do this! 6. Oh wait, why does this data look like this? …..

Table of contents

  1. May 13th Linguistic Inquiry and Word Count (LIWC)
  2. Google colab Review
  3. SHAP and interpretable machine learning
  4. BERT Embeddings
  5. SVM
  6. Deep learning
  7. Topic modelling
  8. Deep learning (Tensorflow, Pytorch with Google Colab pro GPU)
  9. Twarc and Twitter API
  10. Useful functions to know: Regex, Pandas, etc.

May 13th Linguistic Inquiry and Word Count (LIWC)

Introduction

The main website of LIWC can be found here: https://www.liwc.app/
It says: «LIWC is the gold standard in software for analyzing word use. It can be used to study a
single individual, groups of people over time or all of social media.»
You may wonder why we need this tool at all. I think the value of this tool lies on the fact that
it focuses on pyschometric properties. The first line of the LIWC manual says: «The words that
people use in everyday life tell us about their psychological states: their beliefs, emotions,
thinking habits, lived experiences, social relationships, and personalities.» In other words,
«people’s words have tremendous psychological value.» It fits the goal of NLP.

The price of a 90-day license is $40 which is not too bad. Thank you BDS for your generous scholarship.
I want to make a review of whether it is worth it and what kind of insights it provides.

How to install:

Installaion process is well explained and you get a separate email on it.

Manual:

You can find the manual of the software here: https://www.liwc.app/help/psychometrics-manuals
I want to summarize the important points so that I can remember it in due time.
Firstly, there are two versions LIWC-22 and LIWC-15. I decided to use a newer version LIWC-22.
There are multiple modules in LIWC-22, namely, LIWC Analysis, Dictionary workbench, Word frequencies and word clouds
Topic modeling with the meaning extraction method, Narrative arc, Language style matching, Contextualizer, Case studies,
and Prepare transcripts.

The main model is LIWC Analysis. LIWC Analysis uses an internal dictionary which is composed of over
12,000 words, word stems, phrases, and select emoticons. The manual covers how the LIWC-22 Dictionary is developed over time.
The LIWC-22 dictionary is tested on corpus that includes multiple sames of college applications, blogs
conversations, Enron emails, Facebook, movies, novels, NYT, Reddit, short stories, stream of consciousness essays,
U.S Congressional speeches, thematic apperception test, tweets and Yep reviews.

On page 11 of the manual, Table 2. shows the LIWC-22 language dimensions and reliability.
LIWC-22 language dimensions are the outputs of LIWC Analytics. When you run the software on your data,
you receive a csv file (my input was csv file) of different language dimensions.
Major categories are summary variables, linguistic dimensions, psychological process and expanded dictionary.

Summary variables include:
world count, analytical thinking, clout, authentic, emotional tone, words per sentence, big words, dictionary words.

Psychological processes can be interesting to many.
This includes subcategories of drives, cognition, Memory, Affect, Social processes, Social referents.

Expanded dictionary include the followings:
Culture, lifestyle, physical, states, motives, perception, time orientation, conversational.

For internal validity, each subcategories have Cronbach’s alpha.
If you do not know what Cronbach’s alpha is, please check Crocnbach’s alpha.

Questions to be answered.

Now I have sent an email to LIWC team to ask whether I need to do any text pre-processing before using LIWC on Twitter data. Their reply was as follows:

«»»
Thanks so much for the email. You don’t need to remove the hashtags or
mentions. These words however will not get categorized by liwc. For
instance, if a tweet says #excited, this word won’t get categorized in
the positive emotion dictionary. I think it gets counted in the
«netspeak» dictionary— I need to double check this. If you want the
hashtagged words to get categorized based on the meanings of those
words, you’ll need to remove the hashtags. But liwc will not throw an
error if you leave the hashtags in. Same for mentions. Let me know if
this doesn’t make sense or if you have more questions.
«»»

About preprocessing. emoji, hashtag, etc.
Does BERT contains the information covered in LIWC?

Deep learning

Why do we need to scale the features before running Deep learning models? Here

Topic modelling

Robust topic modelling

Twarch and Twitter API

(May 15)
Twitter API could be intimidating. So here is an easy guide. There are so many things you need to know about this when you go into the Twitter Development Platform. If you are a master’s student, PhD or researcher, then you are eligible for a developer account. A detailed explanation on this process can be found here. But essentially, you need to be registered as one of the developers and you can do this by filling up a registration form. Good to know is that they use a bot to classify whether your application is fake or not. Twitter cancelled my API app because of that little bot. I wrote them an email and the issue got solved really quick.

Now, there are multiple tools that you can use to scrape Tweets using Twitter API. I recommend Twarc2. If you need a large set of data, do not use Postman.

When you search for tweets, there is a geotag function. Geotag allows you to find tweets that are from that specific location. We use geographical cooridnates to define this parameter. Here is a tutorial on how to use this function.

Useful functions to know.

Regex or Pandas

(May 14th)
I wanted to cover Regex because I used it in many cases especially when I was searching for rows that contain specific strings.
For example I use Regex to remove certain words in a string or attach strings together.

But I did not know Pandas has a similar function. Here is a link to the website.

How to remove quotations from a string

Let’s say you have a string

However, you want

What can you do?
There are multiple things that you can do. One is by using regex.
But you have an entire column, what to do?
Let’s say you have a column A.
Then, you should write:

A = [re.sub('"','',x) for x in A]

how to store a list of strings into a txt file

for i in userid:
    print(i)
    l = api.get_follower_ids(user_id = i)
    with open(f"{i}_followers.txt", 'w') as outfile:
        for element in l:
            outfile.write(','.join(element))

how to streo a list of integers into a txt file

for i in userid:
    print(i)
    l = api.get_follower_ids(user_id = i)
    with open(f"{i}_followers.txt", 'w') as outfile:
        for element in l:
            outfile.write('%i,' % element)

Twarc2

How to write a string per line

If you use Twarc2 and want to scrape multiple user information or followers of multiple users, then you need to have a file that has one username per line.
How do you create this?

# define list of places
places = ['Berlin', 'Cape Town', 'Sydney', 'Moscow']

with open('listfile.txt', 'w') as filehandle:
    for listitem in places:
        filehandle.write('%sn' % listitem)

Scraping followers of multiple users

You can consult this.
Using the command line, if you have a list of usernames in a file 1 per line called target_users.txt,

while read line; do twarc2 followers "followers_of_$line.jsonl" && echo $line; done < target_users.txt

github

If you would like to create a tutorial like this, please consult this website

Drop duplicates

I found many duplicate rows in my data so I wanted to remove these users.
How can I do that?
In this case, you want to use ‘dataframe.drop_duplicates()’.

Понравилась статья? Поделить с друзьями:
  • Living word russian sda
  • Living word of the desert
  • Living word of god meaning
  • Living word in greek
  • Living word for africa