Presidential Debates with qdap-beta

qdap brief intro
For the past year I’ve been working on a package (qdap) to assist my field in quantitative discourse analysis; basically looking at patterns in language. It’s still a ways from being finished and lacks documentation (roxygen2 is my friend), but after seeing the presidential debates yesterday I decided to try using some of the package’s functions on a transcript of the dialogue.

Getting qdap to work may take some finagling because the package relies on the openNLP package. You have to make sure you have the correct version of java installed. I know the package is able to be installed on all three major OS. You’ll also notice quickly that the tm, ggplot2, and wordcloud packages are relied upon as well.

Note: I display the graphics here with .png files but recommend .pdf or .svg as the image is much clearer. For a combined pdf version of the graphics in this post click here.

Getting and cleaning transcripts of the debate

url_dl("pres.deb1.docx")  #downloads a docx file of the debate to wd
# the read.transcript function allows reading in of docx file 
# special thanks to Bryan Goodrich for his work on this
dat <- read.transcript("pres.deb1.docx", col.names=c("person", "dialogue"))
# qprep wrapper for several lower level qdap functions
# removes brackets & dashes; replaces numbers, symbols & abbreviations
dat$dialogue <- qprep(dat$dialogue)  
# sentSplit splits turns of talk into sentences
# special thanks to Dason Kurkiewicz for his work on this
dat2 <- sentSplit(dat, "dialogue", stem.col=FALSE)  
htruncdf(dat2)   #view a truncated version of the data(see also truncdf)

Wordclouds (relies on Ian Fellows’ wordcloud package)

#first put a unique character between words we want to keep together
#first put a unique character between words we want to keep together
dat2$dia2 <- space_fill(dat2$dialogue, c("Governor Romney", "President Obama", 
    "middle class", "The President", "Mister President"))

#Generate target words to color by
tw <- list(
        health=c("health", "insurance", "medic", "obamacare", "hospital"), 
        economic = c("econom", "jobs", "unemploy", "business", "banks", 
            "budget", "market", "paycheck"),
        foreign = c("war ", "terror", "foreign"),
        class = c("middle~~class", "poor", "rich"),
        opponent = c("romney ", "obama", "the~~president", "mister~~president")

#create stop word list from qdap data set Top25Words but exclude he and I
sw <- exclude(Top25Words, "he", "I")

#the word cloud by grouping variable function
with(dat2,, person, 
    proportional = TRUE,
    target.words = tw,
    cloud.colors = c("red", "blue", "black", "orange", "purple", "gray45"),
    legend = names(tw),
    max.word.size = 4,
    char2space = "~~"))

Visuals of the function
wordcloud 1
wordcloud 2
wordcloud 3

Gantt Plot of the dialogue over time
Obviously (when you see the output), this uses Hadley Wickham’s ggplot2.

# special thanks to Andrie de Vries for his work on this function
with(dat2, gantt_plot(dialogue, person,  xlab = "duration(words)", x.tick=TRUE,
    minor.line.freq = NULL, major.line.freq = NULL, rm.horiz.lines = FALSE))

Visualization of the Gantt Plot
Gantt Plot

Formality scores (how formal a person’s language is)
This concept comes from:

Heylighen, F., & Dewaele, J.-M. (2002). Variation in the 
    contextuality of language: An empirical measure. Foundations 
    of Science, 7(3), 293–340. doi:10.1023/A:1019661126744

The code can be run in parallel because this is a slower function. It uses openNLP to first map parts of speech for every word.

#parallel about 1:20 on 8 GB ram 8 core i7 machine
v1 <- with(dat2, formality(dialogue, person, parallel=TRUE))
#about 4 minutes on 8GB ram i7 machine
v2 <- with(dat2, formality(dialogue, person)) 
# note you can resupply the output from formality back
# to formality and change arguments.  This avoids the need for
# openNLP, saving time.
v3 <- with(dat2, formality(v1, person))
plot(v3, bar.colors=c("Dark2"))

Output and plot from the formality function

  person word.count formality
1 ROMNEY       4068     61.82
2 LEHRER        765     61.31
3  OBAMA       3595     58.30


Afterthought: I was remiss to mention that the word clouds are proportional (argument proportional = TRUE) for all words spoken rather than frequency per person. This enables comparison across clouds.


About tylerrinker

Data Scientist, open-source developer , #rstats enthusiast, #dataviz geek, and #nlp buff
This entry was posted in ggplot2, qdap, word cloud and tagged , , , , , , , , , , . Bookmark the permalink.

33 Responses to Presidential Debates with qdap-beta

  1. xingmowang says:

    Just wonder what the picture look like if you could take out “so”, “going”, “am”, “is”, “area” and so on.

    Quick try on this

  2. tylerrinker says:

    Good question. This post is meant to be for demo purposes (i.e. it’s very raw) and so I did not do what you’ve suggested (with stopwords). Discourse analysis is more complex than arbitrarily removing words because they appear frequently. This decision may be correct but must be made by the researcher based on the research question. So I’d say go ahead and play with it but doing so may change conclusions. i.e. don’t assume a frequent word is unimportant; as in “so” may indicate justification that needs further exploration or emphasis on some thing weak that may also need further analysis. I think you’d want to play with multiple different parameters to get a picture.

    • Manuel Gonzalez Canche says:

      Wonderful post and amazing reply. You are a VERY talented researcher. I think that only bright things are coming to you! COngratulations on trying to improve your field!!! Simply amazing!

  3. tylerrinker says:

    @Manuel Thank you for your kind and encouraging words. πŸ™‚

  4. tylerrinker says:

    If anyone is having difficulty downloading qdap please let us know. Or if you found out something that had to be done in order to install on your operating system please share that as well.

  5. anon says:

    What is the difference between this and text mining?

    • tylerrinker says:

      The intent. Text mining and discourse analysis may use similar approaches and methods but the purpose is very different. That being said there would be considerable overlap.

      • anon says:

        thanks. can you tell us more about the difference between the difference in intent? maybe this is another blog post πŸ™‚

  6. tylerrinker says:

    Generally text mining uses large quantities of unstructured text (admittedly I know little about text mining) where as discourse analysis applied to a transcript is utilizing the structure of the dialogue. There’s a back and forth (turns of talk). As far as I know this back and forth, if it does exist, isn’t the focus of someone performing text mining. I’ve seen discourse analysis used as a synonymous term for text mining but in my field discourse analysis has a specific history and wouldn’t be used in this general context. This is a short answer, the long answer is worth further exploration if you’re interested. I’d appreciate a text miner describe their purposes as this is not my expertise.

  7. tylerrinker says:

    I performed some of the same analysis for the vice presidential debates as well in a script found:

  8. tylerrinker says:

    That’s an interesting visualization I may replicate with ggplot2, though I’d probably opt for a different presentation than bubbles. I’d probably use something from qdap’s termco family to achieve the number counts for this. I’m not sure if the visual you showed was individual words or themes; termco can do either/or. The output from termco functions are usually a list of several data frames. The use of termco in the linked script displays raw uses by each person and their percent use in comparison to word counts (two word phrase may throw this off slightly). In any event hope this is useful/interesting to you:

    • Manuel says:

      I really hope you can replicate/improve it using themes as opposed to words! Thank you

      • tylerrinker says:

        I’d also not use bubbles but bars (similar to the formality visualization above) and print the raw and/or prop for each with total theme/word use on each bar and sort descending order for # of uses. This is much easier to visualize the data though it’s not as sleek as bubbles.

  9. tylerrinker says:

    Interestingly, the vice president only mentioned his boss by name one time the entire debate (generally he refers to him as “The President”). Ryan mention his 21 times.

  10. tylerrinker says:

    Just wanted to say there was a bug in qdap (you couldn’t supply the same term to two different word lists) that I eliminated in termco.a. It also now retains the column order for the order that the match.list was supplied.

    • Manuel Gonzalez Canche says:

      Hi again @tylerrinker i have a question about the function, is there a way to get the word frequencies of the inputs as as data.frames? You know, as when using :
      ap.tdm <- TermDocumentMatrix(txt)
      ap.m <- as.matrix(ap.tdm)
      ap.v <- sort(rowSums(ap.m),decreasing=TRUE)
      ap.d <- data.frame(word = names(ap.v),freq=ap.v)

      thank you in advance!!!

      • tylerrinker says:

        No. But operates on another qdap function called word_list that does what I believe you’re after. Unfortunately, I am unfamiliar with how the tm package operates with any specificity (though it certainly inspired qdap) so I’m not certain it is what you’re after. So in the script above you could do: with(dat, word_list(dialogue, person))$fwl . There are many different word lists that this function produces. It prints a truncated word list but the actuallyfunction returns a list of different word lists.

        Also note I’ve been improving qdap so please download the latest version.

      • Manuel Gonzalez Canche says:

        You are a genius!!! You gave me the answer already πŸ™‚
        I am looking to get my themes very clean, a very smart person said, and i quote that these themes are only useful to the extent that the researcher is able to selec the wirds within each of them.

        Given that with(txtEX, word_list(dialogue, Old))$fwl returs a list with two elements we can retrieve the first and second elements as dataframes as follows:
        phudcfily1<, word_list(dialogue, Old))$fwl[1])
        phudcfily2<, word_list(dialogue, Old))$fwl[2])
        Then we can use the first column to merge the two dataframes so that we can compare them:
        phudcfily<-merge(phudcfily1,phudcfily2, by.x="Speeches_07.10.WORD", by.y="Speeches_84.06.WORD", all=T) #all=T asures that all words are accounted even withouth a match.

        Thank you @tylerrinker

  11. Pingback: Vice Presidential Debates with qdap-beta | TRinker's R Blog

  12. Manuel Gonzalez Canche says:

    I apologize for my broken English, it is getting late!!! Thanks again!

  13. tylerrinker says:

    @Manuel It seems you’re using qdap in depth which makes me excited to see people begin to use it. I apologize for the very underdeveloped documentation. I’m working on it in my spare time. One change to be aware of (from the code above though it’s now been altered) see this change that occured in naming conventions:

    • Manuel Gonzalez Canche says:

      Please, do not apologize, you are doing a superb job! Thank you for the heads up. And yes, I am very excited qdap!

  14. Pingback: Presidential Debates 2012 | TRinker's R Blog

  15. Alex Kumenius says:

    I got cautivate it with R. And when I read your article at R-Blogger I just can say :
    Good Job!!

  16. Pingback: Gradient Word Clouds | TRinker's R Blog

  17. Thank you for such an insightful post. I stumbled upon this post of yours while researching about quantitative discourse analysis. I am in the process of writing my dissertation which involves the analysis of texts and documents and this post of yours, to me, shows how quanti may complement qualitative analysis for multi-method/mixed research design. I look forward to trying out the package your are developing in R.


    ~ Jal

  18. Pingback: Coding Advocacy: Visualizing Supreme Court Arguments and Formality | Patrick Ellis

  19. Pingback: Presidential Debates with qdap-beta | ProsoDis

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s