Vice Presidential Debates with qdap-beta

After the presidential debates I used the beta version of qdap to provide some initial surface level analysis (LINK to Presidential Debates with qdap-beta). In the comments of that post, annon (a commenter) provided a link to an analysis/visualization that utilizes bubbles to demonstrate proportion of words and colors and labels to show each candidate’s usage (LINK). While I initially liked the graphic it was the shape and colors that appealed to me. Closer inspection reveals that smaller words are hard to get information for and the bubbles make comparing across words difficult. I decided to attempt a visualization for the vice presidential debates using qdap and ggplot2.

I decided to use themes rather than words and categorize similar words together. This approach utilizes a function in qdap called termco. Here’s the function’s arguments:

termco(text.var, grouping.var=NULL, match.list, short.term = FALSE, = TRUE, lazy.term = TRUE, elim.old = TRUE, 
    zero.replace = 0, output = "percent", digits = 2)

Basically you can supply a list of named character vectors (our themes) to this function as well as dialogue (the debate text) and grouping variable (person) and it will output a list with several data frames. You can get raw counts, percent/proportions or a combination of raw and percent/proportions by grouping variable (person) for each theme.

The important part is the themes we supply to match list. This function relies on gregexpr meaning it will do partial matching, so there’re some things you’ll want to think about when supplying the themes:

  1. If you want to find “read” but not “bread” or “reading” use a trailing and leading white space as in ” read “
  2. If you want to find and root word with “read” leading white space as in ” read”
  3. This will also find “ready” so if you want any form of the word “read” you’ll have to be explicit and put all these forms in the vector for read with trailing and leading white spaces; ie ” read “, ” reads “, ” reader” (reader and readers), ” reading “
  4. If you use ” obama” and ” obamacare” termco.a will count obamacare two times; instead use ” obama “ and ” obamacare “ or just ” obama”

The basic form for the list of vectors supplied to match.list is:

target.words <- list(
    theme_1 = c(),
    theme_2 = c(),
    theme_n = c(),

Let’s look at the results with some themes I examined for VP debates


url_dl("vpres.deb1.docx")  #downloads a docx file of the debate to wd

dat <- read.transcript("vpres.deb1.docx", col.names=c("person", "dialogue"))
dat$dialogue <- qprep(dat$dialogue)  
dat2 <- sentSplit(dat, "dialogue")  
htruncdf(dat2)   #view a truncated version of the data (see also truncdf)
dat2$person <- factor(Trim(dat2$person))

#the themes we're looking at (termco.a is only as good as the researcher who supplied these themes)
tw2 <- list(health=c(" health", " insurance", " medic", "obamacare", " hospital", " doctor"), 
        economic = c(" econom", " jobs", " unemploy", " business", " banks", " mortgage",
            " budget", " market", " paycheck", " wall street"),
        foreign = c(" war ", " terror", " foreign", "iran", "iraq", "sanctions", "nuclear", 
            "al qaida", "libya", "netanyahu", "israel", "africa", "afgha", " embassy", "russia"),
        democratic_people = c("the president", " obama ", " obamas", " obama's", "biden", 
            "the vice president", "mister vice president"),
        rebublican_people = c("my friend", " ryan", "romney"),
        obama_any_name = c("obama ", "obamas", "obama's", "the president"),
        "romney",  #you don't have to name a vector of length 1
        obama_by_name = c("obama ", "obamas", "obama's"))

(a <- with(dat2, termco(dialogue, person, tw2, short.term = TRUE)))

names(a)  #see what else is in the termco object
a$raw  #raw numbers of use
a$prop  #proportions or percentages of use
a$rnp  #default print for termco

For a txt version of the data frame that termco produces click here

Creating the graphic of the themes via ggplot2

dat3 <- melt(a$raw[-2,], id=qcv(person, word.count)) #drop the moderator
dat3$labs <- melt(a$rnp[-2,], id=qcv(person, word.count))[, 4]
dat3$variable <- factor(dat3$variable, levels=names(sort(apply(a$prop[-2, -c(1:2)], 2, max))))
dat3$loc <- dat3$value - 6.5; dat3$loc[15] <- 7; dat3$loc[6] <- 65.75
dat3$cols <- rep("white", 16); dat3$cols[1] <- "black"

ggplot(dat3, aes(x=variable,  y=value, fill=person)) + 
    geom_bar(position="dodge", stat="identity")  +
    coord_flip() + theme_bw() + 
    theme(legend.position=c(.91, 0.07), legend.background = element_rect(color="grey60"),
        panel.grid.major=element_blank(),panel.grid.minor=element_blank()) +
    ylab("Occurances") +
    xlab("Theme") +
    scale_fill_manual(values=c("#0000FF", "#FF0000"),
        name="Candidate", guide = guide_legend(reverse=TRUE)) +
    geom_text(aes(label = labs,  y = loc, x = variable),
              size = 5, position = position_dodge(width=0.9), color=dat3$cols)  + 
    scale_y_discrete(expand = c(0, 0), breaks=seq(0,80,20))

The graphic
vp themes

For a pdf version of the output click here

Discussion of the results
At first I ran a search to see who used the name Obama the most and I saw Vice President Biden only used the name once. At first I concluded (wrongly) he was focused on himself; after all the point of the vice presidential debates is to sell your boss as the winner. I did more inspection of the terminology (via word clouds) and I found Biden refers to President Obama as “The President”. This must be an inner circle respect thing that’s so ingrained in The Vice President that using the term “Mr. Obama” or “President Obama” just doesn’t happen for him.

I also noticed Ryan pushed the economic theme hard. Vice President Biden discussed the opposition quite a bit as well.

This was a quick and dirty demo. I didn’t actually put a tremendous amount of thought into the themes but was more demonstrating the ability of qdap for aiding the researcher in representing themes numerically and visually


About tylerrinker

Data Scientist, open-source developer , #rstats enthusiast, #dataviz geek, and #nlp buff
This entry was posted in discourse analysis, ggplot2, qdap, text, Uncategorized, visualization and tagged , , , , , , , , , , . Bookmark the permalink.

5 Responses to Vice Presidential Debates with qdap-beta

  1. Manuel Gonzalez Canche says:

    Wonderful, THANK YOU SO MUCH!
    Tylerrinker, have you ever seen an article published in a peer reviewed journal that used word clouds? I am in the social sciences and although I see great value in their use, I am not sure reviewers will accept them.
    I think that your graph is once again superb, but it may mask some words due to the categorization, that is why some good raw descriptive stats are alway useful (i.e. a word cloud). I am saying this because I simply loved the way you said that the Vise President Biden did mentioned President Obama, but not by name, and that you got that by looking at VP Biden’s wold cloud. Simply AMAZING!
    And thank you again!

    • tylerrinker says:

      @Manuel Thank you for your feedback and question. I personally have never seen a word cloud used in a peer reviewed article. I think this is for several reasons: (1) they take up space (2) they’re not comparable across clouds (as most clouds work on frequency) (3) color has not been used to represent anything meaningful and even if it had been journals tend to print only in black and white unless you pay extra. qdap’s function adresses two of these concerns in that it can be set to proportional (allowing for comparison across clouds) can can also represent colors in a meaningful way (ie themes etc). Ian Fellows has a terrific package (wordcloud; qdap’s trans_cloud relies on wordcloud) that really opens some doors for word cloud use. I personally think that word clouds are useful to a reader (beyond researcher use) in that they have the potential to display a lot of information in a small space without the loss of words. The use of colors can highlight what the researcher is discussing in the write up. If we want to see word clouds in journals I think we can do several things (1) selecting an online journal (where color is not a problem), (2) use word clouds to their fullest (colors and proportional comparing across clouds) and continue to develop methods for them (3) be bold an use them if they help you (the researcher) tell the story to the reader.

  2. Pingback: Momento R do dia « De Gustibus Non Est Disputandum

  3. tylerrinker says:

    Again if anyone is having problems getting qdap to install please open an issue on qdap’s webpage: If you have difficulty but discover how to install plase share that experience here so others may use the information.

  4. Pingback: Presidential Debates 2012 | TRinker's R Blog

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s