Shape File Selfies in ggplot2

In this post you will learn how to:

  1. Create your own quasi-shape file
  2. Plot your homemade quasi-shape file in ggplot2
  3. Add an external svg/ps graphic to a plot
  4. Change a grid grob's color and alpha

*Note get simple .md version here


Background (See just code if you don't care much about the process)

I started my journey wanting to replicate a graphic called a space manikin by McNeil (2005) and fill areas in that graphic like a choropleth. I won't share the image from McNeil's book as it's his intellectual property but know that the graphic is from a gesturing book that divides the body up into zones (p. 275). To get a sense of what the manikin looks like here is the ggplot2 version of it:

Figure 1: ggplot2 Version of McNeil’s (2005) Space Manikin

While this is a map of areas of a body you can see where this could be extended to any number of spatial tasks such as mapping the layout of a room.


1. Creating a Quasi-Shape File

So I figured “zones” that's about like states on a map. I have toyed with choropleth maps of the US in the past and figured I'd generalize this learning. The difference is I'd have to make the shape file myself as the maps package doesn't seem to have McNeil’s space manikin.

Let's look at what ggplot2 needs from the maps package:

library(maps); library(ggplot2)
head(map_data("state"))
##     long   lat group order  region subregion
## 1 -87.46 30.39     1     1 alabama      <NA>
## 2 -87.48 30.37     1     2 alabama      <NA>
## 3 -87.53 30.37     1     3 alabama      <NA>
## 4 -87.53 30.33     1     4 alabama      <NA>
## 5 -87.57 30.33     1     5 alabama      <NA>
## 6 -87.59 30.33     1     6 alabama      <NA>

Hmm coordinates, names of regions, and order to connect the coordinates. I figured I can handle that. I don't 100% know what a shape file is, mostly that it’s a file that makes shapes. What we're making may or may not technically be a shape file but know we're going to map shapes in ggplot2 (I use the quasi to avoid the wrath of those who do know precisely what a shape file is).

I needed to make the zones around an image of a person so I first grabbed a free png silhouette from: http://www.flaticon.com/free-icon/standing-frontal-man-silhouette_10633. I then knew I'd need to add some lines and figure out the coordinates of the outlines of each cell. So I read the raster image into R, plotted in ggplot2 and added lots of grid lines for good measure. Here's what I wound up with:

library(png); library(grid); library(qdap)
url_dl(url="http://i.imgur.com/eZ76jcu.png")
file.rename("eZ76jcu.png", "body.png")
img <- rasterGrob(readPNG("body.png"), 0, 0, 1, 1, just=c("left","bottom"))
ggplot(data.frame(x=c(0, 1), y=c(0, 1)), aes(x=x, y=y)) + 
    geom_point() +
    annotation_custom(img, 0, 1, 0, 1) + 
    scale_x_continuous(breaks=seq(0, 1, by=.05))+ 
    scale_y_continuous(breaks=seq(0, 1, by=.05)) + theme_bw() +
    theme(axis.text.x=element_text(angle = 90, hjust = 0, vjust=0))

plot of chunk unnamed-chunk-2

Figure 2: Silhouette from ggplot2 With Grid Lines


1b. Dirty Deeds Done Cheap

I needed to get reference lines on the plot so I could begin recording coordinates. Likely there's a better process but this is how I approached it and it worked. I exported the ggplot in Figure 2 into (GASP) Microsoft Word (I may have just lost a few die hard command line folks). I added lines there and and figured out the coordinates of the lines. It looked something like this:

Figure 3: Silhouette from ggplot2 with MS Word Augmented Border Lines

After that I began the tedious task of figuring out the corners of each of the shapes (“zones”) in the space manikin. Using Figure 3 and a list structure in R I mapped each of the corners, the approximate shape centers, and the order to plot the coordinates in for each shape. This is the code for corners:

library(qdap)
dat <- list(
    `01`=data.frame(x=c(.4, .4, .6, .6), y=c(.67, .525, .525, .67)),
    `02`=data.frame(x=c(.35, .4, .6, .65), y=c(.75, .67, .67, .75)),
    `03`=data.frame(x=c(.6, .65, .65, .6), y=c(.525, .475, .75, .67)),
    `04`=data.frame(x=c(.4, .35, .65, .6), y=c(.525, .475, .475, .525)),
    `05`=data.frame(x=c(.35, .35, .4, .4), y=c(.75, .475, .525, .67)),
    `06`=data.frame(x=c(.4, .4, .6, .6), y=c(.87, .75, .75, .87)),
    `07`=data.frame(x=c(.6, .6, .65, .65, .73, .73), y=c(.87, .75, .75, .67, .67, .87)),
    `08`=data.frame(x=c(.65, .65, .73, .73), y=c(.67, .525, .525, .67)),
    `09`=data.frame(x=c(.6, .6, .73, .73, .65, .65), y=c(.475, .28, .28, .525, .525, .475)),
    `10`=data.frame(x=c(.4, .4, .6, .6), y=c(.475, .28, .28, .475)),
    `11`=data.frame(x=c(.27, .27, .4, .4, .35, .35), y=c(.525, .28, .28, .475, .475, .525)),
    `12`=data.frame(x=c(.27, .27, .35, .35), y=c(.67, .525, .525, .67)),
    `13`=data.frame(x=c(.27, .27, .35, .35, .4, .4), y=c(.87, .67, .67, .75, .75, .87)),
    `14`=data.frame(x=c(.35, .35, .65, .65), y=c(1, .87, .87, 1)),
    `15`=data.frame(x=c(.65, .65, .73, .73, 1, 1), y=c(1, .87, .87, .75, .75, 1)),
    `16`=data.frame(x=c(.73, .73, 1, 1), y=c(.75, .475, .475, .75)),
    `17`=data.frame(x=c(.65, .65, 1, 1, .73, .73), y=c(.28, 0, 0, .475, .475, .28)),
    `18`=data.frame(x=c(.35, .35, .65, .65), y=c(.28, 0, 0, .28)),
    `19`=data.frame(x=c(0, 0, .35, .35, .27, .27), y=c(.475, 0, 0, .28, .28, .475)),
    `20`=data.frame(x=c(0, 0, .27, .27), y=c(.75, .475, .475, .75)),
    `21`=data.frame(x=c(0, 0, .27, .27, .35, .35), y=c(1, .75, .75, .87, .87, 1))
)

dat <- lapply(dat, function(x) {
    x$order <- 1:nrow(x)
    x
})

space.manikin.shape <- list_df2df(dat, "id")[, c(2, 3, 1, 4)]

And the code for the centers:

centers <- data.frame(
    id = unique(space.manikin.shape$id),
    center.x=c(.5, .5, .625, .5, .375, .5, .66, .69, .66, .5, .34, .31, 
        .34, .5, .79, .815, .79, .5, .16, .135, .16),
    center.y=c(.597, .71, .5975, .5, .5975, .82, .81, .5975, .39, .3775, .39, 
        .5975, .81, .935, .89, .6025, .19, .14, .19, .6025, .89)
)

There you have it folks your very own quasi-shape file. Celebrate the fruits of your labor by plotting that bad Oscar.


2. Plot Your Homemade Quasi-Shape File

 ggplot(centers) + annotation_custom(img,0,1,0,1) +
    geom_map(aes(map_id = id), map = space.manikin.shape, colour="black", fill=NA) +
    theme_bw()+ 
    expand_limits(space.manikin.shape) +
    geom_text(data=centers, aes(center.x, center.y, label = id), color="grey60") 

plot of chunk unnamed-chunk-5

Figure 4: Plotting the Quasi-Shape File and a Raster Image

Then I said I may want to tone down the color of the silhouette a bit so I can plot geoms atop without distraction. Here's that attempt.

img[["raster"]][img[["raster"]] == "#0E0F0FFF"] <- "#E7E7E7"

ggplot(centers) + annotation_custom(img,0,1,0,1) +
    geom_map(aes(map_id = id), map = space.manikin.shape, colour="black", fill=NA) +
    theme_bw()+ 
    expand_limits(space.manikin.shape) +
    geom_text(data=centers, aes(center.x, center.y, label = id), color="grey60") 

plot of chunk unnamed-chunk-6

Figure 5: Altered Raster Image Color


3. Add an External svg/ps

I realized quickly a raster was messy. I read up a bit on them in the R Journal (click here). In the process of reading and fooling around with Picasa I turned my original silhouette (body.png) blue and couldn't fix him. I headed back to http://www.flaticon.com/free-icon/standing-frontal-man-silhouette_10633 to download another. In this act I saw you could download a svg file of the silhouette. I thought maybe this will be less messier and easier to change colors. This led me to a google search and finding the grImport package after seeing this listserve post. And then I saw an article from Paul Murrell (2009) and figured I could turn the svg (I didn't realize what svg was until I opened it in Notepad++) into a ps file and read into R and convert to a flexible grid grob.

Probably there are numerous ways to convert an svg to a ps file but I chose a cloud convert service. After I read the file in with grImport per the Paul Murrell (2009) article. You're going to have to download the ps file HERE and get to your working directory.

browseURL("https://github.com/trinker/space_manikin/raw/master/images/being.ps")
## Move that file from your downloads to your working directory.
## Sorry I don't know how to automate this.
library(grImport)

## Convert to xml
PostScriptTrace("being.ps")

## Read back in and convert to a grob
being_img <- pictureGrob(readPicture("being.ps.xml"))

## Plot it
ggplot(centers) + annotation_custom(being_img,0,1,0,1) +
    geom_map(aes(map_id = id), map = space.manikin.shape, 
        colour="black", fill=NA) +
    theme_bw()+ 
    expand_limits(space.manikin.shape) +
    geom_text(data=centers, aes(center.x, center.y, 
        label = id), color="grey60") 

plot of chunk unnamed-chunk-7

Figure 6: Quasi-Shape File with Grob Image Rather than Raster


4. Change a grid Grob's Color and Alpha

Now we have a flexible grob we can mess around with colors and alpha until our heart's content.

str is our friend to figure out where and how to mess with the grob (str(being_img)). That leads me to the following changes to the image to adjust color and/or alpha (transparency).

being_img[["children"]][[1]][[c("gp", "fill")]] <- 
  being_img[["children"]][[2]][[c("gp", "fill")]] <- "black"

being_img[["children"]][[1]][[c("gp", "alpha")]] <- 
  being_img[["children"]][[2]][[c("gp", "alpha")]] <- .2

## Plot it
ggplot(centers) + annotation_custom(being_img,0,1,0,1) +
    geom_map(aes(map_id = id), map = space.manikin.shape, 
        colour="black", fill=NA) +
    theme_bw()+ 
    expand_limits(space.manikin.shape) +
    geom_text(data=centers, aes(center.x, center.y, 
        label = id), color="grey60") 

plot of chunk unnamed-chunk-8

Figure 7: Quasi-Shape File with Grob Image Alpha = .2


Let's Have Some Fun

Let's make it into a choropleth and a density plot. We'll make some fake fill values to fill with.

set.seed(10)
centers[, "Frequency"] <- rnorm(nrow(centers))

being_img[["children"]][[1]][[c("gp", "alpha")]] <- 
  being_img[["children"]][[2]][[c("gp", "alpha")]] <- .25

ggplot(centers, aes(fill=Frequency)) +
    geom_map(aes(map_id = id), map = space.manikin.shape, 
        colour="black") +
    scale_fill_gradient2(high="red", low="blue") +
    theme_bw()+ 
    expand_limits(space.manikin.shape) +
    geom_text(data=centers, aes(center.x, center.y, 
        label = id), color="black") + 
    annotation_custom(being_img,0,1,0,1) 

plot of chunk unnamed-chunk-9

Figure 8: Quasi-Shape File as a Choropleth

set.seed(10)
centers[, "Frequency2"] <- sample(seq(10, 150, by=20, ), nrow(centers), TRUE)

centers2 <- centers[rep(1:nrow(centers), centers[, "Frequency2"]), ]

ggplot(centers2) +
#       geom_map(aes(map_id = id), map = space.manikin.shape, 
#       colour="grey65", fill="white") +
    stat_density2d(data = centers2, 
        aes(x=center.x, y=center.y, alpha=..level.., 
        fill=..level..), size=2, bins=12, geom="polygon") + 
    scale_fill_gradient(low = "yellow", high = "red") +
    scale_alpha(range = c(0.00, 0.5), guide = FALSE) +
    theme_bw()+ 
    expand_limits(space.manikin.shape) +
    geom_text(data=centers, aes(center.x, center.y, 
        label = id), color="black") + 
    annotation_custom(being_img,0,1,0,1) +
    geom_density2d(data = centers2, aes(x=center.x, 
        y=center.y), colour="black", bins=8, show_guide=FALSE) 

plot of chunk unnamed-chunk-10

Figure 9: Quasi-Shape File as a Density Plot

Good times were had by all.


Created using the reports (Rinker, 2013) package

Get the .Rmd file here


References


Posted in discourse analysis, ggplot2, Uncategorized, visualization | Tagged , , , , | 1 Comment

qdap 1.3.1 Release: Demoing Dispersion Plots, Sentiment Analysis, Easy Hash Lookups, Boolean Searches and More…

We’re very pleased to announce the release of qdap 1.3.1

logo

This is the latest installment of the qdap package available at CRAN. Several important updates have occurred since the 1.1.0 release, most notable the addition of two vignettes and some generic view methods.

The new vignettes include:

  1. An Introduction to qdap
  2. qdap-tm Package Compatibility

The former is a detailed HTML based guide over viewing the intended use of qdap functions.  The second vignette is an explanation of how to move between qdap and tm package forms as qdap moves to be more compatible with this seminal R text mining package.

To install use:

install.packages(“qdap”)

Some of the changes in versions 1.2.0-1.3.1 include:


Generic Methods

  • scores generic method added to view scores from select qdap objects.
  • counts generic method added to view counts from select qdap objects.
  • proportions generic method added to view proportions from select qdap objects.
  • preprocessed generic method added to view preprocessed data from select qdap objects.

These methods allow the user to grab particular parts of qdap objects in a consistent fashion.  The majority of these methods also pick up a corresponding plot method as well.  This adds to the qdap philosophy that data results should be easy to grab and easy to visualize. For instance:

(x <- question_type(DATA.SPLIT$state, DATA.SPLIT$person))

## methods
scores(x)
plot(scores(x))
counts(x)
plot(counts(x))
proportions(x)
plot(proportions(x))
truncdf(preprocessed(x), 15)
plot(preprocessed(x))

Demoing Some of the New Features

We’d like to take the time to highlight some of the development that has happened in qdap in the past several months:

Dispersion Plots

 wrds <- freq_terms(pres_debates2012$dialogue, stopwords = Top200Words)

## Add leading/trailing spaces if desired
wrds2 <- spaste(wrds)

## Use `~~` to maintain spaces
wrds2 <- c(" governor~~romney ", wrds2[-c(3, 12)])

## Plot
with(pres_debates2012 , dispersion_plot(dialogue, wrds2, rm.vars = time, 
    color="black", bg.color="white")) 

 with(rajSPLIT, dispersion_plot(dialogue, c("love", "night"),
    bg.color = "black", grouping.var = list(fam.aff, sex),
    color = "yellow", total.color = "white", horiz.color="grey20")) 

Word Correlation

 library(tm)
data("crude")
oil_cor1 <- apply_as_df(crude, word_cor, word = "oil", r=.7)
plot(oil_cor1) 

 oil_cor2 <- apply_as_df(crude, word_cor, word = qcv(texas, oil, money), r=.7)
plot(oil_cor2, ncol=2)
 

Easy Hash Table

A Small Example

 lookup(1:5, data.frame(1:4, 11:14))

## [1] 11 12 13 14 NA

## Leave alone elements w/o a match
lookup(1:5, data.frame(1:4, 11:14), missing = NULL) 

## [1] 11 12 13 14  5

Scaled Up 3 Million Records

key <- data.frame(x=1:2, y=c("A", "B"))

##   x y
## 1 1 A
## 2 2 B

big.vec <- sample(1:2, 3000000, T)
out <- lookup(big.vec, key)
out[1:20]

## On my system 3 million records in:
## Time difference of 24.5534 secs

Binary Operator Version

 codes <- list(
    A = c(1, 2, 4), 
    B = c(3, 5),
    C = 7,
    D = c(6, 8:10)
)

1:12 %l% codes

##  [1] "A" "A" "B" "A" "B" "D" "C" "D" "D" "D" NA  NA 

1:12 %l+% codes

##  [1] "A"  "A"  "B"  "A"  "B"  "D"  "C"  "D"  "D"  "D"  "11" "12" 

Simple-Quick Boolean Searches

We’ll be demoing this capability on the qdap data set DATA:

 ##        person                                 state
## 1         sam         Computer is fun. Not too fun.
## 2        greg               No it's not, it's dumb.
## 3     teacher                    What should we do?
## 4         sam                  You liar, it stinks!
## 5        greg               I am telling the truth!
## 6       sally                How can we be certain?
## 7        greg                      There is no way.
## 8         sam                       I distrust you.
## 9       sally           What are you talking about?
## 10 researcher         Shall we move on?  Good then.
## 11       greg I'm hungry.  Let's eat.  You already? 

First a brief explanation from the documentation:

terms – A character string(s) to search for. The terms are arranged in a single string with AND (use AND or && to connect terms together) and OR (use OR or || to allow for searches of either set of terms. Spaces may be used to control what is searched for. For example using ” I ” on c(“I’m”, “I want”, “in”) will result in FALSE TRUE FALSE whereas “I” will match all three (if case is ignored).

Let’s see how it works. We’ll start with ” I ORliar&&stinks”. This will find sentences that contain ” I “ or that contain “liar” and the word “stinks”.

 boolean_search(DATA$state, " I ORliar&&stinks")

## The following elements meet the criteria:
## [1] 4 5 8

boolean_search(DATA$state, " I &&.", values=TRUE)

## The following elements meet the criteria:
## [1] "I distrust you."

boolean_search(DATA$state, " I OR.", values=TRUE)

## The following elements meet the criteria:
## [1] "Computer is fun. Not too fun."        
## [2] "No it's not, it's dumb."              
## [3] "I am telling the truth!"              
## [4] "There is no way."                     
## [5] "I distrust you."                      
## [6] "Shall we move on?  Good then."        
## [7] "I'm hungry.  Let's eat.  You already?"

boolean_search(DATA$state, " I &&.")

## The following elements meet the criteria:
## [1] 8 

Exclusion as Well

boolean_search(DATA$state, " I ||.", values=TRUE)

## The following elements meet the criteria:
## [1] "Computer is fun. Not too fun."        
## [2] "No it's not, it's dumb."              
## [3] "I am telling the truth!"              
## [4] "There is no way."                     
## [5] "I distrust you."                      
## [6] "Shall we move on?  Good then."        
## [7] "I'm hungry.  Let's eat.  You already?"

boolean_search(DATA$state, " I ||.", exclude = c("way", "truth"), values=TRUE)

## The following elements meet the criteria:
## [1] "Computer is fun. Not too fun."        
## [2] "No it's not, it's dumb."              
## [3] "I distrust you."                      
## [4] "Shall we move on?  Good then."        
## [5] "I'm hungry.  Let's eat.  You already?"  

Binary Operator Version

 dat <- data.frame(x = c("Doggy", "Hello", "Hi Dog", "Zebra"), y = 1:4)

##        x y
## 1  Doggy 1
## 2  Hello 2
## 3 Hi Dog 3
## 4  Zebra 4

z <- data.frame(z =c("Hello", "Dog"))

##       z
## 1 Hello
## 2   Dog

dat[dat$x %bs% paste(z$z, collapse = "OR"), ]  

##        x y
## 1  Doggy 1
## 2  Hello 2
## 3 Hi Dog 3

Polarity (Sentiment)

The polarity function is an extension of the work originally done by Jeffrey Breen with some accompnaying plotting methods. For more information see the Introduction to qdap Vignette.

 poldat2 <- with(mraja1spl, polarity(dialogue,
    list(sex, fam.aff, died)))
colsplit2df(scores(poldat2))[, 1:7] 
    sex fam.aff  died total.sentences total.words ave.polarity sd.polarity
1     f     cap FALSE             158        1810  0.076422846   0.2620359
2     f     cap  TRUE              24         221  0.042477906   0.2087159
3     f    mont  TRUE               4          29  0.079056942   0.3979112
4     m     cap FALSE              73         717  0.026496626   0.2558656
5     m     cap  TRUE              17         185 -0.159815603   0.3133931
6     m   escal FALSE               9         195 -0.152764808   0.3131176
7     m   escal  TRUE              27         646 -0.069421082   0.2556493
8     m    mont FALSE              70         952 -0.043809741   0.3837170
9     m    mont  TRUE             114        1273 -0.003653114   0.4090405
10    m    none FALSE               7          78  0.062243180   0.1067989
11 none    none FALSE               5          18 -0.281649658   0.4387579

The Accompanying Plotting Methods

plot(poldat2)

 plot(scores(poldat2))   

Question Type

 dat <- c("Kate's got no appetite doesn't she?",
    "Wanna tell Daddy what you did today?",
    "You helped getting out a book?", "umm hum?",
    "Do you know what it is?", "What do you want?",
    "Who's there?", "Whose?", "Why do you want it?",
    "Want some?", "Where did it go?", "Was it fun?")

left_just(preprocessed(question_type(dat))[, c(2, 6)])  
   raw.text                             q.type             
1  Kate's got no appetite doesn't she?  doesnt             
2  Wanna tell Daddy what you did today? what               
3  You helped getting out a book?       implied_do/does/did
4  Umm hum?                             unknown            
5  Do you know what it is?              do                 
6  What do you want?                    what               
7  Who's there?                         who                
8  Whose?                               whose              
9  Why do you want it?                  why                
10 Want some?                           unknown            
11 Where did it go?                     where              
12 Was it fun?                          was                
 x <- question_type(DATA.SPLIT$state, DATA.SPLIT$person)

scores(x)
      person tot.quest    what    how   shall implied_do/does/did
1       greg         1       0      0       0             1(100%)
2 researcher         1       0      0 1(100%)                   0
3      sally         2  1(50%) 1(50%)       0                   0
4    teacher         1 1(100%)      0       0                   0
5        sam         0       0      0       0                   0
plot(scores(x), high="orange")

 


These are a few of the more recent developments in qdap. We would encourage readers to dig into the new vignettes and start using qdap for various Natural Language Processing tasks. If you have suggestions or find a bug you are welcome to:

  • submit suggestions and bug-reports at: https://github.com/trinker/qdap/issues
  • send a pull request on: https://github.com/trinker/qdap

  • For a complete list of changes see qdap’s NEWS.md

    Development Version
    github

    Posted in analysis, discourse analysis, qdap, text | Tagged , , , , , , , , , , , | 1 Comment

    qdap 1.1.0 Released on CRAN

    We’re very pleased to announce the release of qdap 1.1.0

    logo

    This is the fourth installment of the qdap package available at CRAN. Major development has taken place since the last CRAN update.

    The qdap package automates many of the tasks associated with quantitative discourse analysis of transcripts containing discourse, including frequency counts of sentence types, words, sentence, turns of talk, syllable counts and other assorted analysis tasks. The package provides parsing tools for preparing transcript data but may be useful for many other natural language processing tasks. Many functions enable the user to aggregate data by any number of grouping variables providing analysis and seamless integration with other R packages that undertake higher level analysis and visualization of text.

    This version is a major overhaul of the qdap package. The word lists and dictionaries in qdap have been moved to qdapDictionaries. Additionally, many functions have been renamed with underscores instead of the former period separators. These changes break backward compatibility. Thus this is a major release (ver. 1.0.0). It is the general practice to deprecate functions within a package before removal, however, the number of necessary changes in light of qdap being relatively new to CRAN, made these changes sensible at this point.

    To install use:

    install.packages(“qdap”)

    Some of the changes in version 1.1.0 include:


    PACKAGE VIGNETTE

    qdap gains an HTML package vignette to better explain the intended workflow and function use for the package. This is not currently a part of the build but can be accessed via:

    http://htmlpreview.github.io/?https://github.com/trinker/qdap/blob/master/vignettes/qdap_vignette.html

    tm PACKAGE COMPATABILITY

    qdap 1.1.0 attempts to gain compatability with the tm package. This enables data structures from tm to be utilized with qdap functions and conversely qdap data structures to be utilized with functions intended for tm data sets. Some of the following changes have been made to gain tm compatability:

    • tdm and dtm are now truly compatable with the tm package. tdm and dtm produce outputs of the class "TermDocumentMatrix" and "DocumentTermMatrix" respectively. This change (coupled with the renaming of stopwords to rm_stopwords) should make the two packages logical companions and further extend the qdap package to integrate with the many packages that already handle "TermDocumentMatrix" and "DocumentTermMatrix".
    • tm2qdap a function to convert "TermDocumentMatrix" and "DocumentTermMatrix" to a wfm added to allow easier integration with the tm package.
    • apply_as_tm a function to allow functions intended to be used on the tm package’s TermDocumentMatrix to be applied to a wfm object.
    • tm_corpus2df and df2tm_corpus added to convert a tm package corpus to a dataframe for use in qdap or vice versa.

    NEW FEATURES

    • hash_look (and %ha%) a counterpart to hash added to allow quick access to a hash table. Intended for use within functions or multiple uses of the same hash table, whereas lookup is intended for a single external (non-function) use which is more convenient though could be slower.
    • word_cor added to find words within grouping variables that are associated based on correlation.
    • dispersion_plot added to enable viewing of word dispersion through discourse.
    • word_proximity added to compliment dispersion_plot and word_cor functions. word_proximity gives the average distance between words in the unit of sentences.
    • boolean_search, a Boolean term search function, added to allow for indexed searches of Boolean terms.
    • wfm now uses mtabulate and is ~10x faster.

    PLOTTING

    Several Plotting Functions have been added to qdap. Many functions pick up a corresponding plotting method as well.


    This version of qdap has seen some exciting changes. We look forward to continued development. In the future we plan to:

    • Further develop the new_report function to better incorporate the reports package and smooth workflow.
    • Incorporate the dplyr package to gain speed boosts in some of qdap’s functions.

    For a complete list of changes see qdap’s NEWS.md

    Development Version
    github

    Posted in qdap | Tagged , , , , , , , , , | 1 Comment

    Sochi Olympic Medals

    For those who are addicted to R and haven’t the time to click the mouse on a web browser you too can still be informed about the results of the 2014 Sochi Winter Olympics. I was inspired by a SO response around the London games a couple years back.

    Packages to Load

    packs <- c("knitr", "ggplot2", "XML", "reshape2", "rCharts")
    lapply(packs, require, character.only = TRUE)
    

    The Script

    olympics <- 
    function(theurl = "http://www.sochi2014.com/en/medal-standings", include.zero = FALSE) {
     
        invisible(lapply(c('ggplot2', 'XML', 'reshape2') , require, character.only=TRUE))
     
        ## Grab Data, Clean and Reshape
        raw <- readHTMLTable(theurl, header=FALSE, 
            colClasses = c(rep("factor", 2), rep("numeric", 4)))
        raw <- as.data.frame(raw)[, -1]
        colnames(raw) <- c("Country", "Gold", "Silver", "Bronze", "Total")
        raw <- with(raw, raw[order(Total, Gold, Silver, Bronze), ])
        if (!include.zero) {
            raw <- raw[raw[, "Total"] != 0, ]
        }
        raw[, "Country"] <- factor(raw[, "Country"], levels = raw[, "Country"])
        rownames(raw) <- NULL
        mdat <- melt(raw, value.name = "Count", variable.name = "Place", id.var = "Country")
        mdat[, "Place"] <- factor(mdat[, "Place"], levels=c("Gold", "Silver", "Bronze", "Total"))  
        ## Plot the Data
        plot1 <- ggplot(mdat, aes(x = Count, y = Country, colour = Place)) +
          geom_point() +
          facet_grid(.~Place) + theme_bw()+
          scale_colour_manual(values=c("#CC6600", "#999999", "#FFCC33", "#000000")) 
        print(plot1)
     
        return(invisible(raw))
    }

    The Visual Results

    x <- olympics()

    olympics

    As a Data Table

    dTable(x)$show('inline', include_assets = TRUE, cdn = TRUE)
    

    WordPress and Data Table don’t play well together so you’ll need to run the code to see it in action.

    Discussion

    I have chosen a dot plot to display the data because it’s easy to quickly get a sense of the data yet be able to compare relatively easily. Dot plots take advantage of the powerful pre-attentive attribute of distance (The most powerful way to visually convey quantitative information). Stephen Few gives his two cents about dot plots here.

    I’m lazy but this would be a fun Shiny app to build. [EDIT @Ty Henkaline answers the call for a Shiny app and provides the ui.R and server.R]

    Thanks to @Ramnath for help implementing the chart as a jQuery DataTable.


    *Created using the reports package


    Posted in data, ggplot2, reshape, Uncategorized, visualization | Tagged , , , , | 18 Comments

    Sentence Drawing: Part II

    In a recent blog post I introduced Stefanie Posavec‘s Sentence Drawings. We created this ggplot2 rendition:

    We left off weighing the aesthetics of the Sentence Drawing with information of quality visualizations. I asked others to think of ways to display the information and also hinted that I’d use Yihui’s animation package to show the fluid nature of the conversation. Jim Vallandingham stepped up with a D3 rendering of the Sentence Drawing (though not implemented in R as we’ve grown accustomed to with rCharts) that gives a mouse over of the dialogue (GitHub here). I too followed up with the animation version as seen in the video outcome and accompanying script below.

    Click Here to view the html version.

    Mp4 Video: Musically enhanced.

    If you likes what you see then have a lookyloo at the code below.


    Getting Started

    Installing Packages from GitHub and Turn Function

    # install.packages("devtools")
    library(devtools)
    install_github(c('slidify', 'slidifyLibraries'), 'ramnathv', ref = 'dev')
    install_github("knitcitations", "cboettig")
    install_github(c("reports", "qdapDictionaries", "qdap"), "trinker")
    install_github("ggthemes", "jrnold")
    install.packages('scales')
    
    invisible(lapply(c("qdap", "ggplot2", "ggthemes", "scales", "grid"), 
        require, character.only = TRUE))
    
    turn_it <- function(dataframe, len.col, turn = -pi/2) {
    
        dat <- dataframe
        dat[, "turn"] <- rep(turn, nrow(dataframe))
        dat <- within(dat, { 
            facing <- pi/2 + cumsum(turn)
            move <- dat[, len.col] * exp(1i * facing)
            position <- cumsum(move)
            x2 <- Re(position)
            y2 <- Im(position)
            x1 <- c(0, head(x2, -1))
            y1 <- c(0, head(y2, -1))
        })
    
        dat[, c("x1", "y1", "x2", "y2")] <- 
            lapply(dat[, c("x1", "y1", "x2", "y2")], round, digits=0)
        data.frame(dataframe, dat[, c("x1", "y1", "x2", "y2")])
    }
    

    The Animation Code

    library(animation)
    
    ## Prepping the data
    
    dat2b <- rajSPLIT
    dat2b$wc <- wc(rajSPLIT$dialogue)
    dat2b <- dat2b[!is.na(dat2b[, "wc"]), ]
    
    ## Reassign names to family affiliation
    dat2b[, "fam.aff"] <- factor(lookup(as.character(dat2b[, "fam.aff"]), 
        levels(dat2b[, "fam.aff"])[1:3], qcv(Escalus, Capulet, Montague), 
        missing = NULL))
    
    ## Make dataframe with the beginning coordinates of each act
    beg_act <- do.call(rbind, lapply(with(turn_it(dat2b, "wc"), 
        split(turn_it(dat2b, "wc"), act)), function(x) {
            x[1, qcv(act, x1, y1, x2, y2)]
    }))
    
    
    keys <- sapply(split(1:nrow(dat2b), dat2b[, "act"]), head, 1)
    
    factor(all.birds$birds)
    
    ani_dat <- turn_it(dat2b, "wc")
    yl <- range(ani_dat[, c("y1", "y2")])
    xl <- range(ani_dat[, c("x1", "x2")])
    
    ## An animation base function
    
    ani_sent <- function(i){
    
        base <- ggplot(ani_dat[1:i, ], aes(x = x1, y = y1, xend = x2, yend = y2)) + 
            geom_segment(aes(color=fam.aff), lineend = "butt", size=1) +
            guides(colour = guide_legend(override.aes = list(alpha = 1))) + 
            theme_few() + 
            scale_colour_few(name="Family\nAffiliation", drop = FALSE) +
            theme(axis.ticks = element_blank(), 
                axis.text = element_blank(),  
                axis.title= element_blank(),
                legend.position = c(.1, .85),
                legend.title.align = .5) +
            ggtitle("Romeo and Juliet Family\nAffiliation: Sentence Drawing") 
    
        addon1 <- geom_text(data=beg_act[i >= keys,], 
            aes(x = x1, y=y1, label = paste("Act", act)), 
            colour = "grey25", hjust = -.1, size=5, fontface = "bold") 
        addon2 <- geom_point(data=beg_act[i >= keys,], 
            aes(x = x1, y=y1), size=2.3, colour = "grey25") 
    
        base2 <- base + addon1 + addon2
        info <- ani_dat[i, c("tot", "act")]
        base3 <- base2 +  geom_rect(aes(xmin = -173, xmax = -79, ymin = -160, ymax = -110), 
            fill="white", colour="grey75") + 
            annotate("text", x = -150, y=-125, label = "ACT", 
                colour="grey75", size=4, fontface = "bold") + 
            annotate("text", x = -105, y=-125, label = "T.O.T.", 
                colour="grey75", size=4, fontface = "bold") +
            annotate("text", x = -150, y=-145, label = as.character(info[2]), 
                colour="grey75", size=4, fontface = "bold") + 
            annotate("text", x = -105, y=-145, label = as.character(info[1]),  
                colour="grey75", size=4, fontface = "bold")  +
            xlim(xl) + ylim(yl)
    
        print(base3)      
    }
    
    pp2 <- function(x=base, alph = .15){
        for(i in 1:nrow(ani_dat)){
            ani_sent(i)
            ani.pause()
        }
    }
    
    ## Plot it
    
    out <- file.path(getwd(), "sent3") ## Change this as needed
    
    saveVideo(pp2(), interval = 0.01, outdir = out, 
        ffmpeg = "C:/Program Files (x86)/ffmpeg-latest-win32-static/ffmpeg-20130306-git-28adecf-win32-static/bin/ffmpeg.exe")
    
    saveHTML(pp2(), autoplay = FALSE, loop = FALSE, verbose = FALSE, outdir = out,
        single.opts = "'controls': ['first', 'previous', 'play', 'next', 'last', 'loop', 'speed'], 'delayMin': 0")
    

    For more on intro to animations see, this blog post.


    *Blog post created using the reports package

    Posted in animation, discourse analysis, ggplot2, qdap, text, Uncategorized, visualization | Tagged , , , , , , | 1 Comment

    Sentence Drawing: Function vs. Art

    I recently was reading the book “Functional Art” and came across the work of Stefanie Posavec. Her Sentence Drawings (click here to see and click here to learn) caught my attention. Here is a ggplot2 rendition:

    From what I understand about this visualization technique it’s meant to show the aesthetic and organic beauty of language (click here for interview with artist). I was captivated and thus I began the journey of using ggplot2 to recreate a Sentence Drawing.


    Getting Started

    I decided to use data sets from the qdap package.

    Installing Packages from GitHub

    # install.packages("devtools")
    library(devtools)
    install_github("ggthemes", "jrnold")
    install.packages("qdap")
    install.packages("scales")
    
    invisible(lapply(c("qdap", "ggplot2", "ggthemes", "scales", "grid"), 
        require, character.only = TRUE))
    

    Right Turn Function

    Stefanie Posavec describes the process for creating the Sentence Drawing by making a right turn at the end of each sentence. I went straight to work creating an inefficient solution to making right hand turns. Realizing the inefficiency, I asked for help and utilized this response from flodel. Here is the solution as a function that you’ll need to run.

    turn_it <- function(dataframe, len.col, turn = -pi/2) {
    
        dat <- dataframe
        dat[, "turn"] <- rep(turn, nrow(dataframe))
        dat <- within(dat, { 
            facing <- pi/2 + cumsum(turn)
            move <- dat[, len.col] * exp(1i * facing)
            position <- cumsum(move)
            x2 <- Re(position)
            y2 <- Im(position)
            x1 <- c(0, head(x2, -1))
            y1 <- c(0, head(y2, -1))
        })
    
        dat[, c("x1", "y1", "x2", "y2")] <- 
            lapply(dat[, c("x1", "y1", "x2", "y2")], round, digits=0)
        data.frame(dataframe, dat[, c("x1", "y1", "x2", "y2")])
    }
    

    Plot It

    Here are the turns represented visually.

    n <- 15
    set.seed(11)
    (dat <- data.frame(id = paste("X", 1:n, sep="."), 
        lens=sample(1:25, n, replace=TRUE)))
    
    ##      id lens
    ## 1   X.1    7
    ## 2   X.2    1
    ## 3   X.3   13
    ## 4   X.4    1
    ## 5   X.5    2
    ## 6   X.6   24
    ## 7   X.7    3
    ## 8   X.8    8
    ## 9   X.9   23
    ## 10 X.10    4
    ## 11 X.11    5
    ## 12 X.12   12
    ## 13 X.13   23
    ## 14 X.14   22
    ## 15 X.15   19
    
    ggplot(turn_it(dat, "lens"), aes(x = x1, y = y1, xend = x2, yend = y2)) + 
        geom_segment(aes(color=id), size=3,lineend = "round") + 
        ylim(c(-40, 10)) + xlim(c(-20, 40))
    

    plot of chunk fig1


    Apply to Romeo and Juliet

    Now that I had this accomplished I set to work with Romeo and Juliet.

    Setting Up a Data Set

    dat2b <- rajSPLIT
    dat2b$wc <- wc(rajSPLIT$dialogue)
    dat2b <- dat2b[!is.na(dat2b[, "wc"]), ]
    
    ## Reassign names to family affiliation
    dat2b[, "fam.aff"] <- factor(lookup(as.character(dat2b[, "fam.aff"]), 
        levels(dat2b[, "fam.aff"])[1:3], qcv(Escalus, Capulet, Montague), 
        missing = NULL))
    
    ## Make dataframe with the beginning coordinates of each act
    beg_act <- do.call(rbind, lapply(with(turn_it(dat2b, "wc"), 
        split(turn_it(dat2b, "wc"), act)), function(x) {
            x[1, qcv(act, x1, y1, x2, y2)]
    }))
    

    Romeo and Juliet Plotted

    ggplot(turn_it(dat2b, "wc"), aes(x = x1, y = y1, xend = x2, yend = y2)) + 
        geom_segment(aes(color=fam.aff), lineend = "butt", size=1) +
        #geom_point(x=0, y=0, size=5, shape="S") +
        #geom_point(data=dat4b, aes(x=-106, y=-273), size=5, shape="E") + 
        geom_point(data=beg_act, aes(x = x1, y=y1), size=2.3,
            colour = "grey25") +
        geom_text(data=beg_act, aes(x = x1, y=y1, label = paste("Act", act)), 
            colour = "grey25", hjust = -.1, size=5, fontface = "bold") +
        guides(colour = guide_legend(override.aes = list(alpha = 1))) + 
        theme_few() + 
        scale_colour_few(name="Family\nAffiliation") +
        theme(axis.ticks = element_blank(), 
            axis.text = element_blank(),  
            axis.title= element_blank(),
            legend.position = c(.1, .85),
            legend.title.align = .5) +
        ggtitle("Romeo and Juliet Family\nAffiliation: Sentence Drawing")
    

    plot of chunk fig2

    After this I wanted to try to fill by sentence level polarity using a newer polarity (sentiment) algorithm from qdap.

    poldat <- polarity(dat2b[, "dialogue"])
    
    ggplot(turn_it(poldat[["all"]], "wc"), aes(colour=polarity)) + 
        geom_segment(aes(x = x1, y = y1, xend = x2, yend = y2), 
            lineend = "round", size=1) + 
        theme_few() +
        theme(panel.background = element_rect(fill = "grey20"),
            axis.ticks = element_blank(), 
            axis.text = element_blank(),  
            axis.title= element_blank(),
            legend.direction = "horizontal",
            legend.title = element_text(colour="white"),
            legend.background = element_blank(),
            legend.text = element_text(colour="white"),
            legend.position = c(.80, .07))  + 
        scale_colour_gradient2(name="", low = muted("blue"), 
            mid = "white", high = muted("red"))  +
        guides(colour = guide_colorbar(barwidth = 11, barheight = .75)) +
        ggtitle("Sentence Polarity: Sentence Drawing")
    

    plot of chunk fig3


    Thoughts…

    While I like the aesthetics and organic feel of Stefanie Posavec’s Sentence Drawings I can’t help but to ask what this is showing me; what does such a visual afford the audience? I concluded that it captures that language isn’t linear but recursive and intricately linked. Posavec describes the tight spirals as choppy and the extended ones as flowing and smooth. However, I believe there are better ways to capture this sentiment while still balancing the notion of organic recursivity with identifying structure.

    Visual representations, like this turn of talk plot below, capture meaningful patterns in the data and allow for comparisons but present the data as linear, when it really is not.

    out <- tot_plot(dat2b, "dialogue", grouping.var = "fam.aff",
        facet.vars = "act", tot=FALSE, plot = FALSE)
    
    out + theme(legend.position = "bottom") + 
        labs(fill="Family\nAffiliation")
    

    plot of chunk fig4

    Again, there must be a balance between capturing the essence of language and understanding the structure. Perhaps using pre-attentive attributes in a meaningful way would be a start to allowing Posavec’s representation to be more useful in finding the narrative in the data. The right hand turn she uses is arbitrary. I ask, what if the turn were meaningful, towards a particular demographic variable. I also could see the benefit of the use of Yihui’s animation package to show the fluid nature of the conversation. I may return to this blog post but I invite others to attempt the challenge of showing something meaningful in the data, while capturing the controlled chaos of language.

    Click here for a complete script of this blog post


    *Blog post created using the reports package

    Posted in discourse analysis, ggplot2, text, Uncategorized, visualization | Tagged , , , , , , , , , , | 7 Comments

    Correspondence Analysis in R

    Correspondence analysis (from a layman’s perspective) is like principal components analysis for categorical data. It can be useful to discover structure in this type of data. My friend Gianmarco Alberti, an archaeologist, has put together an in depth web site detailing the history, use and worked R examples of correspondence analysis. It’s like a FREE ebook :)

    The website: http://cainarchaeology.weebly.com/

    Enjoy!!

    Posted in analysis | Tagged , , , , , , | Leave a comment

    paste, paste0, and sprintf

    I find myself pasting urls and lots of little pieces together lately. Now paste is a standard go to guy when you wanna glue some stuff together. But often I find myself pasting and getting stuff like this:

    paste(LETTERS)
    
     [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
    [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
    

    Rather than the desired…

    [1] "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    

    When I get into those situations I think, “Oh better use collapse instead”; but never really think before using paste (That is whether I collapse or sep and why). This is inefficient and causes me to lack the time to write quality articles for Fox News (JK for those taking me serious). This tutorial will give some basic and clear direction about the following functions:

    paste(x)
    paste0(x)
    sprintf(x, y)
    

    paste

    paste has 3 arguments.

    paste (..., sep = " ", collapse = NULL)
    

    The ... is the stuff you want to paste together and sep and collapse are the guys to get it done. There are three basic things I paste together:

    1. A bunch of individual character strings.
    2. 2 or more vectors pasted element for element.
    3. One vector smushed together.

    Here's an example of each, though not with the correct arguments (I'm building suspense here):

    paste("A", 1, "%")       #A bunch of individual character strings.
    paste(1:4, letters[1:4]) #2 or more vectors pasted element for element.
    paste(1:10)              #One vector smushed together.
    

    Here's the sep/collapse rule for each:

    1. A bunch of individual character strings – You want sep
    2. 2 or more vectors pasted element for element. – You want sep
    3. One vector smushed together.- Smushin requires collapse

    So here they are with the correct arguments:

    paste("A", 1, "%")       #A bunch of individual character strings.
    paste(1:4, letters[1:4]) #2 or more vectors pasted element for element.
    paste(1:10, collapse="") #One vector smushed together.
    

    This yields:

    > paste("A", 1, "%")       #A bunch of individual character strings.
    [1] "A 1 %"
    > paste(1:4, letters[1:4]) #2 or more vectors pasted element for element.
    [1] "1 a" "2 b" "3 c" "4 d"
    > paste(1:10, collapse="") #One vector smushed together.
    [1] "12345678910"
    

    paste0

    paste0 is short for:

    paste(x, sep="")
    

    So it allows us to be lazier and more efficient. I'm lazy so I use paste0 a lot.

    paste0("a", "b") == paste("a", "b", sep="")
    
    ## [1] TRUE
    

    'nuff said.


    sprintf

    I discovered this guy a while back but realized it's value in pasting recently. Much of my work on the reports (Rinker, 2013) package requires that I piece together lots of chunks of url and insert user specific pieces. This can be a nightmare with all the quotation marks. A typical take may look like this:

    person <-"Grover"
    action <-"flying"
    message(paste0("On ", Sys.Date(), " I realized ", person, " was...\n", action, " by the street"))
    
    ## On 2013-09-14 I realized Grover was... flying by the street
    

    No joke it took me 6 tries before I formatted that without an error (missing quotes, spaces, and commas).

    But we can use sprintf to make one string (less commas + less quotations marks = less errors) and feed the elements that may differ from user to user or time to time. Let's look at an example to see what I mean:

    person <-"Grover"
    action <-"flying"
    message(sprintf("On %s I realized %s was...\n%s by the street", Sys.Date(), person, action))
    
    ## On 2013-09-14 I realized Grover was... flying by the street
    

    Boom first time. It's easy to figure out the spacing and there aren't the commas and quotation marks to deal with. Just use the %s marker to denote that some element goes here and then feed it in as a vector after the character string. For some applications sprintf is a superior choice over paste/paste0.


    Note that these are not extensive, all-encompassing rules but guides for general use. Also be aware the sprintf is even cooler than I demonstrated here.

    *Created using the reports package


    References

    Posted in paste, Uncategorized | Tagged , , , , , | 6 Comments

    GitHub Package Ideas I Stole

    One of my favorite sources of good ideas is looking at the GitHub repositories of others and modeling my repos after the good ideas I see others doing. Here's Steve Jobs on stealing ideas:

    In the past few weeks I've spotted three simple things other maintainers are doing that I liked and which I promptly incorporated into my own repos. I wanted to shine a light on those three ideas.


    Versioning Semantics

    The first idea comes from Carl Boettiger, the maintainer of the knitcitations (Boettiger, 2013) package. In his NEWS file for knitcitations Carl provides the reader with his versioning semantics as seen below.

    Releases will be numbered with the following semantic versioning format:
    
    <major>.<minor>.<patch>
    
    And constructed with the following guidelines:
    
    * Breaking backward compatibility bumps the major (and resets the minor 
      and patch)
    * New additions without breaking backward compatibility bumps the minor 
      (and resets the patch)
    * Bug fixes and misc changes bumps the patch
    

    Until Yihui's blog post I thought 1.0.0 implied maturity. To some maintainers this may be the case but I choose to follow Carl's model and I feel it is important to tell the user of your package what the versioning means. In fact a major version bump may indicate an increased likeliness of bugs, not maturity. To you Carl I say “Well played sir!”


    Contact

    The second practice I picked up is Tal Galili's use of a Contact section included in the README.md file of the installr (Galili, 2013) package as seen below.

    Contact
    
    You are welcome to:
    
    * submit suggestions and bug-reports at: https://github.com/talgalili/installr/issues
    * send a pull request on: https://github.com/talgalili/installr/
    * compose a friendly e-mail to: tal.galili@gmail.com
    

    Simple but brilliant. Tal is explicit about where to submit both suggestions and bug reports as a GitHub's issues page often implies just bugs and such, not improvements. Because of the history of R many users will attempt to contact you via email directly, thus you explain the same problem many times rather than answering it once in a public forum. Tal's format is direct yet diplomatic in that it directs the user to co-action rather than asking for fixes (see Yihui's blog post about this). Notice that email is listed as last. This reminds me of my first grade teacher's mantra “Ask three before me.”


    NEWS.md

    Last I saw Yihui Xie's use of a NEWS.md file in the knitr (Xie, 2013) package. The typical NEWS file is plain text and boring. The inclusion of a NEWS.md is much prettier to look at and creates a better experience for the package user. Including a NEWS.md merely requires a quick file conversion via:

    file.copy("NEWS", "NEWS.md")

    I then included NEWS.md in my .Rbuildignore to avoid clogging up CRAN needlessly. To you Yihui I say very nice, high five!


    I hope these little stolen bits of goodness are useful to fellow idea thieves. Please be sure to provide feedback in the comments below.

    Blog post created using the reports (Rinker, 2013) package

    Get the .Rmd file here


    References

    Posted in knitr, package creation, reports, Uncategorized | Tagged , , , , , , | 5 Comments

    How do I re-arrange??: Ordering a plot revisited

    Back in October of last year I wrote a blog post about reordering/rearanging plots. This was, and continues to be, a frequent question on list serves and R help sites. In light of my recent studies/presenting on The Mechanics of Data Visualization, based on the work of Stephen Few (2012); Few (2009), I realized I was remiss in explaining the ordering of variables from largest to smallest bar (particularly Cleveland Dot Plots and Bar Plots). It is often much more meaningful to arrange (order) factor levels by size of other numeric variable(s). This allows for easier pattern recognition over the standard aphabetic arrangement of levels.

    The post will take you through a demonstration of sorting bars/points on another variable, however it assumes you already know how that if you want to reorder/rearrange in a plot you must reorder the factor levels (if you do not know this see this blog post). We then explore my GitHub package package plotflow to add efficiency to re-leveling in the workflow. After we learn how to sort by bar/point size we will look at a applied use. I will use ggplot2 because this is my go to plotting system.


    Section 1: Reordering by Bar/Point Size

    Create a data set we can alter

    mtcars3 <-mtcars2 <-data.frame(car=rownames(mtcars), mtcars, row.names=NULL)
    mtcars3$cyl  <-mtcars2$cyl <-as.factor(mtcars2$cyl)
    head(mtcars2)
    
    ##                 car  mpg cyl disp  hp drat    wt  qsec vs am gear carb
    ## 1         Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
    ## 2     Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
    ## 3        Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
    ## 4    Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
    ## 5 Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
    ## 6           Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
    

    In this example it's difficult to find trends and patterns in the data.

    An Example of Unordered Bars/Points

    library(ggplot2)
    library(gridExtra)
    x <-ggplot(mtcars2, aes(y=car, x=mpg)) + 
        geom_point(stat="identity")
    
    y <-ggplot(mtcars2, aes(x=car, y=mpg)) + 
        geom_bar(stat="identity") + 
        coord_flip()
    
    grid.arrange(x, y, ncol=2)
    

    plot of chunk order1

    Below we use the levels argument to factor in conjunction with order to order the levels of car by miles per gallong (mpg).

    An Example of Ordered Bars/Points

    ## Relevel the cars by mpg
    mtcars3$car <-factor(mtcars2$car, levels=mtcars2[order(mtcars$mpg), "car"])
    
    x <-ggplot(mtcars3, aes(y=car, x=mpg)) + 
        geom_point(stat="identity")
    
    y <-ggplot(mtcars3, aes(x=car, y=mpg)) + 
        geom_bar(stat="identity") + 
        coord_flip()
    
    grid.arrange(x, y, ncol=2)
    

    plot of chunk order2

    This is an example when a factor's levels each has a unique row. This is not always the case. For instance if we we to use mtcars2cyl rather than mtcars2$car as the factor we'd have multiple observations for each cylinder level. In these instances we'd most likely utilize the ording by some summarizing variable as seen in the ordering mtcars2$carb by average mpg below.

    An Example of Ordered and Faceted Bars/Points

    ## Relevel the carb by average mpg
    (ag_mtcars <-aggregate(mpg ~ carb, mtcars3, mean))
    
    ##   carb   mpg
    ## 1    1 25.34
    ## 2    2 22.40
    ## 3    3 16.30
    ## 4    4 15.79
    ## 5    6 19.70
    ## 6    8 15.00
    
    mtcars3$carb <-factor(mtcars2$carb, levels=ag_mtcars[order(ag_mtcars$mpg), "carb"])
    
    ggplot(mtcars3, aes(y=carb, x=mpg)) + 
        geom_point(stat="identity", size=2, aes(color=carb))
    

    plot of chunk order3

    The last plot in this section adds faceting to further draw distinction and allow for pattern recognition. The ordering of the facets can also be changed by reordering factor levels in a way that is sensible for representing the narrative the data is telling.

    An Example of Ordered and Faceted Bars/Points

    ggplot(mtcars3, aes(y=car, x=mpg)) + 
        geom_point(stat="identity") +
        facet_grid(cyl~., scales = "free", space="free")
    

    plot of chunk order4

    Recapping Section 1: Reordering by Bar/Point Size

    In this first section we learned:

    1. Ordering factors by a numeric variable increases the ability to recognize patterns
    2. We can have (a) one row per factor level or (b) multiple rows per factor level.
      • The first scenerio requires feeding the dataframe with the levels reordered through order.
      • The second scenerio requires some sort of aggregation by summary statistic before using order and feeding to the levels argument of factor.
    3. Adding faceting can increase the ability to further find patterns among the ordered figure.

    Section 2: Speeding Up the Workflow With the plotflow Package

    Because I have the need to reorder factors by other numeric variables frequently and using order and sometimes aggregate is tedious and annoying I have wrapped this process up as a function called order_by in the plotflow package. I pretty much ripped off the entire function from Thomas Wutzler. This function allows the user to sort a dataframe by 1 or more numeric variables and return the new dataframe with a releveled factor. This is useful in that a new dataframe is created rather than tampering with the original. The function also allows for a summery stat to be passed via te FUN argument in a similar fashion as aggregate. This approach save typing and is more intuitive.

    Getting the plotflow package

    To get plotflow you can install the devtools package and use the install_github function:

    # install.packages("devtools")
    
    library(devtools)
    install_github("plotflow", "trinker")
    

    What Does order_by do?

    library(plotflow)
    dat <-aggregate(cbind(mpg, hp, disp)~carb, mtcars, mean)
    dat$carb <-factor(dat$carb)
    
    ## compare levels (data set looks the same though)
    dat$carb
    
    ## [1] 1 2 3 4 6 8
    ## Levels: 1 2 3 4 6 8
    
    order_by(carb, ~-hp + -mpg, data = dat)$carb
    
    ## [1] 1 2 3 4 6 8
    ## Levels: 8 4 3 6 2 1
    

    By defualt order_by returns a dataframe however we can also tell order_by to return a vector by setting df=FALSE.

    ## Return just the vector with new levels
    order_by(carb, ~ -hp + -mpg, dat, df=FALSE)
    
    ## [1] 1 2 3 4 6 8
    ## Levels: 8 4 3 6 2 1
    

    Let's see order_by in action.

    Use order_by to Order Bars

    library(ggplot2)
    
    ## Reset the data from Section 1
    dat2 <-data.frame(car=rownames(mtcars), mtcars, row.names=NULL)
    ggplot(order_by(car, ~ mpg, dat2), aes(x=car, y=mpg)) + 
        geom_bar(stat="identity") + 
        coord_flip() + ggtitle("Order Pretty Easy")
    

    plot of chunk order5

    Aggregated by Summary Stat

    ###Carb Ordered By Summary (Mean) of mpg

    ## Ordered points with the order_by function
    a <-ggplot(order_by(carb, ~ mpg, dat2, mean), aes(x=carb, y=mpg)) +
        geom_point(stat="identity", aes(colour=carb)) +
        coord_flip() + ggtitle("Ordered Dot Plots Made Easy")
    
    ## Reverse the ordered points
    b <-ggplot(order_by(carb, ~ -mpg, dat2, mean), aes(x=carb, y=mpg)) +
        geom_point(stat="identity", aes(colour=carb)) +
        coord_flip() + ggtitle("Reverse Order Too!")
    
    grid.arrange(a, b, ncol=1)
    

    plot of chunk order6

    Nested Usage (order_by on an order by dataframe)

    ggplot(order_by(gear, ~mpg, dat2, mean), aes(mpg, carb)) +
        geom_point(aes(color=factor(cyl))) +
        facet_grid(gear~., scales="free") + ggtitle("I'm Nested (Yay for me!)")
    

    plot of chunk order7

    The order_by function makes life a little easier.


    Section 3: Using order_by on Real Data

    Now I turn the attention to a real life usage of ordering a factor by a numeric variable in order to see patterns. A while back Abraham Mathew presented a blog post utilizing some interesting data on job satisfaction within bigger technology companies. His demonstrations showed various ways to utilize ggplot2 to visualize the data.

    As I read the post I was also reading a bit of Stephen Few's work, which recomends ordering bars/dotplots to better see patterns. This visualization, which Mathew produced with ggplot2, is captivating:

    However, I believed that by order the bars as Stephen Few (2012); Few (2009) suggests may enhance our ability to see a pattern; which of the four variables are linked?

    In this next section we'll grab the data, clean it, reshape it, relevel the factors and plot in a more meaningful way to reveal patterns not seen before. Let's begin by loading the following packages:

    library(RCurl)
    library(XML)
    library(rjson)
    library(ggplot2)
    library(qdap)
    library(reshape2)
    library(gridExtra)
    

    Now we can scrape the data and extract the required pieces.

    URL <-"http://www.payscale.com/top-tech-employers-compared-2012/job-satisfaction-survey-data"
    doc   <-htmlTreeParse(URL, useInternalNodes=TRUE)
    nodes <-getNodeSet(doc, "//script[@type='text/javascript']")[[19]][[1]]
    dat <-gsub("];", "]", capture.output(nodes)[5:27])
    ndat <-data.frame(do.call(rbind, fromJSON(paste(dat, collapse = ""))))[, -2]
    ndat[, 1:5] <-lapply(ndat, unlist)
    IBM <-grepl("International Business Machines", ndat[, 1])
    ndat[IBM, 1] <-bracketXtract(ndat[IBM, 1])
    ndat[, 1] <-sapply(strsplit(ndat[, 1], "\\s|,"), "[", 1)
    

    At this point we relevel the factor level Employer.Name by job satisfaction.

    ## Re-level with order_by
    ndat[, "Employer.Name"] <-order_by(Employer.Name, ~Job.Satisfaction, ndat, df=FALSE)
    colnames(ndat)[1] <-"Employer"
    ndat
    
    ##           Employer Job.Satisfaction Work.Stress Job.Meaning Job.Flexibility
    ## 1            Adobe           0.6875      0.7031      0.4532          0.8594
    ## 2       Amazon.com           0.7723      0.7010      0.4901          0.7376
    ## 3              AOL           0.7714      0.6572      0.4118          0.7714
    ## 4            Apple           0.7800      0.6510      0.7114          0.7567
    ## 5             Dell           0.6890      0.6275      0.4983          0.8712
    ## 6             eBay           0.7097      0.6087      0.5824          0.8153
    ## 7         Facebook           0.8750      0.6875      0.8125          0.9375
    ## 8           Google           0.7987      0.5660      0.6387          0.8334
    ## 9  Hewlett-Packard           0.5807      0.6034      0.4335          0.8733
    ## 10           Intel           0.7339      0.6677      0.6892          0.8896
    ## 11             IBM           0.6414      0.6637      0.4631          0.8946
    ## 12        LinkedIn           1.0000      0.6923      0.8462          0.9166
    ## 13       Microsoft           0.6777      0.6181      0.6099          0.9281
    ## 14     Monster.com           0.7273      0.8181      0.5454          0.8181
    ## 15           Nokia           0.7400      0.4800      0.5600          0.8200
    ## 16          Nvidia           0.7692      0.5897      0.5385          0.7692
    ## 17          Oracle           0.6713      0.6406      0.4221          0.9218
    ## 18  Salesforce.com           0.8667      0.7334      0.6667          0.8275
    ## 19         Samsung           0.6596      0.7447      0.6595          0.6170
    ## 20            Sony           0.7500      0.6667      0.5217          0.8750
    ## 21          Yahoo!           0.6762      0.5333      0.5145          0.8750
    

    Now we can reshape the data to long format which ggplot2 prefers almost exclusively.

    ## Melt the data to long format
    mdat <-melt(ndat)
    mdat[, 2] <-factor(gsub("\\.", " ", mdat[, 2]), 
        levels = gsub("\\.", " ", colnames(ndat)[-1]))
    
    head(mdat)
    
    ##     Employer         variable  value
    ## 1      Adobe Job Satisfaction 0.6875
    ## 2 Amazon.com Job Satisfaction 0.7723
    ## 3        AOL Job Satisfaction 0.7714
    ## 4      Apple Job Satisfaction 0.7800
    ## 5       Dell Job Satisfaction 0.6890
    ## 6       eBay Job Satisfaction 0.7097
    

    Now our data is cleaned and reshaped with Employer releveled by job stisfaction. I chose this (job stisfaction) as the variable of interest because of literature I've read around job performance, teacher retention and job satisfaction. Let's see if re-leveling the factor has an improvement on the trends and patterns we can see.

    ggplot(data=mdat, aes(x=Employer, y=value, fill=factor(Employer))) + 
      geom_bar(stat="identity") + coord_flip() + ylim(c(0, 1)) + 
      facet_wrap( ~ variable, ncol=2) + theme(legend.position="none") + 
      ggtitle("Plot 3: Employee Job Satisfaction at Top Tech Companies") +
      ylab(c("Job Satisfaction"))
    

    plot of chunk order8

    The first thing I noticed after the reordering is that Job Meaning and Job Satisfaction appear to be related. In general, higher satisfaction corresponds with greater meaning. I also noticed that Flexibility and Stress do not appear to correspond with satisfaction. This made me curious and so I ran a simple regression model with Satisfaction as the outcome and the other three variables as predictors. The story from the regression model is similar to the visualization.

    mod <-lm(Job.Satisfaction ~ Work.Stress + Job.Meaning + Job.Flexibility, data=ndat)
    mod
    
    ## 
    ## Call:
    ## lm(formula = Job.Satisfaction ~ Work.Stress + Job.Meaning + Job.Flexibility, 
    ##     data = ndat)
    ## 
    ## Coefficients:
    ##     (Intercept)      Work.Stress      Job.Meaning  Job.Flexibility  
    ##          0.3101           0.1062           0.5241           0.0733
    
    anova(mod)
    
    ## Analysis of Variance Table
    ## 
    ## Response: Job.Satisfaction
    ##                 Df Sum Sq Mean Sq F value Pr(&gt;F)    
    ## Work.Stress      1 0.0069  0.0069    1.45 0.2452    
    ## Job.Meaning      1 0.0816  0.0816   17.04 0.0007 ***
    ## Job.Flexibility  1 0.0006  0.0006    0.13 0.7260    
    ## Residuals       17 0.0814  0.0048                   
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    summary(mod)
    
    ## 
    ## Call:
    ## lm(formula = Job.Satisfaction ~ Work.Stress + Job.Meaning + Job.Flexibility, 
    ##     data = ndat)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -0.12043 -0.03002 -0.00263  0.03268  0.11915 
    ## 
    ## Coefficients:
    ##                 Estimate Std. Error t value Pr(&gt;|t|)    
    ## (Intercept)       0.3101     0.2413    1.29   0.2160    
    ## Work.Stress       0.1062     0.2147    0.49   0.6273    
    ## Job.Meaning       0.5241     0.1288    4.07   0.0008 ***
    ## Job.Flexibility   0.0733     0.2058    0.36   0.7260    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.0692 on 17 degrees of freedom
    ## Multiple R-squared:  0.523,  Adjusted R-squared:  0.438 
    ## F-statistic: 6.21 on 3 and 17 DF,  p-value: 0.00483
    

    The model accounts for ~50% of the variability in Job Satisfaction. While the model is significant there clearly is more than just Meaninging that impacts Satisfaction. I Decided to do a bit more plotting and use the preattentive attributes of color and size to represent Flexibility and Stress in the visual model.

    theplot <-ggplot(data=ndat, aes(x = Job.Meaning, y = Job.Satisfaction)) + 
        geom_smooth(method="lm", fill = "blue", alpha = .1, size=1) +  
        geom_smooth(color="red", fill = "pink", alpha = .3, size=1) +
        xlim(c(.4, .9)) +
        geom_point(aes(size = Job.Flexibility, colour = Work.Stress)) +
        geom_text(aes(label=Employer), size = 3, hjust=-.1, vjust=-.1) +
        scale_colour_gradient(low="gold", high="red") 
    
    theplot
    

    plot of chunk order9

    There is certainly a pullby this group of tech companies, that may be an unaccounted variable in the model.

    theplot + annotation_custom(grob=circleGrob(r = unit(.4,"npc")), xmin=.47, xmax=.57, ymin=.72, ymax=.82)
    

    If we view the data as two separate smootherd regression lines we get a more predictable model. This indicates a variable that we have not included.

    ndat$outs <-1
    ndat$outs[ndat$Employer %in% qcv(AOL, Amazon.com, Nvidia, Sony)] <-0
    
    ggplot(data=ndat, aes(x = Job.Meaning, y = Job.Satisfaction)) + 
        geom_smooth(method="lm", fill = "blue", alpha = .1, size=1, aes(group=outs)) +  
        geom_smooth(color="red", fill = "pink", alpha = .3, size=1) +
        xlim(c(.4, .9)) +
        geom_point(aes(size = Job.Flexibility, colour = Work.Stress)) +
        geom_text(aes(label=Employer), size = 3, hjust=-.1, vjust=-.1) +
        scale_colour_gradient(low="gold", high="red") 
    

    plot of chunk order10


    We've learned:

    1. Re-leveling/re-ordering a factor by a numeric variable(s) can lead to important pattern detection in data.
    2. The levels argument to factor is key to the reordering.
    3. order and sometimes aggregate allows the re0leving to occur.
    4. The order_by function in the plotflow package can make re-leveling easier.
    5. 5. Faceting can amplify the distinction made by the re-leveling.

    *Created using the reports (Rinker, 2013) package


    References

    • Stephen Few, (2009) Now You See It: Simple Visualization Techniques for Quantitative
      Analysis.
    • Stephen Few, (2012) Show me the numbers: Designing tables and graphs to enlighten.
    • Tyler Rinker, (2013) reports: Package to assist in report writing. http://github.com/trinker/reports
    Posted in Uncategorized, ggplot2, visualization, factor, work flow | Tagged , , , , , , , | 5 Comments