How do I re-arrange??: Ordering a plot revisited

Back in October of last year I wrote a blog post about reordering/rearanging plots. This was, and continues to be, a frequent question on list serves and R help sites. In light of my recent studies/presenting on The Mechanics of Data Visualization, based on the work of Stephen Few (2012); Few (2009), I realized I was remiss in explaining the ordering of variables from largest to smallest bar (particularly Cleveland Dot Plots and Bar Plots). It is often much more meaningful to arrange (order) factor levels by size of other numeric variable(s). This allows for easier pattern recognition over the standard aphabetic arrangement of levels.

The post will take you through a demonstration of sorting bars/points on another variable, however it assumes you already know how that if you want to reorder/rearrange in a plot you must reorder the factor levels (if you do not know this see this blog post). We then explore my GitHub package package plotflow to add efficiency to re-leveling in the workflow. After we learn how to sort by bar/point size we will look at a applied use. I will use ggplot2 because this is my go to plotting system.


Section 1: Reordering by Bar/Point Size

Create a data set we can alter

mtcars3 <-mtcars2 <-data.frame(car=rownames(mtcars), mtcars, row.names=NULL)
mtcars3$cyl  <-mtcars2$cyl <-as.factor(mtcars2$cyl)
head(mtcars2)
##                 car  mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1         Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## 2     Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## 3        Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## 4    Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 5 Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## 6           Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

In this example it's difficult to find trends and patterns in the data.

An Example of Unordered Bars/Points

library(ggplot2)
library(gridExtra)
x <-ggplot(mtcars2, aes(y=car, x=mpg)) + 
    geom_point(stat="identity")

y <-ggplot(mtcars2, aes(x=car, y=mpg)) + 
    geom_bar(stat="identity") + 
    coord_flip()

grid.arrange(x, y, ncol=2)

plot of chunk order1

Below we use the levels argument to factor in conjunction with order to order the levels of car by miles per gallong (mpg).

An Example of Ordered Bars/Points

## Relevel the cars by mpg
mtcars3$car <-factor(mtcars2$car, levels=mtcars2[order(mtcars$mpg), "car"])

x <-ggplot(mtcars3, aes(y=car, x=mpg)) + 
    geom_point(stat="identity")

y <-ggplot(mtcars3, aes(x=car, y=mpg)) + 
    geom_bar(stat="identity") + 
    coord_flip()

grid.arrange(x, y, ncol=2)

plot of chunk order2

This is an example when a factor's levels each has a unique row. This is not always the case. For instance if we we to use mtcars2cyl rather than mtcars2$car as the factor we'd have multiple observations for each cylinder level. In these instances we'd most likely utilize the ording by some summarizing variable as seen in the ordering mtcars2$carb by average mpg below.

An Example of Ordered and Faceted Bars/Points

## Relevel the carb by average mpg
(ag_mtcars <-aggregate(mpg ~ carb, mtcars3, mean))
##   carb   mpg
## 1    1 25.34
## 2    2 22.40
## 3    3 16.30
## 4    4 15.79
## 5    6 19.70
## 6    8 15.00
mtcars3$carb <-factor(mtcars2$carb, levels=ag_mtcars[order(ag_mtcars$mpg), "carb"])

ggplot(mtcars3, aes(y=carb, x=mpg)) + 
    geom_point(stat="identity", size=2, aes(color=carb))

plot of chunk order3

The last plot in this section adds faceting to further draw distinction and allow for pattern recognition. The ordering of the facets can also be changed by reordering factor levels in a way that is sensible for representing the narrative the data is telling.

An Example of Ordered and Faceted Bars/Points

ggplot(mtcars3, aes(y=car, x=mpg)) + 
    geom_point(stat="identity") +
    facet_grid(cyl~., scales = "free", space="free")

plot of chunk order4

Recapping Section 1: Reordering by Bar/Point Size

In this first section we learned:

  1. Ordering factors by a numeric variable increases the ability to recognize patterns
  2. We can have (a) one row per factor level or (b) multiple rows per factor level.
    • The first scenerio requires feeding the dataframe with the levels reordered through order.
    • The second scenerio requires some sort of aggregation by summary statistic before using order and feeding to the levels argument of factor.
  3. Adding faceting can increase the ability to further find patterns among the ordered figure.

Section 2: Speeding Up the Workflow With the plotflow Package

Because I have the need to reorder factors by other numeric variables frequently and using order and sometimes aggregate is tedious and annoying I have wrapped this process up as a function called order_by in the plotflow package. I pretty much ripped off the entire function from Thomas Wutzler. This function allows the user to sort a dataframe by 1 or more numeric variables and return the new dataframe with a releveled factor. This is useful in that a new dataframe is created rather than tampering with the original. The function also allows for a summery stat to be passed via te FUN argument in a similar fashion as aggregate. This approach save typing and is more intuitive.

Getting the plotflow package

To get plotflow you can install the devtools package and use the install_github function:

# install.packages("devtools")

library(devtools)
install_github("plotflow", "trinker")

What Does order_by do?

library(plotflow)
dat <-aggregate(cbind(mpg, hp, disp)~carb, mtcars, mean)
dat$carb <-factor(dat$carb)

## compare levels (data set looks the same though)
dat$carb
## [1] 1 2 3 4 6 8
## Levels: 1 2 3 4 6 8
order_by(carb, ~-hp + -mpg, data = dat)$carb
## [1] 1 2 3 4 6 8
## Levels: 8 4 3 6 2 1

By defualt order_by returns a dataframe however we can also tell order_by to return a vector by setting df=FALSE.

## Return just the vector with new levels
order_by(carb, ~ -hp + -mpg, dat, df=FALSE)
## [1] 1 2 3 4 6 8
## Levels: 8 4 3 6 2 1

Let's see order_by in action.

Use order_by to Order Bars

library(ggplot2)

## Reset the data from Section 1
dat2 <-data.frame(car=rownames(mtcars), mtcars, row.names=NULL)
ggplot(order_by(car, ~ mpg, dat2), aes(x=car, y=mpg)) + 
    geom_bar(stat="identity") + 
    coord_flip() + ggtitle("Order Pretty Easy")

plot of chunk order5

Aggregated by Summary Stat

###Carb Ordered By Summary (Mean) of mpg

## Ordered points with the order_by function
a <-ggplot(order_by(carb, ~ mpg, dat2, mean), aes(x=carb, y=mpg)) +
    geom_point(stat="identity", aes(colour=carb)) +
    coord_flip() + ggtitle("Ordered Dot Plots Made Easy")

## Reverse the ordered points
b <-ggplot(order_by(carb, ~ -mpg, dat2, mean), aes(x=carb, y=mpg)) +
    geom_point(stat="identity", aes(colour=carb)) +
    coord_flip() + ggtitle("Reverse Order Too!")

grid.arrange(a, b, ncol=1)

plot of chunk order6

Nested Usage (order_by on an order by dataframe)

ggplot(order_by(gear, ~mpg, dat2, mean), aes(mpg, carb)) +
    geom_point(aes(color=factor(cyl))) +
    facet_grid(gear~., scales="free") + ggtitle("I'm Nested (Yay for me!)")

plot of chunk order7

The order_by function makes life a little easier.


Section 3: Using order_by on Real Data

Now I turn the attention to a real life usage of ordering a factor by a numeric variable in order to see patterns. A while back Abraham Mathew presented a blog post utilizing some interesting data on job satisfaction within bigger technology companies. His demonstrations showed various ways to utilize ggplot2 to visualize the data.

As I read the post I was also reading a bit of Stephen Few's work, which recomends ordering bars/dotplots to better see patterns. This visualization, which Mathew produced with ggplot2, is captivating:

However, I believed that by order the bars as Stephen Few (2012); Few (2009) suggests may enhance our ability to see a pattern; which of the four variables are linked?

In this next section we'll grab the data, clean it, reshape it, relevel the factors and plot in a more meaningful way to reveal patterns not seen before. Let's begin by loading the following packages:

library(RCurl)
library(XML)
library(rjson)
library(ggplot2)
library(qdap)
library(reshape2)
library(gridExtra)

Now we can scrape the data and extract the required pieces.

URL <-"http://www.payscale.com/top-tech-employers-compared-2012/job-satisfaction-survey-data"
doc   <-htmlTreeParse(URL, useInternalNodes=TRUE)
nodes <-getNodeSet(doc, "//script[@type='text/javascript']")[[19]][[1]]
dat <-gsub("];", "]", capture.output(nodes)[5:27])
ndat <-data.frame(do.call(rbind, fromJSON(paste(dat, collapse = ""))))[, -2]
ndat[, 1:5] <-lapply(ndat, unlist)
IBM <-grepl("International Business Machines", ndat[, 1])
ndat[IBM, 1] <-bracketXtract(ndat[IBM, 1])
ndat[, 1] <-sapply(strsplit(ndat[, 1], "\\s|,"), "[", 1)

At this point we relevel the factor level Employer.Name by job satisfaction.

## Re-level with order_by
ndat[, "Employer.Name"] <-order_by(Employer.Name, ~Job.Satisfaction, ndat, df=FALSE)
colnames(ndat)[1] <-"Employer"
ndat
##           Employer Job.Satisfaction Work.Stress Job.Meaning Job.Flexibility
## 1            Adobe           0.6875      0.7031      0.4532          0.8594
## 2       Amazon.com           0.7723      0.7010      0.4901          0.7376
## 3              AOL           0.7714      0.6572      0.4118          0.7714
## 4            Apple           0.7800      0.6510      0.7114          0.7567
## 5             Dell           0.6890      0.6275      0.4983          0.8712
## 6             eBay           0.7097      0.6087      0.5824          0.8153
## 7         Facebook           0.8750      0.6875      0.8125          0.9375
## 8           Google           0.7987      0.5660      0.6387          0.8334
## 9  Hewlett-Packard           0.5807      0.6034      0.4335          0.8733
## 10           Intel           0.7339      0.6677      0.6892          0.8896
## 11             IBM           0.6414      0.6637      0.4631          0.8946
## 12        LinkedIn           1.0000      0.6923      0.8462          0.9166
## 13       Microsoft           0.6777      0.6181      0.6099          0.9281
## 14     Monster.com           0.7273      0.8181      0.5454          0.8181
## 15           Nokia           0.7400      0.4800      0.5600          0.8200
## 16          Nvidia           0.7692      0.5897      0.5385          0.7692
## 17          Oracle           0.6713      0.6406      0.4221          0.9218
## 18  Salesforce.com           0.8667      0.7334      0.6667          0.8275
## 19         Samsung           0.6596      0.7447      0.6595          0.6170
## 20            Sony           0.7500      0.6667      0.5217          0.8750
## 21          Yahoo!           0.6762      0.5333      0.5145          0.8750

Now we can reshape the data to long format which ggplot2 prefers almost exclusively.

## Melt the data to long format
mdat <-melt(ndat)
mdat[, 2] <-factor(gsub("\\.", " ", mdat[, 2]), 
    levels = gsub("\\.", " ", colnames(ndat)[-1]))

head(mdat)
##     Employer         variable  value
## 1      Adobe Job Satisfaction 0.6875
## 2 Amazon.com Job Satisfaction 0.7723
## 3        AOL Job Satisfaction 0.7714
## 4      Apple Job Satisfaction 0.7800
## 5       Dell Job Satisfaction 0.6890
## 6       eBay Job Satisfaction 0.7097

Now our data is cleaned and reshaped with Employer releveled by job stisfaction. I chose this (job stisfaction) as the variable of interest because of literature I've read around job performance, teacher retention and job satisfaction. Let's see if re-leveling the factor has an improvement on the trends and patterns we can see.

ggplot(data=mdat, aes(x=Employer, y=value, fill=factor(Employer))) + 
  geom_bar(stat="identity") + coord_flip() + ylim(c(0, 1)) + 
  facet_wrap( ~ variable, ncol=2) + theme(legend.position="none") + 
  ggtitle("Plot 3: Employee Job Satisfaction at Top Tech Companies") +
  ylab(c("Job Satisfaction"))

plot of chunk order8

The first thing I noticed after the reordering is that Job Meaning and Job Satisfaction appear to be related. In general, higher satisfaction corresponds with greater meaning. I also noticed that Flexibility and Stress do not appear to correspond with satisfaction. This made me curious and so I ran a simple regression model with Satisfaction as the outcome and the other three variables as predictors. The story from the regression model is similar to the visualization.

mod <-lm(Job.Satisfaction ~ Work.Stress + Job.Meaning + Job.Flexibility, data=ndat)
mod
## 
## Call:
## lm(formula = Job.Satisfaction ~ Work.Stress + Job.Meaning + Job.Flexibility, 
##     data = ndat)
## 
## Coefficients:
##     (Intercept)      Work.Stress      Job.Meaning  Job.Flexibility  
##          0.3101           0.1062           0.5241           0.0733
anova(mod)
## Analysis of Variance Table
## 
## Response: Job.Satisfaction
##                 Df Sum Sq Mean Sq F value Pr(&gt;F)    
## Work.Stress      1 0.0069  0.0069    1.45 0.2452    
## Job.Meaning      1 0.0816  0.0816   17.04 0.0007 ***
## Job.Flexibility  1 0.0006  0.0006    0.13 0.7260    
## Residuals       17 0.0814  0.0048                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod)
## 
## Call:
## lm(formula = Job.Satisfaction ~ Work.Stress + Job.Meaning + Job.Flexibility, 
##     data = ndat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.12043 -0.03002 -0.00263  0.03268  0.11915 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(&gt;|t|)    
## (Intercept)       0.3101     0.2413    1.29   0.2160    
## Work.Stress       0.1062     0.2147    0.49   0.6273    
## Job.Meaning       0.5241     0.1288    4.07   0.0008 ***
## Job.Flexibility   0.0733     0.2058    0.36   0.7260    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0692 on 17 degrees of freedom
## Multiple R-squared:  0.523,  Adjusted R-squared:  0.438 
## F-statistic: 6.21 on 3 and 17 DF,  p-value: 0.00483

The model accounts for ~50% of the variability in Job Satisfaction. While the model is significant there clearly is more than just Meaninging that impacts Satisfaction. I Decided to do a bit more plotting and use the preattentive attributes of color and size to represent Flexibility and Stress in the visual model.

theplot <-ggplot(data=ndat, aes(x = Job.Meaning, y = Job.Satisfaction)) + 
    geom_smooth(method="lm", fill = "blue", alpha = .1, size=1) +  
    geom_smooth(color="red", fill = "pink", alpha = .3, size=1) +
    xlim(c(.4, .9)) +
    geom_point(aes(size = Job.Flexibility, colour = Work.Stress)) +
    geom_text(aes(label=Employer), size = 3, hjust=-.1, vjust=-.1) +
    scale_colour_gradient(low="gold", high="red") 

theplot

plot of chunk order9

There is certainly a pullby this group of tech companies, that may be an unaccounted variable in the model.

theplot + annotation_custom(grob=circleGrob(r = unit(.4,"npc")), xmin=.47, xmax=.57, ymin=.72, ymax=.82)

If we view the data as two separate smootherd regression lines we get a more predictable model. This indicates a variable that we have not included.

ndat$outs <-1
ndat$outs[ndat$Employer %in% qcv(AOL, Amazon.com, Nvidia, Sony)] <-0

ggplot(data=ndat, aes(x = Job.Meaning, y = Job.Satisfaction)) + 
    geom_smooth(method="lm", fill = "blue", alpha = .1, size=1, aes(group=outs)) +  
    geom_smooth(color="red", fill = "pink", alpha = .3, size=1) +
    xlim(c(.4, .9)) +
    geom_point(aes(size = Job.Flexibility, colour = Work.Stress)) +
    geom_text(aes(label=Employer), size = 3, hjust=-.1, vjust=-.1) +
    scale_colour_gradient(low="gold", high="red") 

plot of chunk order10


We've learned:

  1. Re-leveling/re-ordering a factor by a numeric variable(s) can lead to important pattern detection in data.
  2. The levels argument to factor is key to the reordering.
  3. order and sometimes aggregate allows the re0leving to occur.
  4. The order_by function in the plotflow package can make re-leveling easier.
  5. 5. Faceting can amplify the distinction made by the re-leveling.

*Created using the reports (Rinker, 2013) package


References

  • Stephen Few, (2009) Now You See It: Simple Visualization Techniques for Quantitative
    Analysis.
  • Stephen Few, (2012) Show me the numbers: Designing tables and graphs to enlighten.
  • Tyler Rinker, (2013) reports: Package to assist in report writing. http://github.com/trinker/reports
About these ads

About tylerrinker

I am Literacy PhD student with a bent for the quantitative and a passion for R.
This entry was posted in factor, ggplot2, Uncategorized, visualization, work flow and tagged , , , , , , , . Bookmark the permalink.

7 Responses to How do I re-arrange??: Ordering a plot revisited

  1. very nicely done; thanks for sharing. I look forward to applying some of the techniques.

  2. jjap says:

    Very nice post. Perennial question indeed. ++

  3. ucfagls says:

    You could have achieved some of this using `reorder()` in the stats package that ships with R, could you not?

    • tylerrinker says:

      @ucfagls Yes but until you pointed it out I was unaware/or forgot about the function. That definitely makes life easier. Basically (besides returning a dataframe and working with multiple numeric vectors), I have recreated the `reorder()` function. Thanks for sharing.

  4. annoporci says:

    Thanks, that was cool. Two comments: First, I had to do [[20]] instead of [[19]] in getNodeSet. Secondly, the bit with circleGrob comes out with white color instead of transparent. Not to worry, I learned a lot!

  5. Hi, I am using R v3.1.0 with a mac os x 10.9 and I have followed the instructions to install the plotflow package. However, after installing and uploading the library I still cannot use the ‘order_by’ function. I get the following error:
    Error: could not find function “order_by”

    Any comments as to why this is happening?

    Thank you

    • tylerrinker says:

      Blame Hadley Wickham :-) I am a big fan of various packages Wickham has authored including the dplyr package. He has a function in there called `order_by` that’s pretty important. I renamed the plotflow function of the same name to `reorder_by` to avoid conflicts when I have both packages loaded.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s