Back in October of last year I wrote a blog post about reordering/rearanging plots. This was, and continues to be, a frequent question on list serves and R help sites. In light of my recent studies/presenting on The Mechanics of Data Visualization, based on the work of Stephen Few (2012); Few (2009), I realized I was remiss in explaining the ordering of variables from largest to smallest bar (particularly Cleveland Dot Plots and Bar Plots). It is often much more meaningful to arrange (order) factor levels by size of other numeric variable(s). This allows for easier pattern recognition over the standard aphabetic arrangement of levels.
The post will take you through a demonstration of sorting bars/points on another variable, however it assumes you already know how that if you want to reorder/rearrange in a plot you must reorder the factor levels (if you do not know this see this blog post). We then explore my GitHub package package plotflow to add efficiency to re-leveling in the workflow. After we learn how to sort by bar/point size we will look at a applied use. I will use ggplot2 because this is my go to plotting system.
Section 1: Reordering by Bar/Point Size
Create a data set we can alter
mtcars3 <-mtcars2 <-data.frame(car=rownames(mtcars), mtcars, row.names=NULL) mtcars3$cyl <-mtcars2$cyl <-as.factor(mtcars2$cyl) head(mtcars2)
## car mpg cyl disp hp drat wt qsec vs am gear carb
## 1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## 2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## 3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 5 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## 6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
In this example it's difficult to find trends and patterns in the data.
An Example of Unordered Bars/Points
library(ggplot2) library(gridExtra) x <-ggplot(mtcars2, aes(y=car, x=mpg)) + geom_point(stat="identity") y <-ggplot(mtcars2, aes(x=car, y=mpg)) + geom_bar(stat="identity") + coord_flip() grid.arrange(x, y, ncol=2)
Below we use the levels argument to factor in conjunction with order to order the levels of car by miles per gallong (mpg).
An Example of Ordered Bars/Points
## Relevel the cars by mpg mtcars3$car <-factor(mtcars2$car, levels=mtcars2[order(mtcars$mpg), "car"]) x <-ggplot(mtcars3, aes(y=car, x=mpg)) + geom_point(stat="identity") y <-ggplot(mtcars3, aes(x=car, y=mpg)) + geom_bar(stat="identity") + coord_flip() grid.arrange(x, y, ncol=2)
This is an example when a factor's levels each has a unique row. This is not always the case. For instance if we we to use mtcars2cyl rather than mtcars2$car as the factor we'd have multiple observations for each cylinder level. In these instances we'd most likely utilize the ording by some summarizing variable as seen in the ordering mtcars2$carb by average mpg below.
An Example of Ordered and Faceted Bars/Points
## Relevel the carb by average mpg (ag_mtcars <-aggregate(mpg ~ carb, mtcars3, mean))
## carb mpg
## 1 1 25.34
## 2 2 22.40
## 3 3 16.30
## 4 4 15.79
## 5 6 19.70
## 6 8 15.00
mtcars3$carb <-factor(mtcars2$carb, levels=ag_mtcars[order(ag_mtcars$mpg), "carb"]) ggplot(mtcars3, aes(y=carb, x=mpg)) + geom_point(stat="identity", size=2, aes(color=carb))
The last plot in this section adds faceting to further draw distinction and allow for pattern recognition. The ordering of the facets can also be changed by reordering factor levels in a way that is sensible for representing the narrative the data is telling.
An Example of Ordered and Faceted Bars/Points
ggplot(mtcars3, aes(y=car, x=mpg)) + geom_point(stat="identity") + facet_grid(cyl~., scales = "free", space="free")
Recapping Section 1: Reordering by Bar/Point Size
In this first section we learned:
- Ordering factors by a numeric variable increases the ability to recognize patterns
- We can have (a) one row per factor level or (b) multiple rows per factor level.
- Adding faceting can increase the ability to further find patterns among the ordered figure.
Section 2: Speeding Up the Workflow With the plotflow Package
Because I have the need to reorder factors by other numeric variables frequently and using order and sometimes aggregate is tedious and annoying I have wrapped this process up as a function called order_by in the plotflow package. I pretty much ripped off the entire function from Thomas Wutzler. This function allows the user to sort a dataframe by 1 or more numeric variables and return the new dataframe with a releveled factor. This is useful in that a new dataframe is created rather than tampering with the original. The function also allows for a summery stat to be passed via te FUN argument in a similar fashion as aggregate. This approach save typing and is more intuitive.
Getting the plotflow package
To get plotflow you can install the devtools package and use the install_github function:
# install.packages("devtools") library(devtools) install_github("plotflow", "trinker")
What Does order_by do?
library(plotflow) dat <-aggregate(cbind(mpg, hp, disp)~carb, mtcars, mean) dat$carb <-factor(dat$carb) ## compare levels (data set looks the same though) dat$carb
## [1] 1 2 3 4 6 8 ## Levels: 1 2 3 4 6 8
order_by(carb, ~-hp + -mpg, data = dat)$carb
## [1] 1 2 3 4 6 8 ## Levels: 8 4 3 6 2 1
By defualt order_by returns a dataframe however we can also tell order_by to return a vector by setting df=FALSE.
## Return just the vector with new levels order_by(carb, ~ -hp + -mpg, dat, df=FALSE)
## [1] 1 2 3 4 6 8 ## Levels: 8 4 3 6 2 1
Let's see order_by in action.
Use order_by to Order Bars
library(ggplot2) ## Reset the data from Section 1 dat2 <-data.frame(car=rownames(mtcars), mtcars, row.names=NULL) ggplot(order_by(car, ~ mpg, dat2), aes(x=car, y=mpg)) + geom_bar(stat="identity") + coord_flip() + ggtitle("Order Pretty Easy")
Aggregated by Summary Stat
###Carb Ordered By Summary (Mean) of mpg
## Ordered points with the order_by function a <-ggplot(order_by(carb, ~ mpg, dat2, mean), aes(x=carb, y=mpg)) + geom_point(stat="identity", aes(colour=carb)) + coord_flip() + ggtitle("Ordered Dot Plots Made Easy") ## Reverse the ordered points b <-ggplot(order_by(carb, ~ -mpg, dat2, mean), aes(x=carb, y=mpg)) + geom_point(stat="identity", aes(colour=carb)) + coord_flip() + ggtitle("Reverse Order Too!") grid.arrange(a, b, ncol=1)
Nested Usage (order_by on an order by dataframe)
ggplot(order_by(gear, ~mpg, dat2, mean), aes(mpg, carb)) + geom_point(aes(color=factor(cyl))) + facet_grid(gear~., scales="free") + ggtitle("I'm Nested (Yay for me!)")
The order_by function makes life a little easier.
Section 3: Using order_by on Real Data
Now I turn the attention to a real life usage of ordering a factor by a numeric variable in order to see patterns. A while back Abraham Mathew presented a blog post utilizing some interesting data on job satisfaction within bigger technology companies. His demonstrations showed various ways to utilize ggplot2 to visualize the data.
As I read the post I was also reading a bit of Stephen Few's work, which recomends ordering bars/dotplots to better see patterns. This visualization, which Mathew produced with ggplot2, is captivating:
However, I believed that by order the bars as Stephen Few (2012); Few (2009) suggests may enhance our ability to see a pattern; which of the four variables are linked?
In this next section we'll grab the data, clean it, reshape it, relevel the factors and plot in a more meaningful way to reveal patterns not seen before. Let's begin by loading the following packages:
library(RCurl) library(XML) library(rjson) library(ggplot2) library(qdap) library(reshape2) library(gridExtra)
Now we can scrape the data and extract the required pieces.
URL <-"http://www.payscale.com/top-tech-employers-compared-2012/job-satisfaction-survey-data" doc <-htmlTreeParse(URL, useInternalNodes=TRUE) nodes <-getNodeSet(doc, "//script[@type='text/javascript']")[[19]][[1]] dat <-gsub("];", "]", capture.output(nodes)[5:27]) ndat <-data.frame(do.call(rbind, fromJSON(paste(dat, collapse = ""))))[, -2] ndat[, 1:5] <-lapply(ndat, unlist) IBM <-grepl("International Business Machines", ndat[, 1]) ndat[IBM, 1] <-bracketXtract(ndat[IBM, 1]) ndat[, 1] <-sapply(strsplit(ndat[, 1], "\\s|,"), "[", 1)
At this point we relevel the factor level Employer.Name by job satisfaction.
## Re-level with order_by ndat[, "Employer.Name"] <-order_by(Employer.Name, ~Job.Satisfaction, ndat, df=FALSE) colnames(ndat)[1] <-"Employer" ndat
## Employer Job.Satisfaction Work.Stress Job.Meaning Job.Flexibility
## 1 Adobe 0.6875 0.7031 0.4532 0.8594
## 2 Amazon.com 0.7723 0.7010 0.4901 0.7376
## 3 AOL 0.7714 0.6572 0.4118 0.7714
## 4 Apple 0.7800 0.6510 0.7114 0.7567
## 5 Dell 0.6890 0.6275 0.4983 0.8712
## 6 eBay 0.7097 0.6087 0.5824 0.8153
## 7 Facebook 0.8750 0.6875 0.8125 0.9375
## 8 Google 0.7987 0.5660 0.6387 0.8334
## 9 Hewlett-Packard 0.5807 0.6034 0.4335 0.8733
## 10 Intel 0.7339 0.6677 0.6892 0.8896
## 11 IBM 0.6414 0.6637 0.4631 0.8946
## 12 LinkedIn 1.0000 0.6923 0.8462 0.9166
## 13 Microsoft 0.6777 0.6181 0.6099 0.9281
## 14 Monster.com 0.7273 0.8181 0.5454 0.8181
## 15 Nokia 0.7400 0.4800 0.5600 0.8200
## 16 Nvidia 0.7692 0.5897 0.5385 0.7692
## 17 Oracle 0.6713 0.6406 0.4221 0.9218
## 18 Salesforce.com 0.8667 0.7334 0.6667 0.8275
## 19 Samsung 0.6596 0.7447 0.6595 0.6170
## 20 Sony 0.7500 0.6667 0.5217 0.8750
## 21 Yahoo! 0.6762 0.5333 0.5145 0.8750
Now we can reshape the data to long format which ggplot2 prefers almost exclusively.
## Melt the data to long format mdat <-melt(ndat) mdat[, 2] <-factor(gsub("\\.", " ", mdat[, 2]), levels = gsub("\\.", " ", colnames(ndat)[-1])) head(mdat)
## Employer variable value
## 1 Adobe Job Satisfaction 0.6875
## 2 Amazon.com Job Satisfaction 0.7723
## 3 AOL Job Satisfaction 0.7714
## 4 Apple Job Satisfaction 0.7800
## 5 Dell Job Satisfaction 0.6890
## 6 eBay Job Satisfaction 0.7097
Now our data is cleaned and reshaped with Employer releveled by job stisfaction. I chose this (job stisfaction) as the variable of interest because of literature I've read around job performance, teacher retention and job satisfaction. Let's see if re-leveling the factor has an improvement on the trends and patterns we can see.
ggplot(data=mdat, aes(x=Employer, y=value, fill=factor(Employer))) + geom_bar(stat="identity") + coord_flip() + ylim(c(0, 1)) + facet_wrap( ~ variable, ncol=2) + theme(legend.position="none") + ggtitle("Plot 3: Employee Job Satisfaction at Top Tech Companies") + ylab(c("Job Satisfaction"))
The first thing I noticed after the reordering is that Job Meaning and Job Satisfaction appear to be related. In general, higher satisfaction corresponds with greater meaning. I also noticed that Flexibility and Stress do not appear to correspond with satisfaction. This made me curious and so I ran a simple regression model with Satisfaction as the outcome and the other three variables as predictors. The story from the regression model is similar to the visualization.
mod <-lm(Job.Satisfaction ~ Work.Stress + Job.Meaning + Job.Flexibility, data=ndat) mod
## ## Call: ## lm(formula = Job.Satisfaction ~ Work.Stress + Job.Meaning + Job.Flexibility, ## data = ndat) ## ## Coefficients: ## (Intercept) Work.Stress Job.Meaning Job.Flexibility ## 0.3101 0.1062 0.5241 0.0733
anova(mod)
## Analysis of Variance Table ## ## Response: Job.Satisfaction ## Df Sum Sq Mean Sq F value Pr(>F) ## Work.Stress 1 0.0069 0.0069 1.45 0.2452 ## Job.Meaning 1 0.0816 0.0816 17.04 0.0007 *** ## Job.Flexibility 1 0.0006 0.0006 0.13 0.7260 ## Residuals 17 0.0814 0.0048 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod)
## ## Call: ## lm(formula = Job.Satisfaction ~ Work.Stress + Job.Meaning + Job.Flexibility, ## data = ndat) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.12043 -0.03002 -0.00263 0.03268 0.11915 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.3101 0.2413 1.29 0.2160 ## Work.Stress 0.1062 0.2147 0.49 0.6273 ## Job.Meaning 0.5241 0.1288 4.07 0.0008 *** ## Job.Flexibility 0.0733 0.2058 0.36 0.7260 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.0692 on 17 degrees of freedom ## Multiple R-squared: 0.523, Adjusted R-squared: 0.438 ## F-statistic: 6.21 on 3 and 17 DF, p-value: 0.00483
The model accounts for ~50% of the variability in Job Satisfaction. While the model is significant there clearly is more than just Meaninging that impacts Satisfaction. I Decided to do a bit more plotting and use the preattentive attributes of color and size to represent Flexibility and Stress in the visual model.
theplot <-ggplot(data=ndat, aes(x = Job.Meaning, y = Job.Satisfaction)) + geom_smooth(method="lm", fill = "blue", alpha = .1, size=1) + geom_smooth(color="red", fill = "pink", alpha = .3, size=1) + xlim(c(.4, .9)) + geom_point(aes(size = Job.Flexibility, colour = Work.Stress)) + geom_text(aes(label=Employer), size = 3, hjust=-.1, vjust=-.1) + scale_colour_gradient(low="gold", high="red") theplot
There is certainly a pullby this group of tech companies, that may be an unaccounted variable in the model.
theplot + annotation_custom(grob=circleGrob(r = unit(.4,"npc")), xmin=.47, xmax=.57, ymin=.72, ymax=.82)
If we view the data as two separate smootherd regression lines we get a more predictable model. This indicates a variable that we have not included.
ndat$outs <-1 ndat$outs[ndat$Employer %in% qcv(AOL, Amazon.com, Nvidia, Sony)] <-0 ggplot(data=ndat, aes(x = Job.Meaning, y = Job.Satisfaction)) + geom_smooth(method="lm", fill = "blue", alpha = .1, size=1, aes(group=outs)) + geom_smooth(color="red", fill = "pink", alpha = .3, size=1) + xlim(c(.4, .9)) + geom_point(aes(size = Job.Flexibility, colour = Work.Stress)) + geom_text(aes(label=Employer), size = 3, hjust=-.1, vjust=-.1) + scale_colour_gradient(low="gold", high="red")
We've learned:
- Re-leveling/re-ordering a factor by a numeric variable(s) can lead to important pattern detection in data.
- The levels argument to factor is key to the reordering.
- order and sometimes aggregate allows the re0leving to occur.
- The order_by function in the plotflow package can make re-leveling easier.
- 5. Faceting can amplify the distinction made by the re-leveling.
*Created using the reports (Rinker, 2013) package
References
- Stephen Few, (2009) Now You See It: Simple Visualization Techniques for Quantitative
Analysis. - Stephen Few, (2012) Show me the numbers: Designing tables and graphs to enlighten.
- Tyler Rinker, (2013) reports: Package to assist in report writing. http://github.com/trinker/reports
very nicely done; thanks for sharing. I look forward to applying some of the techniques.
Very nice post. Perennial question indeed. ++
You could have achieved some of this using `reorder()` in the stats package that ships with R, could you not?
@ucfagls Yes but until you pointed it out I was unaware/or forgot about the function. That definitely makes life easier. Basically (besides returning a dataframe and working with multiple numeric vectors), I have recreated the `reorder()` function. Thanks for sharing.
Thanks, that was cool. Two comments: First, I had to do [[20]] instead of [[19]] in getNodeSet. Secondly, the bit with circleGrob comes out with white color instead of transparent. Not to worry, I learned a lot!
Hi, I am using R v3.1.0 with a mac os x 10.9 and I have followed the instructions to install the plotflow package. However, after installing and uploading the library I still cannot use the ‘order_by’ function. I get the following error:
Error: could not find function “order_by”
Any comments as to why this is happening?
Thank you
Blame Hadley Wickham 🙂 I am a big fan of various packages Wickham has authored including the dplyr package. He has a function in there called `order_by` that’s pretty important. I renamed the plotflow function of the same name to `reorder_by` to avoid conflicts when I have both packages loaded.
Hi, thanks for sharing, you did a great job and its easy to follow! I was wondering if anyone can help me. I’m looking at the microbial diversity of patients specific days before and after treatment. I’m trying to make some bar plots, the values on my x-axis are discrete as I converted them to a factor. Any ideas of how I can sort/reorder levels in my data so I can plot it in an ascending manner.
Pingback: How do I re-arrange??: Ordering a plot re-revisited | TRinker's R Blog
Pingback: R프로그래밍 참고할 만한 사이트 | This Is YNWA
Really desired to stress I’m just thrilled I happened onto your web site!
Is there any way of reordering “count” data? i.e., stat=”count” rather than on “identity”? It makes the construction of the initial dataframe a lot easier than having to collate unique terms. However, I’ve yet to find an example or reordering via factored count data.