Ordering Categories within ggplot2 Facets

I saw Simon Jackson’s recent blog post regarding ordering categories within facets. He proposed a way of dealing with the problem of ordering variables shared across facets within facets. This problem becomes apparent in text analysis where words are shared across facets but differ in frequency/magnitude ordering within each facet. Julia Silge and David Robinson note that this is a particularly vexing problem in a TODO comment in their tidy text book:

## TODO: make the ordering vary depending on each facet
## (not easy to fix)

Simon has provided a working approach but it feels awkward in that you are converting factors to numbers visually, adjusting spacing, and then putting the labels back on. I believe that there is a fairly straight forward tidy approach to deal with this problem.

Terminology

Definitions of the terms I use to describe the solution:

  • category variable – the bar categories variable (terms in this case)
  • count variable – the bar heights variable
  • facet variable – the meta grouping used for faceting

Logic

I have dealt with the ordering within facets problem using this logic:

  1. Order the data rows by grouping on the facet variable and the categories variable and arrange-ing on the count variable in a descending* fashion
  2. Ungroup
  3. Remake the categories variable by appending the facet label(s) as a deliminated suffix to the categories variable (make sure this is a factor with the levels reversed) [this maintains the ordering by making shared categories unique]
  4. Plot as usual**
  5. Remove the suffix you added previously using scale_x_discrete

*This allows you to take a slice of the top n terms if desired
**I prefer the ggstance geom_barh to the ggplot2 geom_bar + coord_flip as the former lets me set y as the terms variable and the later doesn’t always play nicely with scales being set free

This approach adds an additional 5 lines of code (in the code below I number them at as comment integers) and is, IMO, pretty easy to reason about. Here’s the additional lines of code:

group_by(word1, word2) %>%                  
arrange(desc(contribution)) %>%                
ungroup() %>%
mutate(word2 = factor(paste(word2, word1, sep = "__"), levels = rev(paste(word2, word1, sep = "__")))) %>% 
    # --ggplot here--
    scale_x_discrete(labels = function(x) gsub("__.+$", "", x)) 

Example

Let’s go ahead and use Julia and David’s example to demonstrate the technique:

# Required libraries
p_load(tidyverse, tidytext, janeaustenr)

# From section 5.1: Tokenizing by n-gram
austen_bigrams <- austen_books() %>%
    unnest_tokens(bigram, text, token = "ngrams", n = 2)

# From section 5.1.1: Counting and filtering n-grams
bigrams_separated <- austen_bigrams %>%
    separate(bigram, c("word1", "word2"), sep = " ")

	
# From section 5.1.3: Using bigrams to provide context in sentiment analysis
AFINN <- get_sentiments("afinn")
negation_words <- c("not", "no", "never", "without")

negated_words <- bigrams_separated %>%
    filter(word1 %in% negation_words) %>%
    inner_join(AFINN, by = c(word2 = "word")) %>%
    count(word1, word2, score, sort = TRUE) %>%
    ungroup()


# Create plot
negated_words %>%
    mutate(contribution = n * score) %>%
    mutate(word2 = reorder(word2, contribution)) %>%
    group_by(word1) %>%
    top_n(10, abs(contribution)) %>%
    group_by(word1, word2) %>%                                    #1
    arrange(desc(contribution)) %>%                               #2
    ungroup() %>%                                                 #3
    mutate(word2 = factor(paste(word2, word1, sep = "__"), levels = rev(paste(word2, word1, sep = "__")))) %>% #4
    ggplot(aes(word2, contribution, fill = n * score > 0)) +
        geom_bar(stat = "identity", show.legend = FALSE) +
        facet_wrap(~ word1, scales = "free") +
        xlab("Words preceded by negation") +
        ylab("Sentiment score * # of occurrences") +
        theme_bw() +
        coord_flip() +
        scale_x_discrete(labels = function(x) gsub("__.+$", "", x)) #5

Advertisements

About tylerrinker

I am Literacy PhD student with a bent for the quantitative and a passion for R.
This entry was posted in ggplot2, r, text, tidytext, tidyverse, tylerrinker, visualization and tagged , , , , , . Bookmark the permalink.

14 Responses to Ordering Categories within ggplot2 Facets

  1. Nice workaround for a common problem without an obvious solution.

    * fill = contribution>0 instead of recalculating it would streamline the code a bit further.
    * group_by(word1, word2) could be removed I think, since grouping is no longer required after the top_n
    * Also the initial `mutate(word2 = reorder(word2, contribution))` could be removed imho.

    So summing up my comments. a simplified solution (leading to the same result) could be

    negated_words %>%
    mutate(contribution = n * score) %>%
    group_by(word1) %>%
    top_n(10, abs(contribution)) %>%
    arrange(desc(contribution)) %>% # 1
    mutate(word2 = paste(word2, word1, sep = “__”) %>% factor(. , levels = rev(.))) %>% #2
    ggplot(aes(word2, contribution, fill = contribution > 0)) +
    … rest of plot

    • tylerrinker says:

      Yes agreed for this specific code. The five lines were the main point… to show a generalizable approach.

    • Daniel Aiken says:

      is the period used in the line of code below denoting the newly mutated “word2”?
      factor(. , levels = rev(.)))

    • Daniel Aiken says:

      Please disregard my reply made January 12, 2017 at 2:09 pm. My confusion stems from reversing the factor levels. Your code works great, but I cannot make sense of the ordering. After working through the code a few times, my intuition still tells me that the mutated word2 variable, when converted to a factor, would have levels arranged in alphabetical order. The “rev(.) would then put “word2” levels in reverse alphabetical order. This post has been extremely helpful and I foresee myself using this trick on future problems. However, I would like to fully understand how its works before moving on. I appreciate any help you are willing to provide!

      • tylerrinker says:

        You are correct. We want levels reverse for the way ggplot2 currently handles plotting the levels. This code will illuminate:

        library(tidyverse)
        dat <- data_frame(
            a = factor(LETTERS[1:5], levels = LETTERS[1:5]),
            b = factor(a, levels = rev(levels(a))),
            x = 1:5
        )
        
        dat %>%
            ggplot(aes(x=x, y=a)) +
                geom_point()
        dat %>%
            ggplot(aes(x=x, y=b)) +
                geom_point()
  2. Hector Alvaro says:

    Hi tylerrinker:

    What’s up?
    Great publication, as always you do.
    I am just working my knowledge on text mining and sentiment analysis. I got this publication and I am following the online book publication “Tidy Text Mining with R” [ http://tidytextmining.com/ ] too.
    So far it has been al ok but I need to get the data files “tweets_julia.csv” [Julia Silge] and “tweets_dave.csv” [David Robinson]. Any chance to get those files from you?
    Regards,

    HA

    • tylerrinker says:

      I suggest you ask the authors. Both are receptive people who are active on twitter.

      • Max says:

        I agree, this is a common problem. I think they are very happy to find a solution within ggplot2

        Thanks for the workaround anyway, I had same problem few weeks ago and this solution is very good

  3. Pingback: Ordering Categories within ggplot2 Facets | A bunch of data

  4. Pingback: Distilled News | Data Analytics & R

  5. Ram Thapa says:

    I have used similar workaround in the past for similar situation. I think you could achieve this without grouping and ungrouping like in comments # 1 and 3 but instead using the facet and grouping variables in the arrange function itself. This way you don’t have to reverse the levels of factor you create in comment # 4

    negated_words %>%
    mutate(contribution = n * score) %>%
    mutate(word2 = reorder(word2, contribution)) %>%
    group_by(word1) %>%
    top_n(10, abs(contribution)) %>%
    # group_by(word1, word2) %>% #1
    arrange(word1, contribution > 0, contribution) %>% #2
    # ungroup() %>% #3
    mutate(word2 = factor(paste(word2, word1, sep = “__”), levels = paste(word2, word1, sep = “__”))) %>% #4
    ggplot(aes(word2, contribution, fill = n * score > 0)) +
    geom_bar(stat = “identity”, show.legend = FALSE) +
    facet_wrap(~ word1, scales = “free”) +
    xlab(“Words preceded by negation”) +
    ylab(“Sentiment score * # of occurrences”) +
    theme_bw() +
    coord_flip() +
    scale_x_discrete(labels = function(x) gsub(“__.+$”, “”, x)) #5

  6. Daniel Aiken says:

    Referring to the “# create plot”, section, could you have changed “group_by(word1, word2)” to “group_by(word1)” or “group_by(word2)” or left it out completely and still obtained the correct order? I removed “group_by(word1, word2)” completely to experiment with the grouping function and got the same order. I am not sure if this was by luck or that the “negated_words” data frame was still group from the “group_by(word1)” line under mutate.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s