pacman Ver 0.4.1 Release

It was just over a year ago that Dason Kurkiewicz and I released pacman to CRAN.  We have been developing the package on GitHub in the past 14 months and are pleased to announce these changes have made their way to CRAN in version 0.4.1.

r_pacman1

What Does pacman Do?

pacman is an R package management tool that combines the functionality of base library related functions into intuitively named functions. This package is ideally added to .Rprofile to increase workflow by reducing time recalling obscurely named functions, reducing code and integrating functionality of base functions to simultaneously perform multiple actions.

We really wished people would use pacman to share code (blog posts and help list/boards).  The reason is selfish.  Often one is trying out code and the poster has ten new packages in use that we don’t have.  This means we have to stop the act of trying out the poster’s code to install the packages being used.  To add injury to insult multiple library calls makes the script less readable.

Sing it with us…

Imagine there’s no install.packages
It’s easy if you try
No multiple library calls
Above us only sky
Imagine all the coders
Using pacman today…

Skip to the bottom where we demo what this coders utopia looks like.

What’s New in Version 0.4.1?

Here are a few of the most notable highlights.

  • Support for Bioconductor packages added compiments of Keith Hughitt.
  • p_boot added to generate a string for the standard pacman script header that, when added to scripts, will ensure pacman is installed before attempting to use it. pacman will attempt to copy this string (standard script header) to the clipboard for easy cut and paste.
  • p_install_version_gh and p_load_current_gh added as partners to p_install_version for GitHub packages. Thanks to Steve Simpson for this.

Example Use

We will examine pacman‘s popularity in the last 14 months.  We implore the readers to make the package used even more by using it in scripts posted online.

This script uses pacman to allow the user to check for, install, and load the four required packages all with two easy lines of code.  The first line (compliments of p_boot) just makes sure pacman is installed.  The later checks for, installs, and loads the packages.  It’s pretty nice to just run a script isn’t it?

if (!require("pacman")) install.packages("pacman")
pacman::p_load(cranlogs, dplyr, ggplot2, scales)

package <- "pacman"
color <- "#26B8A6"
hjust <- -.069
start <- "2015-02-01"

lvls <- format(seq(as.Date(start), to = Sys.Date(), by = "month"), format = "%b %y")


dat <- cran_downloads(packages=package, from=start, to = Sys.Date()) %>%
    tbl_df() %>%
    select(-package) %>%
    mutate(
      date = as.POSIXct(date),
      month = factor(format(date, format = "%b %y"), levels = lvls)
    ) %>%
    na.omit() %>%
    rename(timestamp = date)


aggregated <- dat %>%
  group_by(month) %>%
  summarize(n=sum(count), mean=mean(count), sd=sd(count)) %>%
      filter(n > 0)

aggregated  %>%
      ggplot(aes(month, n, group=1)) +
          geom_path(size=4, color=color) + 
          geom_point(size=8, color=color) +    
          geom_point(size=3, color="white") + 
          theme_bw() +
          #ggplot2::annotate("segment", x=-Inf, xend=-Inf, y=-Inf, yend=Inf, color="grey70") +
          labs(
              x = NULL, #"Year-Month", 
              y = NULL, #"Downloads", 
              title = sprintf("Monthly RStudio CRAN Downloads for %s", package)
          ) +
          theme(
              text=element_text(color="grey50"),
              panel.grid.major.x = element_blank(),
              panel.border = element_blank(), 
              axis.line = element_line(),
              axis.ticks.x = element_line(color='grey70'),
              axis.ticks.y = element_blank(),
              plot.title = element_text(hjust = hjust, face="bold", color='grey50')
          ) + 
          scale_y_continuous(
              expand = c(0, 0), 
              limits = c(0, max(aggregated$n)*1.15), 
              labels = comma 
          )

 

pacman_download

The script is customizable for any package.  Here we view a few more packages’ usage (some of the ones I’ve been enjoying as of late). Oh and you can download all of these packages via:

if (!require("pacman")) install.packages("pacman")
pacman::p_load(googleformr, googlesheets, dplyr, text2vec, waffle, colourlovers, curl, plotly)

ypw2cnx

xlniesp

 

rzuajok

k7ey5if

 

vc1jxct

hdr3vsm

lzkeuxk

b6zscyd

 

Posted in Uncategorized, work flow | Tagged , , , , , | 4 Comments

How do I re-arrange??: Ordering a plot re-revisited

Several years back I wrote a two part blog series in response to seeing questions about plotting and reordering on list serves, talkstats.com, and stackoverflow.  Part I discussed the basics of reordering plots by reordering factor levels.  The essential gist was:

So if you catch yourself using “re-arrange”/”re-order” and “plot” in a question think…factor & levels

Part II undertook re-ordering as a means of more easily seeing patterns in layouts such as bar plots & dot plots.

Well there is at least one time in which reordering factor levels doesn’t help to reorder a plot.  This post will describe this ggplot2 based problem and outline the way to overcome the problem.  You can get just the code here.

The Problem

In a stacked ggplot2 plot the fill ordering is not controlled by factor levels.  Such plots include a stacked bar and area plot.  Here is a demonstration of the problem.

Load Packages

if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, ggplot2)

Generate Data

Here I generate a data set containing a time series element (Month), counts (Count), and a leveling variable (Level).  The counts are transformed to proportions and the Level variable is converted to a leveled factor with the order  “High”,  “Medium”, “Low”.  This leveling is key to the problem as it will be used as the fill variable.  It is here that reordering the factor levels will not work to reorder the plot.

dat <- data_frame( 
    Month = rep(sort(month.abb), each = 3), 
    Count  = sample(10000:60000, 36), 
    Level = rep(c("High", "Low", "Medium"), 12) 
) %>%
    mutate(
        Level = factor(Level, levels = c("High", "Medium", "Low")),
        Month = factor(Month, levels = month.abb)
    ) %>%
    group_by(Month) %>%
    mutate(Prop = Count/sum(Count))

Plot a Stacked Area Plot

Next we generate the area plot.  The accompanying plot demonstrates the problem.  Notice that the legend is ordered according to the factor levels in the Level variable (“High”,  “Medium”, “Low”) yet the plot fill ordering is not in the expected order (it is “Medium”, “Low”, “High”).  I arranged the factor levels correctly but the plot fill ordering is not correct.  How then can I correctly order a stacked ggplot2 plot?

dat %>%
    ggplot(aes(x=as.numeric(Month), y=Prop)) +
        geom_area(aes(fill= Level), position = 'stack') +
        scale_x_continuous(breaks = 1:12, labels = month.abb) +
        scale_fill_brewer(palette = "YlOrBr")

wrong_order

The Solution

Reorder the Stacked Area Plot

It seems ggplot2 orders the plot itself by the order in which the levels are consumed.  That means we need to reorder the data itself (the rows), not the factor levels, in order to reorder the plot.  I use the arrange function from the dplyr package to reorder the data so that ggplot2 will encounter the data levels in the correct order and thus plot as expected.  Note that base R‘s order can be used to reorder the data rows as well.

In the plot we can see that the plot fill ordering now matches the legend and factor level ordering as expected.

dat %>%
    arrange(desc(Level)) %>%
    ggplot(aes(x=as.numeric(Month), y=Prop)) +
        geom_area(aes(fill= Level), position = 'stack') +
        scale_x_continuous(breaks = 1:12, labels = month.abb) +
        scale_fill_brewer(palette = "YlOrBr")

right_order1

This blog post has outlined  a case where reordering the factor levels does not reorder the plot and how to address the issue.

Posted in factor, ggplot2, r, visualization | Tagged , , , , | 3 Comments

Cracking Safe Cracker with R

My wife got me a Safe Cracker 40 puzzle a while back. I believe I misplaced the solution some time back. The company, Creative Crafthouse, stands behind their products. They had amazing customer service and promptly supplied me with a solution. I’d supply the actual wheels as a cutout paper version but this is their property so this blog will be more enjoyable if you buy yourself a Safe Cracker 40 as well (I have no affiliation with the company, just enjoy their products and they have great customer service). Here’s what the puzzle looks like:

There are 26 columns of 4 rows. The goal is to line up the dials so you have all columns summing to 40. It is somewhat difficult to explain how the puzzle moves, but the dials control two rows. The outer row of the dial is notched and only covers every other cell of the row below. The outer most row does not have a notched row covering it. I believe there are 16^4 = 65536 possible combinations. I think it’s best to understand the logic by watching the video:

I enjoy puzzles but after a year didn’t solve it. This one begged me for a computer solution, and so I decided to use R to force the solution a bit. To me the computer challenge was pretty fun in itself.

Here are the dials. The NAs represents the notches in the notched dials. I used a list structure because it helped me sort things out. Anything in the same list moves together, though are not the same row. Row a is the outer most wheel. Both b and b_1 make up the next row, and so on.

L1 <- list(#outer
    a = c(2, 15, 23, 19, 3, 2, 3, 27, 20, 11, 27, 10, 19, 10, 13, 10),
    b = c(22, 9, 5, 10, 5, 1, 24, 2, 10, 9, 7, 3, 12, 24, 10, 9)
)
L2 <- list(
    b_i = c(16, NA, 17, NA, 2, NA, 2, NA, 10, NA, 15, NA, 6, NA, 9, NA),
    c = c(11, 27, 14, 5, 5, 7, 8, 24, 8, 3, 6, 15, 22, 6, 1, 1)
)
L3 <- list(
    c_j = c(10, NA, 2,  NA, 22, NA, 2,  NA, 17, NA, 15, NA, 14, NA, 5, NA),
    d = c( 1,  6,  10, 6,  10, 2,  6,  10, 4,  1,  5,  5,  4,  8,  6,  3) #inner wheel
)
L4 <- list(#inner wheel
    d_k = c(6, NA, 13, NA, 3, NA, 3, NA, 6, NA, 10, NA, 10, NA, 10, NA)
)

This is a brute force method but is still pretty quick. I made a shift function to treat vectors like circles or in this case dials. Here’s a demo of shift moving the vector one rotation to the right.

"A" "B" "C" "D" "E" "F" "G" "H" "I" "J"

results in:

"J" "A" "B" "C" "D" "E" "F" "G" "H" "I" 

I use some indexing of the NAs to over write the notched dials onto each of the top three rows.

shift <- function(x, n){
    if (n == 0) return(x)
    c(x[(n+1):length(x)], x[1:n])
}

dat <- NULL
m <- FALSE

for (i in 0:15){ 
    for (j in 0:15){
        for (k in 0:15){

            # Column 1
            c1 <- L1[[1]]  

            # Column 2
            c2 <- L1[[2]]  
            c2b <- shift(L2[[1]], i)
            c2[!is.na(c2b)]<- na.omit(c2b)

            # Column 3
            c3 <- shift(L2[[2]], i)
            c3b <- shift(L3[[1]], j)
            c3[!is.na(c3b)]<- na.omit(c3b)

            # Column 4
            c4 <- shift(L3[[2]], j)
            c4b <- shift(L4[[1]], k)
            c4[!is.na(c4b)]<- na.omit(c4b)

            ## Check and see if all rows add up to 40
            m <- all(rowSums(data.frame(c1, c2, c3, c4)) %in% 40)

            ## If all rows are 40 print the solution and assign to dat
            if (m){
                assign("dat", data.frame(c1, c2, c3, c4), envir=.GlobalEnv)
                print(data.frame(c1, c2, c3, c4))
                break
            }
            if (m) break
        }    
        if (m) break
    }
    if (m) break
}

Here’s the solution:

   c1 c2 c3 c4
1   2  6 22 10
2  15  9  6 10
3  23  9  2  6
4  19 10  1 10
5   3 16 17  4
6   2  1 27 10
7   3 17 15  5
8  27  2  5  6
9  20  2 14  4
10 11  9  7 13
11 27  2  5  6
12 10  3 24  3
13 19 10 10  1
14 10 24  3  3
15 13 15  2 10
16 10  9 15  6

We can check dat (I wrote the solution the global environment) with rowSums:

 rowSums(dat)
 [1] 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40

A fun exercise for me. If anyone has a more efficient and/or less code intensive solution I’d love to hear about it.

Posted in games, r | Tagged , , , , | 3 Comments

Wakefield: Random Data Set (Part II)

This post is part II of a series detailing the GitHub package, wakefield, for generating random data sets. The First Post (part I) was a test run to gauge user interest. I received positive feedback and some ideas for improvements, which I’ll share below.

The post is broken into the following sections:

You can view just the R code HERE or PDF version HERE

1 Brief Package Description

First we’ll use the pacman package to grab the wakefield package from GitHub and then load it as well as the handy dplyr package.

if (!require("pacman")) install.packages("pacman"); library(pacman)
p_install_gh("trinker/wakefield")
p_load(dplyr, wakefield)

The main function in wakefield is r_data_frame. It takes n (the number of rows) and any number of variable functions that generate random columns. The result is a data frame with named, randomly generated columns. Below is an example, for details see Part I or the README

set.seed(10)

r_data_frame(n = 30,
    id,
    race,
    age(x = 8:14),
    Gender = sex,
    Time = hour,
    iq,
    grade, 
    height(mean=50, sd = 10),
    died,
    Scoring = rnorm,
    Smoker = valid
)
## Source: local data frame [30 x 11]
## 
##    ID     Race Age Gender     Time  IQ Grade Height  Died    Scoring
## 1  01    White  11   Male 01:00:00 110  90.7     52 FALSE -1.8227126
## 2  02    White   8   Male 01:00:00 111  91.8     36  TRUE  0.3525440
## 3  03    White   9   Male 01:30:00  87  81.3     39 FALSE -1.3484514
## 4  04 Hispanic  14   Male 01:30:00 111  83.2     46  TRUE  0.7076883
## 5  05    White  10 Female 03:30:00  95  80.1     51  TRUE -0.4108909
## 6  06    White  13 Female 04:00:00  97  93.9     61  TRUE -0.4460452
## 7  07    White  13 Female 05:00:00 109  89.5     44  TRUE -1.0411563
## 8  08    White  14   Male 06:00:00 101  92.3     63  TRUE -0.3292247
## 9  09    White  12   Male 06:30:00 110  90.1     52  TRUE -0.2828216
## 10 10    White  11   Male 09:30:00 107  88.4     47 FALSE  0.4324291
## .. ..      ... ...    ...      ... ...   ...    ...   ...        ...
## Variables not shown: Smoker (lgl)

2 Improvements

2.1 Repeated Measures Series

Big thanks to Ananda Mahto for suggesting better handing of repeated measures series and providing concise code to extend this capability.

The user may now specify the same variable function multiple times and it is named appropriately:

set.seed(10)

r_data_frame(
    n = 500,
    id,
    age, age, age,
    grade, grade, grade
)
## Source: local data frame [500 x 7]
## 
##     ID Age_1 Age_2 Age_3 Grade_1 Grade_2 Grade_3
## 1  001    28    33    32    80.2    87.2    85.6
## 2  002    24    35    31    89.7    91.7    86.8
## 3  003    26    33    23    92.7    85.7    88.7
## 4  004    31    24    28    82.2    90.0    86.0
## 5  005    21    21    29    86.5    87.0    88.4
## 6  006    23    28    25    85.6    93.5    86.7
## 7  007    24    22    26    89.3    90.3    87.6
## 8  008    24    21    23    92.4    88.3    89.3
## 9  009    29    23    32    86.4    84.4    88.2
## 10 010    26    34    32    97.6    84.2    90.6
## .. ...   ...   ...   ...     ...     ...     ...

But he went further, recommending a short hand for variable, variable, variable. The r_series function takes a variable function and j number of columns. It can also be renamed with the name argument:

set.seed(10)

r_data_frame(n=100,
    id,
    age,
    sex,
    r_series(gpa, 2),
    r_series(likert, 3, name = "Question")
)
## Source: local data frame [100 x 8]
## 
##     ID Age    Sex GPA_1 GPA_2        Question_1        Question_2
## 1  001  28   Male  3.00  4.00 Strongly Disagree   Strongly Agree 
## 2  002  24   Male  3.67  3.67          Disagree           Neutral
## 3  003  26   Male  3.00  4.00          Disagree Strongly Disagree
## 4  004  31   Male  3.67  3.67           Neutral   Strongly Agree 
## 5  005  21 Female  3.00  3.00             Agree   Strongly Agree 
## 6  006  23 Female  3.67  3.67             Agree             Agree
## 7  007  24 Female  3.67  4.00          Disagree Strongly Disagree
## 8  008  24   Male  2.67  3.00   Strongly Agree            Neutral
## 9  009  29 Female  4.00  3.33           Neutral Strongly Disagree
## 10 010  26   Male  4.00  3.00          Disagree Strongly Disagree
## .. ... ...    ...   ...   ...               ...               ...
## Variables not shown: Question_3 (fctr)

2.2 Dummy Coding Expansion of Factors

It is sometimes nice to expand a factor into j (number of groups) dummy coded columns. Here we see a factor version and then a dummy coded version of the same data frame:

set.seed(10)

r_data_frame(n=100,
    id,
    age,
    sex,
    political
)
## Source: local data frame [100 x 4]
## 
##     ID Age    Sex    Political
## 1  001  28   Male Constitution
## 2  002  24   Male Constitution
## 3  003  26   Male     Democrat
## 4  004  31   Male     Democrat
## 5  005  21 Female Constitution
## 6  006  23 Female     Democrat
## 7  007  24 Female     Democrat
## 8  008  24   Male   Republican
## 9  009  29 Female Constitution
## 10 010  26   Male     Democrat
## .. ... ...    ...          ...

The dummy coded version…

set.seed(10)

r_data_frame(n=100,
    id,
    age,
    r_dummy(sex, prefix = TRUE),
    r_dummy(political)
)
## Source: local data frame [100 x 9]
## 
##     ID Age Sex_Male Sex_Female Constitution Democrat Green Libertarian
## 1  001  28        1          0            1        0     0           0
## 2  002  24        1          0            1        0     0           0
## 3  003  26        1          0            0        1     0           0
## 4  004  31        1          0            0        1     0           0
## 5  005  21        0          1            1        0     0           0
## 6  006  23        0          1            0        1     0           0
## 7  007  24        0          1            0        1     0           0
## 8  008  24        1          0            0        0     0           0
## 9  009  29        0          1            1        0     0           0
## 10 010  26        1          0            0        1     0           0
## .. ... ...      ...        ...          ...      ...   ...         ...
## Variables not shown: Republican (int)

2.3 Factor to Numeric Conversion

There are times when you feel like a factor and the when you feel like an integer version. This is particularly useful with Likert-type data and other ordered factors. The as_integer function takes a data.frame and allows the user to specify the indices (j) to convert from factor to numeric. Here I show a factor data.frame and then the integer conversion:

set.seed(10)

r_data_frame(5,
    id, 
    r_series(likert, j = 4, name = "Item")
)
## Source: local data frame [5 x 5]
## 
##   ID          Item_1   Item_2          Item_3            Item_4
## 1  1         Neutral    Agree        Disagree           Neutral
## 2  2           Agree    Agree         Neutral   Strongly Agree 
## 3  3         Neutral    Agree Strongly Agree              Agree
## 4  4        Disagree Disagree         Neutral             Agree
## 5  5 Strongly Agree   Neutral           Agree Strongly Disagree

As integers…

set.seed(10)

r_data_frame(5,
    id, 
    r_series(likert, j = 4, name = "Item")
) %>% 
    as_integer(-1)
## Source: local data frame [5 x 5]
## 
##   ID Item_1 Item_2 Item_3 Item_4
## 1  1      3      4      2      3
## 2  2      4      4      3      5
## 3  3      3      4      5      4
## 4  4      2      2      3      4
## 5  5      5      3      4      1

2.4 Viewing Whole Data Set

dplyr has a nice print method that hides excessive rows and columns. Typically this is great behavior. Sometimes you want to quickly see the whole width of the data set. We can use View but this is a bit too wide and shows all rows. The peek function shows minimal rows, truncated columns, and prints wide for quick inspection. This is particularly nice for text strings as data. dplyr prints wide data sets like this:

r_data_frame(100,
    id, 
    name,
    sex,
    sentence    
)
## Source: local data frame [100 x 4]
## 
##     ID     Name    Sex
## 1  001   Gerald   Male
## 2  002    Jason   Male
## 3  003 Mitchell   Male
## 4  004      Joe Female
## 5  005   Mickey   Male
## 6  006   Michal   Male
## 7  007   Dannie Female
## 8  008   Jordan   Male
## 9  009     Rudy Female
## 10 010   Sammie Female
## .. ...      ...    ...
## Variables not shown: Sentence (chr)

Now use peek:

r_data_frame(100,
    id, 
    name,
    sex,
    sentence    
) %>% peek
## Source: local data frame [100 x 4]
## 
##     ID    Name    Sex   Sentence
## 1  001     Jae Female Excuse me.
## 2  002 Darnell Female Over the l
## 3  003  Elisha Female First of a
## 4  004  Vernon Female Gentlemen,
## 5  005   Scott   Male That's wha
## 6  006   Kasey Female We don't h
## 7  007 Michael   Male You don't 
## 8  008   Cecil Female I'll get o
## 9  009    Cruz Female They must 
## 10 010  Travis Female Good night
## .. ...     ...    ...        ...

2.5 Visualizing Column Types and NAs

When we build a large random data set it is nice to get a sense of the column types and the missing values. The table_heat (also plot for tbl_df class) does this. Here I’ll generate a data set, add missing values (r_na), and then plot:

set.seed(10)

r_data_frame(n=100,
    id,
    dob,
    animal,
    grade, grade,
    death,
    dummy,
    grade_letter,
    gender,
    paragraph,
    sentence
) %>%
   r_na() %>%
   plot(palette = "Set1")

3 Table of Variable Functions

There are currently 66 wakefield based variable functions to chose for building columns. Use variables() to see them or variables(TRUE) to see a list of them broken into variable types. Here’s an HTML table version:


age dob height_in month speed
animal dummy income name speed_kph
answer education internet_browser normal speed_mph
area employment iq normal_round state
birth eye language paragraph string
car gender level pet upper
children gpa likert political upper_factor
coin grade likert_5 primary valid
color grade_letter likert_7 race year
date_stamp grade_level lorem_ipsum religion zip_code
death group lower sat
dice hair lower_factor sentence
died height marital sex
dna height_cm military smokes

4 Possible Uses

4.1 Testing Methods

I personally will use this most frequently when I’m testing out a model. For example say you wanted to test psychometric functions, including the cor function, on a randomly generated assessment:

dat <- r_data_frame(120,
    id, 
    sex,
    age,
    r_series(likert, 15, name = "Item")
) %>% 
    as_integer(-c(1:3))

dat %>%
    select(contains("Item")) %>%
    cor %>%
    heatmap

4.2 Unique Student Data for Course Assignments

Sometimes it’s nice if students each have their own data set to work with but one in which you control the parameters. Simply supply the students with a unique integer id and they can use this inside of set.seed with a wakefield r_data_frame you’ve constructed for them in advance. Viola 25 instant data sets that are structurally the same but randomly different.

student_id <- ## INSERT YOUT ID HERE
    
set.seed(student_id)

dat <- function(1000,
    id, 
    gender,
    religion,
    internet_browser,
    language,
    iq,
    sat,
    smokes
)    

4.3 Blogging and Online Help Communities

wakefield can make data sharing on blog posts and online hep communities (e.g., TalkStats, StackOverflow) fast, accessible, and with little space or cognitive effort. Use variables(TRUE) to see variable functions by class and select the ones you want:

variables(TRUE)
## $character
## [1] "lorem_ipsum" "lower"       "name"        "paragraph"   "sentence"   
## [6] "string"      "upper"       "zip_code"   
## 
## $date
## [1] "birth"      "date_stamp" "dob"       
## 
## $factor
##  [1] "animal"           "answer"           "area"            
##  [4] "car"              "coin"             "color"           
##  [7] "dna"              "education"        "employment"      
## [10] "eye"              "gender"           "grade_level"     
## [13] "group"            "hair"             "internet_browser"
## [16] "language"         "lower_factor"     "marital"         
## [19] "military"         "month"            "pet"             
## [22] "political"        "primary"          "race"            
## [25] "religion"         "sex"              "state"           
## [28] "upper_factor"    
## 
## $integer
## [1] "age"      "children" "dice"     "level"    "year"    
## 
## $logical
## [1] "death"  "died"   "smokes" "valid" 
## 
## $numeric
##  [1] "dummy"        "gpa"          "grade"        "height"      
##  [5] "height_cm"    "height_in"    "income"       "iq"          
##  [9] "normal"       "normal_round" "sat"          "speed"       
## [13] "speed_kph"    "speed_mph"   
## 
## $`ordered factor`
## [1] "grade_letter" "likert"       "likert_5"     "likert_7"

Then throw them inside of r_data_fame to make a quick data set to share.

r_data_frame(8,
    name,
    sex,
    r_series(iq, 3)
) %>%
    peek %>%
    dput

5 Getting Involved

If you’re interested in getting involved with use or contributing you can:

  1. Install and use wakefield
  2. Provide feedback via comments below
  3. Provide feedback (bugs, improvements, and feature requests) via wakefield’s Issues Page
  4. Fork from GitHub and give a Pull Request

Thanks for reading, your feedback is welcomed.


*Get the R code for this post HERE
*Get a PDF version this post HERE

Posted in data, data generation, r, random, trinker, tylerrinker, wakefield | Tagged , | 6 Comments

Random Data Sets Quickly

This post will discuss a recent GitHub package I’m working on, wakefield to generate random data sets.

The post is broken into the following sections:

  1. Demo
    1.1 Random Variable Functions
    1.2 Random Data Frames
    1.3 Missing Values
    1.4 Default Data Set
  2. Future Direction
  3. Getting Involved

You can view just the R code HERE or PDF version HERE


One of my more popular blog posts, Function To Generate A Random Data Set, was an early post about generating random data sets. Basically I had created a function to generate a random data set of various types of continuous and categorical columns. Optionally, the user could assign a certain percentage of cells in each column to missing values (NA). Often I find myself generating random data sets to test code/functions/models out on but rarely do I use that original random data generator. Why?

  1. For one it’s not in a package so it’s not handy
  2. It generates too many unrelated columns

Recently I had an idea inspired by Richie Cotton’s rebus and Kevin Ushey & Jim Hester’s rex regex based packages. Basically, these packages allow the user to utilize many little human readable regular expression chunks to build a larger desired regular expression. I thought, why not apply this concept to building a random data set. I’d make mini, modular random variable generating functions that the user passes to a data.frame like function and the result is a quick data set just as desired. I also like the way dplyr makes a tbl_df that prints only a few rows and limits the number of columns. So I made the output a tbl_df object and print accordingly.

1 Demo

1.1 Random Variable Functions

First we’ll use the pacman package to grab and load the wakefield package from GitHub.

if (!require("pacman")) install.packages("pacman"); library(pacman)
p_load_gh("trinker/wakefield")

Then we’ll look at a random variable generating function.

race(n=10)
##  [1] White    White    White    Black    White    Hispanic Black   
##  [8] Asian    Hispanic White   
## Levels: White Hispanic Black Asian Bi-Racial Native Other Hawaiian
attributes(race(n=10))
## $levels
## [1] "White"     "Hispanic"  "Black"     "Asian"     "Bi-Racial" "Native"   
## [7] "Other"     "Hawaiian" 
## 
## $class
## [1] "variable" "factor"  
## 
## $varname
## [1] "Race"

A few more…

sex(10)
##  [1] Male   Female Male   Male   Male   Female Male   Male   Male   Male  
## Levels: Male Female
likert_7(10)
##  [1] Strongly Agree    Strongly Agree    Neutral          
##  [4] Somewhat Agree    Disagree          Disagree         
##  [7] Somewhat Disagree Neutral           Strongly Agree   
## [10] Agree            
## 7 Levels: Strongly Disagree < Disagree < ... < Strongly Agree
gpa(10)
##  [1] 3.00 3.67 2.67 3.33 3.00 4.00 3.00 3.00 3.67 3.00
dna(10)
##  [1] "Adenine"  "Thymine"  "Thymine"  "Thymine"  "Adenine"  "Cytosine"
##  [7] "Guanine"  "Thymine"  "Thymine"  "Guanine"
string(10, length = 5)
##  [1] "L3MPu" "tyTgQ" "mqBWh" "uGnch" "6KKZC" "DdLrw" "t2lEJ" "Hir6Y"
##  [9] "eE4v9" "oPb4u"

1.2 Random Data Frames

Ok so modular chunks great…but they get more powerful inside of the r_data_frame function. The user only needs to supply n once and the column names are auto-generated by the function (can be specified with name = prefix as usual). The call parenthesis are not even needed if no other arguments are passed.

set.seed(10)

r_data_frame(
    n = 500,
    id,
    race,
    age,
    smokes,
    marital,
    Start = hour,
    End = hour,
    iq,
    height,
    died
)
## Source: local data frame [500 x 10]
## 
##     ID     Race Age Smokes       Marital    Start      End  IQ Height
## 1  001    White  33  FALSE       Married 00:00:00 00:00:00  95     62
## 2  002    White  35  FALSE Never Married 00:00:00 00:00:00  94     69
## 3  003    White  33  FALSE     Separated 00:00:00 00:00:00 112     71
## 4  004 Hispanic  24  FALSE       Married 00:00:00 00:00:00  97     65
## 5  005    White  21  FALSE Never Married 00:00:00 00:00:00  89     74
## 6  006    White  28  FALSE       Married 00:00:00 00:00:00  93     67
## 7  007    White  22  FALSE       Married 00:00:00 00:00:00 113     66
## 8  008    White  21  FALSE Never Married 00:00:00 00:00:00 115     69
## 9  009    White  23  FALSE      Divorced 00:00:00 00:00:00  85     74
## 10 010    White  34  FALSE      Divorced 00:00:00 00:00:00 110     71
## .. ...      ... ...    ...           ...      ...      ... ...    ...
## Variables not shown: Died (lgl)

This r_data_frame is pretty awesome and not my own. Josh O’Brien wrote the function as seen HERE. Pretty nifty trick. Josh thank you for your help with bringing to fruition the concept.

1.3 Missing Values

The original blog post provided a means for adding missing values. wakefield keeps this alive and adds more flexibility. It is no longer a part of the data generation process but a function, r_na, that is called after the data set has been generated. The user can specify which columns to add NAs to. By default column 1 is excluded. This works nicely within a dplyr/magrittr pipe line. Note: dpyr has an id function ad well so the prefix wakedfield:: must be used for id.

p_load(dplyr)
set.seed(10)

r_data_frame(
    n = 30,
    id,
    state,
    month,
    sat,
    minute,
    iq,
    zip_code,
    year,
    Scoring = rnorm,
    Smoker = valid,
    sentence
) %>%
    r_na(prob=.25)
## Source: local data frame [30 x 11]
## 
##    ID      State     Month  SAT   Minute  IQ   Zip Year     Scoring Smoker
## 1  01    Georgia      July 1315 00:03:00 106    NA   NA          NA     NA
## 2  02    Florida  February 1492 00:04:00 107 87108 2007  0.64350004   TRUE
## 3  03       Ohio     March 1597       83 58653 2012 -1.36030614   TRUE
## 4  04         NA  November 1518 00:07:00  NA 50381 1999 -0.19850611   TRUE
## 5  05 California      June 1362 00:08:00 111 58123 1996          NA  FALSE
## 6  06   New York September 1356       87 18479 2010  2.06820961   TRUE
## 7  07         NA        NA   NA      111 97135 2007 -0.30528475  FALSE
## 8  08    Florida  December 1324 00:15:00  NA 99438 2010  0.28124561   TRUE
## 9  09 Washington        NA 1468 00:16:00  97 97135 1996          NA   TRUE
## 10 10       Ohio      July   NA 00:20:00  NA 58123 2014  0.04636144     NA
## .. ..        ...       ...  ...      ... ...   ...  ...         ...    ...
## Variables not shown: Sentence (chr)

1.4 Default Data Set

There’s still a default data set function, r_data, in case the functionality of the original random data generation function is missed or if you’re in a hurry and aren’t too picky about the data being generated.

set.seed(10)

r_data(1000)
## Source: local data frame [1,000 x 8]
## 
##      ID     Race Age    Sex     Hour  IQ Height  Died
## 1  0001    White  32 Female 00:00:00  91     63 FALSE
## 2  0002    White  31   Male 00:00:00  92     69  TRUE
## 3  0003    White  23 Female 00:00:00  94     67 FALSE
## 4  0004 Hispanic  28 Female 00:00:00 102     63 FALSE
## 5  0005    White  29 Female 00:00:00 103     74  TRUE
## 6  0006    White  25   Male 00:00:00  96     68  TRUE
## 7  0007    White  26 Female 00:00:00 115     70 FALSE
## 8  0008    White  23   Male 00:00:00 119     66  TRUE
## 9  0009    White  32   Male 00:00:00 107     74  TRUE
## 10 0010    White  32   Male 00:00:00 104     71  TRUE
## ..  ...      ... ...    ...      ... ...    ...   ...

2 Future Direction

Where will the wakefield package go from here? Well this blog post is a measure of public interest. I use it and at this point it lives on GitHub. I’d like interest in two ways: (a) users and (b) contributors. Users make the effort worth while and provide feedback and suggested improvements. Contributors make maintenance easier.

There is one area of improvement I’d like to see in the r_data_frame (r_list) functions. I like that I don’t have to specify an n for each variable/column. I also like that column names are auto generated. I also like that dplyr’s data_frame function allows me to create a variable y based on column x. So I can make columns that are correlated or any function of another column.

p_load(dplyr)
set.seed(10)

dplyr::data_frame(
    x = 1:10,
    y = x + rnorm(10)
)
## Source: local data frame [10 x 2]
## 
##     x        y
## 1   1 1.018746
## 2   2 1.815747
## 3   3 1.628669
## 4   4 3.400832
## 5   5 5.294545
## 6   6 6.389794
## 7   7 5.791924
## 8   8 7.636324
## 9   9 7.373327
## 10 10 9.743522

The user can use the modular variable functions inside of dplyr::data_frame and have this functionality but the column name and n must explicit be passed to each variable.

set.seed(10)

dplyr::data_frame(
    ID = wakefield::id(n=10),
    Smokes = smokes(n=10),
    Sick = ifelse(Smokes, sample(5:10, 10, TRUE), sample(0:4, 10, TRUE)),
    Death = ifelse(Smokes, sample(0:1, 10, TRUE, prob = c(.2, .8)), sample(0:1, 10, TRUE, prob = c(.7, .3)))
)
## Source: local data frame [10 x 4]
## 
##    ID Smokes Sick Death
## 1  01  FALSE    3     1
## 2  02  FALSE    2     0
## 3  03  FALSE    0     1
## 4  04  FALSE    2     0
## 5  05  FALSE    1     0
## 6  06  FALSE    2     1
## 7  07  FALSE    0     1
## 8  08  FALSE    1     0
## 9  09  FALSE    1     1
## 10 10  FALSE    4     0

I’d like to modify r_data_frame to continue to pass n and extract column names yet have the ability to make columns a function of other columns. Currently this is controlled by the r_list function that r_data_frame wraps.

3 Getting Involved

If you’re interested in getting involved with use or contributing you can:

  1. Install and use wakefield
  2. Provide feedback via comments below
  3. Provide feedback (bugs, improvements, and feature requests) via wakefield’s Issues Page
  4. Fork from GitHub and give a Pull Request

Thanks for reading, your feedback is welcomed.


*Get the R code for this post HERE
*Get a PDF version this post HERE

Posted in data, data generation, r, Uncategorized | Tagged , , , , | 7 Comments

pacman 0.2.0: Initial CRAN Release

We’re please to announce the first CRAN release of pacman v. 0.2.0. pacman is the combined work of Dason Kurkiewicz & Tyler Rinker.

pacman is an R package management tool that combines the functionality of base library related functions into intuitively named functions. This package is ideally added to .Rprofile to increase workflow by reducing time recalling obscurely named functions, reducing code and integrating functionality of base functions to simultaneously perform multiple actions.

Installing pacman

install.packages("pacman")

## May need the following if binaries haven't been built yet:
install.packages("pacman", type="source")

## Or install from GitHub via devtools:
devtools::install_github("trinker/pacman")

As this is the first release we expect that there are kinks that need to be worked out. We appreciate pull requests and issue reports .


Examples

Here are some of the functionalities the pacman authors tend to use most often:

Installing and Loading

p_load is a general use tool that can install, load, and update packages. For example, many blog posts begin coding with this sort of package call:

packs <- c("XML", "devtools", "RCurl", "fakePackage", "SPSSemulate")
success <- suppressWarnings(sapply(packs, require, character.only = TRUE))
install.packages(names(success)[!success])
sapply(names(success)[!success], require, character.only = TRUE)

With pacman this call can be reduced to:

pacman::p_load(XML, devtools, RCurl, fakePackage, SPSSemulate)

Installing Temporarily

p_temp enables the user to temporarily install a package. This allows a session-only install for testing out a single package without muddying the user’s library.

p_temp(aprof)

Package Functions & Data

p_functions (aka p_funs) and p_data enables the user to see the functions or data sets available in an add-on package.

p_functions(pacman)
p_funs(pacman, all=TRUE)
p_data(lattice)

Vignettes

Check out pacman’s vignettes:

Posted in r, Uncategorized | Tagged , , , , | 3 Comments

Scheduling R Tasks via Windows Task Scheduler

This post will allow you to impress your boss with your strong work ethic by enabling Windows R users to schedule late night tasks.  Picture it, your boss gets an email at 1:30 in the morning with the latest company data as a beautiful report.  I’m quite sure Linux and Mac users are able to do this rather easily via cron.  Windows users can do this via the Task Scheduler.  Users can also interface the task scheduler via the command line as well.

As this is more process oriented, I have created a minimal example on GitHub and the following video rather than providing scripts in-text.  All the scripts can be accessed via: https://github.com/trinker/Make_Task  User’s will need to fill in relevant information (e.g., paths, usernames, etc.) and download necessary libraries to run the scripts.  The main point of this demonstration is to provide the reader (who is a Windows user) with a procedure for automating R tasks.

Posted in r, Uncategorized, work flow | Tagged , , , , , , | 36 Comments

Visualizing APA 6 Citations: qdapRegex 0.2.0 & qdapTools 1.1.0

qdapRegex 0.2.0 & qdapTools 1.1.0 have been released to CRAN.  This post will provide some of the packages’ updates/features and provide an integrate demonstration of extracting and viewing in-text APA 6 style citations from an MS Word (.docx) document.

qdapRegex 0.2.0

The qdapRegex package is meant to provide access to canned, common regular expression patterns that can be used within qdapRegex, with R‘s own regular expression functions, or add on string manipulation packages such as stringr and stringi.  The qdapRegex package serves a dual purpose of being both functional and educational.

New Features/Changes

Here are a select few new features.  For a complete list of changes CLICK HERE:

  • is.regex added as a logical check of a regular expression’s validy (conforms to R’s regular expression rules).
  • Case wrapper functions, TC (title case), U (upper case), and L (lower case) added for convenient case manipulation.
  • rm_citation_tex added to remove/extract/replace bibkey citations from a .tex (LaTeX) file.
  • regex_cheat data set and cheat function added to act as a quick reference for common regex task operations such a lookaheads.
  • explain added to view a visual representation of a regular expression using http://www.regexper.com and http://rick.measham.id.au/paste/explain. Also takes named regular expressions from the regex_usa or other supplied dictionary.

The last two functions cheat & explain provide educational regex tools. regex_cheat provides a cheatsheet of common regex elements. explain interfaces with  http://www.regexper.com & http://rick.measham.id.au/paste/explain.

qdapTools 1.1.0

 qdapTools is an R package that contains tools associated with the qdap package that may be useful outside of the context of text analysis.

New Features/Changes

  • loc_split added to split data forms (list, vector, data.frame, matrix) on a vector of integer locations.
  • matrix2long makes a long format data.frame. It takes a matrix object, stacks all columns and adds identifying columns by repeating row and column names accordingly.
  • read_docx added to read in .docx documents.
  • split_vector picks up a regex argument to allow for regular expression search of break location.

Integrated Demonstration

In this demonstration we will use dl_url to grab a .docx file from the Internet. We’ll then read this document in with read_docx. We’ll use split_vector to split the text from the .docx into main body and a references section. rm_citations will be utilize to extract in-text APA 6 style citations. Last we will view frequencies and a visualization of the distribution of the citations using ggplot2. For a complete script of this R code used in this blog post CLICK HERE.

First we’ll make sure we have the correct versions of the packages, install them if necessary, and load the required packages for the demonstration:

Map(function(x, y) {
    if (!x %in% list.files(.libPaths())){
        install.packages(x)   
    } else {
        if (packageVersion(x) < y) {
            install.packages(x)   
        } else {
            message(sprintf("Version of %s is suitable for demonstration", x))
        }
    }
}, c("qdapRegex", "qdapTools"), c("0.2.0", "1.1.0"))

lapply(c("qdapRegex", "qdapTools", "ggplot2", "qdap"), require, character.only=TRUE)

Now let’s grab the .docx document, read it in, and split into body/references sections:

## Download .docx
url_dl("http://umlreading.weebly.com/uploads/2/5/2/5/25253346/whole_language_timeline-updated.docx")

## Read in .docx
txt <- read_docx("whole_language_timeline-updated.docx")

## Remove non ascii characters
txt <- rm_non_ascii(txt) 

## Split into body/references sections
parts <- split_vector(txt, split = "References", include = TRUE, regex=TRUE)

## View body
parts[[1]]

## View references
parts[[2]]

Now we can extract the in-text APA 6 citations and view frequencies:

## Extract citations in order of appearance
rm_citation(unbag(parts[[1]]), extract=TRUE)[[1]]

## Extract citations by section 
rm_citation(parts[[1]], extract=TRUE)

## Frequency
left_just(cites <- list2df(sort(table(rm_citation(unbag(parts[[1]]),
    extract=TRUE)), TRUE), "freq", "citation")[2:1])

##    citation                                                   freq
## 1  Walker, 2008                                                 14
## 2  Flesch (1955)                                                 2
## 3  Adams (1990)                                                  1
## 4  Anderson, Hiebert, Scott, and Wilkinson (1985)                1
## 5  Baumann & Hoffman, 1998                                       1
## 6  Baumann, 1998                                                 1
## 7  Bond and Dykstra (1967)                                       1
## 8  Chall (1967)                                                  1
## 9  Clay (1979)                                                   1
## 10 Goodman and Goodman (1979)                                    1
## 11 McCormick & Braithwaite, 2008                                 1
## 12 Read Adams (1990)                                             1
## 13 Stahl and Miller (1989)                                       1
## 14 Stahl and Millers (1989)                                      1
## 15 Word Perception Intrinsic Phonics Instruction Gates (1951)    1

Now we can find the locations of the citations in the text and plot a distribution of the in-text citations throughout the text:

## Distribution of citations (find locations)
cite_locs <- do.call(rbind, lapply(cites[[1]], function(x){
    m <- gregexpr(x, unbag(parts[[1]]), fixed=TRUE)
    data.frame(
        citation=x,
        start = m[[1]] -5,
        end =  m[[1]] + 5 + attributes(m[[1]])[["match.length"]]
    )
}))

## Plot the distribution
ggplot(cite_locs) +
    geom_segment(aes(x=start, xend=end, y=citation, yend=citation), size=3,
        color="yellow") +
    xlab("Duration") +
    scale_x_continuous(expand = c(0,0),
        limits = c(0, nchar(unbag(parts[[1]])) + 25)) +
    theme_grey() +
    theme(
        panel.grid.major=element_line(color="grey20"),
        panel.grid.minor=element_line(color="grey20"),
        plot.background = element_rect(fill="black"),
        panel.background = element_rect(fill="black"),
        panel.border = element_rect(colour = "grey50", fill=NA, size=1),
        axis.text=element_text(color="grey50"),    
        axis.title=element_text(color="grey50")  
    )

distribution

Posted in ggplot2, qdap, r, regular expression | Tagged , , , , , , , , , , | 2 Comments

LRA 2014- Communication Nomads: Blogging to Reclaim Our Academic Birthright

LRA2014

I have been asked to speak at the 2014 LRA Conference on the topic of Academic Blogging.

Time: 1:15-2:15
Location: Islands Ballroom Salon B – Lobby Level
My Slides: http://clari.buffalo.edu/blog
My Précis: http://clari.buffalo.edu/blog/materials/precis.pdf

The talk is part of a larger alternative session: Professors, We Need You!!! – Public Intellectuals, Advocacy, and Activism. This session “will engage participants in dialogue about how to transform the Literacy Research Association’s (LRA’s) role in advocacy for literacy learning and instruction among children, families, and educators through social media, open access spaces, and other channels.” Please join us if you’re at #LRA14

Session Organizer: Carla K. Meyer, Appalachian State University
Chair: William Ian O’Byrne, University of New Haven
Discussant: Norman A. Stahl, Northern Illinois University

Posted in Uncategorized | Leave a comment

GTrendsR package to Explore Google trending for Field Dependent Terms

My friend, Steve Simpson, introduced me to Philippe Massicotte and Dirk Eddelbuettel’s GTrendsR GitHub package this week. It’s a pretty nifty wrapper to the Google Trends API that enables one to search phrase trends over time. The trend indices that are given are explained in more detail here: https://support.google.com/trends/answer/4355164?hl=en

Ever have a toy you know is super cool but don’t know what to use it for yet? That’s GTrendsR for me. So I made up an activity to use it for, that’s related to my own interests (click HERE to download the just R code for this post). I decided to chose the first 10 phrases I could think of, related to my field, literacy. I then used GTrendsR to view how Google search trending has changed for these terms. Here are the 10 biased terms I choose:

  1. reading assessment
  2. common core
  3. reading standards
  4. phonics
  5. whole language
  6. lexile score
  7. balanced approach
  8. literacy research association
  9. international reading association
  10. multimodal

The last term did not receive enough hits to trend, which is telling, since the field is talking about multimodality, but search trends don’t seem to be affected to the point of registering with Google Trends.


Getting Started

The GTrendsR package provides great tools for grabbing the information from Google, however, for my own task I wanted simpler tools to grab certain chunks of information easily and format them in a tidy way. So I built a small wrapper package, mostly for myself, that will likely remain a GitHub only package: https://github.com/trinker/gtrend

You can install it for yourself (We’ll use it in this post), and load all necessary packages via:

devtools::install_github("dvanclev/GTrendsR")
devtools::install_github("trinker/gtrend")
library(gtrend); library(dplyr); library(ggplot2); library(scales)

The Initial Search

When you perform the search with gtrend_scraper, you will need to enter your Google user name and password.

I did an initial search and plotted the trends for the 9 terms. It was a big, colorful, clustery mess.

terms <- c("reading assessment", "common core", "reading standards",
    "phonics", "whole language", "lexile score", "balanced approach",
    "literacy research association", "international reading association"
)

out <- gtrend_scraper("your@gmail.com", "password", terms)

out %>%
    trend2long() %>%
    plot() 

plot of chunk trend_mess

So I faceted each of the terms out to look at the trends.

out %>%
    trend2long() %>%
    ggplot(aes(x=start, y=trend, color=term)) +
        geom_line() +
        facet_wrap(~term) +
        guides(color=FALSE)

plot of chunk trend_facet

Some interesting patterns began to emerge. I noticed a repeated pattern in almost all of the educational terms which I thought interesting. First we’ll explore that. The basic shape wasn’t yet discernible and so I took a small subset of one term, reading+assessment, to explore the trend line by year:

names(out)[1]
## [1] "reading+assessment"
dat <- out[[1]][["trend"]]
colnames(dat)[3] <- "trend"

dat2 <- dat[dat[["start"]] > as.Date("2011-01-01"), ]

rects <- dat2  %>%
    mutate(year=format(as.Date(start), "%y")) %>%
    group_by(year) %>%
    summarize(xstart = as.Date(min(start)), xend = as.Date(max(end)))

ggplot() +
    geom_rect(data = rects, aes(xmin = xstart, xmax = xend, ymin = -Inf, 
        ymax = Inf, fill = factor(year)), alpha = 0.4) +
    geom_line(data=dat2, aes(x=start, y=trend), size=.9) + 
    scale_x_date(labels = date_format("%m/%y"), 
        breaks = date_breaks("month"),
        expand = c(0,0), 
        limits = c(as.Date("2011-01-02"), as.Date("2014-12-31"))) +
    theme(axis.text.x = element_text(angle = -45, hjust = 0)) 

plot of chunk trend_iso

What I noticed was that for each year there was a general double hump pattern that looked something like this:

This pattern holds consistent across educational terms. I added some context to a smaller subset to help with the narrative:

dat3 <- dat[dat[["start"]] > as.Date("2010-12-21") & 
		dat[["start"]] < as.Date("2012-01-01"), ]

ggplot() +
    geom_line(data=dat3, aes(x=start, y=trend), size=1.2) + 
    scale_x_date(labels = date_format("%b %y"), 
        breaks = date_breaks("month"),
        expand = c(0,0)) +
    theme(axis.text.x = element_text(angle = -45, hjust = 0)) +
    theme_bw() + theme(panel.grid.major.y=element_blank(),
        panel.grid.minor.y=element_blank()) + 
    ggplot2::annotate("text", x = as.Date("2011-01-15"), y = 50, 
        label = "Winter\nBreak Ends") +
    ggplot2::annotate("text", x = as.Date("2011-05-08"), y = 70, 
        label = "Summer\nBreak\nAcademia") +
    ggplot2::annotate("text", x = as.Date("2011-06-15"), y = 76, 
        label = "Summer\nBreak\nTeachers") +
    ggplot2::annotate("text", x = as.Date("2011-08-18"), y = 63, 
        label = "Academia\nReturns") +
    ggplot2::annotate("text", x = as.Date("2011-08-17"), y = 78, 
        label = "Teachers\nReturn")+
    ggplot2::annotate("text", x = as.Date("2011-11-17"), y = 61, 
        label = "Thanksgiving")

plot of chunk narrative

Of course this is all me trying to line up dates with educational search terms in a logical sense; a hypothesis rather than an firm conclusion. If this visual model is correct though, that these events impact Google searches around educational terms, and if a Google search is an indication of work to advance understanding of a concept, it’s clear that folks aren’t too interested in doing much advancing of educational knowledge at Thanksgiving and Christmas time. These are of course big assumptions. But if true, the implications extend further. Perhaps the most fertile time to engage educators, education students, and educational researchers is the first month after summer break.


Second Noticing

I also noticed that the two major literacy organizations are in a negative downward trend.

out %>%
    trend2long() %>%
    filter(term %in% c("literacy+research+association", 
        "international+reading+association")) %>%
    as.trend2long() %>%
    plot() + 
    guides(color=FALSE) +
    ggplot2::annotate("text", x = as.Date("2011-08-17"), y = 60, 
        label = "International\nReading\nAsociation", color="#F8766D")+
    ggplot2::annotate("text", x = as.Date("2006-01-17"), y = 38, 
        label = "Literacy\nResearch\nAssociation", color="#00BFC4") +
    theme_bw() +
    stat_smooth()

plot of chunk downward_trend

I wonder what might be causing the downward trend? Also, I notice the trend is growing apart for the two associations, with the International Reading Association being effected less. Can this downward trend be reversed?


Associated Terms

Lastly, I want to look at some term uses across time and see if they correspond with what I know to be historical events around literacy in education.

out %>%
    trend2long() %>%
    filter(term %in% names(out)[1:7]) %>%
    as.trend2long() %>%
    plot() + scale_colour_brewer(palette="Set1") +
    facet_wrap(~term, ncol=2) +
        guides(color=FALSE)

plot of chunk terms

This made me want to group the following 4 terms together as there’s near perfect overlap in the trends. I don’t have a plausible historical explanation for this. Hopefully, a more knowledgeable other can fill in the blanks.

out %>%
    trend2long() %>%
    filter(term %in% names(out)[c(1, 3, 5, 7)]) %>%
    as.trend2long() %>%
    plot() 

plot of chunk overlap

I explored the three remaining terms in the graph below. As expected, ‘common core’ and ‘lexile’ (scores associated with quantitative measures of text complexity) are on an upward trend. Phonics on the other hand is on a downward trend.

out %>%
    trend2long() %>%
    filter(term %in% names(out)[c(2, 4, 6)]) %>%
    as.trend2long() %>%
    plot() 

plot of chunk overlap2

This was an fun exploratory use of the GTrends package. Thanks to Steve Simpson for the introduction to GTrends and Philippe Massicotte and Dirk Eddelbuettel for sharing their work.


*Created using the reports package

Posted in r | Tagged , , , , , , , | 23 Comments