My friend, Steve Simpson, introduced me to Philippe Massicotte and Dirk Eddelbuettel’s GTrendsR GitHub package this week. It’s a pretty nifty wrapper to the Google Trends API that enables one to search phrase trends over time. The trend indices that are given are explained in more detail here: https://support.google.com/trends/answer/4355164?hl=en
Ever have a toy you know is super cool but don’t know what to use it for yet? That’s GTrendsR for me. So I made up an activity to use it for, that’s related to my own interests (click HERE to download the just R code for this post). I decided to chose the first 10 phrases I could think of, related to my field, literacy. I then used GTrendsR to view how Google search trending has changed for these terms. Here are the 10 biased terms I choose:
- reading assessment
- common core
- reading standards
- phonics
- whole language
- lexile score
- balanced approach
- literacy research association
- international reading association
- multimodal
The last term did not receive enough hits to trend, which is telling, since the field is talking about multimodality, but search trends don’t seem to be affected to the point of registering with Google Trends.
Getting Started
The GTrendsR package provides great tools for grabbing the information from Google, however, for my own task I wanted simpler tools to grab certain chunks of information easily and format them in a tidy way. So I built a small wrapper package, mostly for myself, that will likely remain a GitHub only package: https://github.com/trinker/gtrend
You can install it for yourself (We’ll use it in this post), and load all necessary packages via:
devtools::install_github("dvanclev/GTrendsR") devtools::install_github("trinker/gtrend") library(gtrend); library(dplyr); library(ggplot2); library(scales)
The Initial Search
When you perform the search with gtrend_scraper
, you will need to enter your Google user name and password.
I did an initial search and plotted the trends for the 9 terms. It was a big, colorful, clustery mess.
terms <- c("reading assessment", "common core", "reading standards", "phonics", "whole language", "lexile score", "balanced approach", "literacy research association", "international reading association" ) out <- gtrend_scraper("your@gmail.com", "password", terms) out %>% trend2long() %>% plot()
So I faceted each of the terms out to look at the trends.
out %>% trend2long() %>% ggplot(aes(x=start, y=trend, color=term)) + geom_line() + facet_wrap(~term) + guides(color=FALSE)
Some interesting patterns began to emerge. I noticed a repeated pattern in almost all of the educational terms which I thought interesting. First we’ll explore that. The basic shape wasn’t yet discernible and so I took a small subset of one term, reading+assessment
, to explore the trend line by year:
names(out)[1]
## [1] "reading+assessment"
dat <- out[[1]][["trend"]] colnames(dat)[3] <- "trend" dat2 <- dat[dat[["start"]] > as.Date("2011-01-01"), ] rects <- dat2 %>% mutate(year=format(as.Date(start), "%y")) %>% group_by(year) %>% summarize(xstart = as.Date(min(start)), xend = as.Date(max(end))) ggplot() + geom_rect(data = rects, aes(xmin = xstart, xmax = xend, ymin = -Inf, ymax = Inf, fill = factor(year)), alpha = 0.4) + geom_line(data=dat2, aes(x=start, y=trend), size=.9) + scale_x_date(labels = date_format("%m/%y"), breaks = date_breaks("month"), expand = c(0,0), limits = c(as.Date("2011-01-02"), as.Date("2014-12-31"))) + theme(axis.text.x = element_text(angle = -45, hjust = 0))
What I noticed was that for each year there was a general double hump pattern that looked something like this:
This pattern holds consistent across educational terms. I added some context to a smaller subset to help with the narrative:
dat3 <- dat[dat[["start"]] > as.Date("2010-12-21") & dat[["start"]] < as.Date("2012-01-01"), ] ggplot() + geom_line(data=dat3, aes(x=start, y=trend), size=1.2) + scale_x_date(labels = date_format("%b %y"), breaks = date_breaks("month"), expand = c(0,0)) + theme(axis.text.x = element_text(angle = -45, hjust = 0)) + theme_bw() + theme(panel.grid.major.y=element_blank(), panel.grid.minor.y=element_blank()) + ggplot2::annotate("text", x = as.Date("2011-01-15"), y = 50, label = "Winter\nBreak Ends") + ggplot2::annotate("text", x = as.Date("2011-05-08"), y = 70, label = "Summer\nBreak\nAcademia") + ggplot2::annotate("text", x = as.Date("2011-06-15"), y = 76, label = "Summer\nBreak\nTeachers") + ggplot2::annotate("text", x = as.Date("2011-08-18"), y = 63, label = "Academia\nReturns") + ggplot2::annotate("text", x = as.Date("2011-08-17"), y = 78, label = "Teachers\nReturn")+ ggplot2::annotate("text", x = as.Date("2011-11-17"), y = 61, label = "Thanksgiving")
Of course this is all me trying to line up dates with educational search terms in a logical sense; a hypothesis rather than an firm conclusion. If this visual model is correct though, that these events impact Google searches around educational terms, and if a Google search is an indication of work to advance understanding of a concept, it’s clear that folks aren’t too interested in doing much advancing of educational knowledge at Thanksgiving and Christmas time. These are of course big assumptions. But if true, the implications extend further. Perhaps the most fertile time to engage educators, education students, and educational researchers is the first month after summer break.
Second Noticing
I also noticed that the two major literacy organizations are in a negative downward trend.
out %>% trend2long() %>% filter(term %in% c("literacy+research+association", "international+reading+association")) %>% as.trend2long() %>% plot() + guides(color=FALSE) + ggplot2::annotate("text", x = as.Date("2011-08-17"), y = 60, label = "International\nReading\nAsociation", color="#F8766D")+ ggplot2::annotate("text", x = as.Date("2006-01-17"), y = 38, label = "Literacy\nResearch\nAssociation", color="#00BFC4") + theme_bw() + stat_smooth()
I wonder what might be causing the downward trend? Also, I notice the trend is growing apart for the two associations, with the International Reading Association being effected less. Can this downward trend be reversed?
Associated Terms
Lastly, I want to look at some term uses across time and see if they correspond with what I know to be historical events around literacy in education.
out %>% trend2long() %>% filter(term %in% names(out)[1:7]) %>% as.trend2long() %>% plot() + scale_colour_brewer(palette="Set1") + facet_wrap(~term, ncol=2) + guides(color=FALSE)
This made me want to group the following 4 terms together as there’s near perfect overlap in the trends. I don’t have a plausible historical explanation for this. Hopefully, a more knowledgeable other can fill in the blanks.
out %>% trend2long() %>% filter(term %in% names(out)[c(1, 3, 5, 7)]) %>% as.trend2long() %>% plot()
I explored the three remaining terms in the graph below. As expected, ‘common core’ and ‘lexile’ (scores associated with quantitative measures of text complexity) are on an upward trend. Phonics on the other hand is on a downward trend.
out %>% trend2long() %>% filter(term %in% names(out)[c(2, 4, 6)]) %>% as.trend2long() %>% plot()
This was an fun exploratory use of the GTrends package. Thanks to Steve Simpson for the introduction to GTrends and Philippe Massicotte and Dirk Eddelbuettel for sharing their work.
*Created using the reports package
Is it extremely easy to get an account rate-limited doing this?
I tried a search for three terms and it worked like a charm, tried another one for six terms and only got ‘[1] “No Trends data for data+mining – substituting NA series…”‘ returns from there on.
@Matthew, there definitely is a daily limit but it’s a bit larger than what you describe. As far as I have experienced there was not an account rate-limit imposed for this activity. Is it possible the terms you searched for were not able to reach the threshold for views in order for the API to return results? You’d check this by rerunning the code I included that we know works.
I think they actually banned my IP for “suspicious activity” because I used a new google account to test it, I got a warning to change my passwords later on…. The ban does not seem to be lifted yet after ~16 hours, so beware.
Have you read about any methods for comparing these trend lines, other than visually? I would be very interested in hearing!
No but this is far outside of my realm of expertise. I believe the field of economics has ways to compare trends. But I too would be interested in knowing more about this. If other folks know of anything please share.
Great article! I keep getting an error saying that %>% is not a valid function is there any remedy around this?
If you load the dplyr package this is available. It’s a chain pipe function imported from the magrittr package.
Excellent! Thank you!
Pingback: BibSonomy :: url :: GTrendsR package to Explore Google trending for Field Dependent Terms | TRinker's R Blog
Very nice. Is it possible to change the geography here. For example, focus only on the US.
Actually let me rephrase that question. Is it possible to use geography other than country codes? Thanks!
Great article! I keep getting this error though when using gtrend_scraper, any ideas?
Error in as.character(x) :
cannot coerce type ‘closure’ to vector of type ‘character’
See: https://github.com/trinker/gtrend/issues/3
Very nice article. I have a little different task to do: comapare trends for different terms, like here http://www.google.com/trends/explore#q=roma%2C%20milano&cmpt=q&tz=
Do you know how can i do that?
What if your gmail account requires a verification code in addition to a password. How would you enter that? Or should I instead create a new gmail account that doesn’t require a verification code?
Hi Tyler,
In this post when you download all of the files, you pulled each one of these terms separately from google trends. This is different than doing them all together where you’re able to see the relative popularity of these terms to each other – which is what I think you wanted to do. Do you know if you’ve been able to find a package that does this? In other words, I type in the same keywords you did and get back one csv file with the relative popularity of each term?
Here is an example of what graph I’d like to see when running multiple queries: https://www.google.com/trends/explore#q=Windows%20XP%2C%20Windows%20Vista%2C%20Windows%207%2C%20Windows%208%2C%20Windows%2010&cmpt=q&tz=Etc%2FGMT%2B7
Also, if you haven’t found a reason why there is a negative trend in the “IRA” and “LRA” searches, let me offer a suggestion. In 2004, there was a much more academic bent to computer users (due to access of computers in universities, socioeconomic status, “techy-er” users, etc.) so with more of the populace getting access to the internet and computers, people in 2010 were searching for more (and different) things than in 2004.
Pingback: Using R to query Google Trends and Ngrams | Eryk Walczak
Thanks for the post. Seems really helpful. One aspect I wasn’t able to find is how to get a 3m daily data or data from different time slots.
Pingback: Import von Google-Trends-Zeitreihen nach R | Scripts & Statistics
Does anyone know how to have the gtrends query be specific for the last 7 days, which will give you results by day?
Seems like a stackoverflow.com question. Try asking there.
Hi,
I executed the command “out<- gtrend_scraper("myaddress@gmail.com", "mypassword", "winter").
Of course I got the trend data.
But where is the regions data?
I could not get them by "out$regions".
Does anyone teach me how to do?
Thank you.
There have been major developments in the gtrendsR package that really make my own wrapper obsolete: http://dirk.eddelbuettel.com/blog/2015/11/29/ I’d suggest you invest time with gtrendsR https://github.com/PMassicotte/gtrendsR