pacman 0.2.0: Initial CRAN Release

We’re please to announce the first CRAN release of pacman v. 0.2.0. pacman is the combined work of Dason Kurkiewicz & Tyler Rinker.

pacman is an R package management tool that combines the functionality of base library related functions into intuitively named functions. This package is ideally added to .Rprofile to increase workflow by reducing time recalling obscurely named functions, reducing code and integrating functionality of base functions to simultaneously perform multiple actions.

Installing pacman

install.packages("pacman")

## May need the following if binaries haven't been built yet:
install.packages("pacman", type="source")

## Or install from GitHub via devtools:
devtools::install_github("trinker/pacman")

As this is the first release we expect that there are kinks that need to be worked out. We appreciate pull requests and issue reports .


Examples

Here are some of the functionalities the pacman authors tend to use most often:

Installing and Loading

p_load is a general use tool that can install, load, and update packages. For example, many blog posts begin coding with this sort of package call:

packs <- c("XML", "devtools", "RCurl", "fakePackage", "SPSSemulate")
success <- suppressWarnings(sapply(packs, require, character.only = TRUE))
install.packages(names(success)[!success])
sapply(names(success)[!success], require, character.only = TRUE)

With pacman this call can be reduced to:

pacman::p_load(XML, devtools, RCurl, fakePackage, SPSSemulate)

Installing Temporarily

p_temp enables the user to temporarily install a package. This allows a session-only install for testing out a single package without muddying the user’s library.

p_temp(aprof)

Package Functions & Data

p_functions (aka p_funs) and p_data enables the user to see the functions or data sets available in an add-on package.

p_functions(pacman)
p_funs(pacman, all=TRUE)
p_data(lattice)

Vignettes

Check out pacman’s vignettes:

Posted in r, Uncategorized | Tagged , , , , | 2 Comments

Scheduling R Tasks via Windows Task Scheduler

This post will allow you to impress your boss with your strong work ethic by enabling Windows R users to schedule late night tasks.  Picture it, your boss gets an email at 1:30 in the morning with the latest company data as a beautiful report.  I’m quite sure Linux and Mac users are able to do this rather easily via cron.  Windows users can do this via the Task Scheduler.  Users can also interface the task scheduler via the command line as well.

As this is more process oriented, I have created a minimal example on GitHub and the following video rather than providing scripts in-text.  All the scripts can be accessed via: https://github.com/trinker/Make_Task  User’s will need to fill in relevant information (e.g., paths, usernames, etc.) and download necessary libraries to run the scripts.  The main point of this demonstration is to provide the reader (who is a Windows user) with a procedure for automating R tasks.

Posted in r, Uncategorized, work flow | Tagged , , , , , , | 15 Comments

Visualizing APA 6 Citations: qdapRegex 0.2.0 & qdapTools 1.1.0

qdapRegex 0.2.0 & qdapTools 1.1.0 have been released to CRAN.  This post will provide some of the packages’ updates/features and provide an integrate demonstration of extracting and viewing in-text APA 6 style citations from an MS Word (.docx) document.

qdapRegex 0.2.0

The qdapRegex package is meant to provide access to canned, common regular expression patterns that can be used within qdapRegex, with R‘s own regular expression functions, or add on string manipulation packages such as stringr and stringi.  The qdapRegex package serves a dual purpose of being both functional and educational.

New Features/Changes

Here are a select few new features.  For a complete list of changes CLICK HERE:

  • is.regex added as a logical check of a regular expression’s validy (conforms to R’s regular expression rules).
  • Case wrapper functions, TC (title case), U (upper case), and L (lower case) added for convenient case manipulation.
  • rm_citation_tex added to remove/extract/replace bibkey citations from a .tex (LaTeX) file.
  • regex_cheat data set and cheat function added to act as a quick reference for common regex task operations such a lookaheads.
  • explain added to view a visual representation of a regular expression using http://www.regexper.com and http://rick.measham.id.au/paste/explain. Also takes named regular expressions from the regex_usa or other supplied dictionary.

The last two functions cheat & explain provide educational regex tools. regex_cheat provides a cheatsheet of common regex elements. explain interfaces with  http://www.regexper.com & http://rick.measham.id.au/paste/explain.

qdapTools 1.1.0

 qdapTools is an R package that contains tools associated with the qdap package that may be useful outside of the context of text analysis.

New Features/Changes

  • loc_split added to split data forms (list, vector, data.frame, matrix) on a vector of integer locations.
  • matrix2long makes a long format data.frame. It takes a matrix object, stacks all columns and adds identifying columns by repeating row and column names accordingly.
  • read_docx added to read in .docx documents.
  • split_vector picks up a regex argument to allow for regular expression search of break location.

Integrated Demonstration

In this demonstration we will use dl_url to grab a .docx file from the Internet. We’ll then read this document in with read_docx. We’ll use split_vector to split the text from the .docx into main body and a references section. rm_citations will be utilize to extract in-text APA 6 style citations. Last we will view frequencies and a visualization of the distribution of the citations using ggplot2. For a complete script of this R code used in this blog post CLICK HERE.

First we’ll make sure we have the correct versions of the packages, install them if necessary, and load the required packages for the demonstration:

Map(function(x, y) {
    if (!x %in% list.files(.libPaths())){
        install.packages(x)   
    } else {
        if (packageVersion(x) < y) {
            install.packages(x)   
        } else {
            message(sprintf("Version of %s is suitable for demonstration", x))
        }
    }
}, c("qdapRegex", "qdapTools"), c("0.2.0", "1.1.0"))

lapply(c("qdapRegex", "qdapTools", "ggplot2", "qdap"), require, character.only=TRUE)

Now let’s grab the .docx document, read it in, and split into body/references sections:

## Download .docx
url_dl("http://umlreading.weebly.com/uploads/2/5/2/5/25253346/whole_language_timeline-updated.docx")

## Read in .docx
txt <- read_docx("whole_language_timeline-updated.docx")

## Remove non ascii characters
txt <- rm_non_ascii(txt) 

## Split into body/references sections
parts <- split_vector(txt, split = "References", include = TRUE, regex=TRUE)

## View body
parts[[1]]

## View references
parts[[2]]

Now we can extract the in-text APA 6 citations and view frequencies:

## Extract citations in order of appearance
rm_citation(unbag(parts[[1]]), extract=TRUE)[[1]]

## Extract citations by section 
rm_citation(parts[[1]], extract=TRUE)

## Frequency
left_just(cites <- list2df(sort(table(rm_citation(unbag(parts[[1]]),
    extract=TRUE)), TRUE), "freq", "citation")[2:1])

##    citation                                                   freq
## 1  Walker, 2008                                                 14
## 2  Flesch (1955)                                                 2
## 3  Adams (1990)                                                  1
## 4  Anderson, Hiebert, Scott, and Wilkinson (1985)                1
## 5  Baumann & Hoffman, 1998                                       1
## 6  Baumann, 1998                                                 1
## 7  Bond and Dykstra (1967)                                       1
## 8  Chall (1967)                                                  1
## 9  Clay (1979)                                                   1
## 10 Goodman and Goodman (1979)                                    1
## 11 McCormick & Braithwaite, 2008                                 1
## 12 Read Adams (1990)                                             1
## 13 Stahl and Miller (1989)                                       1
## 14 Stahl and Millers (1989)                                      1
## 15 Word Perception Intrinsic Phonics Instruction Gates (1951)    1

Now we can find the locations of the citations in the text and plot a distribution of the in-text citations throughout the text:

## Distribution of citations (find locations)
cite_locs <- do.call(rbind, lapply(cites[[1]], function(x){
    m <- gregexpr(x, unbag(parts[[1]]), fixed=TRUE)
    data.frame(
        citation=x,
        start = m[[1]] -5,
        end =  m[[1]] + 5 + attributes(m[[1]])[["match.length"]]
    )
}))

## Plot the distribution
ggplot(cite_locs) +
    geom_segment(aes(x=start, xend=end, y=citation, yend=citation), size=3,
        color="yellow") +
    xlab("Duration") +
    scale_x_continuous(expand = c(0,0),
        limits = c(0, nchar(unbag(parts[[1]])) + 25)) +
    theme_grey() +
    theme(
        panel.grid.major=element_line(color="grey20"),
        panel.grid.minor=element_line(color="grey20"),
        plot.background = element_rect(fill="black"),
        panel.background = element_rect(fill="black"),
        panel.border = element_rect(colour = "grey50", fill=NA, size=1),
        axis.text=element_text(color="grey50"),    
        axis.title=element_text(color="grey50")  
    )

distribution

Posted in ggplot2, qdap, r, regular expression | Tagged , , , , , , , , , , | 2 Comments

LRA 2014- Communication Nomads: Blogging to Reclaim Our Academic Birthright

LRA2014

I have been asked to speak at the 2014 LRA Conference on the topic of Academic Blogging.

Time: 1:15-2:15
Location: Islands Ballroom Salon B – Lobby Level
My Slides: http://clari.buffalo.edu/blog
My Précis: http://clari.buffalo.edu/blog/materials/precis.pdf

The talk is part of a larger alternative session: Professors, We Need You!!! – Public Intellectuals, Advocacy, and Activism. This session “will engage participants in dialogue about how to transform the Literacy Research Association’s (LRA’s) role in advocacy for literacy learning and instruction among children, families, and educators through social media, open access spaces, and other channels.” Please join us if you’re at #LRA14

Session Organizer: Carla K. Meyer, Appalachian State University
Chair: William Ian O’Byrne, University of New Haven
Discussant: Norman A. Stahl, Northern Illinois University

Posted in Uncategorized | Leave a comment

GTrendsR package to Explore Google trending for Field Dependent Terms

My friend, Steve Simpson, introduced me to Philippe Massicotte and Dirk Eddelbuettel’s GTrendsR GitHub package this week. It’s a pretty nifty wrapper to the Google Trends API that enables one to search phrase trends over time. The trend indices that are given are explained in more detail here: https://support.google.com/trends/answer/4355164?hl=en

Ever have a toy you know is super cool but don’t know what to use it for yet? That’s GTrendsR for me. So I made up an activity to use it for, that’s related to my own interests (click HERE to download the just R code for this post). I decided to chose the first 10 phrases I could think of, related to my field, literacy. I then used GTrendsR to view how Google search trending has changed for these terms. Here are the 10 biased terms I choose:

  1. reading assessment
  2. common core
  3. reading standards
  4. phonics
  5. whole language
  6. lexile score
  7. balanced approach
  8. literacy research association
  9. international reading association
  10. multimodal

The last term did not receive enough hits to trend, which is telling, since the field is talking about multimodality, but search trends don’t seem to be affected to the point of registering with Google Trends.


Getting Started

The GTrendsR package provides great tools for grabbing the information from Google, however, for my own task I wanted simpler tools to grab certain chunks of information easily and format them in a tidy way. So I built a small wrapper package, mostly for myself, that will likely remain a GitHub only package: https://github.com/trinker/gtrend

You can install it for yourself (We’ll use it in this post), and load all necessary packages via:

devtools::install_github("dvanclev/GTrendsR")
devtools::install_github("trinker/gtrend")
library(gtrend); library(dplyr); library(ggplot2); library(scales)

The Initial Search

When you perform the search with gtrend_scraper, you will need to enter your Google user name and password.

I did an initial search and plotted the trends for the 9 terms. It was a big, colorful, clustery mess.

terms <- c("reading assessment", "common core", "reading standards",
    "phonics", "whole language", "lexile score", "balanced approach",
    "literacy research association", "international reading association"
)

out <- gtrend_scraper("your@gmail.com", "password", terms)

out %>%
    trend2long() %>%
    plot() 

plot of chunk trend_mess

So I faceted each of the terms out to look at the trends.

out %>%
    trend2long() %>%
    ggplot(aes(x=start, y=trend, color=term)) +
        geom_line() +
        facet_wrap(~term) +
        guides(color=FALSE)

plot of chunk trend_facet

Some interesting patterns began to emerge. I noticed a repeated pattern in almost all of the educational terms which I thought interesting. First we’ll explore that. The basic shape wasn’t yet discernible and so I took a small subset of one term, reading+assessment, to explore the trend line by year:

names(out)[1]
## [1] "reading+assessment"
dat <- out[[1]][["trend"]]
colnames(dat)[3] <- "trend"

dat2 <- dat[dat[["start"]] > as.Date("2011-01-01"), ]

rects <- dat2  %>%
    mutate(year=format(as.Date(start), "%y")) %>%
    group_by(year) %>%
    summarize(xstart = as.Date(min(start)), xend = as.Date(max(end)))

ggplot() +
    geom_rect(data = rects, aes(xmin = xstart, xmax = xend, ymin = -Inf, 
        ymax = Inf, fill = factor(year)), alpha = 0.4) +
    geom_line(data=dat2, aes(x=start, y=trend), size=.9) + 
    scale_x_date(labels = date_format("%m/%y"), 
        breaks = date_breaks("month"),
        expand = c(0,0), 
        limits = c(as.Date("2011-01-02"), as.Date("2014-12-31"))) +
    theme(axis.text.x = element_text(angle = -45, hjust = 0)) 

plot of chunk trend_iso

What I noticed was that for each year there was a general double hump pattern that looked something like this:

This pattern holds consistent across educational terms. I added some context to a smaller subset to help with the narrative:

dat3 <- dat[dat[["start"]] > as.Date("2010-12-21") & 
		dat[["start"]] < as.Date("2012-01-01"), ]

ggplot() +
    geom_line(data=dat3, aes(x=start, y=trend), size=1.2) + 
    scale_x_date(labels = date_format("%b %y"), 
        breaks = date_breaks("month"),
        expand = c(0,0)) +
    theme(axis.text.x = element_text(angle = -45, hjust = 0)) +
    theme_bw() + theme(panel.grid.major.y=element_blank(),
        panel.grid.minor.y=element_blank()) + 
    ggplot2::annotate("text", x = as.Date("2011-01-15"), y = 50, 
        label = "Winter\nBreak Ends") +
    ggplot2::annotate("text", x = as.Date("2011-05-08"), y = 70, 
        label = "Summer\nBreak\nAcademia") +
    ggplot2::annotate("text", x = as.Date("2011-06-15"), y = 76, 
        label = "Summer\nBreak\nTeachers") +
    ggplot2::annotate("text", x = as.Date("2011-08-18"), y = 63, 
        label = "Academia\nReturns") +
    ggplot2::annotate("text", x = as.Date("2011-08-17"), y = 78, 
        label = "Teachers\nReturn")+
    ggplot2::annotate("text", x = as.Date("2011-11-17"), y = 61, 
        label = "Thanksgiving")

plot of chunk narrative

Of course this is all me trying to line up dates with educational search terms in a logical sense; a hypothesis rather than an firm conclusion. If this visual model is correct though, that these events impact Google searches around educational terms, and if a Google search is an indication of work to advance understanding of a concept, it’s clear that folks aren’t too interested in doing much advancing of educational knowledge at Thanksgiving and Christmas time. These are of course big assumptions. But if true, the implications extend further. Perhaps the most fertile time to engage educators, education students, and educational researchers is the first month after summer break.


Second Noticing

I also noticed that the two major literacy organizations are in a negative downward trend.

out %>%
    trend2long() %>%
    filter(term %in% c("literacy+research+association", 
        "international+reading+association")) %>%
    as.trend2long() %>%
    plot() + 
    guides(color=FALSE) +
    ggplot2::annotate("text", x = as.Date("2011-08-17"), y = 60, 
        label = "International\nReading\nAsociation", color="#F8766D")+
    ggplot2::annotate("text", x = as.Date("2006-01-17"), y = 38, 
        label = "Literacy\nResearch\nAssociation", color="#00BFC4") +
    theme_bw() +
    stat_smooth()

plot of chunk downward_trend

I wonder what might be causing the downward trend? Also, I notice the trend is growing apart for the two associations, with the International Reading Association being effected less. Can this downward trend be reversed?


Associated Terms

Lastly, I want to look at some term uses across time and see if they correspond with what I know to be historical events around literacy in education.

out %>%
    trend2long() %>%
    filter(term %in% names(out)[1:7]) %>%
    as.trend2long() %>%
    plot() + scale_colour_brewer(palette="Set1") +
    facet_wrap(~term, ncol=2) +
        guides(color=FALSE)

plot of chunk terms

This made me want to group the following 4 terms together as there’s near perfect overlap in the trends. I don’t have a plausible historical explanation for this. Hopefully, a more knowledgeable other can fill in the blanks.

out %>%
    trend2long() %>%
    filter(term %in% names(out)[c(1, 3, 5, 7)]) %>%
    as.trend2long() %>%
    plot() 

plot of chunk overlap

I explored the three remaining terms in the graph below. As expected, ‘common core’ and ‘lexile’ (scores associated with quantitative measures of text complexity) are on an upward trend. Phonics on the other hand is on a downward trend.

out %>%
    trend2long() %>%
    filter(term %in% names(out)[c(2, 4, 6)]) %>%
    as.trend2long() %>%
    plot() 

plot of chunk overlap2

This was an fun exploratory use of the GTrends package. Thanks to Steve Simpson for the introduction to GTrends and Philippe Massicotte and Dirk Eddelbuettel for sharing their work.


*Created using the reports package

Posted in r | Tagged , , , , , , , | 11 Comments

rmarkdown: Alter Action Depending on Document

Can I see a show of hands for those who love rmarkdown? Yeah me too. One nifty feature is the ability to specify various document prettifications in the YAML of a .Rmd document and then use:

rmarkdown::render("foo.Rmd", "all")

rmarkdown


The Problem

Have you ever said, “I wish I could do X for document type A and Y for document type B”? I have, as seen in this SO question from late August. But my plea went unanswered until today…


The Solution

Baptiste Auguie answered a similar question on SO. The key to Baptiste’s answer is this:

```{r, echo=FALSE}
out_type <- knitr::opts_knit$get("rmarkdown.pandoc.to")
```

This basically says “Document. Figure out what type you are”. You can then feed this information to if () {} else {}, switch(), etc. and act differently, depending on the type of document being rendered. If Baptiste is correct the options and flexibility are endless.

I decided to put Baptiste’s answer to the test on more complex scenarios. Here it is as GitHub repo that you can fork and/or download and try at home.

github_flurry_ios_style_icon_by_flakshack-d5ariic


Simple Example

To get a sense of how this is working let’s start with a simple example. I will assume some familiarity with rmarkdown and the YAML system. Here we will grab the info from knitr::opts_knit$get("rmarkdown.pandoc.to") and feed it to a switch() statement and act differently for a latex, docx, and html document.

---
title: "For Fun"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
  html_document:
    toc: true
    theme: journal
    number_sections: true
  pdf_document:
    toc: true
    number_sections: true
  word_document:
    fig_width: 5
    fig_height: 5
    fig_caption: true
---

```{r, echo=FALSE}
out_type <- knitr::opts_knit$get("rmarkdown.pandoc.to")
```

## Out Type

```{r, echo=FALSE}
print(out_type)
```

## Good times

```{r, results='asis', echo=FALSE}
switch(out_type,
    html = "I'm HTML",
    docx = "I'm MS Word",
    latex = "I'm LaTeX"
)
```

The result for each document type is using rmarkdown::render("simple.Rmd", "all"):

 

simple_html

HTML Document

 

simple_latex

LaTeX Document

 

simple_docx

docx Document

 


Extended Example

That’s great but my boss ain’t gonna be impressed with printing different statements. Let’s put this to the test. I want to embed a video into the HTML and PDF (LaTeX) or just put a url for an MS Word (docx) document. By the way if someone has a way to programmaticly embed the video in the docx file please share.

For this setup we can use a standard iframe for HTML and the media9 package for the LaTeX version to add a YouTube video. Note that not all PDF viewers can render the video (Adobe worked for me PDF XChange Viewer did not). We also have to add a small tweak to include the media9 package in a .sty (quasi preamble) using this line in the YAML:

    includes:
            in_header: preambleish.sty

And then create a separate .sty file that includes LaTeX package calls and other typical actions done in a preamble.

---
title: "For Fun"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
  html_document:
    toc: true
    theme: journal
    number_sections: true
  pdf_document:
    toc: true
    number_sections: true
    includes:
            in_header: preambleish.sty
  word_document:
    fig_width: 5
    fig_height: 5
    fig_caption: true
---

```{r, echo=FALSE}
out_type <- knitr::opts_knit$get("rmarkdown.pandoc.to")
```

## Out Type

```{r, echo=FALSE}
print(out_type)
```

## Good times

```{r, results='asis', echo=FALSE}
switch(out_type,
    html = {cat('<a href="https://www.youtube.com/embed/FnblmZdTbYs?feature=player_detailpage">https://www.youtube.com/embed/FnblmZdTbYs?feature=player_detailpage</a>')},
    docx = cat("https://www.youtube.com/watch?v=ekBJgsfKnlw"),
	latex = cat("\\begin{figure}[!ht]
  \\centering
\\includemedia[
  width=0.6\\linewidth,height=0.45\\linewidth,
  activate=pageopen,
  flashvars={
    modestbranding=1 % no YT logo in control bar
   &autohide=1       % controlbar autohide
   &showinfo=0       % no title and other info before start
  }
]{}{http://www.youtube.com/v/ekBJgsfKnlw?rel=0}   % Flash file
  \\caption{Important Video.}
\\end{figure}" )
)
```

The result for each document type is using rmarkdown::render("extended.Rmd", "all"):

extended_html

HTML Document

 

extended_latex

LaTeX Document

 

extended_docx

docx Document

 

I hope this post extends flexibility and speeds up workflow for folks. Thanks to @Baptiste for a terrific solution.

Posted in knitr, r, reports, Uncategorized, work flow | Tagged , , | 4 Comments

Exploration of Letter Make Up of English Words

This blog post will do a quick exploration of the grapheme make up of words in the English. Specifically we will use R and the qdap package to answer 3 questions:

  1. What is the distribution of word lengths (number of graphemes)?
  2. What is the frequency of letter (grapheme) use in English words?
  3. What is the distribution of letters positioned within words?

Click HERE for a script with all of the code for this post.


We will begin by loading the necessary packages and data (note you will need qdap 2.2.0 or higher):

if (!packageVersion("qdap") >= "2.2.0") {
    install.packages("qdap")	
}
library(qdap); library(qdapDictionaries); library(ggplot2); library(dplyr)
data(GradyAugmented)

The Dictionary: Augmented Grady

We will be using qdapDictionaries::GradyAugmented to conduct the mini-analysis. The GradyAugmented list is an augmented version of Grady Ward’s English words with additions from other various sources including Mark Kantrowitz’s names list. The result is a character vector of 122,806 English words and proper nouns.

GradyAugmented
?GradyAugmented

Question 1

What is the distribution of word lengths (number of graphemes)?

To answer this we will use base R’s summary, qdap‘s dist_tab function, and a ggplot2 histogram.

summary(nchar(GradyAugmented))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    6.00    8.00    7.87    9.00   21.00 
dist_tab(nchar(GradyAugmented))
   interval  freq cum.freq percent cum.percent
1         1    26       26    0.02        0.02
2         2   116      142    0.09        0.12
3         3  1085     1227    0.88        1.00
4         4  4371     5598    3.56        4.56
5         5  9830    15428    8.00       12.56
6         6 16246    31674   13.23       25.79
7         7 23198    54872   18.89       44.68
8         8 27328    82200   22.25       66.93
9         9 17662    99862   14.38       81.32
10       10  9777   109639    7.96       89.28
11       11  5640   115279    4.59       93.87
12       12  3348   118627    2.73       96.60
13       13  2052   120679    1.67       98.27
14       14  1066   121745    0.87       99.14
15       15   582   122327    0.47       99.61
16       16   268   122595    0.22       99.83
17       17   136   122731    0.11       99.94
18       18    50   122781    0.04       99.98
19       19    17   122798    0.01       99.99
20       20     5   122803    0.00      100.00
21       21     3   122806    0.00      100.00
ggplot(data.frame(nletters = nchar(GradyAugmented)), aes(x=nletters)) + 
    geom_histogram(binwidth=1, colour="grey70", fill="grey60") +
    theme_minimal() + 
    geom_vline(xintercept = mean(nchar(GradyAugmented)), size=1, 
        colour="blue", alpha=.7) + 
    xlab("Number of Letters")

plot of chunk unnamed-chunk-3

Here we can see that the average word length is 7.87 letters long with a minimum of 1 (expected) and a maximum of 21 letters. The histogram indicates the distribution is skewed slightly right.

Question 2

What is the frequency of letter (grapheme) use in English words?

Now we will view the overall letter uses in the augmented Grady Word list. Wheel of Fortune lovers…how will r,s,t,l,n,e fare? Here we will double loop through each word with each letter of the alphabet and grab the position of the letters in the words using gregexpr. gregexpr is a nifty function that tells the starting locations of regular expressions. At this point the positioning isn’t necessary for answering the 2nd question but we’re setting our selves up to answer the 3rd question. We’ll then use a frequency table and ordered bar chart to see the frequency of letters in the word list.

Be patient with the double loop (lapply/sappy), it is 122,806 words and takes ~1 minute to run.

position <- lapply(GradyAugmented, function(x){

    z <- unlist(sapply(letters, function(y){
        gregexpr(y, x, fixed = TRUE)
    }))
    z <- z[z != -1] 
    setNames(z, gsub("\\d", "", names(z)))
})


position2 <- unlist(position)

freqdat <- dist_tab(names(position2))
freqdat[["Letter"]] <- factor(toupper(freqdat[["interval"]]), 
    levels=toupper((freqdat %<% arrange(freq))[[1]] %<% as.character))

ggplot(freqdat, aes(Letter, weight=percent)) + 
  geom_bar() + coord_flip() +
  scale_y_continuous(breaks=seq(0, 12, 2), label=function(x) paste0(x, "%"), 
      expand = c(0,0), limits = c(0,12)) +
  theme_minimal()

plot of chunk letter_barpot

The output is given in percent of letter uses. Let’s see if that jives with the points one gets in a Scrabble game for various tiles:

Overall, yeah I suppose the Scrabble point system makes sense. However, it makes me question why the “K” is worth 5 and why “Y” is only worth 3. I’m sure more thought went into the creation of Scrabble than this simple analysis**.

**EDIT: I came across THIS BLOG POST indicating that perhaps the point values of Scrabble tiles are antiquated.  

Question 3

What is the distribution of letters positioned within words?

Now we will use a heat map to tackle the question of what letters are found in what positions. I like the blue – high/yellow – low configuration of heat maps. For me it is a good contrast but you may not agree. Please switch the high/low colors if they don’t suit.

dat <- data.frame(letter=toupper(names(position2)), position=unname(position2))

dat2 <- table(dat)
dat3 <- t(round(apply(dat2, 1, function(x) x/sum(x)), digits=3) * 100)
qheat(apply(dat2, 1, function(x) x/length(position2)), high="blue", 
    low="yellow", by.column=NULL, values=TRUE, digits=3, plot=FALSE) +
    ylab("Letter") + xlab("Position") + 
    guides(fill=guide_legend(title="Proportion"))

plot of chunk letter_heat

The letters “S” and “C” dominate the first position. Interestingly, vowels and the consonants “R” and “N” lead the second spot. I’m guessing the latter is due to consonant blends. The letter “S” likes most spots except the second spot. This appears to be similar, though less pronounced, for other popular consonants. The letter “R”, if this were a baseball team, would be the utility player, able to do well in multiple positions. One last noticing…don’t put “H” in the third position.



*Created using the reports package

Posted in discourse analysis, qdap | Tagged , , , , , | 1 Comment

Canned Regular Expressions: qdapRegex 0.1.2 on CRAN

We’re pleased to announce first CRAN release of qdapRegex! You can read about qdapRegex or skip right to the examples.

qdapRegex is a collection of regular expression tools associated with the qdap package that may be useful outside of the context of discourse analysis. The package uses a dictionary system to uniformly perform extraction, removal, and replacement.  Tools include removal/extraction/replacement of abbreviations, dates, dollar amounts, email addresses, hash tags, numbers, percentages, person tags, phone numbers, times, and zip codes.

The qdapRegex package does not aim to compete with string manipulation packages such as stringr or stringi but is meant to provide access to canned, common regular expression patterns that can be used within qdapRegex, with R‘s own regular expression functions, or add on string manipulation packages such as stringr and stringi.

You can download it from CRAN or from GitHub.

 

 

Examples

Let’s see qdapRegex  in action. As you can see functions starting with rm_ generally remove the canned regular expression that they are naming and with extract = TRUE can be extracted. A replacement argument also allows for optional replacements.

URLs

library(qdapRegex)
x <- "I like www.talkstats.com and http://stackoverflow.com"

## Removal
rm_url(x)

## Extraction
rm_url(x, extract=TRUE)

## Replacement
rm_url(x, replacement = '<a href="\\1" target="_blank">\\1</a>')
## Removal
## [1] "I like and"
## > 
## Extraction
## [[1]]
## [1] "www.talkstats.com"        "http://stackoverflow.com"
## 
## Replacement
## [1] "I like <a href=\"\" target=\"_blank\"></a> and <a href=\"http://stackoverflow.com\" target=\"_blank\">http://stackoverflow.com</a>"

Twitter Hash Tags

x <- c("@hadley I like #rstats for #ggplot2 work.",
    "Difference between #magrittr and #pipeR, both implement pipeline operators for #rstats:
        http://renkun.me/r/2014/07/26/difference-between-magrittr-and-pipeR.html @timelyportfolio",
    "Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization
        presentation #user2014. http://ramnathv.github.io/user2014-rcharts/#1"
)

rm_hash(x)
rm_hash(x, extract=TRUE)
## > rm_hash(x)
## [1] "@hadley I like for work."                                                                                                                                  
## [2] "Difference between and , both implement pipeline operators for : http://renkun.me/r/2014/07/26/difference-between-magrittr-and-pipeR.html @timelyportfolio"
## [3] "Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization presentation . http://ramnathv.github.io/user2014-rcharts/#1" 

  
## > rm_hash(x, extract=TRUE)
## [[1]]
## [1] "#rstats"  "#ggplot2"
## 
## [[2]]
## [1] "#magrittr" "#pipeR"    "#rstats"  
## 
## [[3]]
## [1] "#user2014"

Emoticons

x <- c("are :-)) it 8-D he XD on =-D they :D of :-) is :> for :o) that :-/",
  "as :-D I xD with :^) a =D to =) the 8D and :3 in =3 you 8) his B^D was")
rm_emoticon(x)
rm_emoticon(x, extract=TRUE)
## > rm_emoticon(x)
## [1] "are it he on they of is for that"     
## [2] "as I with a to the and in you his was"


## > rm_emoticon(x, extract=TRUE)
## [[1]]
## [1] ":-))" "8-D"  "XD"   "=-D"  ":D"   ":-)"  ":>"   ":o)"  ":-/" 
## 
## [[2]]
##  [1] ":-D" "xD"  ":^)" "=D"  "=)"  "8D"  ":3"  "=3"  "8)"  "B^D"

Academic, APA 6 Style, Citations

x <- c("Hello World (V. Raptor, 1986) bye",
    "Narcissism is not dead (Rinker, 2014)",
    "The R Core Team (2014) has many members.",
    paste("Bunn (2005) said, \"As for elegance, R is refined, tasteful, and",
        "beautiful. When I grow up, I want to marry R.\""),
    "It is wrong to blame ANY tool for our own shortcomings (Baer, 2005).",
    "Wickham's (in press) Tidy Data should be out soon.",
    "Rinker's (n.d.) dissertation not so much.",
    "I always consult xkcd comics for guidance (Foo, 2012; Bar, 2014).",
    "Uwe Ligges (2007) says, \"RAM is cheap and thinking hurts\""
)

rm_citation(x)
rm_citation(x, extract=TRUE)
## > rm_citation(x)
## [1] "Hello World () bye"                                                                                  
## [2] "Narcissism is not dead ()"                                                                           
## [3] "has many members."                                                                                   
## [4] "said, \"As for elegance, R is refined, tasteful, and beautiful. When I grow up, I want to marry R.\""
## [5] "It is wrong to blame ANY tool for our own shortcomings ()."                                          
## [6] "Tidy Data should be out soon."                                                                       
## [7] "dissertation not so much."                                                                           
## [8] "I always consult xkcd comics for guidance (; )."                                                     
## [9] "says, \"RAM is cheap and thinking hurts\""     

                                                      
## > rm_citation(x, extract=TRUE)
## [[1]]
## [1] "V. Raptor, 1986"
## 
## [[2]]
## [1] "Rinker, 2014"
## 
## [[3]]
## [1] "The R Core Team (2014)"
## 
## [[4]]
## [1] "Bunn (2005)"
## 
## [[5]]
## [1] "Baer, 2005"
## 
## [[6]]
## [1] "Wickham's (in press)"
## 
## [[7]]
## [1] "Rinker's (n.d.)"
## 
## [[8]]
## [1] "Foo, 2012" "Bar, 2014"
## 
## [[9]]
## [1] "Uwe Ligges (2007)"

Combining Regular Expressions

A user may wish to combine regular expressions. For example one may want to extract all URLs and Twitter Short URLs. The verb pastex (paste + regex) pastes together regular expressions. It also will search the regex dictionaries for named regular expressions prefixed with a @. So…

pastex("@rm_twitter_url", "@rm_url")

yields…

## [1] "(https?://t\\.co[^ ]*)|(t\\.co[^ ]*)|(http[^ ]*)|(ftp[^ ]*)|(www\\.[^ ]*)"

If we combine this ability with qdapRegex‘s function generator, rm_, we can make our own function that removes both standard URLs and Twitter Short URLs.

rm_twitter_n_url <- rm_(pattern=pastex("@rm_twitter_url", "@rm_url"))

Let’s use it…

rm_twitter_n_url <- rm_(pattern=pastex("@rm_twitter_url", "@rm_url"))

x <- c("download file from http://example.com",
         "this is the link to my website http://example.com",
         "go to http://example.com from more info.",
         "Another url ftp://www.example.com",
         "And https://www.example.net",
         "twitter type: t.co/N1kq0F26tG",
         "still another one https://t.co/N1kq0F26tG :-)")

rm_twitter_n_url(x)
rm_twitter_n_url(x, extract=TRUE)
## > rm_twitter_n_url(x)
## [1] "download file from"             "this is the link to my website"
## [3] "go to from more info."          "Another url"                   
## [5] "And"                            "twitter type:"                 
## [7] "still another one :-)"     

    
## > rm_twitter_n_url(x, extract=TRUE)
## [[1]]
## [1] "http://example.com"
## 
## [[2]]
## [1] "http://example.com"
## 
## [[3]]
## [1] "http://example.com"
## 
## [[4]]
## [1] "ftp://www.example.com"
## 
## [[5]]
## [1] "https://www.example.net"
## 
## [[6]]
## [1] "t.co/N1kq0F26tG"
## 
## [[7]]
## [1] "https://t.co/N1kq0F26tG"

*Note that there is a binary operator version of pastex, %|% that may be more useful to some folks.

"@rm_twitter_url" %|% "@rm_url"

yields…

## > "@rm_twitter_url" %|% "@rm_url"
## [1] "(https?://t\\.co[^ ]*)|(t\\.co[^ ]*)|(http[^ ]*)|(ftp[^ ]*)|(www\\.[^ ]*)"

Educational

Regular expressions can be extremely powerful but were difficult for me to grasp at first.

The qdapRegex package serves a dual purpose of being both functional and educational. While the canned regular expressions are useful in and of themselves they also serve as a platform for understanding regular expressions in the context of meaningful, purposeful usage. In the same way I learned guitar while trying to mimic Eric Clapton, not by learning scales and theory, some folks may enjoy an approach of learning regular expressions in a more pragmatic, experiential interaction. Users are encouraged to look at the regular expressions being used (?regex_usa and ?regex_supplement are the default regular expression dictionaries used by qdapRegex) and unpack how they work. I have found slow repeated exposures to information in a purposeful context results in acquired knowledge.

The following regular expressions sites were very helpful to my own regular expression education:

  1. Regular-Expression.info
  2. Rex Egg
  3. Regular Expressions as used in R

Being able to discuss and ask questions is also important to learning…in this case regular expressions. I have found the following forums extremely helpful to learning about regular expressions:

  1. Talk Stats + Posting Guidelines
  2. stackoverflow + Posting Guidelines

Acknowledgements

Thank you to the folks that have developed stringi (maintainer: Marek Gagolewski). The stringi package provides fast, consistently used regular expression manipulation tools. qdapRegex uses the stringi package as a back-end in most functions prefixed with rm_XXX.

We would also like to thank the many folks at http://www.talkstats.com and http://www.stackoverflow.com that freely give of their time to answer questions around many topics, including regular expressions.

Posted in qdap, regular expression, text, Uncategorized | Tagged , , , , , , | 2 Comments

Spell Checker for R…qdap::check_spelling

I often have had requests for a spell checker for R character vectors. The utils::aspell function can be used to check spelling but many Windows users have reported difficulty with the function.

I came across an article on spelling in R entitled “Watch Your Spelling!” by Kurt Hornik and Duncan Murdoch. The paper walks us through definitions of spell checking, history, and a suggested spell checker implementation for R. A terrific read. Hornik & Murdoch (2010) end with the following call:

Clearly, more work will be needed: modern statistics needs better lexical resources, and a dictionary based on the most frequent spell check false alarms can only be a start. We hope that this article will foster community interest in contributing to the development of such resources, and that refined domain specific dictionaries can be made available and used for improved text analysis with R in the near future (p. 28).

I answered a question on stackoverflow.com a few months back that lead to creating a suite of spell checking functions. The original functions used an agrep approach that was slow and inaccurate. I discovered Mark van der Loo’s terrific stringdist package to do the heavy lifting. It calculates string distances very quickly with various methods.

The rest of this blog post is meant as a minimal introduction to qdap‘s spell checking functions. A video will lead you through most of the process and accompanying scripts are provided.

Primitive Spell Checking Function

The which_misspelled function is a low level function that basically determines if each word of a single string is in a dictionary. It optionally gives suggested corrections.

library(qdap)
x <- "Robots are evl creatres and deserv exterimanitation."
which_misspelled(x, suggest=FALSE)
which_misspelled(x, suggest=TRUE)

Interactive Spell Checking

Typically a user will want to use the interactive spell checker (spell_checker_interactive) as it is more flexible and accurate.

dat <- DATA$state
dat[1] <- "Jasperita I likedd the cokie icekream"
dat
##  [1] "Jasperita I likedd the cokie icekream"
##  [2] "No it's not, it's dumb."              
##  [3] "What should we do?"                   
##  [4] "You liar, it stinks!"                 
##  [5] "I am telling the truth!"              
##  [6] "How can we be certain?"               
##  [7] "There is no way."                     
##  [8] "I distrust you."                      
##  [9] "What are you talking about?"          
## [10] "Shall we move on?  Good then."        
## [11] "I'm hungry.  Let's eat.  You already?"
(o <- check_spelling_interactive(dat))
preprocessed(o)
fixit <- attributes(o)$correct
fixit(dat)

A More Realistic Usage

m <- check_spelling_interactive(mraja1spl$dialogue[1:75])
preprocessed(m)
fixit <- attributes(m)$correct
fixit(mraja1spl$dialogue[1:75])

References

Hornik, K., & Murdoch, D. (2010). Watch Your Spelling!. The R Journal, 3(2), 22-28.

Posted in qdap, Uncategorized | Tagged , , , , , , | 1 Comment

Hijacking R Functions: Changing Default Arguments

I am working on a package to collect common regular expressions into a canned collection that users can easily use without having to know regexes. The package, qdapRegex, has a bunch of functions in the form of rm_xxx. The only difference between each function is one default parameter, the regular expression pattern is different. I had a default template function so what I really needed was to copy that template many times and change one parameter. It seems wasteful of code and electronic space to cut and paste the body of the template function over and over again…I needed to hijack the template.

Come on admit it you’ve all wished you could hijack a function before. Who hasn’t wished the default to data.frame was stringsAsFactors = FALSE? Or sum was na.rm = TRUE (OK maybe the latter is just me). So for the task of efficiently hijacking a function and changing the defaults in a manageable modular way my mind immediately went to Hadley’s pryr package (Wickham (2014)). I remember him hijacking functions in his Advanced R book as seen HERE with the partial function.

It worked except I couldn’t then change the newly set defaults back. In my case for package writing this was not a good thing (maybe there was a way and I missed it).


A Function Worth Hijacking

Here’s an example where we attempt to hijack data.frame.

dat <- data.frame(x1 = 1:3, x2 = c("a", "b", "c"))
str(dat)  # yuck a string as a factor
## 'data.frame':    3 obs. of  2 variables:
##  $ x1: int  1 2 3
##  $ x2: Factor w/ 3 levels "a","b","c": 1 2 3

Typically we’d do something like:

.data.frame <- function(..., row.names = NULL, check.rows = FALSE, check.names = TRUE,
    stringsAsFactors = FALSE) {

    data.frame(..., row.names = row.names, check.rows = check.rows,
        check.names = check.names, stringsAsFactors = stringsAsFactors)

}

dat <- .data.frame(x1 = 1:3, x2 = c("a", "b", "c"))
str(dat)  # yay!  strings are character
## 'data.frame':    3 obs. of  2 variables:
##  $ x1: int  1 2 3
##  $ x2: chr  "a" "b" "c"

But for my qdapRegex needs this required a ton of cut and paste. That means lots of extra code in the .R files.


The First Attempt to Hijack a Function

pryr to the rescue

library(pryr)

## The hijack
.data.frame <- pryr::partial(data.frame, stringsAsFactors = FALSE)

dat <- .data.frame(x1 = 1:3, x2 = c("a", "b", "c"))
str(dat)  # yay! strings are character
## 'data.frame':    3 obs. of  2 variables:
##  $ x1: int  1 2 3
##  $ x2: chr  "a" "b" "c"

But I can’t change the default back…

.data.frame(x1 = 1:3, x2 = c("a", "b", "c"), stringsAsFactors = TRUE)
## Error: formal argument "stringsAsFactors" matched by multiple actual
## arguments

Hijacking In Style (formals)

Doomed…

After tinkering with many not so reasonable solutions I asked on stackoverflow.com. In a short time MrFlick responded most helpfully (as he often does) with a response that used formals to change the formals of a function. I should have thought of it myself as I’d seen its use in Advanced R as well.

Here I use the answer to make a hijack function. It does exactly what I want, take a function and reset its formal arguments as desired.

hijack <- function (FUN, ...) {
    .FUN <- FUN
    args <- list(...)
    invisible(lapply(seq_along(args), function(i) {
        formals(.FUN)[[names(args)[i]]] <<- args[[i]]
    }))
    .FUN
}

Let’s see it in action as it changes the defaults but allows the user to still set these arguments…

.data.frame <- hijack(data.frame, stringsAsFactors = FALSE)

dat <- .data.frame(x1 = 1:3, x2 = c("a", "b", "c"))
str(dat)  # yay! strings are character
## 'data.frame':    3 obs. of  2 variables:
##  $ x1: int  1 2 3
##  $ x2: chr  "a" "b" "c"
.data.frame(x1 = 1:3, x2 = c("a", "b", "c"), stringsAsFactors = TRUE)
##   x1 x2
## 1  1  a
## 2  2  b
## 3  3  c

Note that for some purposes Dason suggested an alternative solution that is similar to the first approach I describe above but requires less copying as it used ldots (ellipsis) to cover the parameters that we don’t want to change. This approach would look something like this:

.data.frame <- function(..., stringsAsFactors = FALSE) {

    data.frame(..., stringsAsFactors = stringsAsFactors)

}

dat <- .data.frame(x1 = 1:3, x2 = c("a", "b", "c"))
str(dat)  # yay!  strings are character
## 'data.frame':    3 obs. of  2 variables:
##  $ x1: int  1 2 3
##  $ x2: chr  "a" "b" "c"
.data.frame(x1 = 1:3, x2 = c("a", "b", "c"), stringsAsFactors = TRUE)
##   x1 x2
## 1  1  a
## 2  2  b
## 3  3  c

Less verbose than the first approach I had. This solution was not the best for me in that I wanted to document all of the arguments to the function for the package. I believe using this approach would limit me to the arguments …, stringsAsFactors in the documentation (though I didn’t try it with CRAN checks). Depending on the situation this approach may be ideal.

References


*Created using the reports package

Posted in package creation, Uncategorized | Tagged , , , , , , | 7 Comments