Visualizing APA 6 Citations: qdapRegex 0.2.0 & qdapTools 1.1.0

qdapRegex 0.2.0 & qdapTools 1.1.0 have been released to CRAN.  This post will provide some of the packages’ updates/features and provide an integrate demonstration of extracting and viewing in-text APA 6 style citations from an MS Word (.docx) document.

qdapRegex 0.2.0

The qdapRegex package is meant to provide access to canned, common regular expression patterns that can be used within qdapRegex, with R‘s own regular expression functions, or add on string manipulation packages such as stringr and stringi.  The qdapRegex package serves a dual purpose of being both functional and educational.

New Features/Changes

Here are a select few new features.  For a complete list of changes CLICK HERE:

  • is.regex added as a logical check of a regular expression’s validy (conforms to R’s regular expression rules).
  • Case wrapper functions, TC (title case), U (upper case), and L (lower case) added for convenient case manipulation.
  • rm_citation_tex added to remove/extract/replace bibkey citations from a .tex (LaTeX) file.
  • regex_cheat data set and cheat function added to act as a quick reference for common regex task operations such a lookaheads.
  • explain added to view a visual representation of a regular expression using http://www.regexper.com and http://rick.measham.id.au/paste/explain. Also takes named regular expressions from the regex_usa or other supplied dictionary.

The last two functions cheat & explain provide educational regex tools. regex_cheat provides a cheatsheet of common regex elements. explain interfaces with  http://www.regexper.com & http://rick.measham.id.au/paste/explain.

qdapTools 1.1.0

 qdapTools is an R package that contains tools associated with the qdap package that may be useful outside of the context of text analysis.

New Features/Changes

  • loc_split added to split data forms (list, vector, data.frame, matrix) on a vector of integer locations.
  • matrix2long makes a long format data.frame. It takes a matrix object, stacks all columns and adds identifying columns by repeating row and column names accordingly.
  • read_docx added to read in .docx documents.
  • split_vector picks up a regex argument to allow for regular expression search of break location.

Integrated Demonstration

In this demonstration we will use dl_url to grab a .docx file from the Internet. We’ll then read this document in with read_docx. We’ll use split_vector to split the text from the .docx into main body and a references section. rm_citations will be utilize to extract in-text APA 6 style citations. Last we will view frequencies and a visualization of the distribution of the citations using ggplot2. For a complete script of this R code used in this blog post CLICK HERE.

First we’ll make sure we have the correct versions of the packages, install them if necessary, and load the required packages for the demonstration:

Map(function(x, y) {
    if (!x %in% list.files(.libPaths())){
        install.packages(x)   
    } else {
        if (packageVersion(x) < y) {
            install.packages(x)   
        } else {
            message(sprintf("Version of %s is suitable for demonstration", x))
        }
    }
}, c("qdapRegex", "qdapTools"), c("0.2.0", "1.1.0"))

lapply(c("qdapRegex", "qdapTools", "ggplot2", "qdap"), require, character.only=TRUE)

Now let’s grab the .docx document, read it in, and split into body/references sections:

## Download .docx
url_dl("http://umlreading.weebly.com/uploads/2/5/2/5/25253346/whole_language_timeline-updated.docx")

## Read in .docx
txt <- read_docx("whole_language_timeline-updated.docx")

## Remove non ascii characters
txt <- rm_non_ascii(txt) 

## Split into body/references sections
parts <- split_vector(txt, split = "References", include = TRUE, regex=TRUE)

## View body
parts[[1]]

## View references
parts[[2]]

Now we can extract the in-text APA 6 citations and view frequencies:

## Extract citations in order of appearance
rm_citation(unbag(parts[[1]]), extract=TRUE)[[1]]

## Extract citations by section 
rm_citation(parts[[1]], extract=TRUE)

## Frequency
left_just(cites <- list2df(sort(table(rm_citation(unbag(parts[[1]]),
    extract=TRUE)), TRUE), "freq", "citation")[2:1])

##    citation                                                   freq
## 1  Walker, 2008                                                 14
## 2  Flesch (1955)                                                 2
## 3  Adams (1990)                                                  1
## 4  Anderson, Hiebert, Scott, and Wilkinson (1985)                1
## 5  Baumann & Hoffman, 1998                                       1
## 6  Baumann, 1998                                                 1
## 7  Bond and Dykstra (1967)                                       1
## 8  Chall (1967)                                                  1
## 9  Clay (1979)                                                   1
## 10 Goodman and Goodman (1979)                                    1
## 11 McCormick & Braithwaite, 2008                                 1
## 12 Read Adams (1990)                                             1
## 13 Stahl and Miller (1989)                                       1
## 14 Stahl and Millers (1989)                                      1
## 15 Word Perception Intrinsic Phonics Instruction Gates (1951)    1

Now we can find the locations of the citations in the text and plot a distribution of the in-text citations throughout the text:

## Distribution of citations (find locations)
cite_locs <- do.call(rbind, lapply(cites[[1]], function(x){
    m <- gregexpr(x, unbag(parts[[1]]), fixed=TRUE)
    data.frame(
        citation=x,
        start = m[[1]] -5,
        end =  m[[1]] + 5 + attributes(m[[1]])[["match.length"]]
    )
}))

## Plot the distribution
ggplot(cite_locs) +
    geom_segment(aes(x=start, xend=end, y=citation, yend=citation), size=3,
        color="yellow") +
    xlab("Duration") +
    scale_x_continuous(expand = c(0,0),
        limits = c(0, nchar(unbag(parts[[1]])) + 25)) +
    theme_grey() +
    theme(
        panel.grid.major=element_line(color="grey20"),
        panel.grid.minor=element_line(color="grey20"),
        plot.background = element_rect(fill="black"),
        panel.background = element_rect(fill="black"),
        panel.border = element_rect(colour = "grey50", fill=NA, size=1),
        axis.text=element_text(color="grey50"),    
        axis.title=element_text(color="grey50")  
    )

distribution

Advertisements

About tylerrinker

I am Literacy PhD student with a bent for the quantitative and a passion for R.
This entry was posted in ggplot2, qdap, r, regular expression and tagged , , , , , , , , , , . Bookmark the permalink.

2 Responses to Visualizing APA 6 Citations: qdapRegex 0.2.0 & qdapTools 1.1.0

  1. msharp2013 says:

    In the documentation to TC, there is an incomplete sentence. It is as follows:
    TC utilizes additional rules for capitalization beyond stri_trans_totitle that includes

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s