rmarkdown: Alter Action Depending on Document

Can I see a show of hands for those who love rmarkdown? Yeah me too. One nifty feature is the ability to specify various document prettifications in the YAML of a .Rmd document and then use:

rmarkdown::render("foo.Rmd", "all")

rmarkdown


The Problem

Have you ever said, “I wish I could do X for document type A and Y for document type B”? I have, as seen in this SO question from late August. But my plea went unanswered until today…


The Solution

Baptiste Auguie answered a similar question on SO. The key to Baptiste’s answer is this:

```{r, echo=FALSE}
out_type <- knitr::opts_knit$get("rmarkdown.pandoc.to")
```

This basically says “Document. Figure out what type you are”. You can then feed this information to if () {} else {}, switch(), etc. and act differently, depending on the type of document being rendered. If Baptiste is correct the options and flexibility are endless.

I decided to put Baptiste’s answer to the test on more complex scenarios. Here it is as GitHub repo that you can fork and/or download and try at home.

github_flurry_ios_style_icon_by_flakshack-d5ariic


Simple Example

To get a sense of how this is working let’s start with a simple example. I will assume some familiarity with rmarkdown and the YAML system. Here we will grab the info from knitr::opts_knit$get("rmarkdown.pandoc.to") and feed it to a switch() statement and act differently for a latex, docx, and html document.

---
title: "For Fun"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
  html_document:
    toc: true
    theme: journal
    number_sections: true
  pdf_document:
    toc: true
    number_sections: true
  word_document:
    fig_width: 5
    fig_height: 5
    fig_caption: true
---

```{r, echo=FALSE}
out_type <- knitr::opts_knit$get("rmarkdown.pandoc.to")
```

## Out Type

```{r, echo=FALSE}
print(out_type)
```

## Good times

```{r, results='asis', echo=FALSE}
switch(out_type,
    html = "I'm HTML",
    docx = "I'm MS Word",
    latex = "I'm LaTeX"
)
```

The result for each document type is using rmarkdown::render("simple.Rmd", "all"):

 

simple_html

HTML Document

 

simple_latex

LaTeX Document

 

simple_docx

docx Document

 


Extended Example

That’s great but my boss ain’t gonna be impressed with printing different statements. Let’s put this to the test. I want to embed a video into the HTML and PDF (LaTeX) or just put a url for an MS Word (docx) document. By the way if someone has a way to programmaticly embed the video in the docx file please share.

For this setup we can use a standard iframe for HTML and the media9 package for the LaTeX version to add a YouTube video. Note that not all PDF viewers can render the video (Adobe worked for me PDF XChange Viewer did not). We also have to add a small tweak to include the media9 package in a .sty (quasi preamble) using this line in the YAML:

    includes:
            in_header: preambleish.sty

And then create a separate .sty file that includes LaTeX package calls and other typical actions done in a preamble.

---
title: "For Fun"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
  html_document:
    toc: true
    theme: journal
    number_sections: true
  pdf_document:
    toc: true
    number_sections: true
    includes:
            in_header: preambleish.sty
  word_document:
    fig_width: 5
    fig_height: 5
    fig_caption: true
---

```{r, echo=FALSE}
out_type <- knitr::opts_knit$get("rmarkdown.pandoc.to")
```

## Out Type

```{r, echo=FALSE}
print(out_type)
```

## Good times

```{r, results='asis', echo=FALSE}
switch(out_type,
    html = {cat('<a href="https://www.youtube.com/embed/FnblmZdTbYs?feature=player_detailpage">https://www.youtube.com/embed/FnblmZdTbYs?feature=player_detailpage</a>')},
    docx = cat("https://www.youtube.com/watch?v=ekBJgsfKnlw"),
	latex = cat("\\begin{figure}[!ht]
  \\centering
\\includemedia[
  width=0.6\\linewidth,height=0.45\\linewidth,
  activate=pageopen,
  flashvars={
    modestbranding=1 % no YT logo in control bar
   &autohide=1       % controlbar autohide
   &showinfo=0       % no title and other info before start
  }
]{}{http://www.youtube.com/v/ekBJgsfKnlw?rel=0}   % Flash file
  \\caption{Important Video.}
\\end{figure}" )
)
```

The result for each document type is using rmarkdown::render("extended.Rmd", "all"):

extended_html

HTML Document

 

extended_latex

LaTeX Document

 

extended_docx

docx Document

 

I hope this post extends flexibility and speeds up workflow for folks. Thanks to @Baptiste for a terrific solution.

Posted in knitr, r, reports, Uncategorized, work flow | Tagged , , | 3 Comments

Exploration of Letter Make Up of English Words

This blog post will do a quick exploration of the grapheme make up of words in the English. Specifically we will use R and the qdap package to answer 3 questions:

  1. What is the distribution of word lengths (number of graphemes)?
  2. What is the frequency of letter (grapheme) use in English words?
  3. What is the distribution of letters positioned within words?

Click HERE for a script with all of the code for this post.


We will begin by loading the necessary packages and data (note you will need qdap 2.2.0 or higher):

if (!packageVersion("qdap") >= "2.2.0") {
    install.packages("qdap")	
}
library(qdap); library(qdapDictionaries); library(ggplot2); library(dplyr)
data(GradyAugmented)

The Dictionary: Augmented Grady

We will be using qdapDictionaries::GradyAugmented to conduct the mini-analysis. The GradyAugmented list is an augmented version of Grady Ward’s English words with additions from other various sources including Mark Kantrowitz’s names list. The result is a character vector of 122,806 English words and proper nouns.

GradyAugmented
?GradyAugmented

Question 1

What is the distribution of word lengths (number of graphemes)?

To answer this we will use base R’s summary, qdap‘s dist_tab function, and a ggplot2 histogram.

summary(nchar(GradyAugmented))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    6.00    8.00    7.87    9.00   21.00 
dist_tab(nchar(GradyAugmented))
   interval  freq cum.freq percent cum.percent
1         1    26       26    0.02        0.02
2         2   116      142    0.09        0.12
3         3  1085     1227    0.88        1.00
4         4  4371     5598    3.56        4.56
5         5  9830    15428    8.00       12.56
6         6 16246    31674   13.23       25.79
7         7 23198    54872   18.89       44.68
8         8 27328    82200   22.25       66.93
9         9 17662    99862   14.38       81.32
10       10  9777   109639    7.96       89.28
11       11  5640   115279    4.59       93.87
12       12  3348   118627    2.73       96.60
13       13  2052   120679    1.67       98.27
14       14  1066   121745    0.87       99.14
15       15   582   122327    0.47       99.61
16       16   268   122595    0.22       99.83
17       17   136   122731    0.11       99.94
18       18    50   122781    0.04       99.98
19       19    17   122798    0.01       99.99
20       20     5   122803    0.00      100.00
21       21     3   122806    0.00      100.00
ggplot(data.frame(nletters = nchar(GradyAugmented)), aes(x=nletters)) + 
    geom_histogram(binwidth=1, colour="grey70", fill="grey60") +
    theme_minimal() + 
    geom_vline(xintercept = mean(nchar(GradyAugmented)), size=1, 
        colour="blue", alpha=.7) + 
    xlab("Number of Letters")

plot of chunk unnamed-chunk-3

Here we can see that the average word length is 7.87 letters long with a minimum of 1 (expected) and a maximum of 21 letters. The histogram indicates the distribution is skewed slightly right.

Question 2

What is the frequency of letter (grapheme) use in English words?

Now we will view the overall letter uses in the augmented Grady Word list. Wheel of Fortune lovers…how will r,s,t,l,n,e fare? Here we will double loop through each word with each letter of the alphabet and grab the position of the letters in the words using gregexpr. gregexpr is a nifty function that tells the starting locations of regular expressions. At this point the positioning isn’t necessary for answering the 2nd question but we’re setting our selves up to answer the 3rd question. We’ll then use a frequency table and ordered bar chart to see the frequency of letters in the word list.

Be patient with the double loop (lapply/sappy), it is 122,806 words and takes ~1 minute to run.

position <- lapply(GradyAugmented, function(x){

    z <- unlist(sapply(letters, function(y){
        gregexpr(y, x, fixed = TRUE)
    }))
    z <- z[z != -1] 
    setNames(z, gsub("\\d", "", names(z)))
})


position2 <- unlist(position)

freqdat <- dist_tab(names(position2))
freqdat[["Letter"]] <- factor(toupper(freqdat[["interval"]]), 
    levels=toupper((freqdat %<% arrange(freq))[[1]] %<% as.character))

ggplot(freqdat, aes(Letter, weight=percent)) + 
  geom_bar() + coord_flip() +
  scale_y_continuous(breaks=seq(0, 12, 2), label=function(x) paste0(x, "%"), 
      expand = c(0,0), limits = c(0,12)) +
  theme_minimal()

plot of chunk letter_barpot

The output is given in percent of letter uses. Let’s see if that jives with the points one gets in a Scrabble game for various tiles:

Overall, yeah I suppose the Scrabble point system makes sense. However, it makes me question why the “K” is worth 5 and why “Y” is only worth 3. I’m sure more thought went into the creation of Scrabble than this simple analysis**.

**EDIT: I came across THIS BLOG POST indicating that perhaps the point values of Scrabble tiles are antiquated.  

Question 3

What is the distribution of letters positioned within words?

Now we will use a heat map to tackle the question of what letters are found in what positions. I like the blue – high/yellow – low configuration of heat maps. For me it is a good contrast but you may not agree. Please switch the high/low colors if they don’t suit.

dat <- data.frame(letter=toupper(names(position2)), position=unname(position2))

dat2 <- table(dat)
dat3 <- t(round(apply(dat2, 1, function(x) x/sum(x)), digits=3) * 100)
qheat(apply(dat2, 1, function(x) x/length(position2)), high="blue", 
    low="yellow", by.column=NULL, values=TRUE, digits=3, plot=FALSE) +
    ylab("Letter") + xlab("Position") + 
    guides(fill=guide_legend(title="Proportion"))

plot of chunk letter_heat

The letters “S” and “C” dominate the first position. Interestingly, vowels and the consonants “R” and “N” lead the second spot. I’m guessing the latter is due to consonant blends. The letter “S” likes most spots except the second spot. This appears to be similar, though less pronounced, for other popular consonants. The letter “R”, if this were a baseball team, would be the utility player, able to do well in multiple positions. One last noticing…don’t put “H” in the third position.



*Created using the reports package

Posted in discourse analysis, qdap | Tagged , , , , , | 1 Comment

Canned Regular Expressions: qdapRegex 0.1.2 on CRAN

We’re pleased to announce first CRAN release of qdapRegex! You can read about qdapRegex or skip right to the examples.

qdapRegex is a collection of regular expression tools associated with the qdap package that may be useful outside of the context of discourse analysis. The package uses a dictionary system to uniformly perform extraction, removal, and replacement.  Tools include removal/extraction/replacement of abbreviations, dates, dollar amounts, email addresses, hash tags, numbers, percentages, person tags, phone numbers, times, and zip codes.

The qdapRegex package does not aim to compete with string manipulation packages such as stringr or stringi but is meant to provide access to canned, common regular expression patterns that can be used within qdapRegex, with R‘s own regular expression functions, or add on string manipulation packages such as stringr and stringi.

You can download it from CRAN or from GitHub.

 

 

Examples

Let’s see qdapRegex  in action. As you can see functions starting with rm_ generally remove the canned regular expression that they are naming and with extract = TRUE can be extracted. A replacement argument also allows for optional replacements.

URLs

library(qdapRegex)
x <- "I like www.talkstats.com and http://stackoverflow.com"

## Removal
rm_url(x)

## Extraction
rm_url(x, extract=TRUE)

## Replacement
rm_url(x, replacement = '<a href="\\1" target="_blank">\\1</a>')
## Removal
## [1] "I like and"
## > 
## Extraction
## [[1]]
## [1] "www.talkstats.com"        "http://stackoverflow.com"
## 
## Replacement
## [1] "I like <a href=\"\" target=\"_blank\"></a> and <a href=\"http://stackoverflow.com\" target=\"_blank\">http://stackoverflow.com</a>"

Twitter Hash Tags

x <- c("@hadley I like #rstats for #ggplot2 work.",
    "Difference between #magrittr and #pipeR, both implement pipeline operators for #rstats:
        http://renkun.me/r/2014/07/26/difference-between-magrittr-and-pipeR.html @timelyportfolio",
    "Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization
        presentation #user2014. http://ramnathv.github.io/user2014-rcharts/#1"
)

rm_hash(x)
rm_hash(x, extract=TRUE)
## > rm_hash(x)
## [1] "@hadley I like for work."                                                                                                                                  
## [2] "Difference between and , both implement pipeline operators for : http://renkun.me/r/2014/07/26/difference-between-magrittr-and-pipeR.html @timelyportfolio"
## [3] "Slides from great talk: @ramnath_vaidya: Interactive slides from Interactive Visualization presentation . http://ramnathv.github.io/user2014-rcharts/#1" 

  
## > rm_hash(x, extract=TRUE)
## [[1]]
## [1] "#rstats"  "#ggplot2"
## 
## [[2]]
## [1] "#magrittr" "#pipeR"    "#rstats"  
## 
## [[3]]
## [1] "#user2014"

Emoticons

x <- c("are :-)) it 8-D he XD on =-D they :D of :-) is :> for :o) that :-/",
  "as :-D I xD with :^) a =D to =) the 8D and :3 in =3 you 8) his B^D was")
rm_emoticon(x)
rm_emoticon(x, extract=TRUE)
## > rm_emoticon(x)
## [1] "are it he on they of is for that"     
## [2] "as I with a to the and in you his was"


## > rm_emoticon(x, extract=TRUE)
## [[1]]
## [1] ":-))" "8-D"  "XD"   "=-D"  ":D"   ":-)"  ":>"   ":o)"  ":-/" 
## 
## [[2]]
##  [1] ":-D" "xD"  ":^)" "=D"  "=)"  "8D"  ":3"  "=3"  "8)"  "B^D"

Academic, APA 6 Style, Citations

x <- c("Hello World (V. Raptor, 1986) bye",
    "Narcissism is not dead (Rinker, 2014)",
    "The R Core Team (2014) has many members.",
    paste("Bunn (2005) said, \"As for elegance, R is refined, tasteful, and",
        "beautiful. When I grow up, I want to marry R.\""),
    "It is wrong to blame ANY tool for our own shortcomings (Baer, 2005).",
    "Wickham's (in press) Tidy Data should be out soon.",
    "Rinker's (n.d.) dissertation not so much.",
    "I always consult xkcd comics for guidance (Foo, 2012; Bar, 2014).",
    "Uwe Ligges (2007) says, \"RAM is cheap and thinking hurts\""
)

rm_citation(x)
rm_citation(x, extract=TRUE)
## > rm_citation(x)
## [1] "Hello World () bye"                                                                                  
## [2] "Narcissism is not dead ()"                                                                           
## [3] "has many members."                                                                                   
## [4] "said, \"As for elegance, R is refined, tasteful, and beautiful. When I grow up, I want to marry R.\""
## [5] "It is wrong to blame ANY tool for our own shortcomings ()."                                          
## [6] "Tidy Data should be out soon."                                                                       
## [7] "dissertation not so much."                                                                           
## [8] "I always consult xkcd comics for guidance (; )."                                                     
## [9] "says, \"RAM is cheap and thinking hurts\""     

                                                      
## > rm_citation(x, extract=TRUE)
## [[1]]
## [1] "V. Raptor, 1986"
## 
## [[2]]
## [1] "Rinker, 2014"
## 
## [[3]]
## [1] "The R Core Team (2014)"
## 
## [[4]]
## [1] "Bunn (2005)"
## 
## [[5]]
## [1] "Baer, 2005"
## 
## [[6]]
## [1] "Wickham's (in press)"
## 
## [[7]]
## [1] "Rinker's (n.d.)"
## 
## [[8]]
## [1] "Foo, 2012" "Bar, 2014"
## 
## [[9]]
## [1] "Uwe Ligges (2007)"

Combining Regular Expressions

A user may wish to combine regular expressions. For example one may want to extract all URLs and Twitter Short URLs. The verb pastex (paste + regex) pastes together regular expressions. It also will search the regex dictionaries for named regular expressions prefixed with a @. So…

pastex("@rm_twitter_url", "@rm_url")

yields…

## [1] "(https?://t\\.co[^ ]*)|(t\\.co[^ ]*)|(http[^ ]*)|(ftp[^ ]*)|(www\\.[^ ]*)"

If we combine this ability with qdapRegex‘s function generator, rm_, we can make our own function that removes both standard URLs and Twitter Short URLs.

rm_twitter_n_url <- rm_(pattern=pastex("@rm_twitter_url", "@rm_url"))

Let’s use it…

rm_twitter_n_url <- rm_(pattern=pastex("@rm_twitter_url", "@rm_url"))

x <- c("download file from http://example.com",
         "this is the link to my website http://example.com",
         "go to http://example.com from more info.",
         "Another url ftp://www.example.com",
         "And https://www.example.net",
         "twitter type: t.co/N1kq0F26tG",
         "still another one https://t.co/N1kq0F26tG :-)")

rm_twitter_n_url(x)
rm_twitter_n_url(x, extract=TRUE)
## > rm_twitter_n_url(x)
## [1] "download file from"             "this is the link to my website"
## [3] "go to from more info."          "Another url"                   
## [5] "And"                            "twitter type:"                 
## [7] "still another one :-)"     

    
## > rm_twitter_n_url(x, extract=TRUE)
## [[1]]
## [1] "http://example.com"
## 
## [[2]]
## [1] "http://example.com"
## 
## [[3]]
## [1] "http://example.com"
## 
## [[4]]
## [1] "ftp://www.example.com"
## 
## [[5]]
## [1] "https://www.example.net"
## 
## [[6]]
## [1] "t.co/N1kq0F26tG"
## 
## [[7]]
## [1] "https://t.co/N1kq0F26tG"

*Note that there is a binary operator version of pastex, %|% that may be more useful to some folks.

"@rm_twitter_url" %|% "@rm_url"

yields…

## > "@rm_twitter_url" %|% "@rm_url"
## [1] "(https?://t\\.co[^ ]*)|(t\\.co[^ ]*)|(http[^ ]*)|(ftp[^ ]*)|(www\\.[^ ]*)"

Educational

Regular expressions can be extremely powerful but were difficult for me to grasp at first.

The qdapRegex package serves a dual purpose of being both functional and educational. While the canned regular expressions are useful in and of themselves they also serve as a platform for understanding regular expressions in the context of meaningful, purposeful usage. In the same way I learned guitar while trying to mimic Eric Clapton, not by learning scales and theory, some folks may enjoy an approach of learning regular expressions in a more pragmatic, experiential interaction. Users are encouraged to look at the regular expressions being used (?regex_usa and ?regex_supplement are the default regular expression dictionaries used by qdapRegex) and unpack how they work. I have found slow repeated exposures to information in a purposeful context results in acquired knowledge.

The following regular expressions sites were very helpful to my own regular expression education:

  1. Regular-Expression.info
  2. Rex Egg
  3. Regular Expressions as used in R

Being able to discuss and ask questions is also important to learning…in this case regular expressions. I have found the following forums extremely helpful to learning about regular expressions:

  1. Talk Stats + Posting Guidelines
  2. stackoverflow + Posting Guidelines

Acknowledgements

Thank you to the folks that have developed stringi (maintainer: Marek Gagolewski). The stringi package provides fast, consistently used regular expression manipulation tools. qdapRegex uses the stringi package as a back-end in most functions prefixed with rm_XXX.

We would also like to thank the many folks at http://www.talkstats.com and http://www.stackoverflow.com that freely give of their time to answer questions around many topics, including regular expressions.

Posted in qdap, regular expression, text, Uncategorized | Tagged , , , , , , | 2 Comments

Spell Checker for R…qdap::check_spelling

I often have had requests for a spell checker for R character vectors. The utils::aspell function can be used to check spelling but many Windows users have reported difficulty with the function.

I came across an article on spelling in R entitled “Watch Your Spelling!” by Kurt Hornik and Duncan Murdoch. The paper walks us through definitions of spell checking, history, and a suggested spell checker implementation for R. A terrific read. Hornik & Murdoch (2010) end with the following call:

Clearly, more work will be needed: modern statistics needs better lexical resources, and a dictionary based on the most frequent spell check false alarms can only be a start. We hope that this article will foster community interest in contributing to the development of such resources, and that refined domain specific dictionaries can be made available and used for improved text analysis with R in the near future (p. 28).

I answered a question on stackoverflow.com a few months back that lead to creating a suite of spell checking functions. The original functions used an agrep approach that was slow and inaccurate. I discovered Mark van der Loo’s terrific stringdist package to do the heavy lifting. It calculates string distances very quickly with various methods.

The rest of this blog post is meant as a minimal introduction to qdap‘s spell checking functions. A video will lead you through most of the process and accompanying scripts are provided.

Primitive Spell Checking Function

The which_misspelled function is a low level function that basically determines if each word of a single string is in a dictionary. It optionally gives suggested corrections.

library(qdap)
x <- "Robots are evl creatres and deserv exterimanitation."
which_misspelled(x, suggest=FALSE)
which_misspelled(x, suggest=TRUE)

Interactive Spell Checking

Typically a user will want to use the interactive spell checker (spell_checker_interactive) as it is more flexible and accurate.

dat <- DATA$state
dat[1] <- "Jasperita I likedd the cokie icekream"
dat
##  [1] "Jasperita I likedd the cokie icekream"
##  [2] "No it's not, it's dumb."              
##  [3] "What should we do?"                   
##  [4] "You liar, it stinks!"                 
##  [5] "I am telling the truth!"              
##  [6] "How can we be certain?"               
##  [7] "There is no way."                     
##  [8] "I distrust you."                      
##  [9] "What are you talking about?"          
## [10] "Shall we move on?  Good then."        
## [11] "I'm hungry.  Let's eat.  You already?"
(o <- check_spelling_interactive(dat))
preprocessed(o)
fixit <- attributes(o)$correct
fixit(dat)

A More Realistic Usage

m <- check_spelling_interactive(mraja1spl$dialogue[1:75])
preprocessed(m)
fixit <- attributes(m)$correct
fixit(mraja1spl$dialogue[1:75])

References

Hornik, K., & Murdoch, D. (2010). Watch Your Spelling!. The R Journal, 3(2), 22-28.

Posted in qdap, Uncategorized | Tagged , , , , , , | Leave a comment

Hijacking R Functions: Changing Default Arguments

I am working on a package to collect common regular expressions into a canned collection that users can easily use without having to know regexes. The package, qdapRegex, has a bunch of functions in the form of rm_xxx. The only difference between each function is one default parameter, the regular expression pattern is different. I had a default template function so what I really needed was to copy that template many times and change one parameter. It seems wasteful of code and electronic space to cut and paste the body of the template function over and over again…I needed to hijack the template.

Come on admit it you’ve all wished you could hijack a function before. Who hasn’t wished the default to data.frame was stringsAsFactors = FALSE? Or sum was na.rm = TRUE (OK maybe the latter is just me). So for the task of efficiently hijacking a function and changing the defaults in a manageable modular way my mind immediately went to Hadley’s pryr package (Wickham (2014)). I remember him hijacking functions in his Advanced R book as seen HERE with the partial function.

It worked except I couldn’t then change the newly set defaults back. In my case for package writing this was not a good thing (maybe there was a way and I missed it).


A Function Worth Hijacking

Here’s an example where we attempt to hijack data.frame.

dat <- data.frame(x1 = 1:3, x2 = c("a", "b", "c"))
str(dat)  # yuck a string as a factor
## 'data.frame':    3 obs. of  2 variables:
##  $ x1: int  1 2 3
##  $ x2: Factor w/ 3 levels "a","b","c": 1 2 3

Typically we’d do something like:

.data.frame <- function(..., row.names = NULL, check.rows = FALSE, check.names = TRUE,
    stringsAsFactors = FALSE) {

    data.frame(..., row.names = row.names, check.rows = check.rows,
        check.names = check.names, stringsAsFactors = stringsAsFactors)

}

dat <- .data.frame(x1 = 1:3, x2 = c("a", "b", "c"))
str(dat)  # yay!  strings are character
## 'data.frame':    3 obs. of  2 variables:
##  $ x1: int  1 2 3
##  $ x2: chr  "a" "b" "c"

But for my qdapRegex needs this required a ton of cut and paste. That means lots of extra code in the .R files.


The First Attempt to Hijack a Function

pryr to the rescue

library(pryr)

## The hijack
.data.frame <- pryr::partial(data.frame, stringsAsFactors = FALSE)

dat <- .data.frame(x1 = 1:3, x2 = c("a", "b", "c"))
str(dat)  # yay! strings are character
## 'data.frame':    3 obs. of  2 variables:
##  $ x1: int  1 2 3
##  $ x2: chr  "a" "b" "c"

But I can’t change the default back…

.data.frame(x1 = 1:3, x2 = c("a", "b", "c"), stringsAsFactors = TRUE)
## Error: formal argument "stringsAsFactors" matched by multiple actual
## arguments

Hijacking In Style (formals)

Doomed…

After tinkering with many not so reasonable solutions I asked on stackoverflow.com. In a short time MrFlick responded most helpfully (as he often does) with a response that used formals to change the formals of a function. I should have thought of it myself as I’d seen its use in Advanced R as well.

Here I use the answer to make a hijack function. It does exactly what I want, take a function and reset its formal arguments as desired.

hijack <- function (FUN, ...) {
    .FUN <- FUN
    args <- list(...)
    invisible(lapply(seq_along(args), function(i) {
        formals(.FUN)[[names(args)[i]]] <<- args[[i]]
    }))
    .FUN
}

Let’s see it in action as it changes the defaults but allows the user to still set these arguments…

.data.frame <- hijack(data.frame, stringsAsFactors = FALSE)

dat <- .data.frame(x1 = 1:3, x2 = c("a", "b", "c"))
str(dat)  # yay! strings are character
## 'data.frame':    3 obs. of  2 variables:
##  $ x1: int  1 2 3
##  $ x2: chr  "a" "b" "c"
.data.frame(x1 = 1:3, x2 = c("a", "b", "c"), stringsAsFactors = TRUE)
##   x1 x2
## 1  1  a
## 2  2  b
## 3  3  c

Note that for some purposes Dason suggested an alternative solution that is similar to the first approach I describe above but requires less copying as it used ldots (ellipsis) to cover the parameters that we don’t want to change. This approach would look something like this:

.data.frame <- function(..., stringsAsFactors = FALSE) {

    data.frame(..., stringsAsFactors = stringsAsFactors)

}

dat <- .data.frame(x1 = 1:3, x2 = c("a", "b", "c"))
str(dat)  # yay!  strings are character
## 'data.frame':    3 obs. of  2 variables:
##  $ x1: int  1 2 3
##  $ x2: chr  "a" "b" "c"
.data.frame(x1 = 1:3, x2 = c("a", "b", "c"), stringsAsFactors = TRUE)
##   x1 x2
## 1  1  a
## 2  2  b
## 3  3  c

Less verbose than the first approach I had. This solution was not the best for me in that I wanted to document all of the arguments to the function for the package. I believe using this approach would limit me to the arguments …, stringsAsFactors in the documentation (though I didn’t try it with CRAN checks). Depending on the situation this approach may be ideal.

References


*Created using the reports package

Posted in package creation, Uncategorized | Tagged , , , , , , | 4 Comments

qdap 2.1.1 Released

We’re very pleased to announce the release of qdap 2.1.1

What is qdap?

qdap (Quantitative Discourse Analysis Package) is an R package designed to assist in quantitative discourse analysis. The package stands as a bridge between qualitative transcripts of dialogue and statistical analysis & visualization. qdap is designed for transcript analysis, however, many functions are applicable to other areas of Text Mining/Natural Language Processing.

qdap Version 2.1.1

logo

This is the latest installment of the qdap package available at CRAN. Several important updates have occurred since we last blogged about the 1.3.1 release (demoing Version 1.3.1), most notable:

  1. Text checking that simplifies and standardizes (includes spell checking and a Cleaning Text & Debugging Vignette
  2. Added many plot methods including animated and network plotting
  3. Improved tm package compatibility
  4. The Introduction to qdap .Rmd vignette has been moved to an internal directory to save CRAN space and time checking the package source. The user may use the build_qdap_vignette function directly to build the vignette
  5. qdap officially begins utilizing the testthat package for unit testing
  6. qdapTools splits off of qdap with non-text specific tools
  7. New text analysis and processing functions

Installation

install.packages("qdap")

Changes up to qdap Version 2.1.1

A complete list of changes since Version 1.3.1 can be found in the NEWS.md found HERE

Future Focus

As the focus of my own research and dissertation has moved toward the graphical analysis of discourse, qdap will reflect this shift. We plan to demonstrate some of the features of qdap that have been added since our last blog article (demoing Version 1.3.1) via a series of small blog articles over the next month.


*Created using the reports package

Posted in qdap, Uncategorized | Tagged , , , , , , , | 5 Comments

What Would Cohen Have Titled “The Earth is Round (p < .05)" in 2014?

The area of bibliometrics is not my area of expertise but is still of interest as a researcher. I sometimes think about how Google has impacted the way we title articles. Gone are the days of witty, snappy titles. Title selection is an art form but of a different kind. Generally, researchers try to construct titles of the most searchable keywords. In trying to title an article today and came upon an Internet article entitled Heading for Success: Or How Not to Title Your Paper.

According to the article, to increase citation rates, a title should:

  1. Contain no ? or !
  2. May contain a :
  3. Should be between 31-40 character
  4. Avoid humor/pun

In seeing:

…some authors are tempted to spice them up with a touch of humour, which may be a pun, a play on words, or an amusing metaphor. This, however, is a risky strategy.

my mind went to the classic Jacob Cohen (1994) paper entitled The Earth is Round (p < .05). In 1994 the world was different; Google didn't exist yet. I ask, “What if Cohen had to title his classic title in 2014?” What would it look like?


Keywords: Mining “The Earth is Round (p < .05)”

I set to work by grabbing the paper's content and converting to plain text. Then I decided to tease out the most frequent terms after stemming and removing stopwords. Here's the script I used:

library(qdap); library(RCurl); library(wordcloud); library(ggplot2)

cohen_url <- "https://raw.githubusercontent.com/trinker/cohen_title/master/data/Cohen1994.txt"
cohen <- getURL(cohen_url, ssl.verifypeer = FALSE)

## remove reference section and title
cohen <- substring(strsplit(cohen, "REFERENCES")[[c(1, 1)]], 34)

## convert format so we can eliminate strange characters
cohen <- iconv(cohen, "", "ASCII", "byte")

## replacement parts
bads <- c("-", "<e2><80><9c>", "<e2><80><9d>", "<e2><80><98>", 
    "<e2><80><99>", "<e2><80><9b>", "<ef><bc><87>", "<e2><80><a6>", 
    "<e2><80><93>", "<e2><80><94>", "<c3><a1>", "<c3><a9>", 
    "<c2><bd>", "<ef><ac><81>", "<c2><a7>", "<ef><ac><82>", 
    "<ef><ac><81>", "<c2><a2>", "/j")

goods <- c(" ", " ", " ", "'", "'", "'", "'", "...", " ", 
    " ", "a", "e", "half", "fi", " | ", "ff", "ff", " ", "ff")

## sub the bad for the good
cohen <- mgsub(bads, goods, clean(cohen))

## Stem it
cohen_stem <- stemmer(cohen)

## Find top words
(cohen_top_20 <- freq_terms(cohen_stem, top = 20, stopwords = Top200Words))
plot(cohen_top_20)
##    WORD         FREQ
## 1  test           21
## 2  signiffc       19
## 3  research       18
## 4  probabl        17
## 5  size           17
## 6  data           15
## 7  h              15
## 8  effect         14
## 9  p              14
## 10 statist        14
## 11 given          13
## 12 hypothesi      13
## 13 analysi        11
## 14 articl         11
## 15 nhst           11
## 16 null           11
## 17 psycholog      11
## 18 conffdenc      10
## 19 correl         10
## 20 psychologist   10
## 21 result         10
## 22 theori         10

plot of chunk plot1

library(wordcloud)
with(cohen_top_20, wordcloud(WORD, FREQ))
mtext("Content Cloud: The Earth is Round (p < .05)", col="blue")

plot of chunk plot2


What Would Cohen Have Titled “The Earth is Round (p < .05)”?

So what would Cohen have titled “The Earth is Round (p < .05)” in 2014? Looking at the results… I don't know. It's fun to speculate. Maybe some could suggest in the comments but as for me I still like “The Earth is Round (p < .05)”.


Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49(12), 997-1003. doi:10.1037/0003-066X.49.12.997

Posted in qdap, text, Uncategorized | Tagged , , , , , | 2 Comments

Handling @S3method’s Death in roxygen2 Version 4.0.0

This is a quickie post and specific to package maintainers who use roxygen2.

Legal Disclaimer: This worked for me but make sure there’s not a corner case that would make it not work for you.  In other words back your stuff up, think it through, and tread lightly.

Welp I updated to the latest version of roxygen2, 4.0.0. Works well and some new niceties and handling. Anyway after you use it for the first time and if you have @S3method in your code it throws a warning along the lines of:

Warning messages:
@S3method is deprecated. Please use @export instead

Inconvenient to make the change if you have a small amount of functions in your package but a pain in the tush if you have tons. Obviously you’ll want to do this as @S3method is deprecated but it doesn’t make it hurt any less. It’s kinda like a a root canal, it’s for the good of the dental well being but it’s still painful. But then a thought occurred to me. Why not be lazy efficient? Read in the files, grepl to find the "#' @S3" and then replace with "#' @export". I tried it with the following code. You’ll have to supply your path to the location of the package’s R directory.

Now this may not be the best approach, hey it may even be wrong but I’ll rely on Cunningham’s Law to sort it out:

pth <- "C:/Users/trinker/qdap/R"

fls <- file.path(pth, dir(pth))

FUN <- function(x) {
    cont <- readLines(x)
    cont[grepl("#' @S3", cont)] <- "#' @export"
    cont[length(cont) + 1] <- ""
    cat(paste(cont, collapse="\n"), file=x)
}

lapply(fls, FUN)
Posted in package creation, qdap, Uncategorized | Tagged , , , , | 2 Comments

Shape File Selfies in ggplot2

In this post you will learn how to:

  1. Create your own quasi-shape file
  2. Plot your homemade quasi-shape file in ggplot2
  3. Add an external svg/ps graphic to a plot
  4. Change a grid grob's color and alpha

*Note get simple .md version here


Background (See just code if you don't care much about the process)

I started my journey wanting to replicate a graphic called a space manikin by McNeil (2005) and fill areas in that graphic like a choropleth. I won't share the image from McNeil's book as it's his intellectual property but know that the graphic is from a gesturing book that divides the body up into zones (p. 275). To get a sense of what the manikin looks like here is the ggplot2 version of it:

Figure 1: ggplot2 Version of McNeil’s (2005) Space Manikin

While this is a map of areas of a body you can see where this could be extended to any number of spatial tasks such as mapping the layout of a room.


1. Creating a Quasi-Shape File

So I figured “zones” that's about like states on a map. I have toyed with choropleth maps of the US in the past and figured I'd generalize this learning. The difference is I'd have to make the shape file myself as the maps package doesn't seem to have McNeil’s space manikin.

Let's look at what ggplot2 needs from the maps package:

library(maps); library(ggplot2)
head(map_data("state"))
##     long   lat group order  region subregion
## 1 -87.46 30.39     1     1 alabama      <NA>
## 2 -87.48 30.37     1     2 alabama      <NA>
## 3 -87.53 30.37     1     3 alabama      <NA>
## 4 -87.53 30.33     1     4 alabama      <NA>
## 5 -87.57 30.33     1     5 alabama      <NA>
## 6 -87.59 30.33     1     6 alabama      <NA>

Hmm coordinates, names of regions, and order to connect the coordinates. I figured I can handle that. I don't 100% know what a shape file is, mostly that it’s a file that makes shapes. What we're making may or may not technically be a shape file but know we're going to map shapes in ggplot2 (I use the quasi to avoid the wrath of those who do know precisely what a shape file is).

I needed to make the zones around an image of a person so I first grabbed a free png silhouette from: http://www.flaticon.com/free-icon/standing-frontal-man-silhouette_10633. I then knew I'd need to add some lines and figure out the coordinates of the outlines of each cell. So I read the raster image into R, plotted in ggplot2 and added lots of grid lines for good measure. Here's what I wound up with:

library(png); library(grid); library(qdap)
url_dl(url="http://i.imgur.com/eZ76jcu.png")
file.rename("eZ76jcu.png", "body.png")
img <- rasterGrob(readPNG("body.png"), 0, 0, 1, 1, just=c("left","bottom"))
ggplot(data.frame(x=c(0, 1), y=c(0, 1)), aes(x=x, y=y)) + 
    geom_point() +
    annotation_custom(img, 0, 1, 0, 1) + 
    scale_x_continuous(breaks=seq(0, 1, by=.05))+ 
    scale_y_continuous(breaks=seq(0, 1, by=.05)) + theme_bw() +
    theme(axis.text.x=element_text(angle = 90, hjust = 0, vjust=0))

plot of chunk unnamed-chunk-2

Figure 2: Silhouette from ggplot2 With Grid Lines


1b. Dirty Deeds Done Cheap

I needed to get reference lines on the plot so I could begin recording coordinates. Likely there's a better process but this is how I approached it and it worked. I exported the ggplot in Figure 2 into (GASP) Microsoft Word (I may have just lost a few die hard command line folks). I added lines there and and figured out the coordinates of the lines. It looked something like this:

Figure 3: Silhouette from ggplot2 with MS Word Augmented Border Lines

After that I began the tedious task of figuring out the corners of each of the shapes (“zones”) in the space manikin. Using Figure 3 and a list structure in R I mapped each of the corners, the approximate shape centers, and the order to plot the coordinates in for each shape. This is the code for corners:

library(qdap)
dat <- list(
    `01`=data.frame(x=c(.4, .4, .6, .6), y=c(.67, .525, .525, .67)),
    `02`=data.frame(x=c(.35, .4, .6, .65), y=c(.75, .67, .67, .75)),
    `03`=data.frame(x=c(.6, .65, .65, .6), y=c(.525, .475, .75, .67)),
    `04`=data.frame(x=c(.4, .35, .65, .6), y=c(.525, .475, .475, .525)),
    `05`=data.frame(x=c(.35, .35, .4, .4), y=c(.75, .475, .525, .67)),
    `06`=data.frame(x=c(.4, .4, .6, .6), y=c(.87, .75, .75, .87)),
    `07`=data.frame(x=c(.6, .6, .65, .65, .73, .73), y=c(.87, .75, .75, .67, .67, .87)),
    `08`=data.frame(x=c(.65, .65, .73, .73), y=c(.67, .525, .525, .67)),
    `09`=data.frame(x=c(.6, .6, .73, .73, .65, .65), y=c(.475, .28, .28, .525, .525, .475)),
    `10`=data.frame(x=c(.4, .4, .6, .6), y=c(.475, .28, .28, .475)),
    `11`=data.frame(x=c(.27, .27, .4, .4, .35, .35), y=c(.525, .28, .28, .475, .475, .525)),
    `12`=data.frame(x=c(.27, .27, .35, .35), y=c(.67, .525, .525, .67)),
    `13`=data.frame(x=c(.27, .27, .35, .35, .4, .4), y=c(.87, .67, .67, .75, .75, .87)),
    `14`=data.frame(x=c(.35, .35, .65, .65), y=c(1, .87, .87, 1)),
    `15`=data.frame(x=c(.65, .65, .73, .73, 1, 1), y=c(1, .87, .87, .75, .75, 1)),
    `16`=data.frame(x=c(.73, .73, 1, 1), y=c(.75, .475, .475, .75)),
    `17`=data.frame(x=c(.65, .65, 1, 1, .73, .73), y=c(.28, 0, 0, .475, .475, .28)),
    `18`=data.frame(x=c(.35, .35, .65, .65), y=c(.28, 0, 0, .28)),
    `19`=data.frame(x=c(0, 0, .35, .35, .27, .27), y=c(.475, 0, 0, .28, .28, .475)),
    `20`=data.frame(x=c(0, 0, .27, .27), y=c(.75, .475, .475, .75)),
    `21`=data.frame(x=c(0, 0, .27, .27, .35, .35), y=c(1, .75, .75, .87, .87, 1))
)

dat <- lapply(dat, function(x) {
    x$order <- 1:nrow(x)
    x
})

space.manikin.shape <- list_df2df(dat, "id")[, c(2, 3, 1, 4)]

And the code for the centers:

centers <- data.frame(
    id = unique(space.manikin.shape$id),
    center.x=c(.5, .5, .625, .5, .375, .5, .66, .69, .66, .5, .34, .31, 
        .34, .5, .79, .815, .79, .5, .16, .135, .16),
    center.y=c(.597, .71, .5975, .5, .5975, .82, .81, .5975, .39, .3775, .39, 
        .5975, .81, .935, .89, .6025, .19, .14, .19, .6025, .89)
)

There you have it folks your very own quasi-shape file. Celebrate the fruits of your labor by plotting that bad Oscar.


2. Plot Your Homemade Quasi-Shape File

 ggplot(centers) + annotation_custom(img,0,1,0,1) +
    geom_map(aes(map_id = id), map = space.manikin.shape, colour="black", fill=NA) +
    theme_bw()+ 
    expand_limits(space.manikin.shape) +
    geom_text(data=centers, aes(center.x, center.y, label = id), color="grey60") 

plot of chunk unnamed-chunk-5

Figure 4: Plotting the Quasi-Shape File and a Raster Image

Then I said I may want to tone down the color of the silhouette a bit so I can plot geoms atop without distraction. Here's that attempt.

img[["raster"]][img[["raster"]] == "#0E0F0FFF"] <- "#E7E7E7"

ggplot(centers) + annotation_custom(img,0,1,0,1) +
    geom_map(aes(map_id = id), map = space.manikin.shape, colour="black", fill=NA) +
    theme_bw()+ 
    expand_limits(space.manikin.shape) +
    geom_text(data=centers, aes(center.x, center.y, label = id), color="grey60") 

plot of chunk unnamed-chunk-6

Figure 5: Altered Raster Image Color


3. Add an External svg/ps

I realized quickly a raster was messy. I read up a bit on them in the R Journal (click here). In the process of reading and fooling around with Picasa I turned my original silhouette (body.png) blue and couldn't fix him. I headed back to http://www.flaticon.com/free-icon/standing-frontal-man-silhouette_10633 to download another. In this act I saw you could download a svg file of the silhouette. I thought maybe this will be less messier and easier to change colors. This led me to a google search and finding the grImport package after seeing this listserve post. And then I saw an article from Paul Murrell (2009) and figured I could turn the svg (I didn't realize what svg was until I opened it in Notepad++) into a ps file and read into R and convert to a flexible grid grob.

Probably there are numerous ways to convert an svg to a ps file but I chose a cloud convert service. After I read the file in with grImport per the Paul Murrell (2009) article. You're going to have to download the ps file HERE and get to your working directory.

browseURL("https://github.com/trinker/space_manikin/raw/master/images/being.ps")
## Move that file from your downloads to your working directory.
## Sorry I don't know how to automate this.
library(grImport)

## Convert to xml
PostScriptTrace("being.ps")

## Read back in and convert to a grob
being_img <- pictureGrob(readPicture("being.ps.xml"))

## Plot it
ggplot(centers) + annotation_custom(being_img,0,1,0,1) +
    geom_map(aes(map_id = id), map = space.manikin.shape, 
        colour="black", fill=NA) +
    theme_bw()+ 
    expand_limits(space.manikin.shape) +
    geom_text(data=centers, aes(center.x, center.y, 
        label = id), color="grey60") 

plot of chunk unnamed-chunk-7

Figure 6: Quasi-Shape File with Grob Image Rather than Raster


4. Change a grid Grob's Color and Alpha

Now we have a flexible grob we can mess around with colors and alpha until our heart's content.

str is our friend to figure out where and how to mess with the grob (str(being_img)). That leads me to the following changes to the image to adjust color and/or alpha (transparency).

being_img[["children"]][[1]][[c("gp", "fill")]] <- 
  being_img[["children"]][[2]][[c("gp", "fill")]] <- "black"

being_img[["children"]][[1]][[c("gp", "alpha")]] <- 
  being_img[["children"]][[2]][[c("gp", "alpha")]] <- .2

## Plot it
ggplot(centers) + annotation_custom(being_img,0,1,0,1) +
    geom_map(aes(map_id = id), map = space.manikin.shape, 
        colour="black", fill=NA) +
    theme_bw()+ 
    expand_limits(space.manikin.shape) +
    geom_text(data=centers, aes(center.x, center.y, 
        label = id), color="grey60") 

plot of chunk unnamed-chunk-8

Figure 7: Quasi-Shape File with Grob Image Alpha = .2


Let's Have Some Fun

Let's make it into a choropleth and a density plot. We'll make some fake fill values to fill with.

set.seed(10)
centers[, "Frequency"] <- rnorm(nrow(centers))

being_img[["children"]][[1]][[c("gp", "alpha")]] <- 
  being_img[["children"]][[2]][[c("gp", "alpha")]] <- .25

ggplot(centers, aes(fill=Frequency)) +
    geom_map(aes(map_id = id), map = space.manikin.shape, 
        colour="black") +
    scale_fill_gradient2(high="red", low="blue") +
    theme_bw()+ 
    expand_limits(space.manikin.shape) +
    geom_text(data=centers, aes(center.x, center.y, 
        label = id), color="black") + 
    annotation_custom(being_img,0,1,0,1) 

plot of chunk unnamed-chunk-9

Figure 8: Quasi-Shape File as a Choropleth

set.seed(10)
centers[, "Frequency2"] <- sample(seq(10, 150, by=20, ), nrow(centers), TRUE)

centers2 <- centers[rep(1:nrow(centers), centers[, "Frequency2"]), ]

ggplot(centers2) +
#       geom_map(aes(map_id = id), map = space.manikin.shape, 
#       colour="grey65", fill="white") +
    stat_density2d(data = centers2, 
        aes(x=center.x, y=center.y, alpha=..level.., 
        fill=..level..), size=2, bins=12, geom="polygon") + 
    scale_fill_gradient(low = "yellow", high = "red") +
    scale_alpha(range = c(0.00, 0.5), guide = FALSE) +
    theme_bw()+ 
    expand_limits(space.manikin.shape) +
    geom_text(data=centers, aes(center.x, center.y, 
        label = id), color="black") + 
    annotation_custom(being_img,0,1,0,1) +
    geom_density2d(data = centers2, aes(x=center.x, 
        y=center.y), colour="black", bins=8, show_guide=FALSE) 

plot of chunk unnamed-chunk-10

Figure 9: Quasi-Shape File as a Density Plot

Good times were had by all.


Created using the reports (Rinker, 2013) package

Get the .Rmd file here


References


Posted in discourse analysis, ggplot2, Uncategorized, visualization | Tagged , , , , | 1 Comment

qdap 1.3.1 Release: Demoing Dispersion Plots, Sentiment Analysis, Easy Hash Lookups, Boolean Searches and More…

We’re very pleased to announce the release of qdap 1.3.1

logo

This is the latest installment of the qdap package available at CRAN. Several important updates have occurred since the 1.1.0 release, most notable the addition of two vignettes and some generic view methods.

The new vignettes include:

  1. An Introduction to qdap
  2. qdap-tm Package Compatibility

The former is a detailed HTML based guide over viewing the intended use of qdap functions.  The second vignette is an explanation of how to move between qdap and tm package forms as qdap moves to be more compatible with this seminal R text mining package.

To install use:

install.packages(“qdap”)

Some of the changes in versions 1.2.0-1.3.1 include:


Generic Methods

  • scores generic method added to view scores from select qdap objects.
  • counts generic method added to view counts from select qdap objects.
  • proportions generic method added to view proportions from select qdap objects.
  • preprocessed generic method added to view preprocessed data from select qdap objects.

These methods allow the user to grab particular parts of qdap objects in a consistent fashion.  The majority of these methods also pick up a corresponding plot method as well.  This adds to the qdap philosophy that data results should be easy to grab and easy to visualize. For instance:

(x <- question_type(DATA.SPLIT$state, DATA.SPLIT$person))

## methods
scores(x)
plot(scores(x))
counts(x)
plot(counts(x))
proportions(x)
plot(proportions(x))
truncdf(preprocessed(x), 15)
plot(preprocessed(x))

Demoing Some of the New Features

We’d like to take the time to highlight some of the development that has happened in qdap in the past several months:

Dispersion Plots

 wrds <- freq_terms(pres_debates2012$dialogue, stopwords = Top200Words)

## Add leading/trailing spaces if desired
wrds2 <- spaste(wrds)

## Use `~~` to maintain spaces
wrds2 <- c(" governor~~romney ", wrds2[-c(3, 12)])

## Plot
with(pres_debates2012 , dispersion_plot(dialogue, wrds2, rm.vars = time, 
    color="black", bg.color="white")) 

 with(rajSPLIT, dispersion_plot(dialogue, c("love", "night"),
    bg.color = "black", grouping.var = list(fam.aff, sex),
    color = "yellow", total.color = "white", horiz.color="grey20")) 

Word Correlation

 library(tm)
data("crude")
oil_cor1 <- apply_as_df(crude, word_cor, word = "oil", r=.7)
plot(oil_cor1) 

 oil_cor2 <- apply_as_df(crude, word_cor, word = qcv(texas, oil, money), r=.7)
plot(oil_cor2, ncol=2)
 

Easy Hash Table

A Small Example

 lookup(1:5, data.frame(1:4, 11:14))

## [1] 11 12 13 14 NA

## Leave alone elements w/o a match
lookup(1:5, data.frame(1:4, 11:14), missing = NULL) 

## [1] 11 12 13 14  5

Scaled Up 3 Million Records

key <- data.frame(x=1:2, y=c("A", "B"))

##   x y
## 1 1 A
## 2 2 B

big.vec <- sample(1:2, 3000000, T)
out <- lookup(big.vec, key)
out[1:20]

## On my system 3 million records in:
## Time difference of 24.5534 secs

Binary Operator Version

 codes <- list(
    A = c(1, 2, 4), 
    B = c(3, 5),
    C = 7,
    D = c(6, 8:10)
)

1:12 %l% codes

##  [1] "A" "A" "B" "A" "B" "D" "C" "D" "D" "D" NA  NA 

1:12 %l+% codes

##  [1] "A"  "A"  "B"  "A"  "B"  "D"  "C"  "D"  "D"  "D"  "11" "12" 

Simple-Quick Boolean Searches

We’ll be demoing this capability on the qdap data set DATA:

 ##        person                                 state
## 1         sam         Computer is fun. Not too fun.
## 2        greg               No it's not, it's dumb.
## 3     teacher                    What should we do?
## 4         sam                  You liar, it stinks!
## 5        greg               I am telling the truth!
## 6       sally                How can we be certain?
## 7        greg                      There is no way.
## 8         sam                       I distrust you.
## 9       sally           What are you talking about?
## 10 researcher         Shall we move on?  Good then.
## 11       greg I'm hungry.  Let's eat.  You already? 

First a brief explanation from the documentation:

terms – A character string(s) to search for. The terms are arranged in a single string with AND (use AND or && to connect terms together) and OR (use OR or || to allow for searches of either set of terms. Spaces may be used to control what is searched for. For example using ” I ” on c(“I’m”, “I want”, “in”) will result in FALSE TRUE FALSE whereas “I” will match all three (if case is ignored).

Let’s see how it works. We’ll start with ” I ORliar&&stinks”. This will find sentences that contain ” I “ or that contain “liar” and the word “stinks”.

 boolean_search(DATA$state, " I ORliar&&stinks")

## The following elements meet the criteria:
## [1] 4 5 8

boolean_search(DATA$state, " I &&.", values=TRUE)

## The following elements meet the criteria:
## [1] "I distrust you."

boolean_search(DATA$state, " I OR.", values=TRUE)

## The following elements meet the criteria:
## [1] "Computer is fun. Not too fun."        
## [2] "No it's not, it's dumb."              
## [3] "I am telling the truth!"              
## [4] "There is no way."                     
## [5] "I distrust you."                      
## [6] "Shall we move on?  Good then."        
## [7] "I'm hungry.  Let's eat.  You already?"

boolean_search(DATA$state, " I &&.")

## The following elements meet the criteria:
## [1] 8 

Exclusion as Well

boolean_search(DATA$state, " I ||.", values=TRUE)

## The following elements meet the criteria:
## [1] "Computer is fun. Not too fun."        
## [2] "No it's not, it's dumb."              
## [3] "I am telling the truth!"              
## [4] "There is no way."                     
## [5] "I distrust you."                      
## [6] "Shall we move on?  Good then."        
## [7] "I'm hungry.  Let's eat.  You already?"

boolean_search(DATA$state, " I ||.", exclude = c("way", "truth"), values=TRUE)

## The following elements meet the criteria:
## [1] "Computer is fun. Not too fun."        
## [2] "No it's not, it's dumb."              
## [3] "I distrust you."                      
## [4] "Shall we move on?  Good then."        
## [5] "I'm hungry.  Let's eat.  You already?"  

Binary Operator Version

 dat <- data.frame(x = c("Doggy", "Hello", "Hi Dog", "Zebra"), y = 1:4)

##        x y
## 1  Doggy 1
## 2  Hello 2
## 3 Hi Dog 3
## 4  Zebra 4

z <- data.frame(z =c("Hello", "Dog"))

##       z
## 1 Hello
## 2   Dog

dat[dat$x %bs% paste(z$z, collapse = "OR"), ]  

##        x y
## 1  Doggy 1
## 2  Hello 2
## 3 Hi Dog 3

Polarity (Sentiment)

The polarity function is an extension of the work originally done by Jeffrey Breen with some accompnaying plotting methods. For more information see the Introduction to qdap Vignette.

 poldat2 <- with(mraja1spl, polarity(dialogue,
    list(sex, fam.aff, died)))
colsplit2df(scores(poldat2))[, 1:7] 
    sex fam.aff  died total.sentences total.words ave.polarity sd.polarity
1     f     cap FALSE             158        1810  0.076422846   0.2620359
2     f     cap  TRUE              24         221  0.042477906   0.2087159
3     f    mont  TRUE               4          29  0.079056942   0.3979112
4     m     cap FALSE              73         717  0.026496626   0.2558656
5     m     cap  TRUE              17         185 -0.159815603   0.3133931
6     m   escal FALSE               9         195 -0.152764808   0.3131176
7     m   escal  TRUE              27         646 -0.069421082   0.2556493
8     m    mont FALSE              70         952 -0.043809741   0.3837170
9     m    mont  TRUE             114        1273 -0.003653114   0.4090405
10    m    none FALSE               7          78  0.062243180   0.1067989
11 none    none FALSE               5          18 -0.281649658   0.4387579

The Accompanying Plotting Methods

plot(poldat2)

 plot(scores(poldat2))   

Question Type

 dat <- c("Kate's got no appetite doesn't she?",
    "Wanna tell Daddy what you did today?",
    "You helped getting out a book?", "umm hum?",
    "Do you know what it is?", "What do you want?",
    "Who's there?", "Whose?", "Why do you want it?",
    "Want some?", "Where did it go?", "Was it fun?")

left_just(preprocessed(question_type(dat))[, c(2, 6)])  
   raw.text                             q.type             
1  Kate's got no appetite doesn't she?  doesnt             
2  Wanna tell Daddy what you did today? what               
3  You helped getting out a book?       implied_do/does/did
4  Umm hum?                             unknown            
5  Do you know what it is?              do                 
6  What do you want?                    what               
7  Who's there?                         who                
8  Whose?                               whose              
9  Why do you want it?                  why                
10 Want some?                           unknown            
11 Where did it go?                     where              
12 Was it fun?                          was                
 x <- question_type(DATA.SPLIT$state, DATA.SPLIT$person)

scores(x)
      person tot.quest    what    how   shall implied_do/does/did
1       greg         1       0      0       0             1(100%)
2 researcher         1       0      0 1(100%)                   0
3      sally         2  1(50%) 1(50%)       0                   0
4    teacher         1 1(100%)      0       0                   0
5        sam         0       0      0       0                   0
plot(scores(x), high="orange")

 


These are a few of the more recent developments in qdap. We would encourage readers to dig into the new vignettes and start using qdap for various Natural Language Processing tasks. If you have suggestions or find a bug you are welcome to:

  • submit suggestions and bug-reports at: https://github.com/trinker/qdap/issues
  • send a pull request on: https://github.com/trinker/qdap

  • For a complete list of changes see qdap’s NEWS.md

    Development Version
    github

    Posted in analysis, discourse analysis, qdap, text | Tagged , , , , , , , , , , , | 3 Comments