Spell Checker for R…qdap::check_spelling

I often have had requests for a spell checker for R character vectors. The utils::aspell function can be used to check spelling but many Windows users have reported difficulty with the function.

I came across an article on spelling in R entitled “Watch Your Spelling!” by Kurt Hornik and Duncan Murdoch. The paper walks us through definitions of spell checking, history, and a suggested spell checker implementation for R. A terrific read. Hornik & Murdoch (2010) end with the following call:

Clearly, more work will be needed: modern statistics needs better lexical resources, and a dictionary based on the most frequent spell check false alarms can only be a start. We hope that this article will foster community interest in contributing to the development of such resources, and that refined domain specific dictionaries can be made available and used for improved text analysis with R in the near future (p. 28).

I answered a question on stackoverflow.com a few months back that lead to creating a suite of spell checking functions. The original functions used an agrep approach that was slow and inaccurate. I discovered Mark van der Loo’s terrific stringdist package to do the heavy lifting. It calculates string distances very quickly with various methods.

The rest of this blog post is meant as a minimal introduction to qdap‘s spell checking functions. A video will lead you through most of the process and accompanying scripts are provided.

Primitive Spell Checking Function

The which_misspelled function is a low level function that basically determines if each word of a single string is in a dictionary. It optionally gives suggested corrections.

library(qdap)
x <- "Robots are evl creatres and deserv exterimanitation."
which_misspelled(x, suggest=FALSE)
which_misspelled(x, suggest=TRUE)

Interactive Spell Checking

Typically a user will want to use the interactive spell checker (spell_checker_interactive) as it is more flexible and accurate.

dat <- DATA$state
dat[1] <- "Jasperita I likedd the cokie icekream"
dat
##  [1] "Jasperita I likedd the cokie icekream"
##  [2] "No it's not, it's dumb."              
##  [3] "What should we do?"                   
##  [4] "You liar, it stinks!"                 
##  [5] "I am telling the truth!"              
##  [6] "How can we be certain?"               
##  [7] "There is no way."                     
##  [8] "I distrust you."                      
##  [9] "What are you talking about?"          
## [10] "Shall we move on?  Good then."        
## [11] "I'm hungry.  Let's eat.  You already?"
(o <- check_spelling_interactive(dat))
preprocessed(o)
fixit <- attributes(o)$correct
fixit(dat)

A More Realistic Usage

m <- check_spelling_interactive(mraja1spl$dialogue[1:75])
preprocessed(m)
fixit <- attributes(m)$correct
fixit(mraja1spl$dialogue[1:75])

References

Hornik, K., & Murdoch, D. (2010). Watch Your Spelling!. The R Journal, 3(2), 22-28.

Advertisements

About tylerrinker

I am Literacy PhD student with a bent for the quantitative and a passion for R.
This entry was posted in qdap, Uncategorized and tagged , , , , , , . Bookmark the permalink.

8 Responses to Spell Checker for R…qdap::check_spelling

  1. ayudub says:

    got stuck found your post pretty helpful thanks 🙂

  2. Thanks, I’m sure this will come in handy.

    But Maybe you can help me out? Do you know how to enable spell checking of comments in R Studio? I’m used to working with Emacs in many other languages and I just used to select the comment and do a M-x ispell-region or M-x ispell-buffer. I think the later was bound to a much shorter keystroke but it’s been a while and I can’t remember. Now I could switch to Emacs to R, but I really don’t want to because I’m otherwise extremely happy with R Studio.

    Googling my problem didn’t help, I just got a whole lot of articles about using R for spell checking, as opposed to spelling R, if you catch my drift…

  3. Rob says:

    Really helpful!

  4. Diego Gaona says:

    Hi Tyler!
    Qdap is a very nice package! Any chance of using another dictionary (in portuguese) for check_text or check_spelling_interactive? I’m trying to use this to clean Facebook comments and analyse the sentiment.

    • tylerrinker says:

      @Diego thank you for your feedback. Currently all my focus is on data science work @ Campus Labs and finishing up my dissertation. That leaves little for features so the answer is no for the immediate future. Since this post the hunspell package has been released for R. It’s awesome and I believe can handle many language. Additionally it’s easy to set up. I might recommend you have a look at that resource.

  5. Kavya says:

    Hi ,this was very useful. I am doing the same for my sentiment analysis project but I guess this one runs only for vector/list of words what if I want to run a spell checker on my complete dataframe/corpus..Please suggest..Thanks

    • Atul Sehgal says:

      Hi Kavya,

      To do spell check on complete corpus, please do the following:

      1) Create document term matrix of corpus
      2) Just take header of the DTM and convert it into vector- Now you have all the words in your corpus as vector
      3) Run QDAP spell check on that vector…store suggested words in another vector
      4) Replace vector of incorrect words with vector of correct words in your corpus (there are find and replace functions available for tm_map).

      Thanks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s