Parallelization: Speed up Functions in a Package

Well I bought a new computer a month back (i7 8GB memory). Finally more than one core and a chance to try parallelization. I saw this blog post a while back and was intrigued and was further intriqued when I saw that plyr/reshape2 has some paralellization capabilities(LINK). Let me say up front this is my first experience so there may be better ways but it sped up my code by over four times.

parallel computing

Let me warn you now, when I first read the A No BS Guide to the Basics of Parallelization in R I tried to see how many cores I had on my computer (this shows my ignorance; which may be of comfort to some of you, others will stop reading this blog post immediately). 1 is the loneliest number especially if you’re attempting to run on multiple cores.

Suggestion if you type detectCores() and see 1 you can’t run code in parallel, at least not by running it on different cores of your machine.

Background (skip this if you are short on time)
I’m working on a package (qdap) and have a function (pos) that takes a long time to run. It is basically finding parts of speech by sentence (each sentence is a cell and there are thousands of them). I rely on openNLP for the pos tagging but the whole process is time consuming. I figured perfect time to try this parallelization out.

I skimmed the Task View for parallel computing and knew I was out of my league and decided to just focus on my problem not the whole parallelization concept. Back to wrathematics bog post and I discovered my silly Windows machine was not compatible with mcapply but saw hope with the clusterApply(). Using ?clusterApply
I saw parLapply said it was a parallel version of lapply. I like lapply and dicided that was what I’d go with.

Working with parallel coding in functions (skip to here)
These are the three major problems/differences I encountered with parLapply over lapply inside a function:

You need to pass/export the functions and variables you’ll be needing in the parLapply using makeCluster & clusterExport. See Andy Garcia’s helpful response to my question about this (LINK)
You have to specify the envir argument of clusterExport as envir=environment(). See GSee’s helpful response to my question about this (LINK)
You have to explicitly stop the cluster when you’re finished using it, much like closing a connection you opened. You stop the cluster using the stopCluster function (see line 38 in the code below).

EDIT: Martin Morgan of stackoverflow.com gives a solution that addresses both the first and second problems. He suggests passing all objects directly to parLapply (LINK).

Below is an example of taking a non parallel function and making it run in parallel:

 library(parallel)
detectCores()  #make sure you have > 1 core

nonpar.test <- function(text.var, gc.rate=10){ 
    ntv <- length(text.var)
    require(parallel)
    pos <-  function(i) {
        paste(sapply(strsplit(tolower(i), " "), nchar), collapse=" | ")
    }
    x <- lapply(seq_len(ntv), function(i) {
            x <- pos(text.var[i])
            if (i%%gc.rate==0) gc()
            return(x)
        }
    )
    return(x)
}

nonpar.test(rep("I wish I ran in parallel.", 20))

par.test <- function(text.var, gc.rate=10){ 
    ntv <- length(text.var)
    require(parallel)
    pos <-  function(i) {
        paste(sapply(strsplit(tolower(i), " "), nchar), collapse=" | ")
    }
#======================================
    cl <- makeCluster(mc <- getOption("cl.cores", 4))
    clusterExport(cl=cl, varlist=c("text.var", "ntv", "gc.rate", "pos"), 
        envir=environment())
    x <- parLapply(cl, seq_len(ntv), function(i) {
#======================================
            x <- pos(text.var[i])
            if (i%%gc.rate==0) gc()
            return(x)
        }
    )
    stopCluster(cl)  #stop the cluster
    return(x)
}

par.test(rep("I wish I ran in parallel.", 20))

Notice that lines 27-30; 37 (between the #==== lines and stopping the cluster) is all that changes. Once you get it down working with parLapply is pretty easy.

Note:
It doesn’t always make sense to run in parallel as it takes time to make the cluster. In the pos I added parallel as an argument because for smaller text vectors running in parallel doesn’t make sense (it’s slower).

Wonderings and future direction:
The pos function I have in qdap uses a progress bar. Currently I couldn’t make a progress bar work with parLapply but it’s less of a need because it was so much faster.

Benchmarking (1 run)

> system.time(pos(rajSPLIT$dialogue, parallel=T))
   user  system elapsed 
   2.35    0.08  199.53 

> system.time(pos(rajSPLIT$dialogue, progress.bar =F))
   user  system elapsed 
 816.61   16.74  833.47

This is benchmarked using the rajSPLIT$dialogue which is the text from Romeo and Juliet, a data set in qdap. This consists of 2151 rows or 23,943 words.

Hopefully this blog post is useful to those learning some parallelization. Check out Task View , the Documentation for the Parallel package and the Vignette for the parallel package.

If you have suggestions for improvement, links, or help on getting a progress bar with parLapply please leave a comment.

About tylerrinker

Data Scientist, open-source developer , #rstats enthusiast, #dataviz geek, and #nlp buff

View all posts by tylerrinker →

9 Responses to Parallelization: Speed up Functions in a Package

Joint_Posterior says:

August 20, 2012 at 1:45 pm

I find parallel R frustrating … It is a lot of work, and while the speed gain is visible, using RCpp and inline would definitely the fastest option.

- tylerrinker says:
  
  August 20, 2012 at 2:03 pm
  
  I haven’t played with RCpp yet, it’s on my to do list but it sounded scary to me. My first experience with `parLapply` was that it wasn’t too much different than `lapply`. In my particular circumstances (I rely on openNLP’s `tagPOS` function coded with java) I don’t think I could have used RCpp or the speed gain would have been minimal. That being said I’ve heard great things about RCpp and for many jobs this may be the way to go. You’ve inspired me to learn a bit more about RCpp.
  
JNFoo says:

August 21, 2012 at 6:27 pm

“Suggestion if you type detectCores() and see 1 you can’t run code in parallel, at least not by running it on different cores of your machine.”

Hmmm, this doesn’t work on my windows7 running Revolution R Community version 6.0 (64-bit). Instead try:

Sys.getenv(‘NUMBER_OF_PROCESSORS’)

Courtesy of Gavin Simpson on SO
http://stackoverflow.com/questions/6389334/detect-the-number-of-cores-on-windows

tylerrinker says:

August 22, 2012 at 10:41 pm

@JNFoo, Thanks for your feedback. This was a mistake on my part. While the parallel package is part of a base install you have to explicitly load it first with `library(parallel)`. I corrected this in the code above as well.

Mark Huberty says:

August 23, 2012 at 11:12 pm

Have you seen any issues with the tagging functions when running on very long string vectors? I was parallelizing the tagging of a 7000-sentence vector in almost exactly the way you describe. But midway through openNLP crashed with a “too many connections” error. It appears that somewhere gzcon() was leaving file connections open. Didn’t know if you’d seen this or not.

- tylerrinker says:
  
  August 24, 2012 at 12:55 am
  
  Mark did you use garbage collection every so many uses? Not sure why but gc() like every 10 sentences helps this. I think it may have something to do with Java.
  
hulllemann says:

March 5, 2014 at 2:18 am

Hallo,
i have a question, i’m interresting in intigrate a progressbar in parLapply, is it possible that you can discribe me how to do that?
Thanks for the post.

- tylerrinker says:
  
  March 5, 2014 at 4:55 am
  
  As far as I know this is not possible.
  
Matt says:

January 16, 2019 at 4:01 pm

Thanks for the note on `envir` in `clusterExport()`, and thanks for linking back here from SO.