The Need for paste2 (part I)

This is Part I of a multi part blog on the paste2 function…

I recently generated a new paste function that takes an unspecified list of equal length variables (a column) or multiple columns of a data frame and pastes them together. First let me thank Dason of Talk Stats for his help in this post that led to the creation of the paste2 function. First let me convince you of the need for a paste2 function by showing you where the original paste falls short. Then I’ll introduce to the function and some basics of what it can do. In Part II of this paste2 blog series I’ll show you a few practical applications I’ve already encountered.

The main idea behind this function is the need to pass an unknown number of columns from a data frame or list and paste them together to generate an uber column that contains all the information of the original columns. You may say well I think paste already does that. Not in it’s home grown state it doesn’t. What’s that prove it? OK. Try the following:

paste(CO2[, 1:3], sep=".")                            #1
paste(CO2[, 1:3], collapse=".")                       #2
paste(CO2[,1], CO2[, 2], CO2[, 3], sep=".")           #3
paste(list(CO2[,1], CO2[, 2], CO2[, 3]), sep=".")     #4

What do you get? Well the third use of paste is the only one that results in pasting the columns together. Why? Because we specified the columns being passed to paste and paste is our friend. If we try a sneak attack with an index of columns paste becomes scared and returns gobly gook. If you try to be nice and give paste a list of columns he again gives gobly gook. So what’s the need for pasting an unknown number of columns (either as an indexed data frame or as a list) together? Often in functions the number of columns passed to paste can’t be specified in advance, hence out problem (I’ll show you more of those specific applications in Part II).

paste2 <- function(multi.columns, sep=".", handle.na=TRUE, trim=TRUE){
    if (trim) multi.columns <- lapply(multi.columns, function(x) {
            gsub("^\\s+|\\s+$", "", x)
        }
    )
    if (!is.data.frame(multi.columns) & is.list(multi.columns)) {
        multi.columns <- do.call('cbind', multi.columns)
      }
    m <- if(handle.na){
                 apply(multi.columns, 1, function(x){
                     if (any(is.na(x))){
                         NA
                     } else {
                         paste(x, collapse = sep)
                     }
                 }
             )   
         } else {
          apply(multi.columns, 1, paste, collapse = sep)
    }
    names(m) <- NULL
    return(m)
}

Now let’s see it in action:

paste2(CO2[, 1:3], sep=".")
paste2(CO2[, 1:3], sep=":")
paste2(list(CO2[,1], CO2[, 2], CO2[, 3]))
#shoot we can paste the whole data set if we want
paste2(CO2)
paste2(mtcars)

In Part II we’ll explore some practical uses of this new function!

Click HERE for a link to a .txt version of paste2

About tylerrinker

Data Scientist, open-source developer , #rstats enthusiast, #dataviz geek, and #nlp buff

View all posts by tylerrinker →

1 Response to The Need for paste2 (part I)

dwinsemius says:

May 6, 2012 at 2:02 pm

`paste` takes an arbitrary collection of vectors . The examples that failed occurred when you gave it lists. `CO2[ , 1:3]` is a list and it’s the same list (minus some attributes) as `list(CO2[,1], CO2[, 2], CO2[, 3])`. I do agree that the way paste responds to lists is bizarre. It first passes to ‘as.character’ which then passes to ‘deparse’. In your example those were factors in CO2[ , 1:3] so the list had integer vectors and you saw the attribute-free version of dput(CO2[ , 1:3[) . Try:
paste(list(1:10)) ## and see that you get:
[1] “1:10” # strange!