Function to Generate a Random Data Set

Often I find myself needing data sets to try functions and code out on or for teaching purposes.  I have a few stand-bys such as the mtcars and CO2 data sets in the base packages of R but sometimes I need a long format data set or a bunch of categorical or a bunch of numeric or repeated measures or I want it to have missing values to test the function and I spend valuable time searching for the correct data set.  About a year ago my answer was to have a file with several data sets I knew could fit various situations but eventually I grew tired of the pain of loading a data set each time I needed to test something and created a randomly generated data set function with categorical, numeric, interval, and repeated measures data.  I recently extended the data set to contain optional missing values, long or wide format, and proportion data and attempted to give it some speed boosts for creating larger data sets.  It generally suits my needs and I think can probably serve others too.

The main function, DFgen, relies on two helper functions, props and NAins.  I do not place these helper functions inside of DFgen itself as they have useful properties in and of themselves.  I’ll briefly explain each function, provide the code, and give a few tests to try it out.

The props Function

The props function generates a data frame of proportions whose rows sum to 1.  It takes two arguments and an optional var.names argument.  The first two arguments are the dimensions of the dataframe and are pretty self explanatory.  The final argument optionally names the columns otherwise they are named X1..Xn.  One note on this function is that for many columns it is a poorer choice.  For a slower props function but better for numerous columns Dason of talkstats.com provides an alternative (LINK).

#############################################################
# function to generate random proportions whose rowSums = 1 #
#############################################################
props <- function(ncol, nrow, var.names=NULL){
    if (ncol < 2) stop("ncol must be greater than 1")
    p <- function(n){
        y <- 0
        z <- sapply(seq_len(n-1), function(i) {
                x <- sample(seq(0, 1-y, by=.01), 1)
                y <<- y + x
                return(x)
            }
        )
        w <- c(z , 1-sum(z))
        return(w)
    }
    DF <- data.frame(t(replicate(nrow, p(n=ncol))))
    if (!is.null(var.names)) colnames(DF) <- var.names
    return(DF)
}
##############
# TRY IT OUT #
##############
props(ncol=5, nrow=5)                                      
props(ncol=3, nrow=25)                                     
props(ncol=3, nrow=5, var.names=c("red", "blue", "green"))

The NAins Function

The NAins function takes a data frame and randomly inserts a certain proportion of missing (NA) values.  The function has two arguments: df which is the dataframe and prop which is the proportion of NA values to be inserted into the data frame (default is .1),

Special thanks again to Dason of talk.stats.com for helping with a speed boost with this function.  This function consumes considerable time in DFgen and he provided the code to really gain some speed.

################################################################
# RANDOMLY INSERT A CERTAIN PROPORTION OF NAs INTO A DATAFRAME #
################################################################
NAins <-  NAinsert <- function(df, prop = .1){
    n <- nrow(df)
    m <- ncol(df)
    num.to.na <- ceiling(prop*n*m)
    id <- sample(0:(m*n-1), num.to.na, replace = FALSE)
    rows <- id %/% m + 1
    cols <- id %% m + 1
    sapply(seq(num.to.na), function(x){
            df[rows[x], cols[x]] <<- NA
        }
    )
    return(df)
}
##############
# TRY IT OUT #
##############
NAins(mtcars, .1)

The DFgen Function

The DFgen function randomly generates an n-lenght data set with predefined variables.  The default DFgen() with no arguments specified will produce the following n=10 data set:

> set.seed(10)
> DFgen()
      id   group hs.grad  race gender age m.status   political n.kids income score time1 time2 time3
1   ID.1   treat     yes white   male  19    never  republican      1 111000 -1.24 51.39 52.15 53.76
2   ID.2 control     yes black   male  30 divorced independent      0 122000 -0.46 32.21 35.07 33.10
3   ID.3 control     yes white   male  32  married  republican      1   2000 -0.83 43.36 45.46 46.22
4   ID.4   treat      no white   male  30 divorced  republican      1  65000  0.34 71.63 72.06 74.49
5   ID.5 control     yes white female  18  married  republican      3  96000  1.07  9.26 12.24 11.02
6   ID.6   treat     yes asian female  30  married independent      3 135000  1.22 24.10 26.45 24.74
7   ID.7   treat     yes white female  26    never    democrat      5  16000  0.74 28.76 31.72 31.39
8   ID.8   treat     yes white   male  40  married  republican      1 113000 -0.48 28.24 29.10 37.12
9   ID.9   treat     yes white   male  23  married independent      2  80000  0.56 62.99 65.09 67.72
10 ID.10   treat      no asian   male  22  married    democrat      1  96000 -1.25 43.74 46.79 44.04

The function also takes optional:

  • type argument (default “wide” or “long”)
  • na.rate (a decimal value between 0 and 1; default is 0) that randomly inserts missing data (great for teaching demos and testing corner cases)
  • prop argument (takes TRUE or default FALSE )
  • digits that controls the number of degits (default is 2)
############################################################
# GENERATE A RANDOM DATA SET.  CAN BE SET TO LONG OR WIDE. #
# DATA SET HAS FACTORS AND NUMERIC VARIABLES AND CAN       #
# OPTIONALLY GIVE BUDGET EXPENDITURES AS A PROPORTION.     #
# CAN ALSO TELL A PROPORTION OF CELLS TO BE MISSING VALUES #
############################################################
# NOTE RELIES ON THE props FUNCTION AND THE NAins FUNCTION #
############################################################
DFgen <- DFmaker <- function(n=10, type=wide, digits=2, 
    proportion=FALSE, na.rate=0) {

    rownamer <- function(dataframe){
        x <- as.data.frame(dataframe)
        rownames(x) <- NULL
        return(x)
    }

    dfround <- function(dataframe, digits = 0){
      df <- dataframe
      df[,sapply(df, is.numeric)] <-round(df[,sapply(df, is.numeric)], digits) 
      return(df)
    }

    TYPE <- as.character(substitute(type))
    time1 <- sample(1:100, n, replace = TRUE) + abs(rnorm(n))
    DF <- data.frame(id = paste0("ID.", 1:n), 
        group= sample(c("control", "treat"), n, replace = TRUE),
        hs.grad = sample(c("yes", "no"), n, replace = TRUE), 
        race = sample(c("black", "white", "asian"), n, 
            replace = TRUE, prob=c(.25, .5, .25)), 
        gender = sample(c("male", "female"), n, replace = TRUE), 
        age = sample(18:40, n, replace = TRUE),
        m.status = sample(c("never", "married", "divorced", "widowed"), 
            n, replace = TRUE, prob=c(.25, .4, .3, .05)), 
        political = sample(c("democrat", "republican", 
            "independent", "other"), n, replace= TRUE, 
            prob=c(.35, .35, .20, .1)),
        n.kids = rpois(n, 1.5), 
        income = sample(c(seq(0, 30000, by=1000), 
            seq(0, 150000, by=1000)), n, replace=TRUE),
        score = rnorm(n), 
        time1, 
        time2 = c(time1 + 2 * abs(rnorm(n))), 
        time3 = c(time1 + (4 * abs(rnorm(n)))))
    if (proportion) {
        DF <- cbind (DF[, 1:10], 
            props(ncol=3, nrow=n, var.names=c("food", 
                "housing", "other")),
            DF[, 11:14])
    }
    if (na.rate!=0) {  
        DF <- cbind(DF[, 1, drop=FALSE], NAins(DF[, -1], 
            prop=na.rate))
    }
    DF <- switch(TYPE, 
        wide = DF, 
        long = {DF <- reshape(DF, direction = "long", idvar = "id",
                varying = c("time1","time2", "time3"),
                v.names = c("value"),
                timevar = "time", times = c("time1", "time2", "time3"))
            rownamer(DF)}, 
        stop("Invalid Data \"type\""))
    return(dfround(DF, digits=digits))
}
##############
# TRY IT OUT #
##############
DFgen()            
DFgen(type="long") 
DFmaker(20000)     
DFgen(prop=T)      
DFgen(na.rate=.3)
NOTE: This function relies on R.2.15.  If you don’t want to update R you must include a paste0 function found in the link below.



Click here for a .txt version of this demonstration

About these ads

About tylerrinker

I am Literacy PhD student with a bent for the quantitative and a passion for R.
This entry was posted in data, data generation and tagged , , , , , , , . Bookmark the permalink.

8 Responses to Function to Generate a Random Data Set

  1. mrdwab says:

    Tyler, these seem like pretty interesting and useful functions. Some aesthetic suggestions though: I find the hash sign boxes a bit overpowering and the ALL CAPS DESCRIPTIONS are not easily readable. For my functions, while I’m working on them, I generally follow a style where the function description, arguments, some sample data, and any credits come as comments immediately after the first line of the function, but before the actual function. See an example here.

    Also, I’m sure you’ve already explored Knitr, but if you haven’t you should consider doing so. I use it with RStudio and write my documentation in markdown so that the documentation can easily be posted to github and converted to a nicely formatted PDF, while retaining the ability to run and test my code while I’m writing the documentation. You can see some example source and output files at my docs page on github. The Rmd file is the source file, the html and md are automatically created by Knitr through RStudio, and the PDF is converted from the markdown file using the Pandoc converter.

  2. mrdwab says:

    Tyler, your DFgen function actually isn’t working for me. It throws an error looking for a “rownamer” function and a “dfround” function. Am I missing something or did you forget to include some other functions that this function depends on?

  3. tylerrinker says:

    @mrdwab you’re totally correct. Thanks for the feedback. They’re really just niceties that I have in my .Rprofile (I know done in by the .Rprofile addition of functions and then sharing). Particularly rownamer is unnecessary but I got tired of typing names(df) <- NULL, not that rownamer is less typing. I put the missing functions into the code. Try it now and tell me what you think.

    • mrdwab says:

      @tylerrinker, works fine now. Nice function and great idea. I could see that it could be useful in teaching situations. The only thing that I would consider adding is at least one more series of repeated measures since that is very common in datasets, at least the ones I encounter.

  4. Thank you, this just came in very handy on stackexchange to show a problem I was having!

    Is this in one of your packages? If not, I would recommend they be included

  5. tylerrinker says:

    Hi Stephanie. No it’s not in one of my packages. I kicked around the idea but then recently, a few members of Stackoverflow created the overflow package in a git repository. And it looks like a similar function will be included in this package. Stay tuned for their work.

  6. Pingback: r Hmisc::dataframeReduce - replicate actions from one dataset to identically structured dataset | Technology & Programming

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s