Often I find myself needing data sets to try functions and code out on or for teaching purposes. I have a few stand-bys such as the mtcars and CO2 data sets in the base packages of R but sometimes I need a long format data set or a bunch of categorical or a bunch of numeric or repeated measures or I want it to have missing values to test the function and I spend valuable time searching for the correct data set. About a year ago my answer was to have a file with several data sets I knew could fit various situations but eventually I grew tired of the pain of loading a data set each time I needed to test something and created a randomly generated data set function with categorical, numeric, interval, and repeated measures data. I recently extended the data set to contain optional missing values, long or wide format, and proportion data and attempted to give it some speed boosts for creating larger data sets. It generally suits my needs and I think can probably serve others too.
The main function, DFgen, relies on two helper functions, props and NAins. I do not place these helper functions inside of DFgen itself as they have useful properties in and of themselves. I’ll briefly explain each function, provide the code, and give a few tests to try it out.
The props Function
The props function generates a data frame of proportions whose rows sum to 1. It takes two arguments and an optional var.names argument. The first two arguments are the dimensions of the dataframe and are pretty self explanatory. The final argument optionally names the columns otherwise they are named X1..Xn. One note on this function is that for many columns it is a poorer choice. For a slower props function but better for numerous columns Dason of talkstats.com provides an alternative (LINK).
#############################################################
# function to generate random proportions whose rowSums = 1 #
#############################################################
props <- function(ncol, nrow, var.names=NULL){
if (ncol < 2) stop("ncol must be greater than 1")
p <- function(n){
y <- 0
z <- sapply(seq_len(n-1), function(i) {
x <- sample(seq(0, 1-y, by=.01), 1)
y <<- y + x
return(x)
}
)
w <- c(z , 1-sum(z))
return(w)
}
DF <- data.frame(t(replicate(nrow, p(n=ncol))))
if (!is.null(var.names)) colnames(DF) <- var.names
return(DF)
}
##############
# TRY IT OUT #
##############
props(ncol=5, nrow=5)
props(ncol=3, nrow=25)
props(ncol=3, nrow=5, var.names=c("red", "blue", "green"))
The NAins Function
The NAins function takes a data frame and randomly inserts a certain proportion of missing (NA) values. The function has two arguments: df which is the dataframe and prop which is the proportion of NA values to be inserted into the data frame (default is .1),
Special thanks again to Dason of talk.stats.com for helping with a speed boost with this function. This function consumes considerable time in DFgen and he provided the code to really gain some speed.
################################################################
# RANDOMLY INSERT A CERTAIN PROPORTION OF NAs INTO A DATAFRAME #
################################################################
NAins <- NAinsert <- function(df, prop = .1){
n <- nrow(df)
m <- ncol(df)
num.to.na <- ceiling(prop*n*m)
id <- sample(0:(m*n-1), num.to.na, replace = FALSE)
rows <- id %/% m + 1
cols <- id %% m + 1
sapply(seq(num.to.na), function(x){
df[rows[x], cols[x]] <<- NA
}
)
return(df)
}
##############
# TRY IT OUT #
##############
NAins(mtcars, .1)
The DFgen Function
The DFgen function randomly generates an n-lenght data set with predefined variables. The default DFgen() with no arguments specified will produce the following n=10 data set:
> set.seed(10)
> DFgen()
id group hs.grad race gender age m.status political n.kids income score time1 time2 time3
1 ID.1 treat yes white male 19 never republican 1 111000 -1.24 51.39 52.15 53.76
2 ID.2 control yes black male 30 divorced independent 0 122000 -0.46 32.21 35.07 33.10
3 ID.3 control yes white male 32 married republican 1 2000 -0.83 43.36 45.46 46.22
4 ID.4 treat no white male 30 divorced republican 1 65000 0.34 71.63 72.06 74.49
5 ID.5 control yes white female 18 married republican 3 96000 1.07 9.26 12.24 11.02
6 ID.6 treat yes asian female 30 married independent 3 135000 1.22 24.10 26.45 24.74
7 ID.7 treat yes white female 26 never democrat 5 16000 0.74 28.76 31.72 31.39
8 ID.8 treat yes white male 40 married republican 1 113000 -0.48 28.24 29.10 37.12
9 ID.9 treat yes white male 23 married independent 2 80000 0.56 62.99 65.09 67.72
10 ID.10 treat no asian male 22 married democrat 1 96000 -1.25 43.74 46.79 44.04
The function also takes optional:
- type argument (default “wide” or “long”)
- na.rate (a decimal value between 0 and 1; default is 0) that randomly inserts missing data (great for teaching demos and testing corner cases)
- prop argument (takes TRUE or default FALSE )
- digits that controls the number of degits (default is 2)
############################################################
# GENERATE A RANDOM DATA SET. CAN BE SET TO LONG OR WIDE. #
# DATA SET HAS FACTORS AND NUMERIC VARIABLES AND CAN #
# OPTIONALLY GIVE BUDGET EXPENDITURES AS A PROPORTION. #
# CAN ALSO TELL A PROPORTION OF CELLS TO BE MISSING VALUES #
############################################################
# NOTE RELIES ON THE props FUNCTION AND THE NAins FUNCTION #
############################################################
DFgen <- DFmaker <- function(n=10, type=wide, digits=2,
proportion=FALSE, na.rate=0) {
rownamer <- function(dataframe){
x <- as.data.frame(dataframe)
rownames(x) <- NULL
return(x)
}
dfround <- function(dataframe, digits = 0){
df <- dataframe
df[,sapply(df, is.numeric)] <-round(df[,sapply(df, is.numeric)], digits)
return(df)
}
TYPE <- as.character(substitute(type))
time1 <- sample(1:100, n, replace = TRUE) + abs(rnorm(n))
DF <- data.frame(id = paste0("ID.", 1:n),
group= sample(c("control", "treat"), n, replace = TRUE),
hs.grad = sample(c("yes", "no"), n, replace = TRUE),
race = sample(c("black", "white", "asian"), n,
replace = TRUE, prob=c(.25, .5, .25)),
gender = sample(c("male", "female"), n, replace = TRUE),
age = sample(18:40, n, replace = TRUE),
m.status = sample(c("never", "married", "divorced", "widowed"),
n, replace = TRUE, prob=c(.25, .4, .3, .05)),
political = sample(c("democrat", "republican",
"independent", "other"), n, replace= TRUE,
prob=c(.35, .35, .20, .1)),
n.kids = rpois(n, 1.5),
income = sample(c(seq(0, 30000, by=1000),
seq(0, 150000, by=1000)), n, replace=TRUE),
score = rnorm(n),
time1,
time2 = c(time1 + 2 * abs(rnorm(n))),
time3 = c(time1 + (4 * abs(rnorm(n)))))
if (proportion) {
DF <- cbind (DF[, 1:10],
props(ncol=3, nrow=n, var.names=c("food",
"housing", "other")),
DF[, 11:14])
}
if (na.rate!=0) {
DF <- cbind(DF[, 1, drop=FALSE], NAins(DF[, -1],
prop=na.rate))
}
DF <- switch(TYPE,
wide = DF,
long = {DF <- reshape(DF, direction = "long", idvar = "id",
varying = c("time1","time2", "time3"),
v.names = c("value"),
timevar = "time", times = c("time1", "time2", "time3"))
rownamer(DF)},
stop("Invalid Data \"type\""))
return(dfround(DF, digits=digits))
}
##############
# TRY IT OUT #
##############
DFgen()
DFgen(type="long")
DFmaker(20000)
DFgen(prop=T)
DFgen(na.rate=.3)
Tyler, these seem like pretty interesting and useful functions. Some aesthetic suggestions though: I find the hash sign boxes a bit overpowering and the ALL CAPS DESCRIPTIONS are not easily readable. For my functions, while I’m working on them, I generally follow a style where the function description, arguments, some sample data, and any credits come as comments immediately after the first line of the function, but before the actual function. See an example here.
Also, I’m sure you’ve already explored Knitr, but if you haven’t you should consider doing so. I use it with RStudio and write my documentation in markdown so that the documentation can easily be posted to github and converted to a nicely formatted PDF, while retaining the ability to run and test my code while I’m writing the documentation. You can see some example source and output files at my docs page on github. The Rmd file is the source file, the html and md are automatically created by Knitr through RStudio, and the PDF is converted from the markdown file using the Pandoc converter.
Tyler, your DFgen function actually isn’t working for me. It throws an error looking for a “rownamer” function and a “dfround” function. Am I missing something or did you forget to include some other functions that this function depends on?
@mrdwab you’re totally correct. Thanks for the feedback. They’re really just niceties that I have in my .Rprofile (I know done in by the .Rprofile addition of functions and then sharing). Particularly rownamer is unnecessary but I got tired of typing names(df) <- NULL, not that rownamer is less typing. I put the missing functions into the code. Try it now and tell me what you think.
@tylerrinker, works fine now. Nice function and great idea. I could see that it could be useful in teaching situations. The only thing that I would consider adding is at least one more series of repeated measures since that is very common in datasets, at least the ones I encounter.