Thursday, May 28, 2015

On data generation ...


One thing I commonly come across is the need to run simulations under specific scenarios. The idea behind this is to create a controlled data generating environment so that we can test out various methods, approaches, and estimators. While this may initially appear straightforward, I feel much of the complexity in "real world" data is lost by many practitioners hoping to mimic their situations in this controlled environment. Take for example a scenario that I've been recently working with.

Scenario

We have a randomized controlled trial, in which cancer patients are randomized to either placebo or active treatment. These patients are followed regularly until the occurrence of some end event (e.g. death, disease progression, etc) to see how the active treatment compares with the placebo group. In the ideal scientific scenario, both treatment groups would stay on their assigned treatment during their entire follow-up.

In practice, however, many of the patients in the placebo group decline in health. As is commonly drawn up in study protocols, these patients are then allowed to cross over to active treatment. Of interest to the relevant scientists is how to draw inference in the face of this informative crossover design. 

Data

Our first task should be to clearly define our data and how we believe it to generated. From the scenario I provided above, it should become apparent that we have time-dependent confounding present. This is because we have a treatment that will affect future values of health, which in turn affects the probability of crossing over to the active treatment group. Let’s try to be more formal with this. We define for times t=1,2,…,$\tau$ where $\tau$ is some final time point of interest:

L(t) = a time-varying continuous measure of health. Examples include cell counts, tumor sizes, etc.
A(t) = an indicator of treatment group (0=control, 1=treatment)
Y(t) = an indicator of end event occurrence (0=alive, 1=failure)

We can employ direct acyclic graphs (DAGs) to help in demonstrating the effect that we believe each covariate to have on one another. 


As shown in the figure, I believe that (among other things) the L(t) directly affects the Y(t) at each time point, as does A(t). Furthermore, I believe that A(t) affects L(t+1). With this in mind, we can continue on towards defining the statistical distribution that is consistent with these effects. 

Distributions

Once we confirm how we believe the data to be generated, we can continue on to simulating this data under a known distribution. Here is where I feel many practitioners oversimplify. A common approach I see is to draw samples from a known parametric distribution. Some examples include the exponential, weibull, or gamma distributions.

While this parametric approach is easy to do, I find it hard to endorse its use. Real data is messy. It is complex. And it is non-parametric. Instead, I believe it better to sample data from a nonparametric distribution. This, of course, begs the question of which distribution to use. And what we should always consider when presented with questions like this is the scenario in which we are applying our work. As my scenario here is with cancer patients, a non-parametric survival or hazard curve from patients with cancer would be ideal. A quick online search turns up the US SEER cancer registry in which I can use towards my research. Some subsequent data cleaning and analyses then yield the following one year hazard rates for patients with Hodgkin’s Lymphoma:

T_months SURVIVAL
0 1
12 0.956377
24 0.926385
36 0.900916
48 0.880696
60 0.860925
72 0.841388
84 0.825341
96 0.810255
108 0.797335
120 0.784967
132 0.772869
144 0.759607
156 0.747690
168 0.734079
180 0.723146
192 0.711468
204 0.699804
216 0.686888
228 0.675939
240 0.663627
252 0.650037
Using these hazard rates, we can now create discrete survival curves for our baseline population. This approach has the advantage that it assumes no specific functional form for the hazards. I should note, however, that there is a bias-variance trade off here in terms of the size of the intervals we use. Smaller intervals in the estimation of the hazard rates can result in less bias in the estimates at the cost of higher variance. Good judgement should be exercised in determining the appropriate interval length. 

With the baseline distribution established, all that remains is to impose causal relationships between the covariates, as indicated by the arrows in the DAG above. I've included code below that imposes this relationship under fixed parameter values. For ease of presentation, the relationships I've imposed are parametric. Don't be afraid though to make them as complex as needed to adequately represent the scenario that you are working under. 

"Stay hungry. Stay foolish." --Steve Jobs

##############################################
## Code to generate data as specified above ##
##############################################

n=500
time = 6
lambda = c(0.956377, 0.926385, 0.900916, 0.880696, 0.860925, 0.841388, 0.825341, 0.810255, 0.797335, 0.784967, 0.772869, 0.759607, 0.747690, 0.734079, 0.723146, 0.711468, 0.699804, 0.686888, 0.675939 ,0.663627 ,0.650037)
var.names = NULL
for(i in 1:length(time)) {
 var.names = c(var.names, paste(c("L","A","Y"), i, sep="."))
}
LT = matrix(nrow=n, ncol=length(var.names)+1, dimnames=list(NULL, c(var.names, "delta")))

Leff = 0.2
atrisk = rep(TRUE,n)
LT[atrisk,"L.1"] = rnorm(n)
LT[atrisk,"A.1"] = rbinom(n, 1, prob=0.5)
lambda.tx = lambda*1.5
lambdaOBS = t(apply(LT[,"A.1", drop=FALSE], 1, function(x) {
     if(x==0) return(lambda)
     else return(lambda.tx)
    }))
colnames(lambdaOBS) = paste("lambda", c(0:(length(lambda)-1)), sep=".")
lambdaOBS[,1] = plogis(qlogis(lambdaOBS[,1]) + Leff*LT[atrisk,"L.1"])
LT[,"Y.1"] = ifelse(atrisk, rbinom(n, 1, prob=1-exp(-lambdaOBS[,1])), NA)

if(max(time)>1) {
 for(i in 2:max(time)) {
  ## AT RISK INDICATOR ##
  atrisk = atrisk & LT[,paste("Y.",i-1, sep="")]==0
  
  ## TIME-DEPENDENT CONFOUNDER
  ibeta = 0.25
  Lbeta = 0.25
  e = rnorm(n)
  LT[,paste("L.",i, sep="")] = ifelse(atrisk, ibeta + Lbeta*LT[,paste("L.",i-1, sep="")] - 0.5*LT[,paste("A.",i-1, sep="")] + e, NA)
  
  ## TREATMENT ##
  if(crossover>0) {
   prob = plogis(qlogis(crossover) + 0.7*LT[atrisk,paste("L.",i, sep="")])
   LT[atrisk, paste("A.",i, sep="")] = ifelse(LT[atrisk, paste("A.",i-1, sep="")]==1, 1, rbinom(rep(1,sum(atrisk)), 1, prob=prob))
  } else {
   LT[atrisk, paste("A.",i, sep="")] = LT[atrisk, "A.1"]
  }
  
  ## HAZARD ##
  lambdaOBS[atrisk,i] = plogis(qlogis(lambdaOBS[atrisk,i]) + Leff*LT[atrisk,paste("L.",i, sep="")])
  
  ## DEATH INDICATOR ##
  LT[, paste("Y.",i, sep="")] = suppressWarnings(ifelse(atrisk, rbinom(rep(1,n), 1, prob=1-exp(-lambdaOBS[,i])), NA))
 }
}

## SURVIVAL TIMES ##
tmp = suppressWarnings(apply(LT[,paste("Y.", 1:max(time), sep=""), drop=FALSE], 1, function(x) max(x, na.rm=TRUE)))
LT[, "delta"] = ifelse(tmp==-Inf, 0, tmp)
T = apply(LT[, paste("Y.", 1:max(time), sep=""), drop=FALSE], 1, function(x) sum(1-x, na.rm=TRUE)+1)
data = data.frame(LT, T)

## TRANSPOSES ##
data = reshape(data, varying=c(1:(ncol(data)-2)), direction="long")
data = subset(data, !(is.na(L) & is.na(A) & is.na(Y)))
data = data[order(data$id, data$time),]
rownames(data) = NULL