One thing I commonly come across is the need to run
simulations under specific scenarios. The idea behind this is to create a
controlled data generating environment so that we can test out various methods,
approaches, and estimators. While this may initially appear straightforward, I
feel much of the complexity in "real world" data is lost by many
practitioners hoping to mimic their situations in this controlled environment.
Take for example a scenario that I've been recently working with.
Scenario
We have a randomized controlled trial, in which cancer
patients are randomized to either placebo or active treatment. These patients
are followed regularly until the occurrence of some end event (e.g. death,
disease progression, etc) to see how the active treatment compares with the
placebo group. In the ideal scientific scenario, both treatment groups would
stay on their assigned treatment during their entire follow-up.
In practice, however, many of the patients in
the placebo group decline in health. As is commonly drawn up in study
protocols, these patients are then allowed to cross over to active treatment.
Of interest to the relevant scientists is how to draw inference in the face of
this informative crossover design.
Data
Our first task should be to clearly define our data and how
we believe it to generated. From the scenario I provided above, it should
become apparent that we have time-dependent confounding present. This is
because we have a treatment that will affect future values of health, which in
turn affects the probability of crossing over to the active treatment group.
Let’s try to be more formal with this. We define for times t=1,2,…,$\tau$ where $\tau$ is some final time point of interest:
L(t) = a time-varying continuous measure of health. Examples
include cell counts, tumor sizes, etc.
A(t) = an indicator of treatment group (0=control, 1=treatment)
Y(t) = an indicator of end event occurrence (0=alive, 1=failure)
A(t) = an indicator of treatment group (0=control, 1=treatment)
Y(t) = an indicator of end event occurrence (0=alive, 1=failure)
We can employ direct acyclic graphs (DAGs) to
help in demonstrating the effect that we believe each covariate to have on one
another.
As shown in the figure, I believe that (among other things) the L(t) directly affects the Y(t) at each time point, as does A(t). Furthermore, I believe that A(t) affects L(t+1). With this in mind, we can continue on towards defining the statistical distribution that is consistent with these effects.
Distributions
Once we confirm how we believe the data to be generated, we
can continue on to simulating this data under a known distribution. Here is
where I feel many practitioners oversimplify. A common approach I see is to
draw samples from a known parametric distribution. Some examples include the
exponential, weibull, or gamma distributions.
While this parametric approach is easy to do, I find
it hard to endorse its use. Real data is messy. It is complex. And it is non-parametric.
Instead, I believe it better to sample data from a nonparametric distribution.
This, of course, begs the question of which distribution to use. And what we
should always consider when presented with questions like this is the scenario
in which we are applying our work. As my scenario here is with cancer patients,
a non-parametric survival or hazard curve from patients with cancer would be
ideal. A quick online search turns up the US SEER cancer registry in which I
can use towards my research. Some subsequent data cleaning and analyses then
yield the following one year hazard rates for patients with Hodgkin’s Lymphoma:
T_months | SURVIVAL |
0 | 1 |
12 | 0.956377 |
24 | 0.926385 |
36 | 0.900916 |
48 | 0.880696 |
60 | 0.860925 |
72 | 0.841388 |
84 | 0.825341 |
96 | 0.810255 |
108 | 0.797335 |
120 | 0.784967 |
132 | 0.772869 |
144 | 0.759607 |
156 | 0.747690 |
168 | 0.734079 |
180 | 0.723146 |
192 | 0.711468 |
204 | 0.699804 |
216 | 0.686888 |
228 | 0.675939 |
240 | 0.663627 |
252 | 0.650037 |
Using these hazard rates, we can now create discrete survival curves for our baseline population. This approach has the advantage that it assumes no specific functional form for the hazards. I should note, however, that there is a bias-variance trade off here in terms of the size of the intervals we use. Smaller intervals in the estimation of the hazard rates can result in less bias in the estimates at the cost of higher variance. Good judgement should be exercised in determining the appropriate interval length.
With the baseline distribution established, all that remains is to impose causal relationships between the covariates, as indicated by the arrows in the DAG above. I've included code below that imposes this relationship under fixed parameter values. For ease of presentation, the relationships I've imposed are parametric. Don't be afraid though to make them as complex as needed to adequately represent the scenario that you are working under.
"Stay hungry. Stay foolish." --Steve Jobs
############################################## ## Code to generate data as specified above ## ############################################## n=500 time = 6 lambda = c(0.956377, 0.926385, 0.900916, 0.880696, 0.860925, 0.841388, 0.825341, 0.810255, 0.797335, 0.784967, 0.772869, 0.759607, 0.747690, 0.734079, 0.723146, 0.711468, 0.699804, 0.686888, 0.675939 ,0.663627 ,0.650037) var.names = NULL for(i in 1:length(time)) { var.names = c(var.names, paste(c("L","A","Y"), i, sep=".")) } LT = matrix(nrow=n, ncol=length(var.names)+1, dimnames=list(NULL, c(var.names, "delta"))) Leff = 0.2 atrisk = rep(TRUE,n) LT[atrisk,"L.1"] = rnorm(n) LT[atrisk,"A.1"] = rbinom(n, 1, prob=0.5) lambda.tx = lambda*1.5 lambdaOBS = t(apply(LT[,"A.1", drop=FALSE], 1, function(x) { if(x==0) return(lambda) else return(lambda.tx) })) colnames(lambdaOBS) = paste("lambda", c(0:(length(lambda)-1)), sep=".") lambdaOBS[,1] = plogis(qlogis(lambdaOBS[,1]) + Leff*LT[atrisk,"L.1"]) LT[,"Y.1"] = ifelse(atrisk, rbinom(n, 1, prob=1-exp(-lambdaOBS[,1])), NA) if(max(time)>1) { for(i in 2:max(time)) { ## AT RISK INDICATOR ## atrisk = atrisk & LT[,paste("Y.",i-1, sep="")]==0 ## TIME-DEPENDENT CONFOUNDER ibeta = 0.25 Lbeta = 0.25 e = rnorm(n) LT[,paste("L.",i, sep="")] = ifelse(atrisk, ibeta + Lbeta*LT[,paste("L.",i-1, sep="")] - 0.5*LT[,paste("A.",i-1, sep="")] + e, NA) ## TREATMENT ## if(crossover>0) { prob = plogis(qlogis(crossover) + 0.7*LT[atrisk,paste("L.",i, sep="")]) LT[atrisk, paste("A.",i, sep="")] = ifelse(LT[atrisk, paste("A.",i-1, sep="")]==1, 1, rbinom(rep(1,sum(atrisk)), 1, prob=prob)) } else { LT[atrisk, paste("A.",i, sep="")] = LT[atrisk, "A.1"] } ## HAZARD ## lambdaOBS[atrisk,i] = plogis(qlogis(lambdaOBS[atrisk,i]) + Leff*LT[atrisk,paste("L.",i, sep="")]) ## DEATH INDICATOR ## LT[, paste("Y.",i, sep="")] = suppressWarnings(ifelse(atrisk, rbinom(rep(1,n), 1, prob=1-exp(-lambdaOBS[,i])), NA)) } } ## SURVIVAL TIMES ## tmp = suppressWarnings(apply(LT[,paste("Y.", 1:max(time), sep=""), drop=FALSE], 1, function(x) max(x, na.rm=TRUE))) LT[, "delta"] = ifelse(tmp==-Inf, 0, tmp) T = apply(LT[, paste("Y.", 1:max(time), sep=""), drop=FALSE], 1, function(x) sum(1-x, na.rm=TRUE)+1) data = data.frame(LT, T) ## TRANSPOSES ## data = reshape(data, varying=c(1:(ncol(data)-2)), direction="long") data = subset(data, !(is.na(L) & is.na(A) & is.na(Y))) data = data[order(data$id, data$time),] rownames(data) = NULL