Demonstrating statistics in theory and practice: 2015

Lately, I've had the opportunity to work with data from a randomized controlled trial. It's interesting, as despite all the efforts taken to ensure that the randomizing of treatment assignment is valid, this is still the real world and things happen. With these imperfections occurring, it's important to understand just what you're doing when you're trying to conduct statistical inference. Let me explain.

Scenario

We have a randomized controlled trial, in which cancer patients are randomized to either placebo or active treatment. These patients are followed regularly until the occurrence of some end event (e.g. death, disease progression, etc) to see how the active treatment compares with the placebo group. In the ideal scientific scenario, both treatment groups would stay on their assigned treatment during their entire follow-up.

In practice, however, many of the patients in the placebo group decline in health. For example, we could be interested in overall survival, but we notice that many patients are having disease progressions. As is commonly drawn up in study protocols, these patients are then allowed to cross over to active treatment. Of interest to the relevant scientists is how to draw inference in the face of this informative crossover design.

One well known approach at accounting for these crossovers when modeling survival is by implementing the rank preserving structural failure time model (RPSFTM).

RPSFTM

Before I can really go into this, I first need to review the idea of "counterfactuals." In reality, a person is normally only exposed to one treatment regime and their outcome observed under that treatment. For example, in the RCT you either randomize them to placebo or active treatment and then follow them to observe their outcome. Now, what if, "contrary to fact", you observed the same person's outcome under both treatment assignments? We can think of it as duplicating the person into two copies, assigning each to separate treatment arms, and following both to observe their outcomes. This is the idea of counterfactuals. Rather than just accepting their outcome under the realized treatment assignment, you try to also model their counterfactual outcome under other potential treatment assignments as well.

The RPSFTM makes use of this counterfactual concept to model the outcome under no treatment. It will help to explain if we think of this under the simplified scenario of no crossovers. This model tries to estimate the survival time of all subjects off treatment. Usually in an RCT, half the patients are randomize to placebo. Therefore, for these subjects, there is no "contrary to the fact" as in reality their observed survival is the survival off treatment.

For the patients on active treatment, we try to model their survival time off treatment under an accelerated failure time model by making use of the randomization that happened at the beginning of the study. That is, we assume that if these subjects did not undergo active treatment, their survival would have been equivalent to the survival of subjects off treatment. We do this by using the survival curve from those subjects on treatment to try and fit the Acceleration Factor (e^\psi) and test how close the resulting survival curve is to that from the placebo group. The figure below demonstrates this.

To test how close the curves are, we can rely on common statistics such as the log-rank test or the Wilcoxon test. The cox proportional hazards parameter can also be used. Our estimate of the acceleration factor is then the value of the statistic which minimizes the test statistic, implying that the two survival curves are as indistinguishable as possible.

I've put together an R-package which implements this model, uploaded at www.github.com/tranlm/rpsftm. For anyone interested, you can install it directly with the following command in R:

devtools::install_github(“tranlm/rpsftm”)

As always: Stay hungry. Stay foolish.

One thing I commonly come across is the need to run simulations under specific scenarios. The idea behind this is to create a controlled data generating environment so that we can test out various methods, approaches, and estimators. While this may initially appear straightforward, I feel much of the complexity in "real world" data is lost by many practitioners hoping to mimic their situations in this controlled environment. Take for example a scenario that I've been recently working with.

Scenario

In practice, however, many of the patients in the placebo group decline in health. As is commonly drawn up in study protocols, these patients are then allowed to cross over to active treatment. Of interest to the relevant scientists is how to draw inference in the face of this informative crossover design.

Data

Our first task should be to clearly define our data and how we believe it to generated. From the scenario I provided above, it should become apparent that we have time-dependent confounding present. This is because we have a treatment that will affect future values of health, which in turn affects the probability of crossing over to the active treatment group. Let’s try to be more formal with this. We define for times t=1,2,…,$\tau$ where $\tau$ is some final time point of interest:

L(t) = a time-varying continuous measure of health. Examples include cell counts, tumor sizes, etc.
A(t) = an indicator of treatment group (0=control, 1=treatment)
Y(t) = an indicator of end event occurrence (0=alive, 1=failure)

We can employ direct acyclic graphs (DAGs) to help in demonstrating the effect that we believe each covariate to have on one another.

As shown in the figure, I believe that (among other things) the L(t) directly affects the Y(t) at each time point, as does A(t). Furthermore, I believe that A(t) affects L(t+1). With this in mind, we can continue on towards defining the statistical distribution that is consistent with these effects.

Distributions

Once we confirm how we believe the data to be generated, we can continue on to simulating this data under a known distribution. Here is where I feel many practitioners oversimplify. A common approach I see is to draw samples from a known parametric distribution. Some examples include the exponential, weibull, or gamma distributions.

While this parametric approach is easy to do, I find it hard to endorse its use. Real data is messy. It is complex. And it is non-parametric. Instead, I believe it better to sample data from a nonparametric distribution. This, of course, begs the question of which distribution to use. And what we should always consider when presented with questions like this is the scenario in which we are applying our work. As my scenario here is with cancer patients, a non-parametric survival or hazard curve from patients with cancer would be ideal. A quick online search turns up the US SEER cancer registry in which I can use towards my research. Some subsequent data cleaning and analyses then yield the following one year hazard rates for patients with Hodgkin’s Lymphoma:

T_months	SURVIVAL
0	1
12	0.956377
24	0.926385
36	0.900916
48	0.880696
60	0.860925
72	0.841388
84	0.825341
96	0.810255
108	0.797335
120	0.784967
132	0.772869
144	0.759607
156	0.747690
168	0.734079
180	0.723146
192	0.711468
204	0.699804
216	0.686888
228	0.675939
240	0.663627
252	0.650037

Using these hazard rates, we can now create discrete survival curves for our baseline population. This approach has the advantage that it assumes no specific functional form for the hazards. I should note, however, that there is a bias-variance trade off here in terms of the size of the intervals we use. Smaller intervals in the estimation of the hazard rates can result in less bias in the estimates at the cost of higher variance. Good judgement should be exercised in determining the appropriate interval length.

With the baseline distribution established, all that remains is to impose causal relationships between the covariates, as indicated by the arrows in the DAG above. I've included code below that imposes this relationship under fixed parameter values. For ease of presentation, the relationships I've imposed are parametric. Don't be afraid though to make them as complex as needed to adequately represent the scenario that you are working under.

"Stay hungry. Stay foolish." --Steve Jobs

##############################################
## Code to generate data as specified above ##
##############################################

n=500
time = 6
lambda = c(0.956377, 0.926385, 0.900916, 0.880696, 0.860925, 0.841388, 0.825341, 0.810255, 0.797335, 0.784967, 0.772869, 0.759607, 0.747690, 0.734079, 0.723146, 0.711468, 0.699804, 0.686888, 0.675939 ,0.663627 ,0.650037)
var.names = NULL
for(i in 1:length(time)) {
 var.names = c(var.names, paste(c("L","A","Y"), i, sep="."))
}
LT = matrix(nrow=n, ncol=length(var.names)+1, dimnames=list(NULL, c(var.names, "delta")))

Leff = 0.2
atrisk = rep(TRUE,n)
LT[atrisk,"L.1"] = rnorm(n)
LT[atrisk,"A.1"] = rbinom(n, 1, prob=0.5)
lambda.tx = lambda*1.5
lambdaOBS = t(apply(LT[,"A.1", drop=FALSE], 1, function(x) {
     if(x==0) return(lambda)
     else return(lambda.tx)
    }))
colnames(lambdaOBS) = paste("lambda", c(0:(length(lambda)-1)), sep=".")
lambdaOBS[,1] = plogis(qlogis(lambdaOBS[,1]) + Leff*LT[atrisk,"L.1"])
LT[,"Y.1"] = ifelse(atrisk, rbinom(n, 1, prob=1-exp(-lambdaOBS[,1])), NA)

if(max(time)>1) {
 for(i in 2:max(time)) {
  ## AT RISK INDICATOR ##
  atrisk = atrisk & LT[,paste("Y.",i-1, sep="")]==0
  
  ## TIME-DEPENDENT CONFOUNDER
  ibeta = 0.25
  Lbeta = 0.25
  e = rnorm(n)
  LT[,paste("L.",i, sep="")] = ifelse(atrisk, ibeta + Lbeta*LT[,paste("L.",i-1, sep="")] - 0.5*LT[,paste("A.",i-1, sep="")] + e, NA)
  
  ## TREATMENT ##
  if(crossover>0) {
   prob = plogis(qlogis(crossover) + 0.7*LT[atrisk,paste("L.",i, sep="")])
   LT[atrisk, paste("A.",i, sep="")] = ifelse(LT[atrisk, paste("A.",i-1, sep="")]==1, 1, rbinom(rep(1,sum(atrisk)), 1, prob=prob))
  } else {
   LT[atrisk, paste("A.",i, sep="")] = LT[atrisk, "A.1"]
  }
  
  ## HAZARD ##
  lambdaOBS[atrisk,i] = plogis(qlogis(lambdaOBS[atrisk,i]) + Leff*LT[atrisk,paste("L.",i, sep="")])
  
  ## DEATH INDICATOR ##
  LT[, paste("Y.",i, sep="")] = suppressWarnings(ifelse(atrisk, rbinom(rep(1,n), 1, prob=1-exp(-lambdaOBS[,i])), NA))
 }
}

## SURVIVAL TIMES ##
tmp = suppressWarnings(apply(LT[,paste("Y.", 1:max(time), sep=""), drop=FALSE], 1, function(x) max(x, na.rm=TRUE)))
LT[, "delta"] = ifelse(tmp==-Inf, 0, tmp)
T = apply(LT[, paste("Y.", 1:max(time), sep=""), drop=FALSE], 1, function(x) sum(1-x, na.rm=TRUE)+1)
data = data.frame(LT, T)

## TRANSPOSES ##
data = reshape(data, varying=c(1:(ncol(data)-2)), direction="long")
data = subset(data, !(is.na(L) & is.na(A) & is.na(Y)))
data = data[order(data$id, data$time),]
rownames(data) = NULL

Demonstrating statistics in theory and practice

Thursday, August 6, 2015

On the Rank Preserving Structural Failure Time Model (RPSFTM)

Thursday, May 28, 2015

On data generation ...