Conformal inference for counterfactuals

conformalCf computes intervals for counterfactuals or outcomes with ignorable missing values in general. It supports both split conformal inference and CV+, including weighted Jackknife+ as a special case. For each type, it supports both conformalized quantile regression (CQR) and standard conformal inference based on conditional mean regression.

conformalCf(
  X,
  Y,
  estimand = c("unconditional", "nonmissing", "missing"),
  type = c("CQR", "mean"),
  side = c("two", "above", "below"),
  quantiles = NULL,
  outfun = NULL,
  outparams = list(),
  psfun = NULL,
  psparams = list(),
  useCV = FALSE,
  trainprop = 0.75,
  nfolds = 10
)

Arguments

X: covariates.
Y: outcome vector with missing values encoded as NA. See Details.
estimand: a string that takes values in {"unconditional", "nonmissing", "missing"}. See Details.
type: a string that takes values in {"CQR", "mean"}.
side: a string that takes values in {"two", "above", "below"}. See Details.
quantiles: a scalar or a vector of length 2 depending on side. Used only when type = "CQR". See Details.
outfun: a function that models the conditional mean or quantiles, or a valid string. The default is random forest when type = "mean" and quantile random forest when type = "CQR". See Details.
outparams: a list of other parameters to be passed into outfun.
psfun: a function that models the missing mechanism (probability of missing given X), or a valid string. The default is "Boosting". See Details.
psparams: a list of other parameters to be passed into psfun.
useCV: FALSE for split conformal inference and TRUE for CV+.
trainprop: proportion of units for training outfun. The default if 75%. Used only when useCV = FALSE.
nfolds: number of folds. The default is 10. Used only when useCV = TRUE.

Value

a conformalSplit object when useCV = FALSE or a conformalCV object

Details

The outcome Y must comprise both observed values and missing values encoded as NA. The missing values are used to estimate the propensity score \(P(missing | X)\).

estimand controls the type of coverage to be guaranteed:

(Default) when estimand = "unconditional", the interval has \(P(Y \in \hat{C}(X))\ge 1 - \alpha\).
When estimand = "nonmissing", the interval has \(P(Y \in \hat{C}(X) | nonmissing) \ge 1 - \alpha\).
When estimand = "missing", the interval has \(P(Y \in \hat{C}(X) | missing) \ge 1 - \alpha\).

When side = "above", intervals are of form [-Inf, a(x)] and when side = "below" the intervals are of form [a(x), Inf].

outfun can be a valid string, including

"RF" for random forest that predicts the conditional mean, a wrapper built on randomForest package. Used when type = "mean".
"quantRF" for quantile random forest that predicts the conditional quantiles, a wrapper built on grf package. Used when type = "CQR".
"Boosting" for gradient boosting that predicts the conditional mean, a wrapper built on gbm package. Used when type = "mean".
"quantBoosting" for quantile gradient boosting that predicts the conditional quantiles, a wrapper built on gbm package. Used when type = "CQR".
"BART" for gradient boosting that predicts the conditional mean, a wrapper built on bartMachine package. Used when type = "mean".
"quantBART" for quantile gradient boosting that predicts the conditional quantiles, a wrapper built on bartMachine package. Used when type = "CQR".

or a function object whose input must include, but not limited to

Y for outcome in the training data.
X for covariates in the training data.
Xtest for covariates in the testing data.

When type = "CQR", outfun should also include an argument quantiles that is either a vector of length 2 or a scalar, depending on the argument side. The output of outfun must be a matrix with two columns giving the conditional quantile estimates when quantiles is a vector of length 2; otherwise, it must be a vector giving the conditional quantile estimate or conditional mean estimate. Other optional arguments can be passed into outfun through outparams.

psfun can be a valid string, including

"RF" for random forest that predicts the propensity score, a wrapper built on randomForest package. Used when type = "mean".
"Boosting" for gradient boosting that predicts the propensity score, a wrapper built on gbm package. Used when type = "mean".

or a function object whose input must include, but not limited to

Y for treatment assignment, a binary vector, in the training data.
X for covariates in the training data.
Xtest for covariates in the testing data.

The output of psfun must be a vector of predicted probabilities. Other optional arguments can be passed into psfun through psparams.

Examples

# Generate data from a linear model
set.seed(1)
n <- 1000
d <- 5
X <- matrix(rnorm(n * d), nrow = n)
beta <- rep(1, 5)
Y <- X %*% beta + rnorm(n)

# Generate missing indicators
missing_prob <- pnorm(X[, 1])
if_missing <- missing_prob < runif(n)
Y[if_missing] <- NA

# Generate testing data
ntest <- 5
Xtest <- matrix(rnorm(ntest * d), nrow = ntest)

# Run weighted split CQR
obj <- conformalCf(X, Y, type = "CQR", quantiles = c(0.05, 0.95),
                   outfun = "quantRF", useCV = FALSE)
predict(obj, Xtest, alpha = 0.1)
#>       lower    upper
#> 1 -2.694169 2.928830
#> 2 -2.644680 2.680240
#> 3 -4.768709 1.537865
#> 4 -5.136779 3.604356
#> 5 -3.036140 2.662865

# Run weighted standard conformal inference
obj <- conformalCf(X, Y, type = "mean",
                   outfun = "RF", useCV = FALSE)
predict(obj, Xtest, alpha = 0.1)
#>       lower      upper
#> 1 -2.554873  2.0402082
#> 2 -2.904058  1.7288508
#> 3 -5.072992 -0.4400834
#> 4 -4.330321  1.2443807
#> 5 -2.483102  2.1119798

# Run one-sided weighted split CQR
obj1 <- conformalCf(X, Y, type = "CQR", side = "above",
                    quantiles = 0.95, outfun = "quantRF", useCV = FALSE)
predict(obj1, Xtest, alpha = 0.1)
#>   lower      upper
#> 1  -Inf  2.2160363
#> 2  -Inf  0.7140021
#> 3  -Inf -0.1788658
#> 4  -Inf  3.2922861
#> 5  -Inf  1.4599988
obj2 <- conformalCf(X, Y, type = "CQR", side = "below",
                    quantiles = 0.05, outfun = "quantRF", useCV = FALSE)
predict(obj2, Xtest, alpha = 0.1)
#>       lower upper
#> 1 -2.168817   Inf
#> 2 -1.705976   Inf
#> 3 -4.792848   Inf
#> 4 -5.133214   Inf
#> 5 -2.333960   Inf

# Run split CQR with a self-defined quantile random forest
# Y, X, Xtest, quantiles should be included in the inputs
quantRF <- function(Y, X, Xtest, quantiles, ...){
    fit <- grf::quantile_forest(X, Y, quantiles = quantiles, ...)
    res <- predict(fit, Xtest, quantiles = quantiles)
    if (is.list(res) && !is.data.frame(res)){
        res <- res$predictions # for the recent update of \code{grf} package that changes the output format
    }
    if (length(quantiles) == 1){
        res <- as.numeric(res)
    } else {
        res <- as.matrix(res)
    }
    return(res)
}
obj <- conformalCf(X, Y, type = "CQR", quantiles = c(0.05, 0.95),
                   outfun = quantRF, useCV = FALSE)
predict(obj, Xtest, alpha = 0.1)
#>       lower    upper
#> 1 -2.761585 3.086488
#> 2 -2.197869 2.702938
#> 3 -5.121073 1.504797
#> 4 -5.110667 2.836832
#> 5 -3.052736 2.679462

# Run standard split conformal inference with a self-defined linear regression
# Y, X, Xtest should be included in the inputs
linearReg <- function(Y, X, Xtest){
    X <- as.data.frame(X)
    Xtest <- as.data.frame(Xtest)
    data <- data.frame(Y = Y, X)
    fit <- lm(Y ~ ., data = data)
    as.numeric(predict(fit, Xtest))
}
obj <- conformalCf(X, Y, type = "mean",
                   outfun = linearReg, useCV = FALSE)
predict(obj, Xtest, alpha = 0.1)
#>       lower      upper
#> 1 -1.786284  1.3121660
#> 2 -3.318391 -0.1431223
#> 3 -5.035091 -1.9366408
#> 4 -4.353584 -0.8314229
#> 5 -2.067116  1.0313344

# Run split CQR with a built-in psfun
# Y, X, Xtest, should be included in the inputs
obj <- conformalCf(X, Y, type = "CQR", quantiles = c(0.05, 0.95),
                   outfun = "quantRF", psfun = "RF", useCV = FALSE)
predict(obj, Xtest, alpha = 0.1)
#>       lower    upper
#> 1 -2.368242 2.929165
#> 2 -2.247054 2.731080
#> 3 -4.193513 1.532939
#> 4 -4.795010 2.283152
#> 5 -2.895413 2.522138

# Run split CQR with a self-defined function to estimate propensity scores
# Y, X, Xtest, should be included in the inputs
logitReg <- function(Y, X, Xtest, ...){
    X <- as.data.frame(X)
    Xtest <- as.data.frame(Xtest)
    data <- data.frame(Y = Y, X)
    fit <- glm(Y ~ ., data = data, family = "binomial", ...)
    as.numeric(predict(fit, Xtest, type = "response"))
}
obj <- conformalCf(X, Y, type = "CQR", quantiles = c(0.05, 0.95),
                   outfun = "quantRF", psfun = logitReg, useCV = FALSE)
predict(obj, Xtest, alpha = 0.1)
#>       lower    upper
#> 1 -2.571811 3.460430
#> 2 -3.163423 3.629465
#> 3 -5.717934 2.079342
#> 4 -6.034055 4.501632
#> 5 -3.048383 2.546090

Arguments

Value

Details

See also

Examples