| Title: | Inference on Predicted Data |
|---|---|
| Description: | Performs valid statistical inference on predicted data (IPD) using recent methods, where for a subset of the data, the outcomes have been predicted by an algorithm. Provides a wrapper function with specified defaults for the type of model and method to be used for estimation and inference. Further provides methods for tidying and summarizing results. Salerno et al., (2025) <doi:10.1093/bioinformatics/btaf055>. |
| Authors: | Stephen Salerno [aut, cre, cph] (ORCID: <https://orcid.org/0000-0003-2763-0494>), Jiacheng Miao [aut], Awan Afiaz [aut], Kentaro Hoffman [aut], Jesse Gronsbell [aut], Jianhui Gao [aut], David Cheng [aut], Anna Neufeld [aut], Qiongshi Lu [aut], Tyler H McCormick [aut], Jeffrey T Leek [aut] |
| Maintainer: | Stephen Salerno <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.4.1.9000 |
| Built: | 2026-05-10 08:55:15 UTC |
| Source: | https://github.com/ipd-tools/ipd |
A function for the calculation of the matrix A based on single dataset
A( X, Y, quant = NA, theta, method = c("ols", "quantile", "mean", "logistic", "poisson") )A( X, Y, quant = NA, theta, method = c("ols", "quantile", "mean", "logistic", "poisson") )
X |
Array or data.frame containing covariates |
Y |
Array or data.frame of outcomes |
quant |
quantile for quantile estimation |
theta |
parameter theta |
method |
indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson". |
matrix A based on single dataset
Augment data from an ipd fit
## S3 method for class 'ipd' augment(x, data = x@data_u, ...)## S3 method for class 'ipd' augment(x, data = x@data_u, ...)
x |
An object of class |
data |
A |
... |
Ignored. |
The data.frame with columns .fitted and .resid.
dat <- simdat() fit <- ipd(Y - f ~ X1, method = "pspa", model = "ols", data = dat, label = "set_label") augmented_df <- augment(fit) head(augmented_df)dat <- simdat() fit <- ipd(Y - f ~ X1, method = "pspa", model = "ols", data = dat, label = "set_label") augmented_df <- augment(fit) head(augmented_df)
Calculates the optimal value of lhat for the prediction-powered confidence interval for GLMs.
calc_lhat_glm( grads, grads_hat, grads_hat_unlabeled, inv_hessian, coord = NULL, clip = FALSE )calc_lhat_glm( grads, grads_hat, grads_hat_unlabeled, inv_hessian, coord = NULL, clip = FALSE )
grads |
(matrix): n x p matrix gradient of the loss function with respect to the parameter evaluated at the labeled data. |
grads_hat |
(matrix): n x p matrix gradient of the loss function with respect to the model parameter evaluated using predictions on the labeled data. |
grads_hat_unlabeled |
(matrix): N x p matrix gradient of the loss function with respect to the parameter evaluated using predictions on the unlabeled data. |
inv_hessian |
(matrix): p x p matrix inverse of the Hessian of the loss function with respect to the parameter. |
coord |
(int, optional): Coordinate for which to optimize |
clip |
(boolean, optional): Whether to clip the value of lhat to be
non-negative. Defaults to |
(float): Optimal value of lhat in [0,1].
Helper function for Chen & Chen logistic regression estimation.
chen_logistic(X_l, Y_l, f_l, X_u, f_u, intercept = TRUE)chen_logistic(X_l, Y_l, f_l, X_u, f_u, intercept = TRUE)
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
intercept |
(Logical): Do the design matrices include intercept
columns? Default is |
Another look at statistical inference with machine learning-imputed data (Gronsbell et al., 2026) doi:10.48550/arXiv.2411.19908
(list): A list containing the following:
(vector): vector of Chen & Chen logistic regression coefficient estimates.
(vector): vector of standard errors of the coefficients.
dat <- simdat(model = "logistic") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) chen_logistic(X_l, Y_l, f_l, X_u, f_u, intercept = TRUE)dat <- simdat(model = "logistic") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) chen_logistic(X_l, Y_l, f_l, X_u, f_u, intercept = TRUE)
Helper function for Chen & Chen OLS estimation
chen_ols(X_l, Y_l, f_l, X_u, f_u, intercept = TRUE)chen_ols(X_l, Y_l, f_l, X_u, f_u, intercept = TRUE)
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
intercept |
(Logical): Do the design matrices include intercept
columns? Default is |
Another look at statistical inference with machine learning-imputed data (Gronsbell et al., 2026) doi:10.48550/arXiv.2411.19908
(list): A list containing the following:
(vector): vector of Chen & Chen OLS regression coefficient estimates.
(vector): vector of standard errors of the coefficients.
dat <- simdat(model = "ols") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) chen_ols(X_l, Y_l, f_l, X_u, f_u, intercept = TRUE)dat <- simdat(model = "ols") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) chen_ols(X_l, Y_l, f_l, X_u, f_u, intercept = TRUE)
Helper function for Chen & Chen Poisson regression estimation.
chen_poisson(X_l, Y_l, f_l, X_u, f_u, intercept = TRUE)chen_poisson(X_l, Y_l, f_l, X_u, f_u, intercept = TRUE)
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
intercept |
(Logical): Do the design matrices include intercept
columns? Default is |
Another look at statistical inference with machine learning-imputed data (Gronsbell et al., 2026) doi:10.48550/arXiv.2411.19908
(list): A list containing the following:
(vector): vector of Chen & Chen Poisson regression coefficient estimates.
(vector): vector of standard errors of the coefficients.
dat <- simdat(model = "poisson") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) chen_poisson(X_l, Y_l, f_l, X_u, f_u, intercept = TRUE)dat <- simdat(model = "poisson") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) chen_poisson(X_l, Y_l, f_l, X_u, f_u, intercept = TRUE)
Computes the empirical CDF of the data.
compute_cdf(Y, grid, w = NULL)compute_cdf(Y, grid, w = NULL)
Y |
(matrix): n x 1 matrix of observed data. |
grid |
(matrix): Grid of values to compute the CDF at. |
w |
(vector, optional): n-vector of sample weights. |
(list): Empirical CDF and its standard deviation at the specified grid points.
Computes the difference between the empirical CDFs of the data and the predictions.
compute_cdf_diff(Y, f, grid, w = NULL)compute_cdf_diff(Y, f, grid, w = NULL)
Y |
(matrix): n x 1 matrix of observed data. |
f |
(matrix): n x 1 matrix of predictions. |
grid |
(matrix): Grid of values to compute the CDF at. |
w |
(vector, optional): n-vector of sample weights. |
(list): Difference between the empirical CDFs of the data and the predictions and its standard deviation at the specified grid points.
est_ini function for initial estimation
est_ini( X, Y, quant = NA, method = c("ols", "quantile", "mean", "logistic", "poisson") )est_ini( X, Y, quant = NA, method = c("ols", "quantile", "mean", "logistic", "poisson") )
X |
Array or data.frame containing covariates |
Y |
Array or data.frame of outcomes |
quant |
quantile for quantile estimation |
method |
indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson". |
initial estimator
Glance at an ipd fit
## S3 method for class 'ipd' glance(x, ...)## S3 method for class 'ipd' glance(x, ...)
x |
An object of class |
... |
Ignored. |
A one-row tibble summarizing the fit.
dat <- simdat() fit <- ipd(Y - f ~ X1, method = "pspa", model = "ols", data = dat, label = "set_label") glance(fit)dat <- simdat() fit <- ipd(Y - f ~ X1, method = "pspa", model = "ols", data = dat, label = "set_label") glance(fit)
The main wrapper function to conduct ipd using various methods and models, and returns a list of fitted model components.
ipd( formula, method, model, data, label = NULL, unlabeled_data = NULL, intercept = TRUE, alpha = 0.05, alternative = "two-sided", na_action = "na.fail", ... )ipd( formula, method, model, data, label = NULL, unlabeled_data = NULL, intercept = TRUE, alpha = 0.05, alternative = "two-sided", na_action = "na.fail", ... )
formula |
An object of class |
method |
The IPD method to be used for fitting the model. Must be one
of |
model |
The type of downstream inferential model to be fitted, or the
parameter being estimated. Must be one of |
data |
A |
label |
A |
unlabeled_data |
(optional) A |
intercept |
|
alpha |
The significance level for confidence intervals. Default is
|
alternative |
A string specifying the alternative hypothesis. Must be
one of |
na_action |
(string, optional) How missing covariate data should be
handled. Currently |
... |
Additional arguments to be passed to the fitting function. See
the |
1. Formula:
The ipd function uses one formula argument that specifies both the
calibrating model (e.g., PostPI "relationship model", PPI "rectifier" model)
and the inferential model. These separate models will be created internally
based on the specific method called.
2. Data:
The data can be specified in two ways:
Single data argument (data) containing a stacked
data.frame and a label identifier (label).
Two data arguments, one for the labeled data (data) and one
for the unlabeled data (unlabeled_data).
For option (1), provide one data argument (data) which contains a
stacked data.frame with both the unlabeled and labeled data and a
label argument that specifies the column identifying the labeled
versus the unlabeled observations in the stacked data.frame (e.g.,
label = "set_label" if the column "set_label" in the stacked data
denotes which set an observation belongs to).
NOTE: Labeled data identifiers can be:
"l", "lab", "label", "labeled", "labelled", "tst", "test", "true"
TRUE
Non-reference category (i.e., binary 1)
Unlabeled data identifiers can be:
"u", "unlab", "unlabeled", "unlabelled", "val", "validation", "false"
FALSE
Non-reference category (i.e., binary 0)
For option (2), provide separate data arguments for the labeled data set
(data) and the unlabeled data set (unlabeled_data). If the
second argument is provided, the function ignores the label
identifier and assumes the data provided are not stacked.
NOTE: Not all columns in data or unlabeled_data may be used
unless explicitly referenced in the formula argument or in the
label argument (if the data are passed as one stacked data frame).
3. Method:
Use the method argument to specify the fitting method:
Gronsbell et al. (2026) Chen and Chen Correction
Gan et al. (2024) Prediction Decorrelated Inference
Wang et al. (2020) Post-Prediction Inference (PostPI) Analytic Correction
Wang et al. (2020) Post-Prediction Inference (PostPI) Bootstrap Correction
Angelopoulos et al. (2023) Prediction-Powered Inference (PPI)
Gronsbell et al. (2025) PPI "All" Correction
Angelopoulos et al. (2023) PPI++
Miao et al. (2023) Assumption-Lean and Data-Adaptive Post-Prediction Inference (PSPA)
4. Model:
Use the model argument to specify the type of downstream inferential
model or parameter to be estimated:
Mean value of a continuous outcome
qth quantile of a continuous outcome
Linear regression coefficients for a continuous outcome
Logistic regression coefficients for a binary outcome
Poisson regression coefficients for a count outcome
The ipd wrapper function will concatenate the method and
model arguments to identify the required helper function, following
the naming convention "method_model".
5. Auxiliary Arguments:
The wrapper function will take method-specific auxiliary arguments (e.g.,
q for the quantile estimation models) and pass them to the helper
function through the "..." with specified defaults for simplicity.
6. Other Arguments:
All other arguments that relate to all methods (e.g., alpha, ci.type), or other method-specific arguments, will have defaults.
a summary of model output.
An S4 object of class IPD with the following slots:
coefficientsNamed numeric
vector of estimated parameters.
seNamed numeric
vector of standard errors.
ciA matrix of confidence intervals,
with columns lower and upper.
coefTableA data.frame summarizing
Estimate, Std. Error, z-value, and Pr(>|z|) (glm-style).
fitThe raw output list returned by
the method-specific helper function.
formulaThe formula used for fitting
the IPD model.
data_lThe labeled data.frame used in
the analysis.
data_uThe unlabeled data.frame used
in the analysis.
methodA character string indicating
which IPD method was applied.
modelA character string indicating
the downstream inferential model.
interceptA logical indicating whether
an intercept was included.
#-- Generate Example Data dat <- simdat(n = c(300, 300, 300), effect = 1, sigma_Y = 1) head(dat) formula <- Y - f ~ X1 #-- Chen and Chen Correction (Gronsbell et al., 2026) ipd(formula, method = "chen", model = "ols", data = dat, label = "set_label" ) #-- Prediction Decorrelated Inference (Gan et al., 2024) ipd(formula, method = "chen", model = "ols", data = dat, label = "set_label" ) #-- PostPI Analytic Correction (Wang et al., 2020) ipd(formula, method = "postpi_analytic", model = "ols", data = dat, label = "set_label" ) #-- PostPI Bootstrap Correction (Wang et al., 2020) nboot <- 200 ipd(formula, method = "postpi_boot", model = "ols", data = dat, label = "set_label", nboot = nboot ) #-- PPI (Angelopoulos et al., 2023) ipd(formula, method = "ppi", model = "ols", data = dat, label = "set_label" ) #-- PPI "All" (Gronsbell et al., 2025) ipd(formula, method = "ppi_a", model = "ols", data = dat, label = "set_label" ) #-- PPI++ (Angelopoulos et al., 2023) ipd(formula, method = "ppi_plusplus", model = "ols", data = dat, label = "set_label" ) #-- PSPA (Miao et al., 2023) ipd(formula, method = "pspa", model = "ols", data = dat, label = "set_label" )#-- Generate Example Data dat <- simdat(n = c(300, 300, 300), effect = 1, sigma_Y = 1) head(dat) formula <- Y - f ~ X1 #-- Chen and Chen Correction (Gronsbell et al., 2026) ipd(formula, method = "chen", model = "ols", data = dat, label = "set_label" ) #-- Prediction Decorrelated Inference (Gan et al., 2024) ipd(formula, method = "chen", model = "ols", data = dat, label = "set_label" ) #-- PostPI Analytic Correction (Wang et al., 2020) ipd(formula, method = "postpi_analytic", model = "ols", data = dat, label = "set_label" ) #-- PostPI Bootstrap Correction (Wang et al., 2020) nboot <- 200 ipd(formula, method = "postpi_boot", model = "ols", data = dat, label = "set_label", nboot = nboot ) #-- PPI (Angelopoulos et al., 2023) ipd(formula, method = "ppi", model = "ols", data = dat, label = "set_label" ) #-- PPI "All" (Gronsbell et al., 2025) ipd(formula, method = "ppi_a", model = "ols", data = dat, label = "set_label" ) #-- PPI++ (Angelopoulos et al., 2023) ipd(formula, method = "ppi_plusplus", model = "ols", data = dat, label = "set_label" ) #-- PSPA (Miao et al., 2023) ipd(formula, method = "pspa", model = "ols", data = dat, label = "set_label" )
ipd: S4 class for inference on predicted data results
coefficientsNumeric vector of parameter estimates.
seNumeric vector of standard errors.
ciNumeric matrix of confidence intervals.
coefTableData frame summarizing Estimate,
Std. Error, z value, and Pr(>|z|).
fitThe raw list returned by the helper function.
formulaThe formula used (class "formula").
data_lThe labeled data (data.frame).
data_uThe unlabeled data (data.frame).
methodCharacter; which IPD method was used.
modelCharacter; which downstream model was fitted.
interceptLogical; was an intercept included?
link_grad function for gradient of the link function
link_grad(t, method = c("ols", "logistic", "poisson"))link_grad(t, method = c("ols", "logistic", "poisson"))
t |
t |
method |
indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson". |
gradient of the link function
link_Hessian function for Hessians of the link function
link_Hessian(t, method = c("logistic", "poisson"))link_Hessian(t, method = c("logistic", "poisson"))
t |
t |
method |
indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson". |
Hessians of the link function
Computes the natural logarithm of 1 plus the exponential of the input, to handle large inputs.
log1pexp(x)log1pexp(x)
x |
(vector): A numeric vector of inputs. |
(vector): A numeric vector where each element is the result of log(1 + exp(x)).
Computes the statistics needed for the logstic regression-based prediction-powered inference.
logistic_get_stats( est, X_l, Y_l, f_l, X_u, f_u, w_l = NULL, w_u = NULL, use_u = TRUE )logistic_get_stats( est, X_l, Y_l, f_l, X_u, f_u, w_l = NULL, w_u = NULL, use_u = TRUE )
est |
(vector): Point estimates of the coefficients. |
X_l |
(matrix): Covariates for the labeled data set. |
Y_l |
(vector): Labels for the labeled data set. |
f_l |
(vector): Predictions for the labeled data set. |
X_u |
(matrix): Covariates for the unlabeled data set. |
f_u |
(vector): Predictions for the unlabeled data set. |
w_l |
(vector, optional): Sample weights for the labeled data set. |
w_u |
(vector, optional): Sample weights for the unlabeled data set. |
use_u |
(bool, optional): Whether to use the unlabeled data set. |
(list): A list containing the following:
(matrix): n x p matrix gradient of the loss function with respect to the coefficients.
(matrix): n x p matrix gradient of the loss function with respect to the coefficients, evaluated using the labeled predictions.
(matrix): N x p matrix gradient of the loss function with respect to the coefficients, evaluated using the unlabeled predictions.
(matrix): p x p matrix inverse Hessian of the loss function with respect to the coefficients.
mean_psi function for sample expectation of psi
mean_psi( X, Y, theta, quant = NA, method = c("ols", "quantile", "mean", "logistic", "poisson") )mean_psi( X, Y, theta, quant = NA, method = c("ols", "quantile", "mean", "logistic", "poisson") )
X |
Array or data.frame containing covariates |
Y |
Array or data.frame of outcomes |
theta |
parameter theta |
quant |
quantile for quantile estimation |
method |
indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson". |
sample expectation of psi
mean_psi_pop function for sample expectation of PSPA psi
mean_psi_pop( X_l, X_u, Y_l, f_l, f_u, w, theta, quant = NA, method = c("ols", "quantile", "mean", "logistic", "poisson") )mean_psi_pop( X_l, X_u, Y_l, f_l, f_u, w, theta, quant = NA, method = c("ols", "quantile", "mean", "logistic", "poisson") )
X_l |
Array or data.frame containing observed covariates in labeled data. |
X_u |
Array or data.frame containing observed or predicted covariates in unlabeled data. |
Y_l |
Array or data.frame of observed outcomes in labeled data. |
f_l |
Array or data.frame of predicted outcomes in labeled data. |
f_u |
Array or data.frame of predicted outcomes in unlabeled data. |
w |
weights vector PSPA linear regression (d-dimensional, where d equals the number of covariates). |
theta |
parameter theta |
quant |
quantile for quantile estimation |
method |
indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson". |
sample expectation of PSPA psi
Computes the ordinary least squares coefficients.
ols(X, Y, return_se = FALSE)ols(X, Y, return_se = FALSE)
X |
(matrix): n x p matrix of covariates. |
Y |
(vector): p-vector of outcome values. |
return_se |
(bool, optional): Whether to return the standard errors of the coefficients. |
(list): A list containing the following:
(vector): p-vector of ordinary least squares estimates of the coefficients.
(vector): If return_se == TRUE, return the p-vector of standard errors of the coefficients.
Computes the statistics needed for the OLS-based prediction-powered inference.
ols_get_stats( est, X_l, Y_l, f_l, X_u, f_u, w_l = NULL, w_u = NULL, use_u = TRUE )ols_get_stats( est, X_l, Y_l, f_l, X_u, f_u, w_l = NULL, w_u = NULL, use_u = TRUE )
est |
(vector): Point estimates of the coefficients. |
X_l |
(matrix): Covariates for the labeled data set. |
Y_l |
(vector): Labels for the labeled data set. |
f_l |
(vector): Predictions for the labeled data set. |
X_u |
(matrix): Covariates for the unlabeled data set. |
f_u |
(vector): Predictions for the unlabeled data set. |
w_l |
(vector, optional): Sample weights for the labeled data set. |
w_u |
(vector, optional): Sample weights for the unlabeled data set. |
use_u |
(boolean, optional): Whether to use the unlabeled data set. |
(list): A list containing the following:
(matrix): n x p matrix gradient of the loss function with respect to the coefficients.
(matrix): n x p matrix gradient of the loss function with respect to the coefficients, evaluated using the labeled predictions.
(matrix): N x p matrix gradient of the loss function with respect to the coefficients, evaluated using the unlabeled predictions.
(matrix): p x p matrix inverse Hessian of the loss function with respect to the coefficients.
optim_est function for One-step update for obtaining estimator
optim_est( X_l, X_u, Y_l, f_l, f_u, w, theta, quant = NA, method = c("ols", "quantile", "mean", "logistic", "poisson") )optim_est( X_l, X_u, Y_l, f_l, f_u, w, theta, quant = NA, method = c("ols", "quantile", "mean", "logistic", "poisson") )
X_l |
Array or data.frame containing observed covariates in labeled data. |
X_u |
Array or data.frame containing observed or predicted covariates in unlabeled data. |
Y_l |
Array or data.frame of observed outcomes in labeled data. |
f_l |
Array or data.frame of predicted outcomes in labeled data. |
f_u |
Array or data.frame of predicted outcomes in unlabeled data. |
w |
weights vector PSPA linear regression (d-dimensional, where d equals the number of covariates). |
theta |
parameter theta |
quant |
quantile for quantile estimation |
method |
indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson". |
estimator
optim_weights function for One-step update for obtaining estimator
optim_weights( X_l, X_u, Y_l, f_l, f_u, w, theta, quant = NA, method = c("ols", "quantile", "mean", "logistic", "poisson") )optim_weights( X_l, X_u, Y_l, f_l, f_u, w, theta, quant = NA, method = c("ols", "quantile", "mean", "logistic", "poisson") )
X_l |
Array or data.frame containing observed covariates in labeled data. |
X_u |
Array or data.frame containing observed or predicted covariates in unlabeled data. |
Y_l |
Array or data.frame of observed outcomes in labeled data. |
f_l |
Array or data.frame of predicted outcomes in labeled data. |
f_u |
Array or data.frame of predicted outcomes in unlabeled data. |
w |
weights vector PSPA linear regression (d-dimensional, where d equals the number of covariates). |
theta |
parameter theta |
quant |
quantile for quantile estimation |
method |
indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson". |
weights
Helper function for PDC logistic regression estimation.
pdc_logistic(X_l, Y_l, f_l, X_u, f_u, intercept = TRUE)pdc_logistic(X_l, Y_l, f_l, X_u, f_u, intercept = TRUE)
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
intercept |
(Logical): Do the design matrices include intercept
columns? Default is |
Prediction de-correlated inference: A safe approach for post-prediction inference (Gan et al., 2024) doi:10.1111/anzs.12429
(list): A list containing the following:
(vector): vector of PDC logistic regression coefficient estimates.
(vector): vector of standard errors of the coefficients.
dat <- simdat(model = "logistic") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) pdc_logistic(X_l, Y_l, f_l, X_u, f_u, intercept = TRUE)dat <- simdat(model = "logistic") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) pdc_logistic(X_l, Y_l, f_l, X_u, f_u, intercept = TRUE)
Helper function for PDC OLS estimation.
pdc_ols(X_l, Y_l, f_l, X_u, f_u, intercept = TRUE)pdc_ols(X_l, Y_l, f_l, X_u, f_u, intercept = TRUE)
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
intercept |
(Logical): Do the design matrices include intercept
columns? Default is |
Prediction de-correlated inference: A safe approach for post-prediction inference (Gan et al., 2024) doi:10.1111/anzs.12429
(list): A list containing the following:
(vector): vector of PDC OLS regression coefficient estimates.
(vector): vector of standard errors of the coefficients.
dat <- simdat(model = "ols") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) pdc_ols(X_l, Y_l, f_l, X_u, f_u, intercept = TRUE)dat <- simdat(model = "ols") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) pdc_ols(X_l, Y_l, f_l, X_u, f_u, intercept = TRUE)
Helper function for PDC Poisson regression estimation.
pdc_poisson(X_l, Y_l, f_l, X_u, f_u, intercept = TRUE)pdc_poisson(X_l, Y_l, f_l, X_u, f_u, intercept = TRUE)
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
intercept |
(Logical): Do the design matrices include intercept
columns? Default is |
Prediction de-correlated inference: A safe approach for post-prediction inference (Gan et al., 2024) doi:10.1111/anzs.12429
(list): A list containing the following:
(vector): vector of PDC Poisson regression coefficient estimates.
(vector): vector of standard errors of the coefficients.
dat <- simdat(model = "poisson") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) pdc_poisson(X_l, Y_l, f_l, X_u, f_u, intercept = TRUE)dat <- simdat(model = "poisson") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) pdc_poisson(X_l, Y_l, f_l, X_u, f_u, intercept = TRUE)
Helper function for PostPI OLS estimation (analytic correction)
postpi_analytic_ols(X_l, Y_l, f_l, X_u, f_u, original = FALSE)postpi_analytic_ols(X_l, Y_l, f_l, X_u, f_u, original = FALSE)
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
original |
(boolean): Logical argument to use original method from Wang et al. (2020). Defaults to FALSE; TRUE retained for posterity. |
Methods for correcting inference based on outcomes predicted by machine learning (Wang et al., 2020) doi:10.1073/pnas.2001238117
A list of outputs: estimate of the inference model parameters and corresponding standard error estimate.
dat <- simdat(model = "ols") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) postpi_analytic_ols(X_l, Y_l, f_l, X_u, f_u)dat <- simdat(model = "ols") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) postpi_analytic_ols(X_l, Y_l, f_l, X_u, f_u)
Helper function for PostPI logistic regression (bootstrap correction)
postpi_boot_logistic(X_l, Y_l, f_l, X_u, f_u, nboot = 100, se_type = "par")postpi_boot_logistic(X_l, Y_l, f_l, X_u, f_u, nboot = 100, se_type = "par")
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
nboot |
(integer): Number of bootstrap samples. Defaults to 100. |
se_type |
(string): Which method to calculate the standard errors. Options include "par" (parametric) or "npar" (nonparametric). Defaults to "par". |
Methods for correcting inference based on outcomes predicted by machine learning (Wang et al., 2020) doi:10.1073/pnas.2001238117
A list of outputs: estimate of inference model parameters and corresponding standard error based on both parametric and non-parametric bootstrap methods.
dat <- simdat(model = "logistic") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) postpi_boot_logistic(X_l, Y_l, f_l, X_u, f_u, nboot = 200)dat <- simdat(model = "logistic") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) postpi_boot_logistic(X_l, Y_l, f_l, X_u, f_u, nboot = 200)
Helper function for PostPI OLS estimation (bootstrap correction)
postpi_boot_ols( X_l, Y_l, f_l, X_u, f_u, nboot = 100, se_type = "par", rel_func = "lm", scale_se = TRUE, n_t = Inf )postpi_boot_ols( X_l, Y_l, f_l, X_u, f_u, nboot = 100, se_type = "par", rel_func = "lm", scale_se = TRUE, n_t = Inf )
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
nboot |
(integer): Number of bootstrap samples. Defaults to 100. |
se_type |
(string): Which method to calculate the standard errors. Options include "par" (parametric) or "npar" (nonparametric). Defaults to "par". |
rel_func |
(string): Method for fitting the relationship model. Options include "lm" (linear model), "rf" (random forest), and "gam" (generalized additive model). Defaults to "lm". |
scale_se |
(boolean): Logical argument to scale relationship model error variance. Defaults to TRUE; FALSE option is retained for posterity. |
n_t |
(integer, optional) Size of the dataset used to train the
prediction function (necessary if |
Methods for correcting inference based on outcomes predicted by machine learning (Wang et al., 2020) doi:10.1073/pnas.2001238117
A list of outputs: estimate of inference model parameters and corresponding standard error based on both parametric and non-parametric bootstrap methods.
dat <- simdat(model = "ols") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) postpi_boot_ols(X_l, Y_l, f_l, X_u, f_u, nboot = 200)dat <- simdat(model = "ols") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) postpi_boot_ols(X_l, Y_l, f_l, X_u, f_u, nboot = 200)
Helper function for PPI "All" for OLS estimation
ppi_a_ols(X_l, Y_l, f_l, X_u, f_u, w_l = NULL, w_u = NULL)ppi_a_ols(X_l, Y_l, f_l, X_u, f_u, w_l = NULL, w_u = NULL)
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
w_l |
(ndarray, optional): Sample weights for the labeled data set. Defaults to a vector of ones. |
w_u |
(ndarray, optional): Sample weights for the unlabeled data set. Defaults to a vector of ones. |
Another look at statistical inference with machine learning-imputed data (Gronsbell et al., 2026) doi:10.48550/arXiv.2411.19908
(list): A list containing the following:
(vector): vector of PPI OLS regression coefficient estimates.
(vector): vector of standard errors of the coefficients.
(vector): vector of the rectifier OLS regression coefficient estimates.
dat <- simdat() form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_a_ols(X_l, Y_l, f_l, X_u, f_u)dat <- simdat() form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_a_ols(X_l, Y_l, f_l, X_u, f_u)
Helper function for PPI logistic regression
ppi_logistic(X_l, Y_l, f_l, X_u, f_u, opts = NULL)ppi_logistic(X_l, Y_l, f_l, X_u, f_u, opts = NULL)
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
opts |
(list, optional): Options to pass to the optimizer. See ?optim for details. |
Prediction Powered Inference (Angelopoulos et al., 2023) doi:10.1126/science.adi6000
(list): A list containing the following:
(vector): vector of PPI logistic regression coefficient estimates.
(vector): vector of standard errors of the coefficients.
(vector): vector of the rectifier logistic regression coefficient estimates.
(matrix): covariance matrix for the gradients in the unlabeled data.
(matrix): covariance matrix for the gradients in the labeled data.
(matrix): matrix of gradients for the labeled data.
(matrix): matrix of predicted gradients for the unlabeled data.
(matrix): matrix of predicted gradients for the labeled data.
(matrix): inverse Hessian matrix.
dat <- simdat(model = "logistic") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_logistic(X_l, Y_l, f_l, X_u, f_u)dat <- simdat(model = "logistic") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_logistic(X_l, Y_l, f_l, X_u, f_u)
Helper function for PPI mean estimation
ppi_mean(Y_l, f_l, f_u, alpha = 0.05, alternative = "two-sided")ppi_mean(Y_l, f_l, f_u, alpha = 0.05, alternative = "two-sided")
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
alpha |
(scalar): type I error rate for hypothesis testing - values in (0, 1); defaults to 0.05. |
alternative |
(string): Alternative hypothesis. Must be one of
|
Prediction Powered Inference (Angelopoulos et al., 2023) doi:10.1126/science.adi6000
tuple: Lower and upper bounds of the prediction-powered confidence interval for the mean.
dat <- simdat(model = "mean") form <- Y - f ~ 1 Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_mean(Y_l, f_l, f_u)dat <- simdat(model = "mean") form <- Y - f ~ 1 Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_mean(Y_l, f_l, f_u)
Helper function for prediction-powered inference for OLS estimation
ppi_ols(X_l, Y_l, f_l, X_u, f_u, w_l = NULL, w_u = NULL)ppi_ols(X_l, Y_l, f_l, X_u, f_u, w_l = NULL, w_u = NULL)
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
w_l |
(ndarray, optional): Sample weights for the labeled data set. Defaults to a vector of ones. |
w_u |
(ndarray, optional): Sample weights for the unlabeled data set. Defaults to a vector of ones. |
Prediction Powered Inference (Angelopoulos et al., 2023) doi:10.1126/science.adi6000
(list): A list containing the following:
(vector): vector of PPI OLS regression coefficient estimates.
(vector): vector of standard errors of the coefficients.
(vector): vector of the rectifier OLS regression coefficient estimates.
dat <- simdat() form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_ols(X_l, Y_l, f_l, X_u, f_u)dat <- simdat() form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_ols(X_l, Y_l, f_l, X_u, f_u)
Helper function for PPI++ logistic regression
ppi_plusplus_logistic( X_l, Y_l, f_l, X_u, f_u, lhat = NULL, coord = NULL, opts = NULL, w_l = NULL, w_u = NULL )ppi_plusplus_logistic( X_l, Y_l, f_l, X_u, f_u, lhat = NULL, coord = NULL, opts = NULL, w_l = NULL, w_u = NULL )
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
lhat |
(float, optional): Power-tuning parameter (see
doi:10.48550/arXiv.2311.01453). The default value, |
coord |
(int, optional): Coordinate for which to optimize
|
opts |
(list, optional): Options to pass to the optimizer. See ?optim for details. |
w_l |
(ndarray, optional): Sample weights for the labeled data set. Defaults to a vector of ones. |
w_u |
(ndarray, optional): Sample weights for the unlabeled data set. Defaults to a vector of ones. |
PPI++: Efficient Prediction Powered Inference (Angelopoulos et al., 2023) doi:10.48550/arXiv.2311.01453
(list): A list containing the following:
(vector): vector of PPI++ logistic regression coefficient estimates.
(vector): vector of standard errors of the coefficients.
(float): estimated power-tuning parameter.
(vector): vector of the rectifier logistic regression coefficient estimates.
(matrix): covariance matrix for the gradients in the unlabeled data.
(matrix): covariance matrix for the gradients in the labeled data.
(matrix): matrix of gradients for the labeled data.
(matrix): matrix of predicted gradients for the unlabeled data.
(matrix): matrix of predicted gradients for the labeled data.
(matrix): inverse Hessian matrix.
dat <- simdat(model = "logistic") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_logistic(X_l, Y_l, f_l, X_u, f_u)dat <- simdat(model = "logistic") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_logistic(X_l, Y_l, f_l, X_u, f_u)
Helper function for PPI++ logistic regression (point estimate)
ppi_plusplus_logistic_est( X_l, Y_l, f_l, X_u, f_u, lhat = NULL, coord = NULL, opts = NULL, w_l = NULL, w_u = NULL )ppi_plusplus_logistic_est( X_l, Y_l, f_l, X_u, f_u, lhat = NULL, coord = NULL, opts = NULL, w_l = NULL, w_u = NULL )
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
lhat |
(float, optional): Power-tuning parameter (see
doi:10.48550/arXiv.2311.01453). The default value, |
coord |
(int, optional): Coordinate for which to optimize
|
opts |
(list, optional): Options to pass to the optimizer. See ?optim for details. |
w_l |
(ndarray, optional): Sample weights for the labeled data set. Defaults to a vector of ones. |
w_u |
(ndarray, optional): Sample weights for the unlabeled data set. Defaults to a vector of ones. |
PPI++: Efficient Prediction Powered Inference (Angelopoulos et al., 2023) doi:10.48550/arXiv.2311.01453
(vector): vector of prediction-powered point estimates of the logistic regression coefficients.
dat <- simdat(model = "logistic") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_logistic_est(X_l, Y_l, f_l, X_u, f_u)dat <- simdat(model = "logistic") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_logistic_est(X_l, Y_l, f_l, X_u, f_u)
Helper function for PPI++ mean estimation
ppi_plusplus_mean( Y_l, f_l, f_u, alpha = 0.05, alternative = "two-sided", lhat = NULL, coord = NULL, w_l = NULL, w_u = NULL )ppi_plusplus_mean( Y_l, f_l, f_u, alpha = 0.05, alternative = "two-sided", lhat = NULL, coord = NULL, w_l = NULL, w_u = NULL )
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
alpha |
(scalar): type I error rate for hypothesis testing - values in (0, 1); defaults to 0.05. |
alternative |
(string): Alternative hypothesis. Must be one of
|
lhat |
(float, optional): Power-tuning parameter (see
doi:10.48550/arXiv.2311.01453). The default value, |
coord |
(int, optional): Coordinate for which to optimize
|
w_l |
(ndarray, optional): Sample weights for the labeled data set. Defaults to a vector of ones. |
w_u |
(ndarray, optional): Sample weights for the unlabeled data set. Defaults to a vector of ones. |
PPI++: Efficient Prediction Powered Inference (Angelopoulos et al., 2023) doi:10.48550/arXiv.2311.01453
tuple: Lower and upper bounds of the prediction-powered confidence interval for the mean.
dat <- simdat(model = "mean") form <- Y - f ~ 1 Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_mean(Y_l, f_l, f_u)dat <- simdat(model = "mean") form <- Y - f ~ 1 Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_mean(Y_l, f_l, f_u)
Helper function for PPI++ mean estimation (point estimate)
ppi_plusplus_mean_est( Y_l, f_l, f_u, lhat = NULL, coord = NULL, w_l = NULL, w_u = NULL )ppi_plusplus_mean_est( Y_l, f_l, f_u, lhat = NULL, coord = NULL, w_l = NULL, w_u = NULL )
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
lhat |
(float, optional): Power-tuning parameter (see
doi:10.48550/arXiv.2311.01453). The default value, |
coord |
(int, optional): Coordinate for which to optimize
|
w_l |
(ndarray, optional): Sample weights for the labeled data set. Defaults to a vector of ones. |
w_u |
(ndarray, optional): Sample weights for the unlabeled data set. Defaults to a vector of ones. |
PPI++: Efficient Prediction Powered Inference (Angelopoulos et al., 2023) doi:10.48550/arXiv.2311.01453
float or ndarray: Prediction-powered point estimate of the mean.
dat <- simdat(model = "mean") form <- Y - f ~ 1 Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_mean_est(Y_l, f_l, f_u)dat <- simdat(model = "mean") form <- Y - f ~ 1 Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_mean_est(Y_l, f_l, f_u)
Helper function for PPI++ OLS estimation
ppi_plusplus_ols( X_l, Y_l, f_l, X_u, f_u, lhat = NULL, coord = NULL, w_l = NULL, w_u = NULL )ppi_plusplus_ols( X_l, Y_l, f_l, X_u, f_u, lhat = NULL, coord = NULL, w_l = NULL, w_u = NULL )
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
lhat |
(float, optional): Power-tuning parameter (see
doi:10.48550/arXiv.2311.01453). The default value, |
coord |
(int, optional): Coordinate for which to optimize
|
w_l |
(ndarray, optional): Sample weights for the labeled data set. Defaults to a vector of ones. |
w_u |
(ndarray, optional): Sample weights for the unlabeled data set. Defaults to a vector of ones. |
PPI++: Efficient Prediction Powered Inference (Angelopoulos et al., 2023) doi:10.48550/arXiv.2311.01453
(list): A list containing the following:
(vector): vector of PPI++ OLS regression coefficient estimates.
(vector): vector of standard errors of the coefficients.
(float): estimated power-tuning parameter.
(vector): vector of the rectifier OLS regression coefficient estimates.
(matrix): covariance matrix for the gradients in the unlabeled data.
(matrix): covariance matrix for the gradients in the labeled data.
(matrix): matrix of gradients for the labeled data.
(matrix): matrix of predicted gradients for the unlabeled data.
(matrix): matrix of predicted gradients for the labeled data.
(matrix): inverse Hessian matrix.
dat <- simdat(model = "ols") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_ols(X_l, Y_l, f_l, X_u, f_u)dat <- simdat(model = "ols") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_ols(X_l, Y_l, f_l, X_u, f_u)
Helper function for PPI++ OLS estimation (point estimate)
ppi_plusplus_ols_est( X_l, Y_l, f_l, X_u, f_u, lhat = NULL, coord = NULL, w_l = NULL, w_u = NULL )ppi_plusplus_ols_est( X_l, Y_l, f_l, X_u, f_u, lhat = NULL, coord = NULL, w_l = NULL, w_u = NULL )
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
lhat |
(float, optional): Power-tuning parameter (see
doi:10.48550/arXiv.2311.01453). The default value, |
coord |
(int, optional): Coordinate for which to optimize
|
w_l |
(ndarray, optional): Sample weights for the labeled data set. Defaults to a vector of ones. |
w_u |
(ndarray, optional): Sample weights for the unlabeled data set. Defaults to a vector of ones. |
PPI++: Efficient Prediction Powered Inference (Angelopoulos et al., 2023) doi:10.48550/arXiv.2311.01453
(vector): vector of prediction-powered point estimates of the OLS coefficients.
dat <- simdat(model = "ols") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_ols_est(X_l, Y_l, f_l, X_u, f_u)dat <- simdat(model = "ols") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_ols_est(X_l, Y_l, f_l, X_u, f_u)
Helper function for PPI++ quantile estimation
ppi_plusplus_quantile( Y_l, f_l, f_u, q, alpha = 0.05, exact_grid = FALSE, w_l = NULL, w_u = NULL )ppi_plusplus_quantile( Y_l, f_l, f_u, q, alpha = 0.05, exact_grid = FALSE, w_l = NULL, w_u = NULL )
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
q |
(float): Quantile to estimate. Must be in the range (0, 1). |
alpha |
(scalar): type I error rate for hypothesis testing - values in (0, 1); defaults to 0.05. |
exact_grid |
(bool, optional): Whether to compute the exact solution (TRUE) or an approximate solution based on a linearly spaced grid of 5000 values (FALSE). |
w_l |
(ndarray, optional): Sample weights for the labeled data set. Defaults to a vector of ones. |
w_u |
(ndarray, optional): Sample weights for the unlabeled data set. Defaults to a vector of ones. |
PPI++: Efficient Prediction Powered Inference (Angelopoulos et al., 2023) doi:10.48550/arXiv.2311.01453
tuple: Lower and upper bounds of the prediction-powered confidence interval for the quantile.
dat <- simdat(model = "quantile") form <- Y - f ~ X1 Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_quantile(Y_l, f_l, f_u, q = 0.5)dat <- simdat(model = "quantile") form <- Y - f ~ X1 Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_quantile(Y_l, f_l, f_u, q = 0.5)
Helper function for PPI++ quantile estimation (point estimate)
ppi_plusplus_quantile_est( Y_l, f_l, f_u, q, exact_grid = FALSE, w_l = NULL, w_u = NULL )ppi_plusplus_quantile_est( Y_l, f_l, f_u, q, exact_grid = FALSE, w_l = NULL, w_u = NULL )
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
q |
(float): Quantile to estimate. Must be in the range (0, 1). |
exact_grid |
(bool, optional): Whether to compute the exact solution (TRUE) or an approximate solution based on a linearly spaced grid of 5000 values (FALSE). |
w_l |
(ndarray, optional): Sample weights for the labeled data set. Defaults to a vector of ones. |
w_u |
(ndarray, optional): Sample weights for the unlabeled data set. Defaults to a vector of ones. |
PPI++: Efficient Prediction Powered Inference (Angelopoulos et al., 2023) doi:10.48550/arXiv.2311.01453
(float): Prediction-powered point estimate of the quantile.
dat <- simdat(model = "quantile") form <- Y - f ~ 1 Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_quantile_est(Y_l, f_l, f_u, q = 0.5)dat <- simdat(model = "quantile") form <- Y - f ~ 1 Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_plusplus_quantile_est(Y_l, f_l, f_u, q = 0.5)
Helper function for PPI quantile estimation
ppi_quantile(Y_l, f_l, f_u, q, alpha = 0.05, exact_grid = FALSE)ppi_quantile(Y_l, f_l, f_u, q, alpha = 0.05, exact_grid = FALSE)
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
q |
(float): Quantile to estimate. Must be in the range (0, 1). |
alpha |
(scalar): type I error rate for hypothesis testing - values in (0, 1); defaults to 0.05. |
exact_grid |
(bool, optional): Whether to compute the exact solution (TRUE) or an approximate solution based on a linearly spaced grid of 5000 values (FALSE). |
Prediction Powered Inference (Angelopoulos et al., 2023) doi:10.1126/science.adi6000
tuple: Lower and upper bounds of the prediction-powered confidence interval for the quantile.
dat <- simdat(model = "quantile") form <- Y - f ~ X1 Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_quantile(Y_l, f_l, f_u, q = 0.5)dat <- simdat(model = "quantile") form <- Y - f ~ X1 Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) ppi_quantile(Y_l, f_l, f_u, q = 0.5)
Print ipd fit
## S3 method for class 'ipd' print(x, ...)## S3 method for class 'ipd' print(x, ...)
x |
An object of class |
... |
Ignored. |
Invisibly returns x.
Print summary.ipd
## S3 method for class 'summary.ipd' print(x, ...)## S3 method for class 'summary.ipd' print(x, ...)
x |
An object of class |
... |
Ignored. |
Invisibly returns x.
psi function for estimating equation
psi( X, Y, theta, quant = NA, method = c("ols", "quantile", "mean", "logistic", "poisson") )psi( X, Y, theta, quant = NA, method = c("ols", "quantile", "mean", "logistic", "poisson") )
X |
Array or data.frame containing covariates |
Y |
Array or data.frame of outcomes |
theta |
parameter theta |
quant |
quantile for quantile estimation |
method |
indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson". |
estimating equation
Helper function for PSPA logistic regression
pspa_logistic(X_l, Y_l, f_l, X_u, f_u, weights = NA, alpha = 0.05)pspa_logistic(X_l, Y_l, f_l, X_u, f_u, weights = NA, alpha = 0.05)
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of binary labeled outcomes. |
f_l |
(vector): n-vector of binary predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of binary predictions in the unlabeled data. |
weights |
(array): p-dimensional array of weights vector for variance reduction. PSPA will estimate the weights if not specified. |
alpha |
(scalar): type I error rate for hypothesis testing - values in (0, 1); defaults to 0.05 |
Post-prediction adaptive inference (Miao et al. 2023) doi:10.48550/arXiv.2311.14220
A list of outputs: estimate of inference model parameters and corresponding standard error.
dat <- simdat(model = "logistic") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) pspa_logistic(X_l, Y_l, f_l, X_u, f_u)dat <- simdat(model = "logistic") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) pspa_logistic(X_l, Y_l, f_l, X_u, f_u)
Helper function for PSPA mean estimation
pspa_mean(Y_l, f_l, f_u, weights = NA, alpha = 0.05)pspa_mean(Y_l, f_l, f_u, weights = NA, alpha = 0.05)
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
weights |
(array): 1-dimensional array of weights vector for variance reduction. PSPA will estimate the weights if not specified. |
alpha |
(scalar): type I error rate for hypothesis testing - values in (0, 1); defaults to 0.05. |
Post-prediction adaptive inference (Miao et al., 2023) doi:10.48550/arXiv.2311.14220
A list of outputs: estimate of inference model parameters and corresponding standard error.
dat <- simdat(model = "mean") form <- Y - f ~ 1 Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) pspa_mean(Y_l = Y_l, f_l = f_l, f_u = f_u)dat <- simdat(model = "mean") form <- Y - f ~ 1 Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) pspa_mean(Y_l = Y_l, f_l = f_l, f_u = f_u)
Helper function for PSPA OLS for linear regression
pspa_ols(X_l, Y_l, f_l, X_u, f_u, weights = NA, alpha = 0.05)pspa_ols(X_l, Y_l, f_l, X_u, f_u, weights = NA, alpha = 0.05)
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
weights |
(array): p-dimensional array of weights vector for variance reduction. PSPA will estimate the weights if not specified. |
alpha |
(scalar): type I error rate for hypothesis testing - values in (0, 1); defaults to 0.05. |
Post-prediction adaptive inference (Miao et al., 2023) doi:10.48550/arXiv.2311.14220
A list of outputs: estimate of inference model parameters and corresponding standard error.
dat <- simdat(model = "ols") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) pspa_ols(X_l, Y_l, f_l, X_u, f_u)dat <- simdat(model = "ols") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) pspa_ols(X_l, Y_l, f_l, X_u, f_u)
Helper function for PSPA Poisson regression
pspa_poisson(X_l, Y_l, f_l, X_u, f_u, weights = NA, alpha = 0.05)pspa_poisson(X_l, Y_l, f_l, X_u, f_u, weights = NA, alpha = 0.05)
X_l |
(matrix): n x p matrix of covariates in the labeled data. |
Y_l |
(vector): n-vector of count labeled outcomes. |
f_l |
(vector): n-vector of binary predictions in the labeled data. |
X_u |
(matrix): N x p matrix of covariates in the unlabeled data. |
f_u |
(vector): N-vector of binary predictions in the unlabeled data. |
weights |
(array): p-dimensional array of weights vector for variance reduction. PSPA will estimate the weights if not specified. |
alpha |
(scalar): type I error rate for hypothesis testing - values in (0, 1); defaults to 0.05 |
Post-prediction adaptive inference (Miao et al., 2023) doi:10.48550/arXiv.2311.14220
A list of outputs: estimate of inference model parameters and corresponding standard error.
dat <- simdat(model = "poisson") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) pspa_poisson(X_l, Y_l, f_l, X_u, f_u)dat <- simdat(model = "poisson") form <- Y - f ~ X1 X_l <- model.matrix(form, data = dat[dat$set_label == "labeled", ]) Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) X_u <- model.matrix(form, data = dat[dat$set_label == "unlabeled", ]) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) pspa_poisson(X_l, Y_l, f_l, X_u, f_u)
Helper function for PSPA quantile estimation
pspa_quantile(Y_l, f_l, f_u, q, weights = NA, alpha = 0.05)pspa_quantile(Y_l, f_l, f_u, q, weights = NA, alpha = 0.05)
Y_l |
(vector): n-vector of labeled outcomes. |
f_l |
(vector): n-vector of predictions in the labeled data. |
f_u |
(vector): N-vector of predictions in the unlabeled data. |
q |
(float): Quantile to estimate. Must be in the range (0, 1). |
weights |
(array): 1-dimensional array of weights vector for variance reduction. PSPA will estimate the weights if not specified. |
alpha |
(scalar): type I error rate for hypothesis testing - values in (0, 1); defaults to 0.05. |
Post-prediction adaptive inference (Miao et al., 2023) doi:10.48550/arXiv.2311.14220
A list of outputs: estimate of inference model parameters and corresponding standard error.
dat <- simdat(model = "quantile") form <- Y - f ~ 1 Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) pspa_quantile(Y_l = Y_l, f_l = f_l, f_u = f_u, q = 0.5)dat <- simdat(model = "quantile") form <- Y - f ~ 1 Y_l <- dat[dat$set_label == "labeled", all.vars(form)[1]] |> matrix(ncol = 1) f_l <- dat[dat$set_label == "labeled", all.vars(form)[2]] |> matrix(ncol = 1) f_u <- dat[dat$set_label == "unlabeled", all.vars(form)[2]] |> matrix(ncol = 1) pspa_quantile(Y_l = Y_l, f_l = f_l, f_u = f_u, q = 0.5)
pspa_y function conducts post-prediction M-Estimation.
pspa_y( X_l = NA, X_u = NA, Y_l, f_l, f_u, alpha = 0.05, weights = NA, quant = NA, intercept = FALSE, method )pspa_y( X_l = NA, X_u = NA, Y_l, f_l, f_u, alpha = 0.05, weights = NA, quant = NA, intercept = FALSE, method )
X_l |
Array or data.frame containing observed covariates in labeled data. |
X_u |
Array or data.frame containing observed or predicted covariates in unlabeled data. |
Y_l |
Array or data.frame of observed outcomes in labeled data. |
f_l |
Array or data.frame of predicted outcomes in labeled data. |
f_u |
Array or data.frame of predicted outcomes in unlabeled data. |
alpha |
Specifies the confidence level as 1 - alpha for confidence intervals. |
weights |
weights vector PSPA linear regression (d-dimensional, where d equals the number of covariates). |
quant |
quantile for quantile estimation |
intercept |
Boolean indicating if the input covariates' data contains the intercept (TRUE if the input data contains) |
method |
indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson". |
A summary table presenting point estimates, standard error, confidence intervals (1 - alpha), P-values, and weights.
Computes the rectified CDF of the data.
rectified_cdf(Y_l, f_l, f_u, grid, w_l = NULL, w_u = NULL)rectified_cdf(Y_l, f_l, f_u, grid, w_l = NULL, w_u = NULL)
Y_l |
(vector): Gold-standard labels. |
f_l |
(vector): Predictions corresponding to the gold-standard labels. |
f_u |
(vector): Predictions corresponding to the unlabeled data. |
grid |
(vector): Grid of values to compute the CDF at. |
w_l |
(vector, optional): Sample weights for the labeled data set. |
w_u |
(vector, optional): Sample weights for the unlabeled data set. |
(vector): Rectified CDF of the data at the specified grid points.
Computes a rectified p-value.
rectified_p_value( rectifier, rectifier_std, imputed_mean, imputed_std, null = 0, alternative = "two-sided" )rectified_p_value( rectifier, rectifier_std, imputed_mean, imputed_std, null = 0, alternative = "two-sided" )
rectifier |
(float or vector): Rectifier value. |
rectifier_std |
(float or vector): Rectifier standard deviation. |
imputed_mean |
(float or vector): Imputed mean. |
imputed_std |
(float or vector): Imputed standard deviation. |
null |
(float, optional): Value of the null hypothesis to be tested.
Defaults to |
alternative |
(str, optional): Alternative hypothesis, either 'two-sided', 'larger' or 'smaller'. |
(float or vector): The rectified p-value.
Display a concise summary of an ipd S4 object, including
method, model, formula, and a glm-style coefficient table.
## S4 method for signature 'ipd' show(object)## S4 method for signature 'ipd' show(object)
object |
An object of S4 class |
Invisibly returns object after printing.
Sigma_cal function for variance-covariance matrix of the
estimation equation
Sigma_cal( X_l, X_u, Y_l, f_l, f_u, w, theta, quant = NA, method = c("ols", "quantile", "mean", "logistic", "poisson") )Sigma_cal( X_l, X_u, Y_l, f_l, f_u, w, theta, quant = NA, method = c("ols", "quantile", "mean", "logistic", "poisson") )
X_l |
Array or data.frame containing observed covariates in labeled data. |
X_u |
Array or data.frame containing observed or predicted covariates in unlabeled data. |
Y_l |
Array or data.frame of observed outcomes in labeled data. |
f_l |
Array or data.frame of predicted outcomes in labeled data. |
f_u |
Array or data.frame of predicted outcomes in unlabeled data. |
w |
weights vector PSPA linear regression (d-dimensional, where d equals the number of covariates). |
theta |
parameter theta |
quant |
quantile for quantile estimation |
method |
indicates the method to be used for M-estimation. Options include "mean", "quantile", "ols", "logistic", and "poisson". |
variance-covariance matrix of the estimation equation
sim_data_y for simulation with ML-predicted Y
sim_data_y(r = 0.9, binary = FALSE)sim_data_y(r = 0.9, binary = FALSE)
r |
imputation correlation |
binary |
simulate binary outcome or not |
simulated data
Data generation function for various underlying models
simdat( n = c(300, 300, 300), effect = 1, sigma_Y = 1, model = "ols", shift = 0, scale = 1 )simdat( n = c(300, 300, 300), effect = 1, sigma_Y = 1, model = "ols", shift = 0, scale = 1 )
n |
Integer vector of size 3 indicating the sample sizes in the training, labeled, and unlabeled data sets, respectively |
effect |
Regression coefficient for the first variable of interest for inference. Defaults is 1. |
sigma_Y |
Residual variance for the generated outcome. Defaults is 1. |
model |
The type of model to be generated. Must be one of
|
shift |
Scalar shift of the predictions for continuous outcomes (i.e., "mean", "quantile", and "ols"). Defaults to 0. |
scale |
Scaling factor for the predictions for continuous outcomes (i.e., "mean", "quantile", and "ols"). Defaults to 1. |
The simdat function generates three datasets consisting of independent
realizations of (for model = "mean" or
"quantile"), or (for model =
"ols", "logistic", or "poisson"): a training
dataset of size , a labeled dataset of size , and
an unlabeled dataset of size . These sizes are specified by
the argument n.
NOTE: In the unlabeled data subset, outcome data are still generated
to facilitate a benchmark for comparison with an "oracle" model that uses
the true values for estimation and inference.
Generating Data
For "mean" and "quantile", we simulate a continuous outcome,
, with mean given by the effect argument and
error variance given by the sigma_y argument.
For "ols", "logistic", or "poisson" models, predictor
data, are simulated such that the
th observation follows a standard multivariate normal distribution
with a zero mean vector and identity covariance matrix:
For "ols", a continuous outcome is simulated
to depend on through a linear term with the effect size specified
by the effect argument, while the other predictors,
, have nonlinear effects:
and , where the
sigma_y argument specifies the error variance.
For "logistic", we simulate:
and generate:
where .
For "poisson", we simulate:
and generate:
Generating Predictions
To generate predicted outcomes for "mean" and "quantile", we
simulate a continuous variable with mean given by the empirical mean of the
training data and error variance given by the sigma_y argument.
For "ols", we fit a generalized additive model (GAM) on the
simulated training dataset and calculate predictions for the
labeled and unlabeled datasets as deterministic functions of
. Specifically, we fit the following GAM:
where denotes the training dataset, is an
intercept term, and , , ,
and are smoothing spline functions for , ,
, and , respectively, with three target equivalent degrees
of freedom. Residual error is modeled as .
Predictions for labeled and unlabeled datasets are calculated as:
where , and
are estimates of , and , respectively.
NOTE: For continuous outcomes, we provide optional arguments shift and
scale to further apply a location shift and scaling factor,
respectively, to the predicted outcomes. These default to shift = 0
and scale = 1, i.e., no location shift or scaling.
For "logistic", we train k-nearest neighbors (k-NN) classifiers on
the simulated training dataset for values of ranging from 1
to 10. The optimal is chosen via cross-validation, minimizing the
misclassification error on the validation folds. Predictions for the
labeled and unlabeled datasets are obtained by applying the
k-NN classifier with the optimal to .
Specifically, for each observation in the labeled and unlabeled datasets:
where represents the set of nearest neighbors in
the training dataset, indexes the possible classes (0 or 1), and
is an indicator function.
For "poisson", we fit a generalized linear model (GLM) with a log link
function to the simulated training dataset. The model is of the form:
where is the expected count for the response variable
in the training dataset, is the intercept, and
, , , and are the
regression coefficients for the predictors , , ,
and , respectively.
Predictions for the labeled and unlabeled datasets are calculated as:
where , , ,
, and are the estimated
coefficients.
A data.frame containing n rows and columns corresponding to the labeled outcome (Y), the predicted outcome (f), a character variable (set_label) indicating which data set the observation belongs to (training, labeled, or unlabeled), and four independent, normally distributed predictors (X1, X2, X3, and X4), where applicable.
#-- Mean dat_mean <- simdat(c(100, 100, 100), effect = 1, sigma_Y = 1, model = "mean" ) head(dat_mean) #-- Linear Regression dat_ols <- simdat(c(100, 100, 100), effect = 1, sigma_Y = 1, model = "ols" ) head(dat_ols)#-- Mean dat_mean <- simdat(c(100, 100, 100), effect = 1, sigma_Y = 1, model = "mean" ) head(dat_mean) #-- Linear Regression dat_ols <- simdat(c(100, 100, 100), effect = 1, sigma_Y = 1, model = "ols" ) head(dat_ols)
Summarize ipd fit
## S3 method for class 'ipd' summary(object, ...)## S3 method for class 'ipd' summary(object, ...)
object |
An object of class |
... |
Ignored. |
An object of class summary.ipd containing:
The model formula.
A glm-style table of estimates, SE, z, p.
Which IPD method was used.
Which downstream model was fitted.
Logical; whether an intercept was included.
Tidy an ipd fit
## S3 method for class 'ipd' tidy(x, ...)## S3 method for class 'ipd' tidy(x, ...)
x |
An object of class |
... |
Ignored. |
A tibble with columns
term, estimate, std.error, conf.low, conf.high.
dat <- simdat() fit <- ipd(Y - f ~ X1, method = "pspa", model = "ols", data = dat, label = "set_label") tidy(fit)dat <- simdat() fit <- ipd(Y - f ~ X1, method = "pspa", model = "ols", data = dat, label = "set_label") tidy(fit)
Computes the weighted least squares estimate of the coefficients.
wls(X, Y, w = NULL, return_se = FALSE)wls(X, Y, w = NULL, return_se = FALSE)
X |
(matrix): n x p matrix of covariates. |
Y |
(vector): p-vector of outcome values. |
w |
(vector, optional): n-vector of sample weights. |
return_se |
(bool, optional): Whether to return the standard errors of the coefficients. |
(list): A list containing the following:
(vector): p-vector of weighted least squares estimates of the coefficients.
(vector): If return_se == TRUE, return the p-vector of standard errors of the coefficients.
Calculates normal confidence intervals for a given alternative at a given significance level.
zconfint_generic(mean, std_mean, alpha, alternative)zconfint_generic(mean, std_mean, alpha, alternative)
mean |
(float): Estimated normal mean. |
std_mean |
(float): Estimated standard error of the mean. |
alpha |
(float): Significance level in [0,1] |
alternative |
(string): Alternative hypothesis, either 'two-sided', 'larger' or 'smaller'. |
(vector): Lower and upper (1 - alpha) * 100% confidence limits.
Computes the z-statistic and the corresponding p-value for a given test.
zstat_generic(value1, value2, std_diff, alternative, diff = 0)zstat_generic(value1, value2, std_diff, alternative, diff = 0)
value1 |
(numeric): The first value or sample mean. |
value2 |
(numeric): The second value or sample mean. |
std_diff |
(numeric): The standard error of the difference between the two values. |
alternative |
(character): The alternative hypothesis. Can be one of "two-sided" (or "2-sided", "2s"), "larger" (or "l"), or "smaller" (or "s"). |
diff |
(numeric, optional): The hypothesized difference between the two values. Default is 0. |
(list): A list containing the following:
(numeric): The computed z-statistic.
(numeric): The corresponding p-value for the test.