Title: | Causal Inference in Case-Control and Case-Population Studies |
---|---|
Description: | Estimation and inference methods for causal relative and attributable risk in case-control and case-population studies under the monotone treatment response and monotone treatment selection assumptions. For more details, see the paper by Jun and Lee (2023), "Causal Inference under Outcome-Based Sampling with Monotonicity Assumptions," <arXiv:2004.08318 [econ.EM]>, accepted for publication in Journal of Business & Economic Statistics. |
Authors: | Sung Jae Jun [aut], Sokbae Lee [aut, cre] |
Maintainer: | Sokbae Lee <[email protected]> |
License: | GPL-3 |
Version: | 0.3.0 |
Built: | 2025-02-12 05:26:53 UTC |
Source: | https://github.com/sokbae/ciccr |
Averages the log odds ratio using prospective or retrospective high-dimensional logistic regression
AAA_DML(y, t, x, type = "pro", k = 10)
AAA_DML(y, t, x, type = "pro", k = 10)
y |
n-dimensional vector of binary outcomes |
t |
n-dimensional vector of binary treatments |
x |
n by d matrix of covariates |
type |
'pro' if the average is based on prospective regression; 'retro' if it is based on prospective regression (default = 'pro') |
k |
number of folds in k-fold partition (default = 10) |
An S3 object of type "ciccr". The object has the following elements.
est |
a scalar estimate |
se |
standard error |
Jun, S.J. and Lee, S. (2020). Causal Inference under Outcome-Based Sampling with Monotonicity Assumptions. https://arxiv.org/abs/2004.08318.
# use the ACS dataset included in the package y = ciccr::ACS$topincome t = ciccr::ACS$baplus age = ciccr::ACS$age x = splines::bs(age, df=6) # b-splines for age results = AAA_DML(y, t, x, 'pro', k=2)
# use the ACS dataset included in the package y = ciccr::ACS$topincome t = ciccr::ACS$baplus age = ciccr::ACS$age x = splines::bs(age, df=6) # b-splines for age results = AAA_DML(y, t, x, 'pro', k=2)
Sample extracted from American Community Survey (ACS) 2018, restricted to white males residing in California with at least a bachelor's degree. The sample is composed of 17,816 individuals whose age is restricted to be between 25 and 70.
ACS
ACS
A data frame with 17,816 rows and 4 variables:
age, in years
industry code, in four digits
1 if a respondent has a master’s degree, a professional degree, or a doctoral degree; 0 otherwise
1 if income is top-coded; 0 otherwise
A case-control sample extracted from American Community Survey (ACS) 2018, restricted to white males residing in California with at least a bachelor's degree. The original ACS dataset is not from case-control sampling, but this case-control sample is obtained by the following procedure. The case sample is composed of 921 individuals whose income is top-coded. The control sample of equal size is randomly drawn without replacement from the pool of individuals whose income is not top-coded. Age is restricted to be between 25 and 70.
ACS_CC
ACS_CC
A data frame with 1842 rows and 4 variables:
age, in years
industry code, in four digits
1 if a respondent has a master’s degree, a professional degree, or a doctoral degree; 0 otherwise
1 if income is top-coded; 0 otherwise
A case-population sample extracted from American Community Survey (ACS) 2018, restricted to white males residing in California with at least a bachelor's degree. The original ACS dataset is not from case-population sampling, but this case-population sample is obtained by the following procedure. The case sample is composed of 921 individuals whose income is top-coded. The control sample of equal size is randomly drawn without replacement from all observations and its top-coded status is coded missing. Age is restricted to be between 25 and 70.
ACS_CP
ACS_CP
A data frame with 1842 rows and 4 variables:
age, in years
industry code, in four digits
1 if a respondent has a master’s degree, a professional degree, or a doctoral degree; 0 otherwise
1 if an observation belongs to the case sample; NA otherwise
Averages the upper bound on causal attributable risk using prospective and retrospective logistic regression models under the monotone treatment response (MTR) and monotone treatment selection (MTS) assumptions.
avg_AR_logit( y, t, x, sampling = "cc", p_upper = 1L, length = 21L, interaction = TRUE, eps = 1e-08 )
avg_AR_logit( y, t, x, sampling = "cc", p_upper = 1L, length = 21L, interaction = TRUE, eps = 1e-08 )
y |
n-dimensional vector of binary outcomes |
t |
n-dimensional vector of binary treatments |
x |
n by d matrix of covariates |
sampling |
'cc' for case-control sampling; 'cp' for case-population sampling; 'rs' for random sampling (default = 'cc') |
p_upper |
specified upper bound for the unknown true case probability (default = 1) |
length |
specified length of a sequence from 0 to p_upper (default = 21) |
interaction |
TRUE if there are interaction terms in the retrospective logistic model; FALSE if not (default = TRUE) |
eps |
a small constant that determines the trimming of the estimated probabilities. Specifically, the estimate probability is trimmed to be between eps and 1-eps (default = 1e-8). |
An S3 object of type "ciccr". The object has the following elements.
est |
(length)-dimensional vector of the average of the upper bound of causal attributable risk |
pseq |
(length)-dimensional vector of a grid from 0 to p_upper |
Jun, S.J. and Lee, S. (2023). Causal Inference under Outcome-Based Sampling with Monotonicity Assumptions. https://arxiv.org/abs/2004.08318.
Manski, C.F. (1997). Monotone Treatment Response. Econometrica, 65(6), 1311-1334.
Manski, C.F. and Pepper, J.V. (2000). Monotone Instrumental Variables: With an Application to the Returns to Schooling. Econometrica, 68(4), 997-1010.
# use the ACS_CC dataset included in the package. y = ciccr::ACS_CC$topincome t = ciccr::ACS_CC$baplus x = ciccr::ACS_CC$age results = avg_AR_logit(y, t, x, sampling = 'cc')
# use the ACS_CC dataset included in the package. y = ciccr::ACS_CC$topincome t = ciccr::ACS_CC$baplus x = ciccr::ACS_CC$age results = avg_AR_logit(y, t, x, sampling = 'cc')
Averages the log odds ratio using retrospective logistic regression.
avg_RR_logit(y, t, x, w = "control")
avg_RR_logit(y, t, x, w = "control")
y |
n-dimensional vector of binary outcomes |
t |
n-dimensional vector of binary treatments |
x |
n by d matrix of covariates |
w |
'case' if the average is conditional on the case sample; 'control' if it is conditional on the control sample; 'all' if it is based on the whole sample; default w = 'control' |
An S3 object of type "ciccr". The object has the following elements.
est |
a scalar estimate of the weighted average of the log odds ratio using retrospective logistic regression |
se |
standard error |
Jun, S.J. and Lee, S. (2023). Causal Inference under Outcome-Based Sampling with Monotonicity Assumptions. https://arxiv.org/abs/2004.08318.
# use the ACS_CC dataset included in the package y = ciccr::ACS_CC$topincome t = ciccr::ACS_CC$baplus x = ciccr::ACS_CC$age # use 'case' to condition on the distribution of covariates given y = 1 results = avg_RR_logit(y, t, x, 'case')
# use the ACS_CC dataset included in the package y = ciccr::ACS_CC$topincome t = ciccr::ACS_CC$baplus x = ciccr::ACS_CC$age # use 'case' to condition on the distribution of covariates given y = 1 results = avg_RR_logit(y, t, x, 'case')
Provides an upper bound on the average of attributable risk under the monotone treatment response (MTR) and monotone treatment selection (MTS) assumptions.
cicc_AR( y, t, x, sampling = "cc", p_upper = 1L, cov_prob = 0.95, length = 21L, interaction = TRUE, no_boot = 0L, eps = 1e-08 )
cicc_AR( y, t, x, sampling = "cc", p_upper = 1L, cov_prob = 0.95, length = 21L, interaction = TRUE, no_boot = 0L, eps = 1e-08 )
y |
n-dimensional vector of binary outcomes |
t |
n-dimensional vector of binary treatments |
x |
n by d matrix of covariates |
sampling |
'cc' for case-control sampling; 'cp' for case-population sampling; 'rs' for random sampling (default = 'cc') |
p_upper |
a specified upper bound for the unknown true case probability (default = 1) |
cov_prob |
coverage probability of a confidence interval (default = 0.95) |
length |
specified length of a sequence from 0 to p_upper (default = 21) |
interaction |
TRUE if there are interaction terms in the retrospective logistic model; FALSE if not (default = TRUE) |
no_boot |
number of bootstrap repetitions to compute the confidence intervals (default = 0) |
eps |
a small constant that determines the trimming of the estimated probabilities. Specifically, the estimate probability is trimmed to be between eps and 1-eps (default = 1e-8). |
An S3 object of type "ciccr". The object has the following elements:
est |
(length)-dimensional vector of the upper bounds on the average of attributable risk |
ci |
(length)-dimensional vector of the upper ends of pointwise one-sided confidence intervals |
pseq |
(length)-dimensional vector of a grid from 0 to p_upper |
cov_prob |
the nominal coverage probability |
return_code |
status of existence of missing values in bootstrap replications |
Jun, S.J. and Lee, S. (2020). Causal Inference under Outcome-Based Sampling with Monotonicity Assumptions. https://arxiv.org/abs/2004.08318.
Manski, C.F. (1997). Monotone Treatment Response. Econometrica, 65(6), 1311-1334.
Manski, C.F. and Pepper, J.V. (2000). Monotone Instrumental Variables: With an Application to the Returns to Schooling. Econometrica, 68(4), 997-1010.
# use the ACS_CC dataset included in the package. y = ciccr::ACS_CC$topincome t = ciccr::ACS_CC$baplus x = ciccr::ACS_CC$age results_AR = cicc_AR(y, t, x, sampling = 'cc', no_boot = 100)
# use the ACS_CC dataset included in the package. y = ciccr::ACS_CC$topincome t = ciccr::ACS_CC$baplus x = ciccr::ACS_CC$age results_AR = cicc_AR(y, t, x, sampling = 'cc', no_boot = 100)
Plots upper bounds on relative and attributable risk
cicc_plot( results, parameter = "RR", sampling = "cc", save_plots = FALSE, file_name = Sys.Date(), plots_ctl = 0.3 )
cicc_plot( results, parameter = "RR", sampling = "cc", save_plots = FALSE, file_name = Sys.Date(), plots_ctl = 0.3 )
results |
estimation results from either cicc_RR or cicc_AR |
parameter |
'RR' for relative risk; 'AR' for attributable risk (default = 'RR') |
sampling |
'cc' for case-control sampling; 'cp' for case-population sampling (default = 'cc') |
save_plots |
TRUE if the plots are saved as pdf files; FALSE if not (default = FALSE) |
file_name |
the pdf file name to save the plots (default = Sys.Date()) |
plots_ctl |
value to determine the topleft position of the legend in the figure a large value makes the legend far away from the confidence intervals (default = 0.3) |
A X-Y plot where the X axis shows the range of p from 0 to p_upper and the Y axis depicts both point estimates and the upper end point of the one-sided confidence intervals.
Jun, S.J. and Lee, S. (2020). Causal Inference under Outcome-Based Sampling with Monotonicity Assumptions. https://arxiv.org/abs/2004.08318.
# use the ACS_CC dataset included in the package. y = ciccr::ACS_CC$topincome t = ciccr::ACS_CC$baplus x = ciccr::ACS_CC$age results = cicc_RR(y, t, x) cicc_plot(results)
# use the ACS_CC dataset included in the package. y = ciccr::ACS_CC$topincome t = ciccr::ACS_CC$baplus x = ciccr::ACS_CC$age results = cicc_RR(y, t, x) cicc_plot(results)
Provides upper bounds on the average of log relative risk under the monotone treatment response (MTR) and monotone treatment selection (MTS) assumptions.
cicc_RR(y, t, x, sampling = "cc", cov_prob = 0.95)
cicc_RR(y, t, x, sampling = "cc", cov_prob = 0.95)
y |
n-dimensional vector of binary outcomes |
t |
n-dimensional vector of binary treatments |
x |
n by d matrix of covariates |
sampling |
'cc' for case-control sampling; 'cp' for case-population sampling; 'rs' for random sampling (default = 'cc') |
cov_prob |
coverage probability of a uniform confidence band (default = 0.95) |
An S3 object of type "ciccr". The object has the following elements:
est |
estimates of the upper bounds on the average of log relative risk at p=0 and p=1 |
se |
pointwise standard errors at p=0 and p=1 |
ci |
the upper end points of the uniform confidence band at p=0 and p=1 |
pseq |
two end points: p=0 and p=1 |
Jun, S.J. and Lee, S. (2023). Causal Inference under Outcome-Based Sampling with Monotonicity Assumptions. https://arxiv.org/abs/2004.08318.
Manski, C.F. (1997). Monotone Treatment Response. Econometrica, 65(6), 1311-1334.
Manski, C.F. and Pepper, J.V. (2000). Monotone Instrumental Variables: With an Application to the Returns to Schooling. Econometrica, 68(4), 997-1010.
# use the ACS_CC dataset included in the package. y = ciccr::ACS_CC$topincome t = ciccr::ACS_CC$baplus x = ciccr::ACS_CC$age results_RR = cicc_RR(y, t, x, sampling = 'cc', cov_prob = 0.95)
# use the ACS_CC dataset included in the package. y = ciccr::ACS_CC$topincome t = ciccr::ACS_CC$baplus x = ciccr::ACS_CC$age results_RR = cicc_RR(y, t, x, sampling = 'cc', cov_prob = 0.95)
Case-control sample extracted from Delavande and Zafar (2019). The sample is composed of 689 students who attended either "Very Selective University" (VSU) or "Selective University" (SU).
DZ_CC
DZ_CC
A data frame with 689 rows and 5 variables:
indicator: at least one college-educated parent
indicator: attended private school before unversity
parents' monthly income (in 1000s Rs)
indicator: attended "Very Selective University" (VSU)
indicator: attended the other simply "Selective University" (SU)
https://www.journals.uchicago.edu/doi/suppl/10.1086/701808/suppl_file/2014399data.zip
Delavande, A. and Zafar, B. (2019). University Choice: The Role of Expected Earnings, Non-Pecuniary Outcomes, and Financial Constraints. Journal of Political Economy 127(5), 2343-2393.
Dataset from Fang and Gong (2017,2020). The original dataset in Fang and Gong (2017) is updated in Fang and Gong (2020) after Matsumoto (2020) pointed out data and coding errors in the original work. We use the updated version of the dataset. The sample is composed of 78,165 physicians who billed at least 20 hours per week.
FG
FG
A data frame with 78,165 rows and 5 variables:
indicator: physician is male
indicator: physician has a MD degree
experience in years
indicator: physician billed for more than 100 hours per week
indicator: number of group practice members less than 6
Fang, H. and Gong, Q. (2020) Data and Code for: Detecting Potential Overbilling in Medicare Reimbursement via Hours Worked: Reply. Nashville, TN: American Economic Association [publisher]. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2020-11-23. doi:10.3886/E119192V1
Fang, H. and Gong, Q. (2017). Detecting Potential Overbilling in Medicare Reimbursement via Hours Worked. American Economic Review, 107(2), 562-91.
Matsumoto, B. (2020). Detecting Potential Overbilling in Medicare Reimbursement via Hours Worked: Comment. American Economic Review, 110(12), 3991-4003.
Fang, H. and Gong, Q. (2020). Detecting Potential Overbilling in Medicare Reimbursement via Hours Worked: Reply. American Economic Review, 110(12): 4004-10.
Case-control sample extracted from Fang and Gong (2020). The case-control sample is extracted from Fang and Gong (2020) by the following procedure. The case sample is composed of 2,261 flagged physicians (that is, those who billed for more than 100 hours per week). The control sample of equal size is randomly drawn without replacement from the pool of physicians who were never flagged. The sample is composed of 4,522 physicians who billed at least 20 hours per week.
FG_CC
FG_CC
A data frame with 4,522 rows and 5 variables:
indicator: physician is male
indicator: physician has a MD degree
experience in years
indicator: physician billed for more than 100 hours per week
indicator: number of group practice members less than 6
Fang, H. and Gong, Q. (2020) Data and Code for: Detecting Potential Overbilling in Medicare Reimbursement via Hours Worked: Reply. Nashville, TN: American Economic Association [publisher]. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2020-11-23. doi:10.3886/E119192V1
Fang, H. and Gong, Q. (2017). Detecting Potential Overbilling in Medicare Reimbursement via Hours Worked. American Economic Review, 107(2), 562-91.
Matsumoto, B. (2020). Detecting Potential Overbilling in Medicare Reimbursement via Hours Worked: Comment. American Economic Review, 110(12), 3991-4003.
Fang, H. and Gong, Q. (2020). Detecting Potential Overbilling in Medicare Reimbursement via Hours Worked: Reply. American Economic Review, 110(12), 4004-10.
Case-population sample extracted from Fang and Gong (2020). The case-population sample is extracted from Fang and Gong (2020) by the following procedure. The case sample is composed of 2,261 flagged physicians (that is, those who billed for more than 100 hours per week). The control sample of equal size is randomly drawn without replacement from all observations and its flagged status is coded missing. The sample is composed of 4,522 physicians who billed at least 20 hours per week.
FG_CP
FG_CP
A data frame with 4,522 rows and 5 variables:
indicator: physician is male
indicator: physician has a MD degree
experience in years
1 if an observation belongs to the case sample; NA otherwise
indicator: number of group practice members less than 6
Fang, H. and Gong, Q. (2020) Data and Code for: Detecting Potential Overbilling in Medicare Reimbursement via Hours Worked: Reply. Nashville, TN: American Economic Association [publisher]. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2020-11-23. doi:10.3886/E119192V1
Fang, H. and Gong, Q. (2017). Detecting Potential Overbilling in Medicare Reimbursement via Hours Worked. American Economic Review, 107(2), 562-91.
Matsumoto, B. (2020). Detecting Potential Overbilling in Medicare Reimbursement via Hours Worked: Comment. American Economic Review, 110(12), 3991-4003.
Fang, H. and Gong, Q. (2020). Detecting Potential Overbilling in Medicare Reimbursement via Hours Worked: Reply. American Economic Review, 110(12): 4004-10.
Trimming the estimates to be strictly between 0 and 1
trim_pr(ps, eps = 1e-08)
trim_pr(ps, eps = 1e-08)
ps |
n-dimensional vector of estimated probabilities |
eps |
a small constant that determines the trimming of the estimated probabilities. Specifically, the estimate probability is trimmed to be between eps and 1-eps (default = 1e-8). |
ps_tr |
n-dimensional trimmed estimates |