Title: | Sketching of Data via Random Subspace Embeddings |
---|---|
Description: | Construct sketches of data via random subspace embeddings. For more details, see the following papers. Lee, S. and Ng, S. (2022). "Least Squares Estimation Using Sketched Data with Heteroskedastic Errors," Proceedings of the 39th International Conference on Machine Learning (ICML22), 162:12498-12520. Lee, S. and Ng, S. (2020). "An Econometric Perspective on Algorithmic Subsampling," Annual Review of Economics, 12(1): 45–80. |
Authors: | Sokbae Lee [aut, cre], Serena Ng [aut] |
Maintainer: | Sokbae Lee <[email protected]> |
License: | GPL-3 |
Version: | 0.1.2 |
Built: | 2025-02-16 04:47:38 UTC |
Source: | https://github.com/sokbae/sketching |
Angrist-Krueger (AK) dataset is a data extract from US Censuses that was analyzed in Angrist and Krueger (1991). In particular, the current dataset is from the 1970 Census, consisting of men born 1920-1929 (Year 1929 is the omitted cohort group).
AK
AK
A data frame with 247,199 rows and 42 variables:
Outcome: log weekly wages
Covariate of interest: years of education
Indicator variable for the year of birth: equals 1 if yob = 1920
Indicator variable for the year of birth: equals 1 if yob = 1921
Indicator variable for the year of birth: equals 1 if yob = 1922
Indicator variable for the year of birth: equals 1 if yob = 1923
Indicator variable for the year of birth: equals 1 if yob = 1924
Indicator variable for the year of birth: equals 1 if yob = 1925
Indicator variable for the year of birth: equals 1 if yob = 1926
Indicator variable for the year of birth: equals 1 if yob = 1927
Indicator variable for the year of birth: equals 1 if yob = 1928
Quarter-of-birth indicator interacted with year-of-birth indicator
Quarter-of-birth indicator interacted with year-of-birth indicator
Quarter-of-birth indicator interacted with year-of-birth indicator
Quarter-of-birth indicator interacted with year-of-birth indicator
Quarter-of-birth indicator interacted with year-of-birth indicator
Quarter-of-birth indicator interacted with year-of-birth indicator
Quarter-of-birth indicator interacted with year-of-birth indicator
Quarter-of-birth indicator interacted with year-of-birth indicator
Quarter-of-birth indicator interacted with year-of-birth indicator
Quarter-of-birth indicator interacted with year-of-birth indicator
Quarter-of-birth indicator interacted with year-of-birth indicator
Quarter-of-birth indicator interacted with year-of-birth indicator
Quarter-of-birth indicator interacted with year-of-birth indicator
Quarter-of-birth indicator interacted with year-of-birth indicator
Quarter-of-birth indicator interacted with year-of-birth indicator
Quarter-of-birth indicator interacted with year-of-birth indicator
Quarter-of-birth indicator interacted with year-of-birth indicator
Quarter-of-birth indicator interacted with year-of-birth indicator
Quarter-of-birth indicator interacted with year-of-birth indicator
Quarter-of-birth indicator interacted with year-of-birth indicator
Quarter-of-birth indicator interacted with year-of-birth indicator
Quarter-of-birth indicator interacted with year-of-birth indicator
Quarter-of-birth indicator interacted with year-of-birth indicator
Quarter-of-birth indicator interacted with year-of-birth indicator
Quarter-of-birth indicator interacted with year-of-birth indicator
Quarter-of-birth indicator interacted with year-of-birth indicator
Quarter-of-birth indicator interacted with year-of-birth indicator
Quarter-of-birth indicator interacted with year-of-birth indicator
Quarter-of-birth indicator interacted with year-of-birth indicator
Quarter-of-birth indicator interacted with year-of-birth indicator
Constant
The dataset is publicly available on Joshua Angrist's website at https://economics.mit.edu/faculty/angrist/data1/data/angkru1991/.
Angrist, J.D. and Krueger, A.B., 1991. Does compulsory school attendance affect schooling and earnings? Quarterly Journal of Economics, 106(4), pp.979–1014. doi:10.2307/2937954
Simulates observations from the data-generating process considered in Lee and Ng (2022)
simulation_dgp(n, d, hetero = FALSE)
simulation_dgp(n, d, hetero = FALSE)
n |
sample size |
d |
dimension of regressors from a multivariate normal distribution |
hetero |
TRUE if the conditional variance of the error term is heteroskedastic and FALSE if it is homoskedastic (default: FALSE) |
An S3 object has the following elements.
Y |
n observations of outcomes |
X |
n times d matrix of regressors |
beta |
d dimensional vector of coefficients |
Lee, S. and Ng, S. (2022). "Least Squares Estimation Using Sketched Data with Heteroskedastic Errors," arXiv:2007.07781.
data <- simulation_dgp(100, 5, hetero = TRUE) y <- data$Y x <- data$X model <- lm(y ~ x)
data <- simulation_dgp(100, 5, hetero = TRUE) y <- data$Y x <- data$X model <- lm(y ~ x)
Provides a subsample of data using sketches
sketch(data, m, method = "unif")
sketch(data, m, method = "unif")
data |
(n times d)-dimensional matrix of data. |
m |
(expected) subsample size that is less than n |
method |
method for sketching: "unif" uniform sampling with replacement (default); "unif_without_replacement" uniform sampling without replacement; "bernoulli" Bernoulli sampling; "gaussian" Gaussian projection; "countsketch" CountSketch; "srht" subsampled randomized Hadamard transform; "fft" subsampled randomized trigonometric transforms using the real part of fast discrete Fourier transform (stats::ftt). |
(m times d)-dimensional matrix of data For Bernoulli sampling, the number of rows is not necessarily m.
## Least squares: sketch and solve # setup n <- 1e+6 # full sample size d <- 5 # dimension of covariates m <- 1e+3 # sketch size # generate psuedo-data X <- matrix(stats::rnorm(n*d), nrow = n, ncol = d) beta <- matrix(rep(1,d), nrow = d, ncol = 1) eps <- matrix(stats::rnorm(n), nrow = n, ncol = 1) Y <- X %*% beta + eps intercept <- matrix(rep(1,n), nrow = n, ncol = 1) # full sample including the intercept term fullsample <- cbind(Y,intercept,X) # generate a sketch using CountSketch s_cs <- sketch(fullsample, m, "countsketch") # solve without the intercept ls_cs <- lm(s_cs[,1] ~ s_cs[,2] - 1) # generate a sketch using SRHT s_srht <- sketch(fullsample, m, "srht") # solve without the intercept ls_srht <- lm(s_srht[,1] ~ s_srht[,2] - 1)
## Least squares: sketch and solve # setup n <- 1e+6 # full sample size d <- 5 # dimension of covariates m <- 1e+3 # sketch size # generate psuedo-data X <- matrix(stats::rnorm(n*d), nrow = n, ncol = d) beta <- matrix(rep(1,d), nrow = d, ncol = 1) eps <- matrix(stats::rnorm(n), nrow = n, ncol = 1) Y <- X %*% beta + eps intercept <- matrix(rep(1,n), nrow = n, ncol = 1) # full sample including the intercept term fullsample <- cbind(Y,intercept,X) # generate a sketch using CountSketch s_cs <- sketch(fullsample, m, "countsketch") # solve without the intercept ls_cs <- lm(s_cs[,1] ~ s_cs[,2] - 1) # generate a sketch using SRHT s_srht <- sketch(fullsample, m, "srht") # solve without the intercept ls_srht <- lm(s_srht[,1] ~ s_srht[,2] - 1)
Provides a subsample of data using sketches
sketch_leverage(data, m, method = "leverage")
sketch_leverage(data, m, method = "leverage")
data |
(n times d)-dimensional matrix of data. The first column needs to be a vector of the dependent variable (Y) |
m |
subsample size that is less than n |
method |
method for sketching: "leverage" leverage score sampling using X (default); "root_leverage" square-root leverage score sampling using X. |
An S3 object has the following elements.
subsample |
(m times d)-dimensional matrix of data |
prob |
m-dimensional vector of probabilities |
Ma, P., Zhang, X., Xing, X., Ma, J. and Mahoney, M.. (2020). Asymptotic Analysis of Sampling Estimators for Randomized Numerical Linear Algebra Algorithms. Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, PMLR 108:1026-1035.
## Least squares: sketch and solve # setup n <- 1e+6 # full sample size d <- 5 # dimension of covariates m <- 1e+3 # sketch size # generate psuedo-data X <- matrix(stats::rnorm(n*d), nrow = n, ncol = d) beta <- matrix(rep(1,d), nrow = d, ncol = 1) eps <- matrix(stats::rnorm(n), nrow = n, ncol = 1) Y <- X %*% beta + eps intercept <- matrix(rep(1,n), nrow = n, ncol = 1) # full sample including the intercept term fullsample <- cbind(Y,intercept,X) # generate a sketch using leverage score sampling s_lev <- sketch_leverage(fullsample, m, "leverage") # solve without the intercept with weighting ls_lev <- lm(s_lev$subsample[,1] ~ s_lev$subsample[,2] - 1, weights = s_lev$prob)
## Least squares: sketch and solve # setup n <- 1e+6 # full sample size d <- 5 # dimension of covariates m <- 1e+3 # sketch size # generate psuedo-data X <- matrix(stats::rnorm(n*d), nrow = n, ncol = d) beta <- matrix(rep(1,d), nrow = d, ncol = 1) eps <- matrix(stats::rnorm(n), nrow = n, ncol = 1) Y <- X %*% beta + eps intercept <- matrix(rep(1,n), nrow = n, ncol = 1) # full sample including the intercept term fullsample <- cbind(Y,intercept,X) # generate a sketch using leverage score sampling s_lev <- sketch_leverage(fullsample, m, "leverage") # solve without the intercept with weighting ls_lev <- lm(s_lev$subsample[,1] ~ s_lev$subsample[,2] - 1, weights = s_lev$prob)