Title: | Methods for Conducting Nonresponse Bias Analysis (NRBA) |
---|---|
Description: | Facilitates nonresponse bias analysis (NRBA) for survey data. Such data may arise from a complex sampling design with features such as stratification, clustering, or unequal probabilities of selection. Multiple types of analyses may be conducted: comparisons of response rates across subgroups; comparisons of estimates before and after weighting adjustments; comparisons of sample-based estimates to external population totals; tests of systematic differences in covariate means between respondents and full samples; tests of independence between response status and covariates; and modeling of outcomes and response status as a function of covariates. Extensive documentation and references are provided for each type of analysis. Krenzke, Van de Kerckhove, and Mohadjer (2005) <http://www.asasrms.org/Proceedings/y2005/files/JSM2005-000572.pdf> and Lohr and Riddles (2016) <https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2016002/article/14677-eng.pdf?st=q7PyNsGR> provide an overview of the methods implemented in this package. |
Authors: | Ben Schneider [aut, cre] , Jim Green [aut], Shelley Brock [aut] (Author of original SAS macro, WesNRBA), Tom Krenzke [aut] (Author of original SAS macro, WesNRBA), Michael Jones [aut] (Author of original SAS macro, WesNRBA), Wendy Van de Kerckhove [aut] (Author of original SAS macro, WesNRBA), David Ferraro [aut] (Author of original SAS macro, WesNRBA), Laura Alvarez-Rojas [aut] (Author of original SAS macro, WesNRBA), Katie Hubbell [aut] (Author of original SAS macro, WesNRBA), Westat [cph] |
Maintainer: | Ben Schneider <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.3.1 |
Built: | 2024-10-29 05:26:19 UTC |
Source: | https://github.com/cran/nrba |
This range-of-bias analysis assesses the range of possible nonresponse bias under varying assumptions about how nonrespondents differ from respondents. The range of potential bias is calculated for both unadjusted estimates (i.e., from using base weights) and nonresponse-adjusted estimates (i.e., based on nonresponse-adjusted weights).
assess_range_of_bias( survey_design, y_var, comparison_cell, status, status_codes, assumed_multiple = c(0.5, 0.75, 0.9, 1.1, 1.25, 1.5), assumed_percentile = NULL )
assess_range_of_bias( survey_design, y_var, comparison_cell, status, status_codes, assumed_multiple = c(0.5, 0.75, 0.9, 1.1, 1.25, 1.5), assumed_percentile = NULL )
survey_design |
A survey design object created with the 'survey' package |
y_var |
Name of a variable whose mean or proportion is to be estimated |
comparison_cell |
(Optional) The name of a variable in the data dividing the sample into cells. If supplied, then the analysis is based on assumptions about differences between respondents and nonrespondents within the same cell. Typically, the variable used is a nonresponse adjustment cell or post-stratification variable. |
status |
A character string giving the name of the variable representing response/eligibility status. The status variable should have at most four categories, representing eligible respondents (ER), eligible nonrespondents (EN), known ineligible cases (IE), and cases whose eligibility is unknown (UE). |
status_codes |
A named vector,
with four entries named 'ER', 'EN', 'IE', and 'UE'.
|
assumed_multiple |
One or more numeric values.
Within each nonresponse adjustment cell,
the mean for nonrespondents is assumed to be a specified multiple
of the mean for respondents. If |
assumed_percentile |
One or more numeric values, ranging from 0 to 1.
Within each nonresponse adjustment cell,
the mean of a continuous variable among nonrespondents is
assumed to equal a specified percentile of the variable among respondents.
The |
A data frame summarizing the range of bias under each assumption.
For a numeric outcome variable, there is one row per value of
assumed_multiple
or assumed_percentile
. For a categorical
outcome variable, there is one row per combination of category
and assumed_multiple
or assumed_percentile
.
The column bias_of_unadj_estimate
is the nonresponse bias
of the estimate from respondents produced using the unadjusted weights.
The column bias_of_adj_estimate
is the nonresponse bias
of the estimate from respondents produced
using nonresponse-adjusted weights, based on a weighting-class
adjustment with comparison_cell
as the weighting class variable.
If no comparison_cell
is specified, the two bias estimates
will be the same.
See Petraglia et al. (2016) for an example of a range-of-bias analysis using these methods.
Petraglia, E., Van de Kerckhove, W., and Krenzke, T. (2016). Review of the Potential for Nonresponse Bias in FoodAPS 2012. Prepared for the Economic Research Service, U.S. Department of Agriculture. Washington, D.C.
# Load example data suppressPackageStartupMessages(library(survey)) data(api) base_weights_design <- svydesign( data = apiclus1, id = ~dnum, weights = ~pw, fpc = ~fpc ) |> as.svrepdesign(type = "JK1") base_weights_design$variables$response_status <- sample( x = c("Respondent", "Nonrespondent"), prob = c(0.75, 0.25), size = nrow(base_weights_design), replace = TRUE ) # Assess range of bias for mean of `api00` # based on assuming nonrespondent means # are equal to the 25th percentile or 75th percentile # among respondents, within nonresponse adjustment cells assess_range_of_bias( survey_design = base_weights_design, y_var = "api00", comparison_cell = "stype", status = "response_status", status_codes = c("ER" = "Respondent", "EN" = "Nonrespondent", "IE" = "Ineligible", "UE" = "Unknown"), assumed_percentile = c(0.25, 0.75) ) # Assess range of bias for proportions of `sch.wide` # based on assuming nonrespondent proportions # are equal to some multiple of respondent proportions, # within nonresponse adjustment cells assess_range_of_bias( survey_design = base_weights_design, y_var = "sch.wide", comparison_cell = "stype", status = "response_status", status_codes = c("ER" = "Respondent", "EN" = "Nonrespondent", "IE" = "Ineligible", "UE" = "Unknown"), assumed_multiple = c(0.25, 0.75) )
# Load example data suppressPackageStartupMessages(library(survey)) data(api) base_weights_design <- svydesign( data = apiclus1, id = ~dnum, weights = ~pw, fpc = ~fpc ) |> as.svrepdesign(type = "JK1") base_weights_design$variables$response_status <- sample( x = c("Respondent", "Nonrespondent"), prob = c(0.75, 0.25), size = nrow(base_weights_design), replace = TRUE ) # Assess range of bias for mean of `api00` # based on assuming nonrespondent means # are equal to the 25th percentile or 75th percentile # among respondents, within nonresponse adjustment cells assess_range_of_bias( survey_design = base_weights_design, y_var = "api00", comparison_cell = "stype", status = "response_status", status_codes = c("ER" = "Respondent", "EN" = "Nonrespondent", "IE" = "Ineligible", "UE" = "Unknown"), assumed_percentile = c(0.25, 0.75) ) # Assess range of bias for proportions of `sch.wide` # based on assuming nonrespondent proportions # are equal to some multiple of respondent proportions, # within nonresponse adjustment cells assess_range_of_bias( survey_design = base_weights_design, y_var = "sch.wide", comparison_cell = "stype", status = "response_status", status_codes = c("ER" = "Respondent", "EN" = "Nonrespondent", "IE" = "Ineligible", "UE" = "Unknown"), assumed_multiple = c(0.25, 0.75) )
Calculates response rates using one of the response rate formulas defined by AAPOR (American Association of Public Opinion Research).
calculate_response_rates( data, status, status_codes = c("ER", "EN", "IE", "UE"), weights, rr_formula = "RR3", elig_method = "CASRO-subgroup", e = NULL )
calculate_response_rates( data, status, status_codes = c("ER", "EN", "IE", "UE"), weights, rr_formula = "RR3", elig_method = "CASRO-subgroup", e = NULL )
data |
A data frame containing the selected sample, one row per case. |
status |
A character string giving the name of the variable representing response/eligibility status.
The |
status_codes |
A named vector, with four entries named 'ER', 'EN', 'IE', and 'UE'.
|
weights |
(Optional) A character string giving the name of a variable representing weights in the data to use for calculating weighted response rates |
rr_formula |
A character vector including any of the following: 'RR1', 'RR3', and 'RR5'. |
elig_method |
If |
e |
(Required if |
Output consists of a data frame giving weighted and unweighted response rates. The following columns may be included, depending on the arguments supplied:
RR1_Unweighted
RR1_Weighted
RR3_Unweighted
RR3_Weighted
RR5_Unweighted
RR5_Weighted
n
: Total sample size
Nhat
: Sum of weights for the total sample
n_ER
: Number of eligible respondents
Nhat_ER
: Sum of weights for eligible respondents
n_EN
: Number of eligible nonrespondents
Nhat_EN
: Sum of weights for eligible nonrespondents
n_IE
: Number of ineligible cases
Nhat_IE
: Sum of weights for ineligible cases
n_UE
: Number of cases whose eligibility is unknown
Nhat_UE
: Sum of weights for cases whose eligibility is unknown
e_unwtd
: If RR3 is calculated, the eligibility rate estimate e used for the unweighted response rate.
e_wtd
: If RR3 is calculated, the eligibility rate estimate e used for the weighted response rate.
If the data frame is grouped (i.e. by using df %>% group_by(Region)
),
then the output contains one row per subgroup.
Denote the sample totals as follows:
ER: Total number of eligible respondents
EN: Total number of eligible non-respondents
IE: Total number of ineligible cases
UE: Total number of cases whose eligibility is unknown
For weighted response rates, these totals are calculated using weights.
The response rate formulas are then as follows:
RR1 essentially assumes that all cases with unknown eligibility are in fact eligible.
RR3 uses an estimate, e, of the eligibility rate among cases with unknown eligibility.
RR5 essentially assumes that all cases with unknown eligibility are in fact ineligible.
For RR3, an estimate, e
, of the eligibility rate among cases with unknown eligibility must be used.
AAPOR strongly recommends that the basis for the estimate should be explicitly stated and detailed.
The CASRO methods, which might be appropriate for the design, use the formula .
For elig_method='CASRO-overall'
, an estimate is calculated for the overall sample
and this single estimate is used when calculating response rates for subgroups.
For elig_method='CASRO-subgroup'
, estimates are calculated separately for each subgroup.
Please consult AAPOR's current Standard Definitions for in-depth explanations.
The American Association for Public Opinion Research. 2016. Standard Definitions: Final Dispositions of Case Codes and Outcome Rates for Surveys. 9th edition. AAPOR.
# Load example data data(involvement_survey_srs, package = "nrba") involvement_survey_srs[["RESPONSE_STATUS"]] <- sample(1:4, size = 5000, replace = TRUE) # Calculate overall response rates involvement_survey_srs %>% calculate_response_rates( status = "RESPONSE_STATUS", status_codes = c("ER" = 1, "EN" = 2, "IE" = 3, "UE" = 4), weights = "BASE_WEIGHT", rr_formula = "RR3", elig_method = "CASRO-overall" ) # Calculate response rates by subgroup library(dplyr) involvement_survey_srs %>% group_by(STUDENT_RACE, STUDENT_SEX) %>% calculate_response_rates( status = "RESPONSE_STATUS", status_codes = c("ER" = 1, "EN" = 2, "IE" = 3, "UE" = 4), weights = "BASE_WEIGHT", rr_formula = "RR3", elig_method = "CASRO-overall" ) # Compare alternative approaches for handling of cases with unknown eligiblity involvement_survey_srs %>% group_by(STUDENT_RACE) %>% calculate_response_rates( status = "RESPONSE_STATUS", status_codes = c("ER" = 1, "EN" = 2, "IE" = 3, "UE" = 4), rr_formula = "RR3", elig_method = "CASRO-overall" ) involvement_survey_srs %>% group_by(STUDENT_RACE) %>% calculate_response_rates( status = "RESPONSE_STATUS", status_codes = c("ER" = 1, "EN" = 2, "IE" = 3, "UE" = 4), rr_formula = "RR3", elig_method = "CASRO-subgroup" ) involvement_survey_srs %>% group_by(STUDENT_RACE) %>% calculate_response_rates( status = "RESPONSE_STATUS", status_codes = c("ER" = 1, "EN" = 2, "IE" = 3, "UE" = 4), rr_formula = "RR3", elig_method = "specified", e = 0.5 ) involvement_survey_srs %>% transform(e_by_email = ifelse(PARENT_HAS_EMAIL == "Has Email", 0.75, 0.25)) %>% group_by(PARENT_HAS_EMAIL) %>% calculate_response_rates( status = "RESPONSE_STATUS", status_codes = c("ER" = 1, "EN" = 2, "IE" = 3, "UE" = 4), rr_formula = "RR3", elig_method = "specified", e = "e_by_email" )
# Load example data data(involvement_survey_srs, package = "nrba") involvement_survey_srs[["RESPONSE_STATUS"]] <- sample(1:4, size = 5000, replace = TRUE) # Calculate overall response rates involvement_survey_srs %>% calculate_response_rates( status = "RESPONSE_STATUS", status_codes = c("ER" = 1, "EN" = 2, "IE" = 3, "UE" = 4), weights = "BASE_WEIGHT", rr_formula = "RR3", elig_method = "CASRO-overall" ) # Calculate response rates by subgroup library(dplyr) involvement_survey_srs %>% group_by(STUDENT_RACE, STUDENT_SEX) %>% calculate_response_rates( status = "RESPONSE_STATUS", status_codes = c("ER" = 1, "EN" = 2, "IE" = 3, "UE" = 4), weights = "BASE_WEIGHT", rr_formula = "RR3", elig_method = "CASRO-overall" ) # Compare alternative approaches for handling of cases with unknown eligiblity involvement_survey_srs %>% group_by(STUDENT_RACE) %>% calculate_response_rates( status = "RESPONSE_STATUS", status_codes = c("ER" = 1, "EN" = 2, "IE" = 3, "UE" = 4), rr_formula = "RR3", elig_method = "CASRO-overall" ) involvement_survey_srs %>% group_by(STUDENT_RACE) %>% calculate_response_rates( status = "RESPONSE_STATUS", status_codes = c("ER" = 1, "EN" = 2, "IE" = 3, "UE" = 4), rr_formula = "RR3", elig_method = "CASRO-subgroup" ) involvement_survey_srs %>% group_by(STUDENT_RACE) %>% calculate_response_rates( status = "RESPONSE_STATUS", status_codes = c("ER" = 1, "EN" = 2, "IE" = 3, "UE" = 4), rr_formula = "RR3", elig_method = "specified", e = 0.5 ) involvement_survey_srs %>% transform(e_by_email = ifelse(PARENT_HAS_EMAIL == "Has Email", 0.75, 0.25)) %>% group_by(PARENT_HAS_EMAIL) %>% calculate_response_rates( status = "RESPONSE_STATUS", status_codes = c("ER" = 1, "EN" = 2, "IE" = 3, "UE" = 4), rr_formula = "RR3", elig_method = "specified", e = "e_by_email" )
Tests whether response status among eligible sample cases is independent of categorical auxiliary variables, using a Chi-Square test with Rao-Scott's second-order adjustment. If the data include cases known to be ineligible or who have unknown eligibility status, the data are subsetted to only include respondents and nonrespondents known to be eligible.
chisq_test_ind_response( survey_design, status, status_codes = c("ER", "EN", "UE", "IE"), aux_vars )
chisq_test_ind_response( survey_design, status, status_codes = c("ER", "EN", "UE", "IE"), aux_vars )
survey_design |
A survey design object created with the |
status |
A character string giving the name of the variable representing response/eligibility status.
The |
status_codes |
A named vector, with four entries named 'ER', 'EN', 'IE', and 'UE'. |
aux_vars |
A list of names of auxiliary variables. |
Please see svychisq for details of how the Rao-Scott second-order adjusted test is conducted.
A data frame containing the results of the Chi-Square test(s) of independence between response status and each auxiliary variable.
If multiple auxiliary variables are specified, the output data contains one row per auxiliary variable.
The columns of the output dataset include:
auxiliary_variable
: The name of the auxiliary variable tested
statistic
: The value of the test statistic
ndf
: Numerator degrees of freedom for the reference distribution
ddf
: Denominator degrees of freedom for the reference distribution
p_value
: The p-value of the test of independence
test_method
: Text giving the name of the statistical test
variance_method
: Text describing the method of variance estimation
Rao, JNK, Scott, AJ (1984) "On Chi-squared Tests For Multiway Contigency Tables with Proportions Estimated From Survey Data" Annals of Statistics 12:46-60.
# Create a survey design object ---- library(survey) data(involvement_survey_srs, package = "nrba") involvement_survey <- svydesign( weights = ~BASE_WEIGHT, id = ~UNIQUE_ID, data = involvement_survey_srs ) # Test whether response status varies by race or by sex ---- test_results <- chisq_test_ind_response( survey_design = involvement_survey, status = "RESPONSE_STATUS", status_codes = c( "ER" = "Respondent", "EN" = "Nonrespondent", "UE" = "Unknown", "IE" = "Ineligible" ), aux_vars = c("STUDENT_RACE", "STUDENT_SEX") ) print(test_results)
# Create a survey design object ---- library(survey) data(involvement_survey_srs, package = "nrba") involvement_survey <- svydesign( weights = ~BASE_WEIGHT, id = ~UNIQUE_ID, data = involvement_survey_srs ) # Test whether response status varies by race or by sex ---- test_results <- chisq_test_ind_response( survey_design = involvement_survey, status = "RESPONSE_STATUS", status_codes = c( "ER" = "Respondent", "EN" = "Nonrespondent", "UE" = "Unknown", "IE" = "Ineligible" ), aux_vars = c("STUDENT_RACE", "STUDENT_SEX") ) print(test_results)
Compare estimated percentages from the present survey to external estimates from a benchmark source. A Chi-Square test with Rao-Scott's second-order adjustment is used to evaluate whether the survey's estimates differ from the external estimates.
chisq_test_vs_external_estimate(survey_design, y_var, ext_ests, na.rm = TRUE)
chisq_test_vs_external_estimate(survey_design, y_var, ext_ests, na.rm = TRUE)
survey_design |
A survey design object created with the |
y_var |
Name of dependent categorical variable. |
ext_ests |
A numeric vector containing the external estimate of the percentages for each category. The vector must have names, each name corresponding to a given category. |
na.rm |
Whether to drop cases with missing values |
Please see svygofchisq for details of how the Rao-Scott second-order adjusted test is conducted.
The test statistic, statistic
is obtained by calculating the Pearson Chi-squared statistic for the estimated table of population totals. The reference distribution is a Satterthwaite approximation. The p-value is obtained by comparing statistic
/scale
to a Chi-squared distribution with df
degrees of freedom.
A data frame containing the results of the Chi-Square test(s) of whether survey-based estimates systematically differ from external estimates.
The columns of the output dataset include:
statistic
: The value of the test statistic
df
: Degrees of freedom for the reference Chi-Squared distribution
scale
: Estimated scale parameter.
p_value
: The p-value of the test of independence
test_method
: Text giving the name of the statistical test
variance_method
: Text describing the method of variance estimation
Rao, JNK, Scott, AJ (1984) "On Chi-squared Tests For Multiway Contigency Tables with Proportions Estimated From Survey Data" Annals of Statistics 12:46-60.
library(survey) # Create a survey design ---- data("involvement_survey_pop", package = "nrba") data("involvement_survey_str2s", package = "nrba") involvement_survey_sample <- svydesign( data = involvement_survey_str2s, weights = ~BASE_WEIGHT, strata = ~SCHOOL_DISTRICT, ids = ~ SCHOOL_ID + UNIQUE_ID, fpc = ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL ) # Subset to only include survey respondents ---- involvement_survey_respondents <- subset( involvement_survey_sample, RESPONSE_STATUS == "Respondent" ) # Test whether percentages of categorical variable differ from benchmark ---- parent_email_benchmark <- c( "Has Email" = 0.85, "No Email" = 0.15 ) chisq_test_vs_external_estimate( survey_design = involvement_survey_respondents, y_var = "PARENT_HAS_EMAIL", ext_ests = parent_email_benchmark )
library(survey) # Create a survey design ---- data("involvement_survey_pop", package = "nrba") data("involvement_survey_str2s", package = "nrba") involvement_survey_sample <- svydesign( data = involvement_survey_str2s, weights = ~BASE_WEIGHT, strata = ~SCHOOL_DISTRICT, ids = ~ SCHOOL_ID + UNIQUE_ID, fpc = ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL ) # Subset to only include survey respondents ---- involvement_survey_respondents <- subset( involvement_survey_sample, RESPONSE_STATUS == "Respondent" ) # Test whether percentages of categorical variable differ from benchmark ---- parent_email_benchmark <- c( "Has Email" = 0.85, "No Email" = 0.15 ) chisq_test_vs_external_estimate( survey_design = involvement_survey_respondents, y_var = "PARENT_HAS_EMAIL", ext_ests = parent_email_benchmark )
Calculates estimates of a mean/proportion which are cumulative with respect to a predictor variable, such as week of data collection or number of contact attempts. This can be useful for examining whether estimates are affected by decisions such as whether to extend the data collection period or make additional contact attempts.
get_cumulative_estimates( survey_design, y_var, y_var_type = NULL, predictor_variable )
get_cumulative_estimates( survey_design, y_var, y_var_type = NULL, predictor_variable )
survey_design |
A survey design object created with the |
y_var |
Name of a variable whose mean or proportion is to be estimated. |
y_var_type |
Either |
predictor_variable |
Name of a variable for which cumulative estimates of |
A dataframe of cumulative estimates. The first column–whose name matches predictor_variable
–gives
describes the values of predictor_variable
for which a given estimate was computed.
The other columns of the result include the following:
outcome |
The name of the variable for which estimates are computed |
outcome_category |
For a categorical variable, the category of that variable |
estimate |
The estimated mean or proportion. |
std_error |
The estimated standard error |
respondent_sample_size |
The number of cases used to produce the estimate (excluding missing values) |
See Maitland et al. (2017) for an example of a level-of-effort analysis based on this method.
Maitland, A. et al. (2017). A Nonresponse Bias Analysis of the Health Information National Trends Survey (HINTS). Journal of Health Communication 22, 545-553. doi:10.1080/10810730.2017.1324539
# Create an example survey design # with a variable representing number of contact attempts library(survey) data(involvement_survey_srs, package = "nrba") survey_design <- svydesign( weights = ~BASE_WEIGHT, id = ~UNIQUE_ID, fpc = ~N_STUDENTS, data = involvement_survey_srs ) # Cumulative estimates from respondents for average student age ---- get_cumulative_estimates( survey_design = survey_design |> subset(RESPONSE_STATUS == "Respondent"), y_var = "STUDENT_AGE", y_var_type = "numeric", predictor_variable = "CONTACT_ATTEMPTS" ) # Cumulative estimates from respondents for proportions of categorical variable ---- get_cumulative_estimates( survey_design = survey_design |> subset(RESPONSE_STATUS == "Respondent"), y_var = "WHETHER_PARENT_AGREES", y_var_type = "categorical", predictor_variable = "CONTACT_ATTEMPTS" )
# Create an example survey design # with a variable representing number of contact attempts library(survey) data(involvement_survey_srs, package = "nrba") survey_design <- svydesign( weights = ~BASE_WEIGHT, id = ~UNIQUE_ID, fpc = ~N_STUDENTS, data = involvement_survey_srs ) # Cumulative estimates from respondents for average student age ---- get_cumulative_estimates( survey_design = survey_design |> subset(RESPONSE_STATUS == "Respondent"), y_var = "STUDENT_AGE", y_var_type = "numeric", predictor_variable = "CONTACT_ATTEMPTS" ) # Cumulative estimates from respondents for proportions of categorical variable ---- get_cumulative_estimates( survey_design = survey_design |> subset(RESPONSE_STATUS == "Respondent"), y_var = "WHETHER_PARENT_AGREES", y_var_type = "categorical", predictor_variable = "CONTACT_ATTEMPTS" )
An example dataset describing a population of 20,000 students with disabilities in 20 school districts. This population is the basis for selecting a sample of students for a parent involvement survey.
involvement_survey_pop
involvement_survey_pop
A data frame with 20,000 rows and 9 variables
A unique identifier for students
A unique identifier for school districts
A unique identifier for schools, nested within districts
Student's grade level: 'PK', 'K', 1-12
Student's age, measured in years
Code for student's disability category (e.g. 'VI' for 'Visual Impairments')
Student's disability category (e.g. 'Visual Impairments')
'Female' or 'Male'
Seven-level code with descriptive label (e.g. 'AS7 (Asian)')
involvement_survey_pop
involvement_survey_pop
An example dataset describing a simple random sample of 5,000 parents
of students with disabilities, from a population of 20,000.
The parent involvement survey measures a single key outcome:
whether "parents perceive that schools facilitate parent involvement
as a means of improving services and results for children with disabilities."
The variable BASE_WEIGHT
provides the base sampling weight.
The variable N_STUDENTS_IN_SCHOOL
can be used to provide a finite population correction
for variance estimation.
involvement_survey_srs
involvement_survey_srs
A data frame with 5,000 rows and 17 variables
A unique identifier for students
Survey response/eligibility status: 'Respondent', 'Nonrespondent', 'Ineligble', 'Unknown'
Parent agreement ('AGREE' or 'DISAGREE') for whether they perceive that schools facilitate parent involvement
A unique identifier for school districts
A unique identifier for schools, nested within districts
Student's grade level: 'PK', 'K', 1-12
Student's age, measured in years
Code for student's disability category (e.g. 'VI' for 'Visual Impairments')
Student's disability category (e.g. 'Visual Impairments')
'Female' or 'Male'
Seven-level code with descriptive label (e.g. 'AS7 (Asian)')
Whether parent has an e-mail address ('Has Email' vs 'No Email')
Population benchmark for category of PARENT_HAS_EMAIL
Population benchmark for category of STUDENT_RACE
Sampling weight to use for weighted estimates
Total number of students in the population
The number of contact attempts made for each case (ranges between 1 and 6)
involvement_survey_srs
involvement_survey_srs
An example dataset describing a stratified, multistage sample of 1,000 parents
of students with disabilities, from a population of 20,000.
The parent involvement survey measures a single key outcome:
whether "parents perceive that schools facilitate parent involvement
as a means of improving services and results for children with disabilities."
The sample was selected by sampling 5 schools from each of 20 districts,
and then sampling parents of 10 children in each sampled school.
The variable BASE_WEIGHT
provides the base sampling weight.
The variable SCHOOL_DISTRICT
was used for stratification,
and the variables SCHOOL_ID
and UNIQUE_ID
uniquely identify
the first and second stage sampling units (schools and parents).
The variables N_SCHOOLS_IN_DISTRICT
and N_STUDENTS_IN_SCHOOL
can be used to provide finite population corrections.
involvement_survey_str2s
involvement_survey_str2s
A data frame with 5,000 rows and 18 variables
A unique identifier for students
Survey response/eligibility status: 'Respondent', 'Nonrespondent', 'Ineligble', 'Unknown'
Parent agreement ('AGREE' or 'DISAGREE') for whether they perceive that schools facilitate parent involvement
A unique identifier for school districts
A unique identifier for schools, nested within districts
Student's grade level: 'PK', 'K', 1-12
Student's age, measured in years
Code for student's disability category (e.g. 'VI' for 'Visual Impairments')
Student's disability category (e.g. 'Visual Impairments')
'Female' or 'Male'
Seven-level code with descriptive label (e.g. 'AS7 (Asian)')
Whether parent has an e-mail address ('Has Email' vs 'No Email')
Population benchmark for category of PARENT_HAS_EMAIL
Population benchmark for category of STUDENT_RACE
Total number of schools in each district
Total number of students in each school
Sampling weight to use for weighted estimates
The number of contact attempts made for each case (ranges between 1 and 6)
# Load the data involvement_survey_str2s # Prepare the data for analysis with the 'survey' package library(survey) involvement_survey <- svydesign( data = involvement_survey_str2s, weights = ~ BASE_WEIGHT, strata = ~ SCHOOL_DISTRICT, ids = ~ SCHOOL_ID + UNIQUE_ID, fpc = ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL )
# Load the data involvement_survey_str2s # Prepare the data for analysis with the 'survey' package library(survey) involvement_survey <- svydesign( data = involvement_survey_str2s, weights = ~ BASE_WEIGHT, strata = ~ SCHOOL_DISTRICT, ids = ~ SCHOOL_ID + UNIQUE_ID, fpc = ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL )
A regression model is fit to the sample data to
predict outcomes measured by a survey.
This model can be used to identify auxiliary variables that are
predictive of survey outcomes and hence are potentially useful
for nonresponse bias analysis or weighting adjustments.
Only data from survey respondents will be used to fit the model,
since survey outcomes are only measured among respondents.
The function returns a summary of the model, including overall tests
for each variable of whether that variable improves the model's
ability to predict response status in the population of interest (not just in the random sample at hand).
predict_outcome_via_glm( survey_design, outcome_variable, outcome_type = "continuous", outcome_to_predict = NULL, numeric_predictors = NULL, categorical_predictors = NULL, model_selection = "main-effects", selection_controls = list(alpha_enter = 0.5, alpha_remain = 0.5, max_iterations = 100L) )
predict_outcome_via_glm( survey_design, outcome_variable, outcome_type = "continuous", outcome_to_predict = NULL, numeric_predictors = NULL, categorical_predictors = NULL, model_selection = "main-effects", selection_controls = list(alpha_enter = 0.5, alpha_remain = 0.5, max_iterations = 100L) )
survey_design |
A survey design object created with the |
outcome_variable |
Name of an outcome variable to use as the dependent variable in the model
The value of this variable is expected to be |
outcome_type |
Either |
outcome_to_predict |
Only required if |
numeric_predictors |
A list of names of numeric auxiliary variables to use for predicting response status. |
categorical_predictors |
A list of names of categorical auxiliary variables to use for predicting response status. |
model_selection |
A character string specifying how to select a model.
The default and recommended method is 'main-effects', which simply includes main effects
for each of the predictor variables. |
selection_controls |
Only required if |
See Lumley and Scott (2017) for details of how regression models are fit to survey data.
For overall tests of variables, a Rao-Scott Likelihood Ratio Test is conducted
(see section 4 of Lumley and Scott (2017) for statistical details)
using the function regTermTest(method = "LRT", lrt.approximation = "saddlepoint")
from the 'survey' package.
If the user specifies model_selection = "stepwise"
, a regression model
is selected by adding and removing variables based on the p-value from a
likelihood ratio rest. At each stage, a single variable is added to the model if
the p-value of the likelihood ratio test from adding the variable is below alpha_enter
and its p-value is less than that of all other variables not already in the model.
Next, of the variables already in the model, the variable with the largest p-value
is dropped if its p-value is greater than alpha_remain
. This iterative process
continues until a maximum number of iterations is reached or until
either all variables have been added to the model or there are no unadded variables
for which the likelihood ratio test has a p-value below alpha_enter
.
A data frame summarizing the fitted regression model.
Each row in the data frame represents a coefficient in the model.
The column variable
describes the underlying variable
for the coefficient. For categorical variables, the column variable_category
indicates the particular category of that variable for which a coefficient is estimated.
The columns estimated_coefficient
, se_coefficient
,
conf_intrvl_lower
, conf_intrvl_upper
, and p_value_coefficient
are summary statistics for the estimated coefficient. Note that p_value_coefficient
is based on the Wald t-test for the coefficient.
The column variable_level_p_value
gives the p-value of the
Rao-Scott Likelihood Ratio Test for including the variable in the model.
This likelihood ratio test has its test statistic given by the column
LRT_chisq_statistic
, and the reference distribution
for this test is a linear combination of p
F-distributions
with numerator degrees of freedom given by LRT_df_numerator
and denominator degrees of freedom given by LRT_df_denominator
,
where p
is the number of coefficients in the model corresponding to
the variable being tested.
Lumley, T., & Scott A. (2017). Fitting Regression Models to Survey Data. Statistical Science 32 (2) 265 - 278. https://doi.org/10.1214/16-STS605
library(survey) # Create a survey design ---- data(involvement_survey_str2s, package = "nrba") survey_design <- svydesign( weights = ~BASE_WEIGHT, strata = ~SCHOOL_DISTRICT, id = ~ SCHOOL_ID + UNIQUE_ID, fpc = ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL, data = involvement_survey_str2s ) predict_outcome_via_glm( survey_design = survey_design, outcome_variable = "WHETHER_PARENT_AGREES", outcome_type = "binary", outcome_to_predict = "AGREE", model_selection = "main-effects", numeric_predictors = c("STUDENT_AGE"), categorical_predictors = c("STUDENT_DISABILITY_CATEGORY", "PARENT_HAS_EMAIL") )
library(survey) # Create a survey design ---- data(involvement_survey_str2s, package = "nrba") survey_design <- svydesign( weights = ~BASE_WEIGHT, strata = ~SCHOOL_DISTRICT, id = ~ SCHOOL_ID + UNIQUE_ID, fpc = ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL, data = involvement_survey_str2s ) predict_outcome_via_glm( survey_design = survey_design, outcome_variable = "WHETHER_PARENT_AGREES", outcome_type = "binary", outcome_to_predict = "AGREE", model_selection = "main-effects", numeric_predictors = c("STUDENT_AGE"), categorical_predictors = c("STUDENT_DISABILITY_CATEGORY", "PARENT_HAS_EMAIL") )
A logistic regression model is fit to the sample data to
predict whether an individual responds to the survey (i.e. is an eligible respondent)
rather than a nonrespondent. Ineligible cases and cases with unknown eligibility status
are not included in this model.
The function returns a summary of the model, including overall tests
for each variable of whether that variable improves the model's
ability to predict response status in the population of interest (not just in the random sample at hand).
This model can be used to identify auxiliary variables associated with response status and compare multiple auxiliary variables in terms of their ability to predict response status.
predict_response_status_via_glm( survey_design, status, status_codes = c("ER", "EN", "IE", "UE"), numeric_predictors = NULL, categorical_predictors = NULL, model_selection = "main-effects", selection_controls = list(alpha_enter = 0.5, alpha_remain = 0.5, max_iterations = 100L) )
predict_response_status_via_glm( survey_design, status, status_codes = c("ER", "EN", "IE", "UE"), numeric_predictors = NULL, categorical_predictors = NULL, model_selection = "main-effects", selection_controls = list(alpha_enter = 0.5, alpha_remain = 0.5, max_iterations = 100L) )
survey_design |
A survey design object created with the |
status |
A character string giving the name of the variable representing response/eligibility status.
The |
status_codes |
A named vector, with two entries named 'ER' and 'EN'
indicating which values of the |
numeric_predictors |
A list of names of numeric auxiliary variables to use for predicting response status. |
categorical_predictors |
A list of names of categorical auxiliary variables to use for predicting response status. |
model_selection |
A character string specifying how to select a model.
The default and recommended method is 'main-effects', which simply includes main effects
for each of the predictor variables. |
selection_controls |
Only required if |
See Lumley and Scott (2017) for details of how regression models are fit to survey data.
For overall tests of variables, a Rao-Scott Likelihood Ratio Test is conducted
(see section 4 of Lumley and Scott (2017) for statistical details)
using the function regTermTest(method = "LRT", lrt.approximation = "saddlepoint")
from the 'survey' package.
If the user specifies model_selection = "stepwise"
, a regression model
is selected by adding and removing variables based on the p-value from a
likelihood ratio rest. At each stage, a single variable is added to the model if
the p-value of the likelihood ratio test from adding the variable is below alpha_enter
and its p-value is less than that of all other variables not already in the model.
Next, of the variables already in the model, the variable with the largest p-value
is dropped if its p-value is greater than alpha_remain
. This iterative process
continues until a maximum number of iterations is reached or until
either all variables have been added to the model or there are no unadded variables
for which the likelihood ratio test has a p-value below alpha_enter
.
A data frame summarizing the fitted logistic regression model.
Each row in the data frame represents a coefficient in the model.
The column variable
describes the underlying variable
for the coefficient. For categorical variables, the column variable_category
indicates
the particular category of that variable for which a coefficient is estimated.
The columns estimated_coefficient
, se_coefficient
, conf_intrvl_lower
, conf_intrvl_upper
,
and p_value_coefficient
are summary statistics for
the estimated coefficient. Note that p_value_coefficient
is based on the Wald t-test for the coefficient.
The column variable_level_p_value
gives the p-value of the
Rao-Scott Likelihood Ratio Test for including the variable in the model.
This likelihood ratio test has its test statistic given by the column
LRT_chisq_statistic
, and the reference distribution
for this test is a linear combination of p
F-distributions
with numerator degrees of freedom given by LRT_df_numerator
and
denominator degrees of freedom given by LRT_df_denominator
,
where p
is the number of coefficients in the model corresponding to
the variable being tested.
Lumley, T., & Scott A. (2017). Fitting Regression Models to Survey Data. Statistical Science 32 (2) 265 - 278. https://doi.org/10.1214/16-STS605
library(survey) # Create a survey design ---- data(involvement_survey_str2s, package = "nrba") survey_design <- survey_design <- svydesign( data = involvement_survey_str2s, weights = ~BASE_WEIGHT, strata = ~SCHOOL_DISTRICT, ids = ~ SCHOOL_ID + UNIQUE_ID, fpc = ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL ) predict_response_status_via_glm( survey_design = survey_design, status = "RESPONSE_STATUS", status_codes = c( "ER" = "Respondent", "EN" = "Nonrespondent", "IE" = "Ineligible", "UE" = "Unknown" ), model_selection = "main-effects", numeric_predictors = c("STUDENT_AGE"), categorical_predictors = c("PARENT_HAS_EMAIL", "STUDENT_GRADE") )
library(survey) # Create a survey design ---- data(involvement_survey_str2s, package = "nrba") survey_design <- survey_design <- svydesign( data = involvement_survey_str2s, weights = ~BASE_WEIGHT, strata = ~SCHOOL_DISTRICT, ids = ~ SCHOOL_ID + UNIQUE_ID, fpc = ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL ) predict_response_status_via_glm( survey_design = survey_design, status = "RESPONSE_STATUS", status_codes = c( "ER" = "Respondent", "EN" = "Nonrespondent", "IE" = "Ineligible", "UE" = "Unknown" ), model_selection = "main-effects", numeric_predictors = c("STUDENT_AGE"), categorical_predictors = c("PARENT_HAS_EMAIL", "STUDENT_GRADE") )
Adjusts weights in the data to ensure that estimated population totals for grouping variables match known population benchmarks. If there is only one grouping variable, simple post-stratification is used. If there are multiple grouping variables, raking (also known as iterative post-stratification) is used.
rake_to_benchmarks( survey_design, group_vars, group_benchmark_vars, max_iterations = 100, epsilon = 5e-06 )
rake_to_benchmarks( survey_design, group_vars, group_benchmark_vars, max_iterations = 100, epsilon = 5e-06 )
survey_design |
A survey design object created with the |
group_vars |
Names of grouping variables in the data dividing the sample into groups for which benchmark data are available. These variables cannot have any missing values |
group_benchmark_vars |
Names of group benchmark variables in the data corresponding to |
max_iterations |
If there are multiple grouping variables,
then raking is used rather than post-stratification.
The parameter |
epsilon |
If raking is used, convergence for a given margin is declared
if the maximum change in a re-weighted total is less than |
Raking adjusts the weight assigned to each sample member
so that, after reweighting, the weighted sample percentages for population subgroups
match their known population percentages. In a sense, raking causes
the sample to more closely resemble the population in terms of variables
for which population sizes are known.
Raking can be useful to reduce nonresponse bias caused by
having groups which are overrepresented in the responding sample
relative to their population size.
If the population subgroups systematically differ in terms of outcome variables of interest,
then raking can also be helpful in terms of reduce sampling variances. However,
when population subgroups do not differ in terms of outcome variables of interest,
then raking may increase sampling variances.
There are two basic requirements for raking.
Basic Requirement 1 - Values of the grouping variable(s) must be known for all respondents.
Basic Requirement 2 - The population size of each group must be known (or precisely estimated).
When there is effectively only one grouping variable (though this variable can be defined as a combination of other variables), raking amounts to simple post-stratification. For example, simple post-stratification would be used if the grouping variable is "Age x Sex x Race", and the population size of each combination of age, sex, and race is known. The method of "iterative poststratification" (also known as "iterative proportional fitting") is used when there are multiple grouping variables, and population sizes are known for each grouping variable but not for combinations of grouping variables. For example, iterative proportional fitting would be necessary if population sizes are known for age groups and for gender categories but not for combinations of age groups and gender categories.
A survey design object with raked or post-stratified weights
# Load the survey data data(involvement_survey_srs, package = "nrba") # Calculate population benchmarks population_benchmarks <- list( "PARENT_HAS_EMAIL" = data.frame( PARENT_HAS_EMAIL = c("Has Email", "No Email"), PARENT_HAS_EMAIL_POP_BENCHMARK = c(17036, 2964) ), "STUDENT_RACE" = data.frame( STUDENT_RACE = c( "AM7 (American Indian or Alaska Native)", "AS7 (Asian)", "BL7 (Black or African American)", "HI7 (Hispanic or Latino Ethnicity)", "MU7 (Two or More Races)", "PI7 (Native Hawaiian or Other Pacific Islander)", "WH7 (White)" ), STUDENT_RACE_POP_BENCHMARK = c(206, 258, 3227, 1097, 595, 153, 14464) ) ) # Add the population benchmarks as variables in the data involvement_survey_srs <- merge( x = involvement_survey_srs, y = population_benchmarks$PARENT_HAS_EMAIL, by = "PARENT_HAS_EMAIL" ) involvement_survey_srs <- merge( x = involvement_survey_srs, y = population_benchmarks$STUDENT_RACE, by = "STUDENT_RACE" ) # Create a survey design object library(survey) survey_design <- svydesign( weights = ~BASE_WEIGHT, id = ~UNIQUE_ID, fpc = ~N_STUDENTS, data = involvement_survey_srs ) # Subset data to only include respondents survey_respondents <- subset( survey_design, RESPONSE_STATUS == "Respondent" ) # Rake to the benchmarks raked_survey_design <- rake_to_benchmarks( survey_design = survey_respondents, group_vars = c("PARENT_HAS_EMAIL", "STUDENT_RACE"), group_benchmark_vars = c( "PARENT_HAS_EMAIL_POP_BENCHMARK", "STUDENT_RACE_POP_BENCHMARK" ), ) # Inspect estimates from respondents, before and after raking svymean( x = ~PARENT_HAS_EMAIL, design = survey_respondents ) svymean( x = ~PARENT_HAS_EMAIL, design = raked_survey_design ) svymean( x = ~WHETHER_PARENT_AGREES, design = survey_respondents ) svymean( x = ~WHETHER_PARENT_AGREES, design = raked_survey_design )
# Load the survey data data(involvement_survey_srs, package = "nrba") # Calculate population benchmarks population_benchmarks <- list( "PARENT_HAS_EMAIL" = data.frame( PARENT_HAS_EMAIL = c("Has Email", "No Email"), PARENT_HAS_EMAIL_POP_BENCHMARK = c(17036, 2964) ), "STUDENT_RACE" = data.frame( STUDENT_RACE = c( "AM7 (American Indian or Alaska Native)", "AS7 (Asian)", "BL7 (Black or African American)", "HI7 (Hispanic or Latino Ethnicity)", "MU7 (Two or More Races)", "PI7 (Native Hawaiian or Other Pacific Islander)", "WH7 (White)" ), STUDENT_RACE_POP_BENCHMARK = c(206, 258, 3227, 1097, 595, 153, 14464) ) ) # Add the population benchmarks as variables in the data involvement_survey_srs <- merge( x = involvement_survey_srs, y = population_benchmarks$PARENT_HAS_EMAIL, by = "PARENT_HAS_EMAIL" ) involvement_survey_srs <- merge( x = involvement_survey_srs, y = population_benchmarks$STUDENT_RACE, by = "STUDENT_RACE" ) # Create a survey design object library(survey) survey_design <- svydesign( weights = ~BASE_WEIGHT, id = ~UNIQUE_ID, fpc = ~N_STUDENTS, data = involvement_survey_srs ) # Subset data to only include respondents survey_respondents <- subset( survey_design, RESPONSE_STATUS == "Respondent" ) # Rake to the benchmarks raked_survey_design <- rake_to_benchmarks( survey_design = survey_respondents, group_vars = c("PARENT_HAS_EMAIL", "STUDENT_RACE"), group_benchmark_vars = c( "PARENT_HAS_EMAIL_POP_BENCHMARK", "STUDENT_RACE_POP_BENCHMARK" ), ) # Inspect estimates from respondents, before and after raking svymean( x = ~PARENT_HAS_EMAIL, design = survey_respondents ) svymean( x = ~PARENT_HAS_EMAIL, design = raked_survey_design ) svymean( x = ~WHETHER_PARENT_AGREES, design = survey_respondents ) svymean( x = ~WHETHER_PARENT_AGREES, design = raked_survey_design )
A regression model is selected by iteratively adding and removing variables based on the p-value from a
likelihood ratio rest. At each stage, a single variable is added to the model if
the p-value of the likelihood ratio test from adding the variable is below alpha_enter
and its p-value is less than that of all other variables not already in the model.
Next, of the variables already in the model, the variable with the largest p-value
is dropped if its p-value is greater than alpha_remain
. This iterative process
continues until a maximum number of iterations is reached or until
either all variables have been added to the model or there are no variables
not yet in the model whose likelihood ratio test has a p-value below alpha_enter
.
Stepwise model selection generally invalidates inferential statistics
such as p-values, standard errors, or confidence intervals and leads to
overestimation of the size of coefficients for variables included in the selected model.
This bias increases as the value of alpha_enter
or alpha_remain
decreases.
The use of stepwise model selection should be limited only to
reducing a large list of candidate variables for nonresponse adjustment.
stepwise_model_selection( survey_design, outcome_variable, predictor_variables, model_type = "binary-logistic", max_iterations = 100L, alpha_enter = 0.5, alpha_remain = 0.5 )
stepwise_model_selection( survey_design, outcome_variable, predictor_variables, model_type = "binary-logistic", max_iterations = 100L, alpha_enter = 0.5, alpha_remain = 0.5 )
survey_design |
A survey design object created with the |
outcome_variable |
The name of an outcome variable to use as the dependent variable. |
predictor_variables |
A list of names of variables to consider as predictors for the model. |
model_type |
A character string describing the type of model to fit.
|
max_iterations |
Maximum number of iterations to try adding new variables to the model. |
alpha_enter |
The maximum p-value allowed for a variable to be added to the model. Large values such as 0.5 or greater are recommended to reduce the bias of estimates from the selected model. |
alpha_remain |
The maximum p-value allowed for a variable to remain in the model. Large values such as 0.5 or greater are recommended to reduce the bias of estimates from the selected model. |
See Lumley and Scott (2017) for details of how regression models are fit to survey data.
For overall tests of variables, a Rao-Scott Likelihood Ratio Test is conducted
(see section 4 of Lumley and Scott (2017) for statistical details)
using the function regTermTest(method = "LRT", lrt.approximation = "saddlepoint")
from the 'survey' package.
See Sauerbrei et al. (2020) for a discussion of statistical issues with using stepwise model selection.
An object of class svyglm
representing
a regression model fit using the 'survey' package.
Lumley, T., & Scott A. (2017). Fitting Regression Models to Survey Data. Statistical Science 32 (2) 265 - 278. https://doi.org/10.1214/16-STS605
Sauerbrei, W., Perperoglou, A., Schmid, M. et al. (2020). State of the art in selection of variables and functional forms in multivariable analysis - outstanding issues. Diagnostic and Prognostic Research 4, 3. https://doi.org/10.1186/s41512-020-00074-3
library(survey) # Load example data and prepare it for analysis data(involvement_survey_str2s, package = 'nrba') involvement_survey <- svydesign( data = involvement_survey_str2s, ids = ~ SCHOOL_ID + UNIQUE_ID, fpc = ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL, strata = ~ SCHOOL_DISTRICT, weights = ~ BASE_WEIGHT ) involvement_survey <- involvement_survey |> transform(WHETHER_PARENT_AGREES = factor(WHETHER_PARENT_AGREES)) # Fit a regression model using stepwise selection selected_model <- stepwise_model_selection( survey_design = involvement_survey, outcome_variable = "WHETHER_PARENT_AGREES", predictor_variables = c("STUDENT_RACE", "STUDENT_DISABILITY_CATEGORY"), model_type = "binary-logistic", max_iterations = 100, alpha_enter = 0.5, alpha_remain = 0.5 )
library(survey) # Load example data and prepare it for analysis data(involvement_survey_str2s, package = 'nrba') involvement_survey <- svydesign( data = involvement_survey_str2s, ids = ~ SCHOOL_ID + UNIQUE_ID, fpc = ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL, strata = ~ SCHOOL_DISTRICT, weights = ~ BASE_WEIGHT ) involvement_survey <- involvement_survey |> transform(WHETHER_PARENT_AGREES = factor(WHETHER_PARENT_AGREES)) # Fit a regression model using stepwise selection selected_model <- stepwise_model_selection( survey_design = involvement_survey, outcome_variable = "WHETHER_PARENT_AGREES", predictor_variables = c("STUDENT_RACE", "STUDENT_DISABILITY_CATEGORY"), model_type = "binary-logistic", max_iterations = 100, alpha_enter = 0.5, alpha_remain = 0.5 )
The function t_test_resp_vs_full
tests whether means of auxiliary variables differ between respondents and the full selected sample,
where the full sample consists of all cases regardless of response status or eligibility status.
The function t_test_resp_vs_elig
tests whether means differ between the responding sample and the eligible sample,
where the eligible sample consists of all cases known to be eligible, regardless of response status.
See Lohr and Riddles (2016) for the statistical theory of this test.
t_test_resp_vs_full( survey_design, y_vars, na.rm = TRUE, status, status_codes = c("ER", "EN", "IE", "UE"), null_difference = 0, alternative = "unequal", degrees_of_freedom = survey::degf(survey_design) - 1 ) t_test_resp_vs_elig( survey_design, y_vars, na.rm = TRUE, status, status_codes = c("ER", "EN", "IE", "UE"), null_difference = 0, alternative = "unequal", degrees_of_freedom = survey::degf(survey_design) - 1 )
t_test_resp_vs_full( survey_design, y_vars, na.rm = TRUE, status, status_codes = c("ER", "EN", "IE", "UE"), null_difference = 0, alternative = "unequal", degrees_of_freedom = survey::degf(survey_design) - 1 ) t_test_resp_vs_elig( survey_design, y_vars, na.rm = TRUE, status, status_codes = c("ER", "EN", "IE", "UE"), null_difference = 0, alternative = "unequal", degrees_of_freedom = survey::degf(survey_design) - 1 )
survey_design |
A survey design object created with the |
y_vars |
Names of dependent variables for tests. For categorical variables, percentages of each category are tested. |
na.rm |
Whether to drop cases with missing values for a given dependent variable. |
status |
The name of the variable representing response/eligibility status. |
status_codes |
A named vector, with four entries named 'ER', 'EN', 'IE', and 'UE'. |
null_difference |
The difference between the two means under the null hypothesis. Default is |
alternative |
Can be one of the following:
|
degrees_of_freedom |
The degrees of freedom to use for the test's reference distribution.
Unless specified otherwise, the default is the design degrees of freedom minus one,
where the design degrees of freedom are estimated using the |
A data frame describing the results of the t-tests, one row per dependent variable.
The t-statistic used for the test has as its numerator the difference in means between the two samples, minus the null_difference
.
The denominator for the t-statistic is the estimated standard error of the difference in means.
Because the two means are based on overlapping groups and thus have correlated sampling errors, special care is taken to estimate the covariance of the two estimates.
For designs which use sets of replicate weights for variance estimation, the two means and their difference are estimated using each set of replicate weights;
the estimated differences from the sets of replicate weights are then used to estimate sampling error with a formula appropriate to the replication method (JKn, BRR, etc.).
For designs which use linearization methods for variance estimation, the covariance between the two means is estimated using the method of linearization based on influence functions implemented in the survey
package.
See Osier (2009) for an overview of the method of linearization based on influence functions.
Eckman et al. (2023) showed in a simulation study that linearization and replication
performed similarly in estimating the variance of a difference in means for overlapping samples.
Unless specified otherwise using the degrees_of_freedom
parameter, the degrees of freedom for the test are set to the design degrees of freedom minus one.
Design degrees of freedom are estimated using the survey
package's degf
method.
See Lohr and Riddles (2016) for the statistical details of this test.
See Van de Kerckhove et al. (2009) and Amaya and Presser (2017)
for examples of a nonresponse bias analysis which uses t-tests to compare responding samples to eligible samples.
Amaya, A., Presser, S. (2017). Nonresponse Bias for Univariate and Multivariate Estimates of Social Activities and Roles. Public Opinion Quarterly, Volume 81, Issue 1, 1 March 2017, Pages 1–36, https://doi.org/10.1093/poq/nfw037
Eckman, S., Unangst, J., Dever, J., Antoun, A. (2023). The Precision of Estimates of Nonresponse Bias in Means. Journal of Survey Statistics and Methodology, 11(4), 758-783. https://doi.org/10.1093/jssam/smac019
Lohr, S., Riddles, M. (2016). Tests for Evaluating Nonresponse Bias in Surveys. Survey Methodology 42(2): 195-218. https://www150.statcan.gc.ca/n1/pub/12-001-x/2016002/article/14677-eng.pdf
Osier, G. (2009). Variance estimation for complex indicators of poverty and inequality using linearization techniques. Survey Research Methods, 3(3), 167-195. https://doi.org/10.18148/srm/2009.v3i3.369
Van de Kerckhove, W., Krenzke, T., and Mohadjer, L. (2009). Adult Literacy and Lifeskills Survey (ALL) 2003: U.S. Nonresponse Bias Analysis (NCES 2009-063). National Center for Education Statistics, Institute of Education Sciences, U.S. Department of Education. Washington, DC.
library(survey) # Create a survey design ---- data(involvement_survey_srs, package = 'nrba') survey_design <- svydesign(weights = ~ BASE_WEIGHT, id = ~ UNIQUE_ID, fpc = ~ N_STUDENTS, data = involvement_survey_srs) # Compare respondents' mean to the full sample mean ---- t_test_resp_vs_full(survey_design = survey_design, y_vars = c("STUDENT_AGE", "WHETHER_PARENT_AGREES"), status = 'RESPONSE_STATUS', status_codes = c('ER' = "Respondent", 'EN' = "Nonrespondent", 'IE' = "Ineligible", 'UE' = "Unknown")) # Compare respondents' mean to the mean of all eligible cases ---- t_test_resp_vs_full(survey_design = survey_design, y_vars = c("STUDENT_AGE", "WHETHER_PARENT_AGREES"), status = 'RESPONSE_STATUS', status_codes = c('ER' = "Respondent", 'EN' = "Nonrespondent", 'IE' = "Ineligible", 'UE' = "Unknown")) # One-sided tests ---- ## Null Hypothesis: Y_bar_resp - Y_bar_full <= 0.1 ## Alt. Hypothesis: Y_bar_resp - Y_bar_full > 0.1 t_test_resp_vs_full(survey_design = survey_design, y_vars = c("STUDENT_AGE", "WHETHER_PARENT_AGREES"), status = 'RESPONSE_STATUS', status_codes = c('ER' = "Respondent", 'EN' = "Nonrespondent", 'IE' = "Ineligible", 'UE' = "Unknown"), null_difference = 0.1, alternative = 'greater') ## Null Hypothesis: Y_bar_resp - Y_bar_full >= 0.1 ## Alt. Hypothesis: Y_bar_resp - Y_bar_full < 0.1 t_test_resp_vs_full(survey_design = survey_design, y_vars = c("STUDENT_AGE", "WHETHER_PARENT_AGREES"), status = 'RESPONSE_STATUS', status_codes = c('ER' = "Respondent", 'EN' = "Nonrespondent", 'IE' = "Ineligible", 'UE' = "Unknown"), null_difference = 0.1, alternative = 'less')
library(survey) # Create a survey design ---- data(involvement_survey_srs, package = 'nrba') survey_design <- svydesign(weights = ~ BASE_WEIGHT, id = ~ UNIQUE_ID, fpc = ~ N_STUDENTS, data = involvement_survey_srs) # Compare respondents' mean to the full sample mean ---- t_test_resp_vs_full(survey_design = survey_design, y_vars = c("STUDENT_AGE", "WHETHER_PARENT_AGREES"), status = 'RESPONSE_STATUS', status_codes = c('ER' = "Respondent", 'EN' = "Nonrespondent", 'IE' = "Ineligible", 'UE' = "Unknown")) # Compare respondents' mean to the mean of all eligible cases ---- t_test_resp_vs_full(survey_design = survey_design, y_vars = c("STUDENT_AGE", "WHETHER_PARENT_AGREES"), status = 'RESPONSE_STATUS', status_codes = c('ER' = "Respondent", 'EN' = "Nonrespondent", 'IE' = "Ineligible", 'UE' = "Unknown")) # One-sided tests ---- ## Null Hypothesis: Y_bar_resp - Y_bar_full <= 0.1 ## Alt. Hypothesis: Y_bar_resp - Y_bar_full > 0.1 t_test_resp_vs_full(survey_design = survey_design, y_vars = c("STUDENT_AGE", "WHETHER_PARENT_AGREES"), status = 'RESPONSE_STATUS', status_codes = c('ER' = "Respondent", 'EN' = "Nonrespondent", 'IE' = "Ineligible", 'UE' = "Unknown"), null_difference = 0.1, alternative = 'greater') ## Null Hypothesis: Y_bar_resp - Y_bar_full >= 0.1 ## Alt. Hypothesis: Y_bar_resp - Y_bar_full < 0.1 t_test_resp_vs_full(survey_design = survey_design, y_vars = c("STUDENT_AGE", "WHETHER_PARENT_AGREES"), status = 'RESPONSE_STATUS', status_codes = c('ER' = "Respondent", 'EN' = "Nonrespondent", 'IE' = "Ineligible", 'UE' = "Unknown"), null_difference = 0.1, alternative = 'less')
Tests whether estimates of means/percentages differ systematically between two sets of replicate weights: an original set of weights, and the weights after adjustment (e.g. post-stratification or nonresponse adjustments) and possibly subsetting (e.g. subsetting to only include respondents).
t_test_of_weight_adjustment( orig_design, updated_design, y_vars, na.rm = TRUE, null_difference = 0, alternative = "unequal", degrees_of_freedom = NULL )
t_test_of_weight_adjustment( orig_design, updated_design, y_vars, na.rm = TRUE, null_difference = 0, alternative = "unequal", degrees_of_freedom = NULL )
orig_design |
A replicate design object created with the |
updated_design |
A potentially updated version of |
y_vars |
Names of dependent variables for tests. For categorical variables, percentages of each category are tested. |
na.rm |
Whether to drop cases with missing values for a given dependent variable. |
null_difference |
The difference between the two means/percentages under the null hypothesis. Default is |
alternative |
Can be one of the following:
|
degrees_of_freedom |
The degrees of freedom to use for the test's reference distribution.
Unless specified otherwise, the default is the design degrees of freedom minus one,
where the design degrees of freedom are estimated using the |
A data frame describing the results of the t-tests, one row per dependent variable.
The t-statistic used for the test has as its numerator the difference in means/percentages between the two samples, minus the null_difference
.
The denominator for the t-statistic is the estimated standard error of the difference in means.
Because the two means are based on overlapping groups and thus have correlated sampling errors, special care is taken to estimate the covariance of the two estimates.
For designs which use sets of replicate weights for variance estimation, the two means and their difference are estimated using each set of replicate weights;
the estimated differences from the sets of replicate weights are then used to estimate sampling error with a formula appropriate to the replication method (JKn, BRR, etc.).
This analysis is not implemented for designs which use linearization methods for variance estimation.
Unless specified otherwise using the degrees_of_freedom
parameter, the degrees of freedom for the test are set to the design degrees of freedom minus one.
Design degrees of freedom are estimated using the survey
package's degf
method.
See Van de Kerckhove et al. (2009) for an example of this type of nonresponse bias analysis (among others).
See Lohr and Riddles (2016) for the statistical details of this test.
Lohr, S., Riddles, M. (2016). Tests for Evaluating Nonresponse Bias in Surveys. Survey Methodology 42(2): 195-218. https://www150.statcan.gc.ca/n1/pub/12-001-x/2016002/article/14677-eng.pdf
Van de Kerckhove, W., Krenzke, T., and Mohadjer, L. (2009). Adult Literacy and Lifeskills Survey (ALL) 2003: U.S. Nonresponse Bias Analysis (NCES 2009-063). National Center for Education Statistics, Institute of Education Sciences, U.S. Department of Education. Washington, DC.
library(survey) # Create a survey design ---- data(involvement_survey_srs, package = 'nrba') survey_design <- svydesign(weights = ~ BASE_WEIGHT, id = ~ UNIQUE_ID, fpc = ~ N_STUDENTS, data = involvement_survey_srs) # Create replicate weights for the design ---- rep_svy_design <- as.svrepdesign(survey_design, type = "subbootstrap", replicates = 500) # Subset to only respondents (always subset *after* creating replicate weights) rep_svy_respondents <- subset(rep_svy_design, RESPONSE_STATUS == "Respondent") # Apply raking adjustment ---- raked_rep_svy_respondents <- rake_to_benchmarks( survey_design = rep_svy_respondents, group_vars = c("PARENT_HAS_EMAIL", "STUDENT_RACE"), group_benchmark_vars = c("PARENT_HAS_EMAIL_BENCHMARK", "STUDENT_RACE_BENCHMARK"), ) # Compare estimates from respondents in original vs. adjusted design ---- t_test_of_weight_adjustment(orig_design = rep_svy_respondents, updated_design = raked_rep_svy_respondents, y_vars = c('STUDENT_AGE', 'STUDENT_SEX')) t_test_of_weight_adjustment(orig_design = rep_svy_respondents, updated_design = raked_rep_svy_respondents, y_vars = c('WHETHER_PARENT_AGREES')) # Compare estimates to true population values ---- data('involvement_survey_pop', package = 'nrba') mean(involvement_survey_pop$STUDENT_AGE) prop.table(table(involvement_survey_pop$STUDENT_SEX))
library(survey) # Create a survey design ---- data(involvement_survey_srs, package = 'nrba') survey_design <- svydesign(weights = ~ BASE_WEIGHT, id = ~ UNIQUE_ID, fpc = ~ N_STUDENTS, data = involvement_survey_srs) # Create replicate weights for the design ---- rep_svy_design <- as.svrepdesign(survey_design, type = "subbootstrap", replicates = 500) # Subset to only respondents (always subset *after* creating replicate weights) rep_svy_respondents <- subset(rep_svy_design, RESPONSE_STATUS == "Respondent") # Apply raking adjustment ---- raked_rep_svy_respondents <- rake_to_benchmarks( survey_design = rep_svy_respondents, group_vars = c("PARENT_HAS_EMAIL", "STUDENT_RACE"), group_benchmark_vars = c("PARENT_HAS_EMAIL_BENCHMARK", "STUDENT_RACE_BENCHMARK"), ) # Compare estimates from respondents in original vs. adjusted design ---- t_test_of_weight_adjustment(orig_design = rep_svy_respondents, updated_design = raked_rep_svy_respondents, y_vars = c('STUDENT_AGE', 'STUDENT_SEX')) t_test_of_weight_adjustment(orig_design = rep_svy_respondents, updated_design = raked_rep_svy_respondents, y_vars = c('WHETHER_PARENT_AGREES')) # Compare estimates to true population values ---- data('involvement_survey_pop', package = 'nrba') mean(involvement_survey_pop$STUDENT_AGE) prop.table(table(involvement_survey_pop$STUDENT_SEX))
Compare estimated means/percentages from the present survey to external estimates from a benchmark source. A t-test is used to evaluate whether the survey's estimates differ from the external estimates.
t_test_vs_external_estimate( survey_design, y_var, ext_ests, ext_std_errors = NULL, na.rm = TRUE, null_difference = 0, alternative = "unequal", degrees_of_freedom = survey::degf(survey_design) - 1 )
t_test_vs_external_estimate( survey_design, y_var, ext_ests, ext_std_errors = NULL, na.rm = TRUE, null_difference = 0, alternative = "unequal", degrees_of_freedom = survey::degf(survey_design) - 1 )
survey_design |
A survey design object created with the |
y_var |
Name of dependent variable. For categorical variables, percentages of each category are tested. |
ext_ests |
A numeric vector containing the external estimate of the mean for the dependent variable.
If |
ext_std_errors |
(Optional) The standard errors of the external estimates. This is useful if the external data are estimated with an appreciable level of uncertainty, for instance if the external data come from a survey with a small-to-moderate sample size. If supplied, the variance of the difference between the survey and external estimates is estimated by adding the variance of the external estimates to the estimated variance of the survey's estimates. |
na.rm |
Whether to drop cases with missing values for |
null_difference |
The hypothesized difference between the estimate and the external mean. Default is |
alternative |
Can be one of the following:
|
degrees_of_freedom |
The degrees of freedom to use for the test's reference distribution.
Unless specified otherwise, the default is the design degrees of freedom minus one,
where the design degrees of freedom are estimated using the survey package's |
A data frame describing the results of the t-tests, one row per mean being compared.
See Brick and Bose (2001) for an example of this analysis method and a discussion of its limitations.
Brick, M., and Bose, J. (2001). Analysis of Potential Nonresponse Bias. in Proceedings of the Section on Survey Research Methods. Alexandria, VA: American Statistical Association. http://www.asasrms.org/Proceedings/y2001/Proceed/00021.pdf
library(survey) # Create a survey design ---- data("involvement_survey_str2s", package = 'nrba') involvement_survey_sample <- svydesign( data = involvement_survey_str2s, weights = ~ BASE_WEIGHT, strata = ~ SCHOOL_DISTRICT, ids = ~ SCHOOL_ID + UNIQUE_ID, fpc = ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL ) # Subset to only include survey respondents ---- involvement_survey_respondents <- subset(involvement_survey_sample, RESPONSE_STATUS == "Respondent") # Test whether percentages of categorical variable differ from benchmark ---- parent_email_benchmark <- c( 'Has Email' = 0.85, 'No Email' = 0.15 ) t_test_vs_external_estimate( survey_design = involvement_survey_respondents, y_var = "PARENT_HAS_EMAIL", ext_ests = parent_email_benchmark ) # Test whether the sample mean differs from the population benchmark ---- average_age_benchmark <- 11 t_test_vs_external_estimate( survey_design = involvement_survey_respondents, y_var = "STUDENT_AGE", ext_ests = average_age_benchmark, null_difference = 0 )
library(survey) # Create a survey design ---- data("involvement_survey_str2s", package = 'nrba') involvement_survey_sample <- svydesign( data = involvement_survey_str2s, weights = ~ BASE_WEIGHT, strata = ~ SCHOOL_DISTRICT, ids = ~ SCHOOL_ID + UNIQUE_ID, fpc = ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL ) # Subset to only include survey respondents ---- involvement_survey_respondents <- subset(involvement_survey_sample, RESPONSE_STATUS == "Respondent") # Test whether percentages of categorical variable differ from benchmark ---- parent_email_benchmark <- c( 'Has Email' = 0.85, 'No Email' = 0.15 ) t_test_vs_external_estimate( survey_design = involvement_survey_respondents, y_var = "PARENT_HAS_EMAIL", ext_ests = parent_email_benchmark ) # Test whether the sample mean differs from the population benchmark ---- average_age_benchmark <- 11 t_test_vs_external_estimate( survey_design = involvement_survey_respondents, y_var = "STUDENT_AGE", ext_ests = average_age_benchmark, null_difference = 0 )
Updates weights in a survey design object to adjust for nonresponse and/or unknown eligibility
using the method of weighting class adjustment. For unknown eligibility adjustments, the weight in each class
is set to zero for cases with unknown eligibility, and the weight of all other cases in the class is
increased so that the total weight is unchanged. For nonresponse adjustments, the weight in each class
is set to zero for cases classified as eligible nonrespondents, and the weight of eligible respondent cases
in the class is increased so that the total weight is unchanged.
This function currently only works for survey designs with replicate weights,
since the linearization-based estimators included in the survey
package (or Stata or SAS for that matter)
are unable to fully reflect the impact of nonresponse adjustment.
Adjustments are made to both the full-sample weights and all of the sets of replicate weights.
wt_class_adjust( survey_design, status, status_codes, wt_class = NULL, type = c("UE", "NR") )
wt_class_adjust( survey_design, status, status_codes, wt_class = NULL, type = c("UE", "NR") )
survey_design |
A replicate survey design object created with the |
status |
A character string giving the name of the variable representing response/eligibility status. |
status_codes |
A named vector, with four entries named 'ER', 'EN', 'IE', and 'UE'. |
wt_class |
(Optional) A character string giving the name of the variable which divides sample cases into weighting classes. |
type |
A character vector including one or more of the following options:
|
See the vignette "Nonresponse Adjustments" from the svrep package for a step-by-step walkthrough of
nonresponse weighting adjustments in R: vignette(topic = "nonresponse-adjustments", package = "svrep")
A replicate survey design object, with adjusted full-sample and replicate weights
See Chapter 2 of Heeringa, West, and Berglund (2017) or Chapter 13 of Valliant, Dever, and Kreuter (2018) for an overview of nonresponse adjustment methods based on redistributing weights.
Heeringa, S., West, B., Berglund, P. (2017). Applied Survey Data Analysis, 2nd edition. Boca Raton, FL: CRC Press. "Applied Survey Data Analysis, 2nd edition." Boca Raton, FL: CRC Press.
Valliant, R., Dever, J., Kreuter, F. (2018). "Practical Tools for Designing and Weighting Survey Samples, 2nd edition." New York: Springer.
svrep::redistribute_weights()
, vignette(topic = "nonresponse-adjustments", package = "svrep")
library(survey) # Load an example dataset data("involvement_survey_str2s", package = "nrba") # Create a survey design object involvement_survey_sample <- svydesign( data = involvement_survey_str2s, weights = ~BASE_WEIGHT, strata = ~SCHOOL_DISTRICT, ids = ~ SCHOOL_ID + UNIQUE_ID, fpc = ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL ) rep_design <- as.svrepdesign(involvement_survey_sample, type = "mrbbootstrap") # Adjust weights for nonresponse within weighting classes nr_adjusted_design <- wt_class_adjust( survey_design = rep_design, status = "RESPONSE_STATUS", status_codes = c( "ER" = "Respondent", "EN" = "Nonrespondent", "IE" = "Ineligible", "UE" = "Unknown" ), wt_class = "PARENT_HAS_EMAIL", type = "NR" )
library(survey) # Load an example dataset data("involvement_survey_str2s", package = "nrba") # Create a survey design object involvement_survey_sample <- svydesign( data = involvement_survey_str2s, weights = ~BASE_WEIGHT, strata = ~SCHOOL_DISTRICT, ids = ~ SCHOOL_ID + UNIQUE_ID, fpc = ~ N_SCHOOLS_IN_DISTRICT + N_STUDENTS_IN_SCHOOL ) rep_design <- as.svrepdesign(involvement_survey_sample, type = "mrbbootstrap") # Adjust weights for nonresponse within weighting classes nr_adjusted_design <- wt_class_adjust( survey_design = rep_design, status = "RESPONSE_STATUS", status_codes = c( "ER" = "Respondent", "EN" = "Nonrespondent", "IE" = "Ineligible", "UE" = "Unknown" ), wt_class = "PARENT_HAS_EMAIL", type = "NR" )