Raking weights with R

When I was working with public opinion surveys, I usually had to adjust the data according to population parameters such as sex, age, socioeconomic status, or region. There are several ways to do this. One of them is raking. Using the package anesrake by Josh Pasek is easy to compute raking weights in R. I show a brief and straightforward raking example in this post.

What is raking?

It is common to find differences between the population and sample distribution in some key demographic variables when analyzing surveys. These differences might be due to the sampling design, non-coverage issues or non-response. To avoid biases of point estimates, we generally adjust those differences using weights so that the marginal totals agree (at least in part) with the population. These adjustments can be done using different methods: post-stratification or cell-weighting, raking, general regression estimator GREG (Battaglia et al. 2004, Valliant et al. 2013). Here I will show an example for raking adjustment. Raking has some advantages over cell-weighting:

We have to know only the population totals of the specific variables, not all cells of the cross-classification.
It is more flexible feasible when we use several variables at once.

It is important to note that raking adjustments are useful when we want to estimate a proportion or mean from a population, but they are not necessarily the best when modeling (see Gelman 2007). In that case, model specification (i.e., inclusion of all the key variables and interactions) may be a better solution. It is important to note that the raking procedure may affect the dependence structure of variables, what eventually may bias model estimates.

Some rules of thumb

DeBell and Krosnick (2009) provide useful recommendations regarding variable selection when implementing raking:

Identify a set of variable likely to be measured with little error and a low item nonresponse rate (less than 5%), and compare them with reliable data sources (e.g., Census). Some typical variables are age groups, gender, race, SES, census region.
Implement weighting when substantive demography discrepancies are observed. For instance, differences higher than five percentage points. When differences are less than two percentage points this adjustment would not be necessary.
When selected variables have categories with less than 5% in the sample, it would be recommended to collapse them.

Remember, however, that the ultimate goal of raking is to calibrate and improve the efficiency of our estimates (reduce standard errors). In general, there is trade-off between these two objectives. Anyway, weight adjustments should be variable-dependent: they should explore how different key variables of our survey behave after adjusting. In this post I only show how to implement raking. A more complete analysis will be surely necessary.

Once the raking procedure is implemented, it is a good idea to follow these steps in an iterative fashion:

Identify extreme weights and truncate them. For example, those greater than five times the mean of weights.
Truncate large values, but not small ones (i.e., those near zero). While large values increase the effect of outliers and are more likely to inflate variance, low values do not.
Examine the design effect using the new weights. If the new design effect is greater than the old design effect by more than 0.5, it is a good idea to review the raking adjustment. In other words, calibration may cost too much in terms of efficiency (standard errors).

Three things can be done to improve the weighting procedure:

Collapse categories of the variables
Eliminate or replace weighting variables
Increase the weight truncation criterion ( > 5 )

Raking using R

In this example I use data from Chile to estimate the presidential approval (Opinion Public Survey CEP, July-August 2012). Below, you can see the characteristics of the dataset:

dat  <- read.csv("cep.csv")
str(dat)

## 'data.frame':	1512 obs. of  12 variables:
##  $ fnu     : int  1580 1579 890 871 1031 104 1103 263 924 105 ...
##  $ area    : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ region  : int  13 13 9 9 10 2 13 5 9 2 ...
##  $ comuna  : int  13123 13123 9110 9102 10202 2201 13101 5106 9201 2201 ...
##  $ sex     : int  1 1 2 1 1 1 2 1 2 1 ...
##  $ age     : int  30 60 28 72 55 35 33 25 73 43 ...
##  $ ses     : int  2 2 3 3 4 3 2 3 4 2 ...
##  $ phone   : int  1 1 9 2 2 2 2 1 2 1 ...
##  $ edu     : int  4 4 4 1 2 3 4 4 2 4 ...
##  $ pond    : num  1.977 1.243 0.514 0.421 0.526 ...
##  $ agecat  : int  2 5 2 5 5 3 2 2 5 3 ...
##  $ approval: int  2 1 2 1 1 2 2 2 2 2 ...

I use five variables to define raking weights: sex, agecat, ses, region, and area. I get relative frequencies of these variables using the package weights . Most of the variables have more than five percent in each category, with the exception of region and ses. For illustrative purposes, I will use those variables without collapsing their categories.

library(weights)
# sex: 1 male, 2 female
wpct(dat$sex)

##     1     2
## 0.407 0.593

# agecat: 1 18-24, 2 25-34, 3 35-44, 4 45-54, 5 55+
wpct(dat$agecat)

##     1     2     3     4     5
## 0.124 0.159 0.177 0.194 0.346

# ses: 1 abc1, 2 c2, 3 c3, 4 d, 5 e
wpct(dat$ses)

##      1      2      3      4      5
## 0.0390 0.1078 0.3651 0.4484 0.0397

# region: 1 to 15
wpct(dat$region)

##       1       2       3       4       5       6       7       8       9      10      11      12      13      14      15
## 0.01323 0.04233 0.01389 0.04167 0.09921 0.05489 0.06151 0.13095 0.06349 0.04894 0.00397 0.01058 0.37632 0.02579 0.01323

# area: 1 urban, 2 rural
wpct(dat$area)

##     1     2
## 0.837 0.163

The next step is to specify the population distribution of the selected variables in a target list. I use two sources to obtain population values: the Chilean Census 2002, and the Bicentenario Survey 2009. It is hard to find reliable SES population parameters. The Bicentenario data provide, at least, an approximation.

# chilean census 2002
sex <- c(.49,.51)
agecat  <- c(.163, .203, .195, .187, .253)
region <- c(0.015, 0.031, 0.016, 0.039, 0.102, 0.051, 0.059, 0.123,
            0.056, 0.046, 0.006, 0.010, 0.408, 0.023 , 0.013)
area  <-  c(.869, .131)

# bicentenario survey 2009
ses  <- c(.109, .184, .261, .364, .083)

# definitions of target list
targets <- list(sex, agecat,ses, region, area)
# important: to use the same variable names of the dataset
names(targets) <- c("sex", "agecat", "ses", "region", "area")
# id variable
dat$caseid <- 1:length(dat$sex)

The output shows that some distributions do not sum one. This is because of rounding. We can force any proportion distribution to sum one using force1 = TRUE. The differences between the population and the sample distribution are all greater than five percentage points. This meets the variable selection criterion discussed above.

sum(region)

## [1] 0.998

library(anesrake)
anesrakefinder(targets, dat, choosemethod = "total")

##    sex agecat    ses region   area
## 0.1652 0.2017 0.3780 0.0828 0.0634

I apply the anesrake function as follows:

The maximum weight value is five, weights greater than five will be truncated (cap = 5).
The total differences between population and sample have to be greater than 0.05 so that to include a variable (pctlim = .05).
The maximum number of variables included in the raking procedure is five (nlim = 5).

outsave <- anesrake(targets, dat, caseid = dat$caseid,
  verbose= FALSE, cap = 5, choosemethod = "total",
  type = "pctlim", pctlim = .05 , nlim = 5,
  iterate = TRUE , force1 = TRUE)

summary(outsave)

## $convergence
## [1] "Complete convergence was achieved after 48 iterations"
##
## $raking.variables
## [1] "sex"    "agecat" "ses"    "region" "area"
##
## $weight.summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##    0.33    0.62    0.80    1.00    1.10    4.30
##
## $selection.method
## [1] "variable selection conducted using _pctlim_ - discrepancies selected using _total_."
##
## $general.design.effect
## [1] 1.38
##
## $sex
##   Target Unweighted N Unweighted % Wtd N Wtd % Change in % Resid. Disc. Orig. Disc.
## 1   0.49          616        0.407   741  0.49      0.0826            0      0.0826
## 2   0.51          896        0.593   771  0.51     -0.0826            0     -0.0826
##
## $agecat
##   Target Unweighted N Unweighted % Wtd N Wtd % Change in % Resid. Disc. Orig. Disc.
## 1  0.163          187        0.124   246 0.163     0.03916     2.78e-17     0.03916
## 2  0.203          240        0.159   307 0.203     0.04407    -2.78e-17     0.04407
## 3  0.195          268        0.177   295 0.195     0.01756     0.00e+00     0.01756
## 4  0.187          294        0.194   282 0.187    -0.00763    -2.78e-17    -0.00763
## 5  0.253          523        0.346   382 0.253    -0.09315     0.00e+00    -0.09315
##
## $ses
##   Target Unweighted N Unweighted % Wtd N  Wtd % Change in % Resid. Disc. Orig. Disc.
## 1 0.1089           59       0.0390   165 0.1089      0.0699     1.39e-17      0.0699
## 2 0.1838          163       0.1078   278 0.1838      0.0760     2.78e-17      0.0760
## 3 0.2607          552       0.3651   394 0.2607     -0.1043     5.55e-17     -0.1043
## 4 0.3636          678       0.4484   550 0.3636     -0.0848     5.55e-17     -0.0848
## 5 0.0829           60       0.0397   125 0.0829      0.0432     1.39e-17      0.0432
##
## $region
##     Target Unweighted N Unweighted %  Wtd N   Wtd % Change in % Resid. Disc. Orig. Disc.
## 1  0.01503           20      0.01323  22.73 0.01503    0.001803     0.00e+00    0.001803
## 2  0.03106           64      0.04233  46.97 0.03106   -0.011266     0.00e+00   -0.011266
## 3  0.01603           21      0.01389  24.24 0.01603    0.002143     0.00e+00    0.002143
## 4  0.03908           63      0.04167  59.09 0.03908   -0.002589     0.00e+00   -0.002589
## 5  0.10220          150      0.09921 154.53 0.10220    0.002998    -2.78e-17    0.002998
## 6  0.05110           83      0.05489  77.27 0.05110   -0.003792    -6.94e-18   -0.003792
## 7  0.05912           93      0.06151  89.39 0.05912   -0.002390     0.00e+00   -0.002390
## 8  0.12325          198      0.13095 186.35 0.12325   -0.007706     0.00e+00   -0.007706
## 9  0.05611           96      0.06349  84.84 0.05611   -0.007380     0.00e+00   -0.007380
## 10 0.04609           74      0.04894  69.69 0.04609   -0.002850     0.00e+00   -0.002850
## 11 0.00601            6      0.00397   9.09 0.00601    0.002044     0.00e+00    0.002044
## 12 0.01002           16      0.01058  15.15 0.01002   -0.000562     0.00e+00   -0.000562
## 13 0.40882          569      0.37632 618.13 0.40882    0.032495    -5.55e-17    0.032495
## 14 0.02305           39      0.02579  34.85 0.02305   -0.002748     0.00e+00   -0.002748
## 15 0.01303           20      0.01323  19.70 0.01303   -0.000201    -3.47e-18   -0.000201
##
## $area
##   Target Unweighted N Unweighted % Wtd N Wtd % Change in % Resid. Disc. Orig. Disc.
## 1  0.869         1266        0.837  1314 0.869      0.0317     1.11e-16      0.0317
## 2  0.131          246        0.163   198 0.131     -0.0317     0.00e+00     -0.0317

The procedure reaches good convergence, weights lower than 5, and very small differences between the sample and population distribution. In addition, the design effect is 1.38. This gives us an idea of the weighting loss due to the raking procedure. The weighting loss ($L_w$) is the inflation in the variance of sample estimates that can be attributed to weighting (Heeringa et. al. 2010). A simple approximation is:

\[L_w(\bar{y}) \approx \frac{\sigma^{2}(w)}{\bar{w}^2} = \Bigg( \frac{\sum\limits_{i=1}^n w_i^2}{\big(\sum\limits_{i=1}^n w_i\big)^2} \cdot n \Bigg) - 1\]

This simple formula was introduced by Kish (1965) in the context of proportionated stratified sampling. It represents the proportional increase in the variances of means and proportions, ignoring any clustering that may be included in the sample plan. It assumes that:

Proportionate allocation is the optimal stratified design, i.e., variances of $y$ are approximately equal in all strata.
Weights are uncorrelated with the values of the random variable $y$.

Thus, values from this formula only represents an approximation to weighting loss provided that model assumptions are met. As can be seen in the anesrake output, the design effect is 1.38. That value is equal to $1 + L_w$.

# add weights to the dataset
dat$weightvec  <- unlist(outsave[1])
n  <- length(dat$region)

# weighting loss
((sum(dat$weightvec ^ 2) / (sum(dat$weightvec)) ^ 2) * n) - 1

## [1] 0.38

For a more precise estimation of the design effect it would be necessary to use a package such as survey, specifying the characteristics of the sample design. Unfortunately, I do not have enough information to specify the sample design and to estimate the design effect of the presidential approval. There are also no cluster variables or replicate weights in the CEP dataset.

Anyway, the weighting loss approximation denotes an increase in the design effect lower than .5, what meets the recommendations mentioned above. However, it is important to keep in mind that this weighting procedure does not take into account any clustering effect associated with the sample design.

Finally, we can estimate the presidential approval with and without weights:

unweighted <-  wpct(dat$approval)
weighted  <-  wpct(dat$approval, dat$weightvec)
tab  <- data.frame(unweighted, weighted)
rownames(tab)  <- c("approve", "disapprove", "unsure", "dk")
tab

##            unweighted weighted
## approve        0.2864   0.2983
## disapprove     0.5119   0.5206
## unsure         0.1779   0.1609
## dk             0.0238   0.0201

The difference between the weighted and unweighted presidential approval is not big, only 0.041 percentage points.

Raking using given survey weights

The CEP dataset includes its own weights (variable named pond). Unfortunately, there is no detailed information about the weighting construction in the methodological documentation of this survey (“Survey weighting is a mess”, Gelman 2007). For illustrative purposes, however, I will show how to estimate raking weights using previous weights.

Exploring the pond variable, we can see that there are some big weights (e.g., 17.6). This makes me think that they are just scaled design weights, probably with some non-response adjustment, but I am not really sure.

summary(dat$pond)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##    0.02    0.46    0.79    1.00    1.24   17.60

The weighting loss associated with the pond weights is relatively high, denoting a design effect of 2.09.

# weighting loss
((sum(dat$pond ^ 2) / (sum(dat$pond)) ^ 2) * n) - 1

## [1] 1.09

We can explore the total differences between the sample and population distributions using pond weights. Only ses and region meet the criterion of total differences greater than 5 percentage points. However, I will keep all the variables used in the previous section and specify that total differences between the population and sample should be greater than .05 (pctlim = .05).

anesrakefinder(targets, dat, weightvec = dat$pond, choosemethod = "total")

##      sex   agecat      ses   region     area
## 0.000405 0.014836 0.317979 0.169910 0.001149

In order to get raking weights using previous weights, I only have to add weightvec = dat$pond to the anesrake command.

outsave <- anesrake(targets, dat, caseid = dat$caseid,
  weightvec = dat$pond,  verbose = FALSE, cap = 5,
  choosemethod = "total", type = "pctlim",
  pctlim = .05 , nlim = 5, iterate = TRUE, force1 = TRUE)

summary(outsave)

## $convergence
## [1] "Complete convergence was achieved after 50 iterations"
##
## $raking.variables
## [1] "ses"    "region"
##
## $weight.summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##    0.01    0.41    0.71    1.00    1.23    5.00
##
## $selection.method
## [1] "variable selection conducted using _pctlim_ - discrepancies selected using _total_."
##
## $general.design.effect
## [1] 1.91
##
## $sex
##   Target Old Weights N Old Weights % Wtd N Wtd % Change in % Resid. Disc. Orig. Disc.
## 1   0.49           741          0.49   763 0.504      0.0147      -0.0145    0.000202
## 2   0.51           771          0.51   749 0.496     -0.0147       0.0145   -0.000202
##
## $agecat
##   Target Old Weights N Old Weights % Wtd N Wtd % Change in % Resid. Disc. Orig. Disc.
## 1  0.163           242         0.160   230 0.152    -0.00768      0.01049    0.002815
## 2  0.203           309         0.204   314 0.208     0.00388     -0.00521   -0.001321
## 3  0.195           289         0.191   297 0.197     0.00565     -0.00193    0.003715
## 4  0.187           281         0.186   267 0.176    -0.00956      0.01041    0.000843
## 5  0.253           391         0.259   403 0.267     0.00771     -0.01376   -0.006053
##
## $ses
##   Target Old Weights N Old Weights % Wtd N  Wtd % Change in % Resid. Disc. Orig. Disc.
## 1 0.1089          77.1        0.0510   165 0.1089      0.0579     1.39e-17      0.0579
## 2 0.1838         209.7        0.1387   278 0.1838      0.0451     2.78e-17      0.0451
## 3 0.2607         588.5        0.3892   394 0.2607     -0.1284     5.55e-17     -0.1284
## 4 0.3636         596.3        0.3943   550 0.3636     -0.0307     5.55e-17     -0.0307
## 5 0.0829          40.5        0.0268   125 0.0829      0.0561     1.39e-17      0.0561
##
## $region
##     Target Old Weights N Old Weights %  Wtd N   Wtd % Change in % Resid. Disc. Orig. Disc.
## 1  0.01503         31.70       0.02097  22.73 0.01503    -0.00594     0.00e+00    -0.00594
## 2  0.03106         99.85       0.06604  46.97 0.03106    -0.03497     3.47e-18    -0.03497
## 3  0.01603         19.56       0.01293  24.24 0.01603     0.00310     0.00e+00     0.00310
## 4  0.03908         72.79       0.04814  59.09 0.03908    -0.00906     6.94e-18    -0.00906
## 5  0.10220        206.06       0.13627 154.53 0.10220    -0.03407     0.00e+00    -0.03407
## 6  0.05110         69.77       0.04614  77.27 0.05110     0.00496     6.94e-18     0.00496
## 7  0.05912         74.20       0.04907  89.39 0.05912     0.01005     0.00e+00     0.01005
## 8  0.12325        163.69       0.10826 186.35 0.12325     0.01499     0.00e+00     0.01499
## 9  0.05611         79.51       0.05258  84.84 0.05611     0.00353    -6.94e-18     0.00353
## 10 0.04609         67.87       0.04489  69.69 0.04609     0.00121     0.00e+00     0.00121
## 11 0.00601          5.33       0.00352   9.09 0.00601     0.00249     8.67e-19     0.00249
## 12 0.01002         11.90       0.00787  15.15 0.01002     0.00215     0.00e+00     0.00215
## 13 0.40882        559.46       0.36999 618.13 0.40882     0.03883     0.00e+00     0.03883
## 14 0.02305         28.42       0.01879  34.85 0.02305     0.00425     0.00e+00     0.00425
## 15 0.01303         21.99       0.01454  19.70 0.01303    -0.00151     0.00e+00    -0.00151
##
## $area
##   Target Old Weights N Old Weights % Wtd N Wtd % Change in % Resid. Disc. Orig. Disc.
## 1  0.869          1315          0.87  1296 0.857     -0.0124       0.0119   -0.000575
## 2  0.131           197          0.13   216 0.143      0.0124      -0.0119    0.000575

Only ses and region are used in the raking procedure. Some weights were truncated to 5, and the design effect decreased from 2.09 to 1.91. The differences between the sample and population distribution are not big.

# add weights to the dataset
dat$weightvec2  <- unlist(outsave[1])
pond <- wpct(dat$approval, dat$pond)
weighted_pond <-  wpct(dat$approval, dat$weightvec2)
tab  <- data.frame(unweighted, pond, weighted, weighted_pond)
rownames(tab)  <- c("approve", "disapprove", "unsure", "dk")
tab

##            unweighted   pond weighted weighted_pond
## approve        0.2864 0.2739   0.2983        0.2945
## disapprove     0.5119 0.5231   0.5206        0.5310
## unsure         0.1779 0.1758   0.1609        0.1520
## dk             0.0238 0.0272   0.0201        0.0224

There are no great differences between the estimates weighted and weighted_pond (0.025). There are greater discrepancies between pond and weighted estimates (0.049). This reveals the importance of the socioeconomic status in the presidential approval.

Last Update: 7/30/15

References

Battaglia, M., D. Izrael, D. C. Hoaglin, and M. Frankel. 2004. “Tips and Tricks for Raking Survey Data (Aka Sample Balancing).” American Association of Public Opinion Research.

DeBell, Matthew and Jon A. Krosnick. 2009. “Computing Weights for American National Election Study Survey Data.” Stanford University.

Gelman, Andrew. 2007. “Struggles with Survey Weighting and Regression Modeling.” Statistical Science 22(2):153-64.

Heeringa, Steven, Brady T. West, and Patricia A. Berglund. 2010. Applied Survey Data Analysis. Boca Raton, FL: Chapman & Hall/CRC.

Valliant, Richard, Jill A. Dever, and Frauke Kreuter. 2013. Practical Tools for Designing and Weighting Survey Samples. New York, NY: Springer New York.