R package to compute statistics from the American Community Survey (ACS) and Decennial US Census

The acsr package helps extracting variables and computing statistics using the America Community Survey and Decennial US Census. It was created for the Applied Population Laboratory (APL) at the University of Wisconsin-Madison.

Installation

The functions depend on the acs and data.table packages, so it is necessary to install then before using acsr. The acsr package is hosted on a github repository and can be installed using devtools:

devtools::install_github("sdaza/acsr")
library(acsr)

Remember to set the ACS API key, to check the help documentation and the default values of the acsr functions.

api.key.install(key="*")
?sumacs
?acsdata

The default dataset is acs, the level is state (Wisconsin, state = "WI"), the endyear is 2014, and the confidence level to compute margins of error (MOEs) is 90%.

Levels

The acsr functions can extract all the levels available in the acs package. The table below shows the summary and required levels when using the acsdata and sumacs functions:

summary number	levels
010	us
020	region
030	division
040	state
050	state, county
060	state, county, county.subdivision
140	state, county, tract
150	state, county, tract, block.group
160	state, place
250	american.indian.area
320	state, msa
340	state, csa
350	necta
400	urban.area
500	state, congressional.district
610	state, state.legislative.district.upper
620	state, state.legislative.district.lower
795	state, puma
860	zip.code
950	state, school.district.elementary
960	state, school.district.secondary
970	state, school.district.unified

Getting variables and statistics

We can use the sumacs function to extract variable and statistics. We have to specify the corresponding method (e.g., proportion or just variable), and the name of the statistic or variable to be included in the output.

sumacs(formula = c("(b16004_004 + b16004_026 + b16004_048 / b16004_001)", "b16004_026"),
        varname = c("mynewvar", "myvar"),
        method = c("prop", "variable"),
        level = c("division"))

## [1] "Extracting data from: acs 2014"
## [1] ". . . . . .  ACS/Census variables : 4"
## [1] ". . . . . .  Levels : 1"
## [1] ". . . . . .  New variables : 2"
## [1] ". . . . . .  Getting division data"
## [1] ". . . . . .  Creating variables"
## [1] ". . . . . .  50%"
## [1] ". . . . . .  100%"
## [1] ". . . . . .  Formatting output"

##    sumlevel geoid division mynewvar_est mynewvar_moe myvar_est myvar_moe
## 1:      030    NA        1       0.0762     0.000347    770306      3490
## 2:      030    NA        2       0.1182     0.000278   3332150      9171
## 3:      030    NA        3       0.0599     0.000196   1819417      7209
## 4:      030    NA        4       0.0411     0.000277    547577      4461
## 5:      030    NA        5       0.1108     0.000246   4526480     11869
## 6:      030    NA        6       0.0320     0.000265    402475      3781
## 7:      030    NA        7       0.2203     0.000469   5318126     13044
## 8:      030    NA        8       0.1582     0.000602   2279303     10746
## 9:      030    NA        9       0.2335     0.000501   7765838     20289

To download the data can be slow, especially when many levels are being used (e.g., blockgroup). A better approach in those cases is, first, download the data using the function acsdata , and then use them as input.

mydata <- acsdata(formula = c("(b16004_004 + b16004_026 + b16004_048 /  b16004_001)",
        "b16004_026"),
        level = c("division"))

## [1] ". . . . . .  Getting division data"

sumacs(formula = c("(b16004_004 + b16004_026 + b16004_048 / b16004_001)", "b16004_026"),
        varname = c("mynewvar", "myvar"),
        method = c("prop", "variable"),
        level = c("division"),
        data = mydata)

## [1] "Extracting data from: acs 2014"
## [1] ". . . . . .  ACS/Census variables : 4"
## [1] ". . . . . .  Levels : 1"
## [1] ". . . . . .  New variables : 2"
## [1] ". . . . . .  Creating variables"
## [1] ". . . . . .  50%"
## [1] ". . . . . .  100%"
## [1] ". . . . . .  Formatting output"

##    sumlevel geoid division mynewvar_est mynewvar_moe myvar_est myvar_moe
## 1:      030    NA        1       0.0762     0.000347    770306      3490
## 2:      030    NA        2       0.1182     0.000278   3332150      9171
## 3:      030    NA        3       0.0599     0.000196   1819417      7209
## 4:      030    NA        4       0.0411     0.000277    547577      4461
## 5:      030    NA        5       0.1108     0.000246   4526480     11869
## 6:      030    NA        6       0.0320     0.000265    402475      3781
## 7:      030    NA        7       0.2203     0.000469   5318126     13044
## 8:      030    NA        8       0.1582     0.000602   2279303     10746
## 9:      030    NA        9       0.2335     0.000501   7765838     20289

Standard errors

When computing statistics there are two ways to define the standard errors:

Including all standard errors of the variables used to compute a statistic (one.zero = FALSE)
Include all standard errors except those of variables that are equal to zero. Only the maximum standard error of the variables equal to zero is included (one.zero = TRUE)
The default value is one.zero = TRUE

For more details about how standard errors are computed for proportions, ratios and aggregations look at A Compass for Understanding and Using American Community Survey Data.

Below an example when estimating proportions and using one.zero = FALSE:

sumacs(formula = "(b16004_004 + b16004_026 + b16004_048) / b16004_001",
            varname =  "mynewvar",
            method = "prop",
            level = "tract",
            county = 1,
            tract = 950501,
            endyear = 2013,
            one.zero = FALSE)

## [1] "Extracting data from: acs 2013"
## [1] ". . . . . .  ACS/Census variables : 4"
## [1] ". . . . . .  Levels : 1"
## [1] ". . . . . .  New variables : 1"
## [1] ". . . . . .  Getting tract data"
## [1] ". . . . . .  Creating variables"
## [1] ". . . . . .  100%"
## [1] ". . . . . .  Formatting output"

##    sumlevel       geoid st_fips cnty_fips tract_fips mynewvar_est mynewvar_moe
## 1:      140 55001950501      55         1     950501       0.0226       0.0252

\[SE = \sqrt{ \frac{(5.47 ^ 2 + 22.49 ^ 2 + 5.47 ^ 2) - ( 0.02 ^ 2 \times 102.13 ^ 2)}{1546} } \times 1.645 = 0.0252\]

When one.zero = TRUE:

sumacs(formula = "(b16004_004 + b16004_026 + b16004_048) / b16004_001",
            varname = "mynewvar",
            method = "prop",
            level = "tract",
            county = 1,
            tract = 950501,
            endyear = 2013,
            one.zero = TRUE)

## [1] "Extracting data from: acs 2013"
## [1] ". . . . . .  ACS/Census variables : 4"
## [1] ". . . . . .  Levels : 1"
## [1] ". . . . . .  New variables : 1"
## [1] ". . . . . .  Getting tract data"
## [1] ". . . . . .  Creating variables"
## [1] ". . . . . .  100%"
## [1] ". . . . . .  Formatting output"

##    sumlevel       geoid st_fips cnty_fips tract_fips mynewvar_est mynewvar_moe
## 1:      140 55001950501      55         1     950501       0.0226       0.0245

\[SE_{\text{ one.zero}} \sqrt{ \frac{(5.47 ^ 2 + 22.49 ^ 2) - ( 0.02 ^ 2 \times 102.13 ^ 2)}{1546} } \times 1.645 = 0.0245\]

When the square root value in the standard error formula doesn’t exist (e.g., the square root of a negative number), the ratio formula is instead used. The ratio adjustment is done variable by variable .

It can also be that the one.zero option makes the square root undefinable. In those cases, the function uses again the ratio formula to compute standard errors. There is also a possibility that the standard error estimates using the ratio formula are higher than the proportion estimates without the one.zero option.

Decennial Data from the US Census

Let’s get the African American and Hispanic population by state. In this case, we don’t have any estimation of margin of error.

sumacs(formula = c("p0080004", "p0090002"),
            method = "variable",
            dataset = "sf1",
            level = "state",
            state = "*",
            endyear = 2010)

## [1] "Extracting data from: sf1 2010"
## [1] ". . . . . .  ACS/Census variables : 2"
## [1] ". . . . . .  Levels : 1"
## [1] ". . . . . .  New variables : 2"
## [1] ". . . . . .  Getting state data"
## [1] ". . . . . .  Creating variables"
## [1] ". . . . . .  50%"
## [1] ". . . . . .  100%"
## [1] ". . . . . .  Formatting output"

##     sumlevel geoid st_fips p0080004 p0090002
##  1:      040    01      01  1251311   185602
##  2:      040    02      02    23263    39249
##  3:      040    04      04   259008  1895149
##  4:      040    05      05   449895   186050
##  5:      040    06      06  2299072 14013719
##  6:      040    08      08   201737  1038687
##  7:      040    09      09   362296   479087
##  8:      040    10      10   191814    73221
##  9:      040    11      11   305125    54749
## 10:      040    12      12  2999862  4223806
## 11:      040    13      13  2950435   853689
## 12:      040    15      15    21424   120842
## 13:      040    16      16     9810   175901
## 14:      040    17      17  1866414  2027578
## 15:      040    18      18   591397   389707
## 16:      040    19      19    89148   151544
## 17:      040    20      20   167864   300042
## 18:      040    21      21   337520   132836
## 19:      040    22      22  1452396   192560
## 20:      040    23      23    15707    16935
## 21:      040    24      24  1700298   470632
## 22:      040    25      25   434398   627654
## 23:      040    26      26  1400362   436358
## 24:      040    27      27   274412   250258
## 25:      040    28      28  1098385    81481
## 26:      040    29      29   693391   212470
## 27:      040    30      30     4027    28565
## 28:      040    31      31    82885   167405
## 29:      040    32      32   218626   716501
## 30:      040    33      33    15035    36704
## 31:      040    34      34  1204826  1555144
## 32:      040    35      35    42550   953403
## 33:      040    36      36  3073800  3416922
## 34:      040    37      37  2048628   800120
## 35:      040    38      38     7960    13467
## 36:      040    39      39  1407681   354674
## 37:      040    40      40   277644   332007
## 38:      040    41      41    69206   450062
## 39:      040    42      42  1377689   719660
## 40:      040    44      44    60189   130655
## 41:      040    45      45  1290684   235682
## 42:      040    46      46    10207    22119
## 43:      040    47      47  1057315   290059
## 44:      040    48      48  2979598  9460921
## 45:      040    49      49    29287   358340
## 46:      040    50      50     6277     9208
## 47:      040    51      51  1551399   631825
## 48:      040    53      53   240042   755790
## 49:      040    54      54    63124    22268
## 50:      040    55      55   359148   336056
## 51:      040    56      56     4748    50231
## 52:      040    72      72   461498  3688455
##     sumlevel geoid st_fips p0080004 p0090002

Output

The output can be formatted using a wide or long format:

sumacs(formula = "(b16004_004 + b16004_026 + b16004_048 / b16004_001)",
            varname = "mynewvar",
            method = "prop",
            level = "division",
            format.out = "long")

## [1] "Extracting data from: acs 2014"
## [1] ". . . . . .  ACS/Census variables : 4"
## [1] ". . . . . .  Levels : 1"
## [1] ". . . . . .  New variables : 1"
## [1] ". . . . . .  Getting division data"
## [1] ". . . . . .  Creating variables"
## [1] ". . . . . .  100%"
## [1] ". . . . . .  Formatting output"

##    geoid sumlevel division var_name    est      moe
## 1:    NA      030        1 mynewvar 0.0762 0.000347
## 2:    NA      030        2 mynewvar 0.1182 0.000278
## 3:    NA      030        3 mynewvar 0.0599 0.000196
## 4:    NA      030        4 mynewvar 0.0411 0.000277
## 5:    NA      030        5 mynewvar 0.1108 0.000246
## 6:    NA      030        6 mynewvar 0.0320 0.000265
## 7:    NA      030        7 mynewvar 0.2203 0.000469
## 8:    NA      030        8 mynewvar 0.1582 0.000602
## 9:    NA      030        9 mynewvar 0.2335 0.000501

And it can also be exported to a csv file:

sumacs(formula = "(b16004_004 + b16004_026 + b16004_048 / b16004_001)",
            varname = "mynewvar",
            method = "prop",
            level = "division",
            file = "myfile.out")

## [1] "Extracting data from: acs 2014"
## [1] ". . . . . .  ACS/Census variables : 4"
## [1] ". . . . . .  Levels : 1"
## [1] ". . . . . .  New variables : 1"
## [1] ". . . . . .  Getting division data"
## [1] ". . . . . .  Creating variables"
## [1] ". . . . . .  100%"
## [1] ". . . . . .  Formatting output"
## [1] "Data exported to a CSV file! "

Combining geographic levels

We can combine geographic levels using two methods: (1) sumacs and (2) combine.output. The first one allows only single combinations, the second multiple ones.

If I want to combine two states (e.g., Wisconsin and Minnesota) I can use:

sumacs("(b16004_004 + b16004_026 + b16004_048 / b16004_001)",
    varname = "mynewvar",
    method = "prop",
    level = "state",
    state = list("WI", "MN"),
    combine = TRUE,
    print.levels = FALSE)

## [1] "Extracting data from: acs 2014"
## [1] ". . . . . .  ACS/Census variables : 4"
## [1] ". . . . . .  Levels : 1"
## [1] ". . . . . .  New variables : 1"
## [1] ". . . . . .  Getting combined data"
## [1] ". . . . . .  Creating variables"
## [1] ". . . . . .  100%"
## [1] ". . . . . .  Formatting output"

##    geoid combined_group mynewvar_est mynewvar_moe
## 1:    NA      aggregate        0.042     0.000331

If I want to put together multiple combinations (e.g., groups of states):

combine.output("(b16004_004 + b16004_026 + b16004_048 / b16004_001)",
    varname = "mynewvar",
    method = "prop",
    level = list("state", "state"),
    state = list( list("WI", "MN"), list("CA", "OR")), # nested list
    combine.names = c("WI+MN", "CA+OR"),
    print.levels = FALSE)

## [1] ". . . . . .  Defining WI+MN"
## [1] "Extracting data from: acs 2014"
## [1] ". . . . . .  ACS/Census variables : 4"
## [1] ". . . . . .  Levels : 1"
## [1] ". . . . . .  New variables : 1"
## [1] ". . . . . .  Getting combined data"
## [1] ". . . . . .  Creating variables"
## [1] ". . . . . .  100%"
## [1] ". . . . . .  Formatting output"
## [1] ". . . . . .  Defining CA+OR"
## [1] "Extracting data from: acs 2014"
## [1] ". . . . . .  ACS/Census variables : 4"
## [1] ". . . . . .  Levels : 1"
## [1] ". . . . . .  New variables : 1"
## [1] ". . . . . .  Getting combined data"
## [1] ". . . . . .  Creating variables"
## [1] ". . . . . .  100%"
## [1] ". . . . . .  Formatting output"

##    combined_group mynewvar_est mynewvar_moe
## 1:          WI+MN        0.042     0.000331
## 2:          CA+OR        0.269     0.000565

A map?

Let’s color a map using poverty by county:

pov <- sumacs(formula = "b17001_002 / b17001_001 * 100",
        varname = c("pov"),
        method = c("prop"),
        level = c("county"),
        state = "*")

## [1] "Extracting data from: acs 2014"
## [1] ". . . . . .  ACS/Census variables : 2"
## [1] ". . . . . .  Levels : 1"
## [1] ". . . . . .  New variables : 1"
## [1] ". . . . . .  Getting county data"
## [1] ". . . . . .  Creating variables"
## [1] ". . . . . .  100%"
## [1] ". . . . . .  Formatting output"

library(choroplethr)
library(choroplethrMaps)
pov[, region := as.numeric(geoid)]
setnames(pov, "pov_est", "value")
county_choropleth(pov, num_colors = 5)

center

In sum, the acsr package:

Reads formulas directly and extracts any ACS/Census variable
Provides an automatized and tailored way to obtain indicators and MOEs
Allows different outputs’ formats (wide and long, csv)
Provides an easy way to adjust MOEs to different confidence levels
Includes a variable-by-variable ratio adjustment of standard errors
Includes the zero-option when computing standard errors for proportions, ratios, and aggregations
Combines geographic levels flexibly

Last Update: 02/07/2016