urbnindicators aims to provide users with analysis-ready data from the American Community Survey (ACS).
With a single function call, you get:
-
Access to hundreds of standardized variables, such as percentages and the raw count variables used to produce them.
-
Margins of error for all variables–those direct from the API as well as derived variables.
-
Meaningful, consistent variable names.
-
A codebook that describes how each variable is calculated.
-
The built-in capacity to pull data for multiple years and multiple states.
-
Supplemental measures, such as population density, that aren’t available from the ACS.
-
Built-in quality checks to help ensure that calculated variables and measures of error are accurate. Plus some good, old-fashioned manual QC. That said–use at your own risk. We cannot and do not guarantee there aren’t bugs.
Install the development version of urbnindicators from
GitHub with:
# install.packages("renv")
renv::install("UI-Research/urbnindicators")You’ll want a Census API key (request one here). Set it once with:
tidycensus::census_api_key("YOUR_KEY", install = TRUE)Note that this package is under active development with frequent updates–check to ensure you have the most recent version installed!
list_tables() |> head(10)
#> [1] "age" "computing_devices" "cost_burden"
#> [4] "disability" "educational_attainment" "employment"
#> [7] "gini" "health_insurance" "household_size"
#> [10] "income_quintiles"A single call to compile_acs_data() returns analysis-ready data with
pre-computed percentages, meaningful variable names, and margins of
error:
df = compile_acs_data(
tables = "race",
years = c(2019, 2024),
geography = "county",
states = "NJ")
df %>%
select(1:10) %>%
glimpse()
#> Rows: 42
#> Columns: 10
#> $ data_source_year <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019,…
#> $ GEOID <chr> "34025", "34037", "34013", "34015", "3403…
#> $ NAME <chr> "Monmouth County, New Jersey", "Sussex Co…
#> $ total_population_universe <dbl> 621659, 141483, 795404, 291165, 503637, 9…
#> $ race_universe <dbl> 621659, 141483, 795404, 291165, 503637, 9…
#> $ race_nonhispanic_allraces <dbl> 554491, 129866, 612222, 273106, 294434, 7…
#> $ race_nonhispanic_white_alone <dbl> 467752, 122081, 242965, 228576, 208005, 5…
#> $ race_nonhispanic_black_alone <dbl> 41697, 2991, 305796, 28452, 52523, 49249,…
#> $ race_nonhispanic_aian_alone <dbl> 440, 16, 1107, 204, 651, 1000, 123, 191, …
#> $ race_nonhispanic_asian_alone <dbl> 33451, 2887, 41976, 9002, 25732, 151090, …compile_acs_data() makes it easy to pull multiple years and produce
publication-ready visualizations:
plot_data = df %>%
transmute(
county_name = NAME %>% str_remove(" County, New Jersey"),
race_personofcolor_percent,
race_personofcolor_percent_M,
data_source_year = factor(data_source_year))
state_averages = plot_data %>%
summarize(
.by = data_source_year,
mean_pct = mean(race_personofcolor_percent)) %>%
arrange(data_source_year) %>%
pull(mean_pct)
## order counties by 2019 value for the dumbbell plot
county_order = plot_data %>%
filter(data_source_year == "2019") %>%
arrange(race_personofcolor_percent) %>%
pull(county_name)
plot_data = plot_data %>%
mutate(county_name = factor(county_name, levels = county_order))
dumbbell_data = plot_data %>%
pivot_wider(
id_cols = county_name,
names_from = data_source_year,
values_from = race_personofcolor_percent,
names_prefix = "year_")
ggplot() +
geom_segment(
data = dumbbell_data,
aes(
x = county_name,
y = year_2019,
yend = year_2024),
color = palette_urbn_main[7],
linewidth = 1) +
ggdist::stat_gradientinterval(
data = plot_data,
aes(
x = county_name,
ydist = distributional::dist_normal(
race_personofcolor_percent,
race_personofcolor_percent_M / 1.645),
color = data_source_year),
point_size = 2,
.width = .95) +
geom_hline(
yintercept = state_averages[1],
linetype = "dashed",
color = palette_urbn_main[1]) +
geom_hline(
yintercept = state_averages[2],
linetype = "dashed",
color = palette_urbn_main[2]) +
annotate(
"text",
y = state_averages[1] - .15,
x = 21.5,
label = "State mean (2019)",
fontface = "bold.italic",
color = palette_urbn_main[1],
size = 9 / .pt,
hjust = 0,
nudge_y = .01) +
annotate(
"text",
y = state_averages[2] + .01,
x = 21.5,
label = "State mean (2024)",
fontface = "bold.italic",
color = palette_urbn_main[2],
size = 9 / .pt,
hjust = 0,
nudge_y = .01) +
labs(
title = "All NJ Counties Experienced Racial Diversification from 2019 to 2024",
subtitle = paste0("Share of population who are people of color, by county, 2019-2024
Confidence intervals are presented around each point but are extremely small"),
x = "",
y = "Share of population who are people of color") +
scale_x_discrete(expand = expansion(mult = c(.03, .04))) +
scale_y_continuous(
breaks = c(0, .25, .50, .75, 1.0),
limits = c(0, .75),
labels = scales::percent) +
coord_flip() +
theme_urbn_print()ACS data are available for standard geographies (tracts, counties,
states, etc.), but many analyses require non-standard areas like
neighborhoods, school zones, or planning districts.
calculate_custom_geographies() aggregates tract-level data to any
user-defined geography, properly re-deriving percentages and propagating
margins of error:
dc_tracts = compile_acs_data(
tables = "snap",
years = 2024,
geography = "tract",
states = "DC",
spatial = TRUE)
## assign each tract to a quadrant based on its centroid
dc_tracts = dc_tracts %>%
mutate(
centroid = sf::st_centroid(geometry),
lon = sf::st_coordinates(centroid)[, 1],
lat = sf::st_coordinates(centroid)[, 2],
quadrant = case_when(
lon < median(lon) & lat >= median(lat) ~ "NW",
lon >= median(lon) & lat >= median(lat) ~ "NE",
lon < median(lon) & lat < median(lat) ~ "SW",
lon >= median(lon) & lat < median(lat) ~ "SE")) %>%
select(-centroid, -lon, -lat)
## aggregate tracts to quadrants
dc_quadrants = calculate_custom_geographies(
.data = dc_tracts,
group_id = "quadrant",
spatial = TRUE)
dc_quadrants %>%
sf::st_drop_geometry() %>%
select(GEOID, snap_received_percent, snap_received_percent_M)
#> GEOID snap_received_percent snap_received_percent_M
#> 1 NE 0.15951925 0.019448994
#> 2 NW 0.07036185 0.006889427
#> 3 SE 0.24445974 0.012073306
#> 4 SW 0.06525691 0.012003668See vignette("custom-geographies") for more.
Beyond the package’s built-in tables, you can define your own derived
variables using the define_*() helpers and pass them directly to
compile_acs_data(). Your custom variables automatically get codebook
entries and margins of error:
df = compile_acs_data(
tables = list(
"snap",
define_percent(
"snap_not_received_percent",
numerator_variables = c("snap_universe"),
numerator_subtract_variables = c("snap_received"),
denominator_variables = c("snap_universe")),
define_one_minus(
"snap_received_complement",
source_variable = "snap_received_percent")),
years = 2024,
geography = "county",
states = "DC")
df %>%
select(matches("snap.*percent")) %>%
glimpse()
#> Rows: 1
#> Columns: 4
#> $ snap_received_percent <dbl> 0.143
#> $ snap_not_received_percent <dbl> 0.857
#> $ snap_received_percent_M <dbl> 0.0064
#> $ snap_not_received_percent_M <dbl> 0.0071The available helpers are:
| Helper | Use case |
|---|---|
define_percent() |
Ratio of a numerator to a denominator |
define_across_percent() |
Percentages for every column matching a regex |
define_across_sum() |
Sum paired columns (e.g., male + female counts) |
define_one_minus() |
Complement of an existing percentage (1 - x) |
define_metadata() |
Codebook-only entry for a non-computed variable |
See vignette("custom-derived-variables") for detailed examples of each
helper.
Check out the vignettes for additional details:
-
A package overview to help users Get Started.
-
An interactive version of the package’s Codebook so that prospective users can know what to expect.
-
A brief description of the package’s Design Philosophy to clarify the use-cases that
urbnindicatorsis built to support. -
An illustration of how Quantifying Survey Error can improve inference making.
-
You can re-create your indicators and their measures of error for Custom Geographies. Neighborhoods? Unincorporated counties? Start here.
-
A guide to defining Custom Derived Variables using the
define_*()helpers.
This package is built on top of and enormously indebted to
library(tidycensus), which provides the core functionality for
accessing the Census Bureau API. For users who want additional
variables, library(tidycensus) exposes the entire range of
pre-tabulated variables available from the ACS and provides access to
ACS microdata and other Census Bureau datasets.
Learn more here: https://walker-data.com/tidycensus/index.html.
