Skip to content

UI-Research/urbnindicators

Repository files navigation

urbnindicators

Lifecycle: experimental License: GPL v3 Codecov test coverage

Overview

urbnindicators aims to provide users with analysis-ready data from the American Community Survey (ACS).

With a single function call, you get:

  • Access to hundreds of standardized variables, such as percentages and the raw count variables used to produce them.

  • Margins of error for all variables–those direct from the API as well as derived variables.

  • Meaningful, consistent variable names.

  • A codebook that describes how each variable is calculated.

  • The built-in capacity to pull data for multiple years and multiple states.

  • Supplemental measures, such as population density, that aren’t available from the ACS.

  • Built-in quality checks to help ensure that calculated variables and measures of error are accurate. Plus some good, old-fashioned manual QC. That said–use at your own risk. We cannot and do not guarantee there aren’t bugs.

Installation

Install the development version of urbnindicators from GitHub with:

# install.packages("renv")
renv::install("UI-Research/urbnindicators")

You’ll want a Census API key (request one here). Set it once with:

tidycensus::census_api_key("YOUR_KEY", install = TRUE)

Note that this package is under active development with frequent updates–check to ensure you have the most recent version installed!

Use

Discover Available Data

list_tables() |> head(10)
#>  [1] "age"                    "computing_devices"      "cost_burden"           
#>  [4] "disability"             "educational_attainment" "employment"            
#>  [7] "gini"                   "health_insurance"       "household_size"        
#> [10] "income_quintiles"

Obtain Data

A single call to compile_acs_data() returns analysis-ready data with pre-computed percentages, meaningful variable names, and margins of error:

df = compile_acs_data(
  tables = "race",
  years = c(2019, 2024),
  geography = "county",
  states = "NJ")

df %>%
  select(1:10) %>%
  glimpse()
#> Rows: 42
#> Columns: 10
#> $ data_source_year             <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019,…
#> $ GEOID                        <chr> "34025", "34037", "34013", "34015", "3403…
#> $ NAME                         <chr> "Monmouth County, New Jersey", "Sussex Co…
#> $ total_population_universe    <dbl> 621659, 141483, 795404, 291165, 503637, 9…
#> $ race_universe                <dbl> 621659, 141483, 795404, 291165, 503637, 9…
#> $ race_nonhispanic_allraces    <dbl> 554491, 129866, 612222, 273106, 294434, 7…
#> $ race_nonhispanic_white_alone <dbl> 467752, 122081, 242965, 228576, 208005, 5…
#> $ race_nonhispanic_black_alone <dbl> 41697, 2991, 305796, 28452, 52523, 49249,…
#> $ race_nonhispanic_aian_alone  <dbl> 440, 16, 1107, 204, 651, 1000, 123, 191, …
#> $ race_nonhispanic_asian_alone <dbl> 33451, 2887, 41976, 9002, 25732, 151090, …

Visualize Data

compile_acs_data() makes it easy to pull multiple years and produce publication-ready visualizations:

plot_data = df %>%
  transmute(
    county_name = NAME %>% str_remove(" County, New Jersey"),
    race_personofcolor_percent,
    race_personofcolor_percent_M,
    data_source_year = factor(data_source_year))

state_averages = plot_data %>%
  summarize(
    .by = data_source_year,
    mean_pct = mean(race_personofcolor_percent)) %>%
  arrange(data_source_year) %>%
  pull(mean_pct)

## order counties by 2019 value for the dumbbell plot
county_order = plot_data %>%
  filter(data_source_year == "2019") %>%
  arrange(race_personofcolor_percent) %>%
  pull(county_name)

plot_data = plot_data %>%
  mutate(county_name = factor(county_name, levels = county_order))

dumbbell_data = plot_data %>%
  pivot_wider(
    id_cols = county_name,
    names_from = data_source_year,
    values_from = race_personofcolor_percent,
    names_prefix = "year_")

ggplot() +
  geom_segment(
    data = dumbbell_data,
    aes(
      x = county_name,
      y = year_2019,
      yend = year_2024),
    color = palette_urbn_main[7],
    linewidth = 1) +
  ggdist::stat_gradientinterval(
    data = plot_data,
    aes(
      x = county_name,
      ydist = distributional::dist_normal(
        race_personofcolor_percent,
        race_personofcolor_percent_M / 1.645),
      color = data_source_year),
    point_size = 2,
    .width = .95) +
  geom_hline(
    yintercept = state_averages[1],
    linetype = "dashed",
    color = palette_urbn_main[1]) +
  geom_hline(
    yintercept = state_averages[2],
    linetype = "dashed",
    color = palette_urbn_main[2]) +
  annotate(
    "text",
    y = state_averages[1] - .15,
    x = 21.5,
    label = "State mean (2019)",
    fontface = "bold.italic",
    color = palette_urbn_main[1],
    size = 9 / .pt,
    hjust = 0,
    nudge_y = .01) +
  annotate(
    "text",
    y = state_averages[2] + .01,
    x = 21.5,
    label = "State mean (2024)",
    fontface = "bold.italic",
    color = palette_urbn_main[2],
    size = 9 / .pt,
    hjust = 0,
    nudge_y = .01) +
  labs(
    title = "All NJ Counties Experienced Racial Diversification from 2019 to 2024",
    subtitle = paste0("Share of population who are people of color, by county, 2019-2024
Confidence intervals are presented around each point but are extremely small"),
    x = "",
    y = "Share of population who are people of color") +
  scale_x_discrete(expand = expansion(mult = c(.03, .04))) +
  scale_y_continuous(
    breaks = c(0, .25, .50, .75, 1.0),
    limits = c(0, .75),
    labels = scales::percent) +
  coord_flip() +
  theme_urbn_print()

Custom Geographies

ACS data are available for standard geographies (tracts, counties, states, etc.), but many analyses require non-standard areas like neighborhoods, school zones, or planning districts. calculate_custom_geographies() aggregates tract-level data to any user-defined geography, properly re-deriving percentages and propagating margins of error:

dc_tracts = compile_acs_data(
  tables = "snap",
  years = 2024,
  geography = "tract",
  states = "DC",
  spatial = TRUE)

## assign each tract to a quadrant based on its centroid
dc_tracts = dc_tracts %>%
  mutate(
    centroid = sf::st_centroid(geometry),
    lon = sf::st_coordinates(centroid)[, 1],
    lat = sf::st_coordinates(centroid)[, 2],
    quadrant = case_when(
      lon <  median(lon) & lat >= median(lat) ~ "NW",
      lon >= median(lon) & lat >= median(lat) ~ "NE",
      lon <  median(lon) & lat <  median(lat) ~ "SW",
      lon >= median(lon) & lat <  median(lat) ~ "SE")) %>%
  select(-centroid, -lon, -lat)

## aggregate tracts to quadrants
dc_quadrants = calculate_custom_geographies(
  .data = dc_tracts,
  group_id = "quadrant",
  spatial = TRUE)

dc_quadrants %>%
  sf::st_drop_geometry() %>%
  select(GEOID, snap_received_percent, snap_received_percent_M)
#>   GEOID snap_received_percent snap_received_percent_M
#> 1    NE            0.15951925             0.019448994
#> 2    NW            0.07036185             0.006889427
#> 3    SE            0.24445974             0.012073306
#> 4    SW            0.06525691             0.012003668

See vignette("custom-geographies") for more.

Custom Derived Variables

Beyond the package’s built-in tables, you can define your own derived variables using the define_*() helpers and pass them directly to compile_acs_data(). Your custom variables automatically get codebook entries and margins of error:

df = compile_acs_data(
  tables = list(
    "snap",
    define_percent(
      "snap_not_received_percent",
      numerator_variables = c("snap_universe"),
      numerator_subtract_variables = c("snap_received"),
      denominator_variables = c("snap_universe")),
    define_one_minus(
      "snap_received_complement",
      source_variable = "snap_received_percent")),
  years = 2024,
  geography = "county",
  states = "DC")

df %>%
  select(matches("snap.*percent")) %>%
  glimpse()
#> Rows: 1
#> Columns: 4
#> $ snap_received_percent       <dbl> 0.143
#> $ snap_not_received_percent   <dbl> 0.857
#> $ snap_received_percent_M     <dbl> 0.0064
#> $ snap_not_received_percent_M <dbl> 0.0071

The available helpers are:

Helper Use case
define_percent() Ratio of a numerator to a denominator
define_across_percent() Percentages for every column matching a regex
define_across_sum() Sum paired columns (e.g., male + female counts)
define_one_minus() Complement of an existing percentage (1 - x)
define_metadata() Codebook-only entry for a non-computed variable

See vignette("custom-derived-variables") for detailed examples of each helper.

Learn More

Check out the vignettes for additional details:

  • A package overview to help users Get Started.

  • An interactive version of the package’s Codebook so that prospective users can know what to expect.

  • A brief description of the package’s Design Philosophy to clarify the use-cases that urbnindicators is built to support.

  • An illustration of how Quantifying Survey Error can improve inference making.

  • You can re-create your indicators and their measures of error for Custom Geographies. Neighborhoods? Unincorporated counties? Start here.

  • A guide to defining Custom Derived Variables using the define_*() helpers.

Credits

This package is built on top of and enormously indebted to library(tidycensus), which provides the core functionality for accessing the Census Bureau API. For users who want additional variables, library(tidycensus) exposes the entire range of pre-tabulated variables available from the ACS and provides access to ACS microdata and other Census Bureau datasets.

Learn more here: https://walker-data.com/tidycensus/index.html.

About

Out of the Box Social Science Indicators from the American Community Survey (ACS)

Resources

License

Unknown, GPL-3.0 licenses found

Licenses found

Unknown
LICENSE
GPL-3.0
LICENSE.md

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages