Skip to contents

Introduction

The below example shows how one can use cepumd to generate annual, weighted expenditure means and quantiles for the different categories of a grouping variable. This example will focus on getting these estimates for used cars and trucks using 2021 Interview data since reports of vehicle purchases are only taken from the Interview survey. It’s also useful to show how to calculate weighted quantiles of expenditures since this is only statistically valid when using a single survey instrument.

I’ll need the following data and metadata sources:

  • Interview survey data zip files for 2020 and 2021
  • Hierarchical grouping zip files
  • CE Data Dictionary

Data gathering

Data files can be downloaded from the CE PUMD Data Files page (it’s easiest to use CSV) and hierarchical grouping files can be downloaded from the CE PUMD Documentation page

To get the files I’ll first create a temporary directory and download all of the files that I need from the BLS website into this directory. You might choose to store your files differently, but this convention will keep the files organized, it will keep the code simple, and everything will be in a folder that will be easy to clean up after. Since the BLS blocks third party applications I’ll add a user-agent to identify myself in the download function that is stored in a variable called cepumd_ua (not shown).

ce_data_dir <- tempdir()

download.file(
  "https://www.bls.gov/cex/pumd/data/comma/intrvw21.zip",
  fs::path(ce_data_dir, "intrvw21.zip"),
  mode = "wb",
  headers = list(
    "User-Agent" = cepumd_ua
  )
)

download.file(
  "https://www.bls.gov/cex/pumd/data/comma/intrvw20.zip",
  fs::path(ce_data_dir, "intrvw20.zip"),
  mode = "wb",
  headers = list(
    "User-Agent" = cepumd_ua
  )
)

download.file(
  "https://www.bls.gov/cex/pumd/stubs.zip",
  fs::path(ce_data_dir, "stubs.zip"),
  mode = "wb",
  headers = list(
    "User-Agent" = cepumd_ua
  )
)

download.file(
  "https://www.bls.gov/cex/pumd/ce-pumd-interview-diary-dictionary.xlsx",
  fs::path(ce_data_dir, "ce-data-dict.xlsx"),
  mode = "wb",
  headers = list(
    "User-Agent" = cepumd_ua
  )
)

Calculate CE weighted mean estimate by urbanicity

Now that I have the files I’ll start the analysis. First I’ll load the hierarchical grouping file and filter it for categories of used cars or used trucks.

int21_hg <- ce_hg(
  2021,
  interview,
  hg_zip_path = file.path(ce_data_dir, "stubs.zip")
)

int21_hg |>
  filter(str_detect(title, "Used (cars|trucks)")) |>
  gt()
level title ucc survey factor
5 Used cars 460110 I 1
5 Used trucks 460901 I 1

These are the UCCs for “Used cars and trucks. I’ll use the code above to set the uccs argument in ce_prepdata(). I’ll also include the bls_urbn variable in the ... argument to get estimates by urbanicity.

car_data <- ce_prepdata(
  2021,
  interview,
  int21_hg,
  uccs = int21_hg |>
    filter(str_detect(title, "Used (cars|trucks)"), !is.na(as.numeric(ucc))) |>
    pull(ucc),
  bls_urbn,  # <------- this is the variable for urbanicity
  int_zp = c(
    file.path(ce_data_dir, "intrvw20.zip"),
    file.path(ce_data_dir, "intrvw21.zip")
  ),
  recode_variables = TRUE,
  dict_path = file.path(ce_data_dir, "ce-data-dict.xlsx")
)

gt(head(car_data))
newid finlwt21 wtrep01 wtrep02 wtrep03 wtrep04 wtrep05 wtrep06 wtrep07 wtrep08 wtrep09 wtrep10 wtrep11 wtrep12 wtrep13 wtrep14 wtrep15 wtrep16 wtrep17 wtrep18 wtrep19 wtrep20 wtrep21 wtrep22 wtrep23 wtrep24 wtrep25 wtrep26 wtrep27 wtrep28 wtrep29 wtrep30 wtrep31 wtrep32 wtrep33 wtrep34 wtrep35 wtrep36 wtrep37 wtrep38 wtrep39 wtrep40 wtrep41 wtrep42 wtrep43 wtrep44 bls_urbn mo_scope popwt ucc ref_yr ref_mo cost survey
04378034 15009.70 26582.38 30220.87 28559.33 0.00 28263.97 0.00 26801.74 30392.73 0.00 0.00 35877.18 0.00 0.00 31583.15 0.00 28192.04 0.00 0.00 27742.65 32402.03 32193.33 0.00 33288.04 33269.07 28259.38 31891.03 32534.43 0.00 0.00 0.00 26770.84 0.00 24617.21 31914.74 31437.74 0.00 0.00 0.00 0.00 0.00 33248.90 0.00 0.00 0.00 Urban 0 0 NA NA NA 0 I
04378044 14659.23 0.00 24636.01 25415.50 0.00 29842.23 26240.00 31559.69 0.00 0.00 32690.11 0.00 35219.62 0.00 0.00 32209.17 28073.98 0.00 30630.13 28545.87 0.00 22419.07 0.00 29752.76 35818.35 0.00 0.00 0.00 37168.17 0.00 0.00 0.00 0.00 0.00 32609.68 28785.20 32743.62 0.00 30564.90 0.00 0.00 0.00 33512.77 26508.03 28093.74 Urban 0 0 NA NA NA 0 I
04378144 13807.65 0.00 28979.32 27083.40 26693.86 27992.45 29585.03 0.00 29732.86 29316.57 27431.31 0.00 0.00 28173.63 0.00 22879.56 0.00 0.00 25564.56 26377.60 0.00 26336.40 27843.16 0.00 26624.50 0.00 26128.58 27506.85 0.00 0.00 0.00 27602.16 0.00 0.00 0.00 0.00 0.00 28507.23 31304.32 25729.67 0.00 32459.95 0.00 0.00 0.00 Urban 0 0 NA NA NA 0 I
04378174 34366.94 0.00 56197.54 0.00 68378.57 0.00 0.00 64913.65 51360.36 0.00 68007.76 68969.61 0.00 61009.71 0.00 73813.78 67483.74 0.00 0.00 0.00 61308.65 0.00 0.00 0.00 0.00 0.00 64421.14 80364.05 63777.95 0.00 68784.96 0.00 0.00 0.00 66196.82 69101.02 68260.70 69157.21 59350.62 0.00 56558.47 61551.35 63870.17 0.00 0.00 Urban 0 0 NA NA NA 0 I
04378194 26543.42 48347.47 0.00 52400.57 0.00 0.00 0.00 54452.42 56357.94 0.00 49279.82 0.00 49634.22 50323.44 0.00 0.00 51177.03 0.00 0.00 54998.37 0.00 50254.05 0.00 0.00 49948.46 50022.11 50772.96 0.00 48660.00 49470.47 48256.39 56070.84 51642.24 0.00 0.00 0.00 51286.78 0.00 49888.79 47868.12 55679.25 0.00 0.00 0.00 0.00 Urban 0 0 NA NA NA 0 I
04378264 15160.89 0.00 26527.27 0.00 35462.68 28120.91 0.00 0.00 0.00 31427.12 0.00 0.00 0.00 0.00 0.00 38128.59 34121.49 33596.86 0.00 29419.94 0.00 0.00 0.00 37406.35 34331.83 35157.93 32163.44 27623.50 0.00 26467.51 30457.72 32260.50 0.00 0.00 29017.90 0.00 34166.82 0.00 0.00 33442.30 34997.28 0.00 26329.75 25890.87 0.00 Urban 0 0 NA NA NA 0 I

With the table output of ce_prepdata() one can now split up the data by the demographic variable of interest and operate on each of the subsets independently to calculate estimates. Below I’ll show the workflow for calculating estimated mean expenditures only for urban consumers.

car_data_urban <- filter(car_data, bls_urbn %in% "Urban")

ce_mean(car_data_urban) |>
  gt()
agg_exp mean_exp se cv
310383802180 2472.104 152.2238 6.157661

Since I’m intersted in seeing this mean for all categories of urbanicity, in the next step I’ll nest the car data by urbanicity in order to be able to operate on subsets of the data independently. For this I’ll use the nest() function from the tidyr package.

car_data_nested <- nest(car_data, .by = "bls_urbn")

car_data_nested
#> # A tibble: 2 × 2
#>   bls_urbn data                  
#>   <fct>    <list>                
#> 1 Urban    <tibble [23,945 × 53]>
#> 2 Rural    <tibble [1,600 × 53]>

With the data now grouped and subset by urbanicity I’ll now use map() from the purrr package to apply ce_mean() to the subsets of data.

car_data_nested |>
  mutate(ce_ann_means = map(data, ce_mean)) |>
  select(-data) |>
  unnest(ce_ann_means) |>
  gt()
bls_urbn agg_exp mean_exp se cv
Urban 310383802180 2472.104 152.2238 6.157661
Rural 30912654497 3844.600 683.8189 17.786476

Getting the annual, weighted estimate of the median (or another quantile) would be just as easy. Here I’ll calculate the first percentile the median, the 95th percentile, and the 99.1 percentile through 99.9 percentile for the overall sample.

ce_quantiles(
  car_data,
  probs = c(0.01, 0.5, 0.95, seq(0.99, 0.999, by = 0.001))
) |>
  gt()
probs quantile
1.0% 0
50.0% 0
95.0% 0
99.0% 20000
99.1% 21694
99.2% 23271
99.3% 25000
99.4% 27005
99.5% 29953
99.6% 32000
99.7% 35000
99.8% 40000
99.9% 47000

Calculate CE weighted quantile estimates by urbanicity

Performing the operation by urbanicity would be similar to the way it’s done with nested means. Here I’ll get the same quantiles by urbanicity and pivot the table to compare the categories side-by-side using the nested data.

car_data_nested |>
  mutate(
    ce_ann_quantiles = map(
      data,
      \(x) ce_quantiles(
        x,
        probs = c(0.01, 0.50, 0.95, seq(0.99, 0.999, by = 0.001))
      )
    )
  ) |>
  select(-data) |>
  unnest(ce_ann_quantiles) |>
  pivot_wider(values_from = quantile, names_from = bls_urbn) |>
  gt()
probs Urban Rural
1.0% 0 0
50.0% 0 0
95.0% 0 0
99.0% 20000 26100
99.1% 21000 28000
99.2% 23000 28000
99.3% 25000 32000
99.4% 26780 33536
99.5% 29000 37500
99.6% 31500 45000
99.7% 35000 45000
99.8% 39000 52430
99.9% 45678 54639

Clean-up

Finally, now that the analysis is done, I’ll delete the temporary directory that contains all the CE data.

unlink(ce_data_dir, recursive = TRUE, force = TRUE)