CE Estimates by Demographic Category • cepumd

Introduction

The below example shows how one can use cepumd to generate annual, weighted expenditure means and quantiles for the different categories of a grouping variable. This example will focus on getting these estimates for used cars and trucks using 2021 Interview data since reports of vehicle purchases are only taken from the Interview survey. It’s also useful to show how to calculate weighted quantiles of expenditures since this is only statistically valid when using a single survey instrument.

I’ll need the following data and metadata sources:

Interview survey data zip files for 2020 and 2021
Hierarchical grouping zip files
CE Data Dictionary

Data gathering

Data files can be downloaded from the CE PUMD Data Files page (it’s easiest to use CSV) and hierarchical grouping files can be downloaded from the CE PUMD Documentation page

To get the files I’ll first create a temporary directory and download all of the files that I need from the BLS website into this directory. You might choose to store your files differently, but this convention will keep the files organized, it will keep the code simple, and everything will be in a folder that will be easy to clean up after. Since the BLS blocks third party applications I’ll add a user-agent to identify myself in the download function that is stored in a variable called cepumd_ua (not shown).

ce_data_dir <- tempdir()

download.file(
  "https://www.bls.gov/cex/pumd/data/comma/intrvw21.zip",
  fs::path(ce_data_dir, "intrvw21.zip"),
  mode = "wb",
  headers = list(
    "User-Agent" = cepumd_ua
  )
)

download.file(
  "https://www.bls.gov/cex/pumd/data/comma/intrvw20.zip",
  fs::path(ce_data_dir, "intrvw20.zip"),
  mode = "wb",
  headers = list(
    "User-Agent" = cepumd_ua
  )
)

download.file(
  "https://www.bls.gov/cex/pumd/stubs.zip",
  fs::path(ce_data_dir, "stubs.zip"),
  mode = "wb",
  headers = list(
    "User-Agent" = cepumd_ua
  )
)

download.file(
  "https://www.bls.gov/cex/pumd/ce-pumd-interview-diary-dictionary.xlsx",
  fs::path(ce_data_dir, "ce-data-dict.xlsx"),
  mode = "wb",
  headers = list(
    "User-Agent" = cepumd_ua
  )
)

Calculate CE weighted mean estimate by urbanicity

Now that I have the files I’ll start the analysis. First I’ll load the hierarchical grouping file and filter it for categories of used cars or used trucks.

int21_hg <- ce_hg(
  2021,
  interview,
  hg_zip_path = file.path(ce_data_dir, "stubs.zip")
)

int21_hg |>
  filter(str_detect(title, "Used (cars|trucks)")) |>
  gt()

level	title	ucc	survey	factor
5	Used cars	460110	I	1
5	Used trucks	460901	I	1

These are the UCCs for “Used cars and trucks. I’ll use the code above to set the uccs argument in ce_prepdata(). I’ll also include the bls_urbn variable in the ... argument to get estimates by urbanicity.

car_data <- ce_prepdata(
  2021,
  interview,
  int21_hg,
  uccs = int21_hg |>
    filter(str_detect(title, "Used (cars|trucks)"), !is.na(as.numeric(ucc))) |>
    pull(ucc),
  bls_urbn,  # <------- this is the variable for urbanicity
  int_zp = c(
    file.path(ce_data_dir, "intrvw20.zip"),
    file.path(ce_data_dir, "intrvw21.zip")
  ),
  recode_variables = TRUE,
  dict_path = file.path(ce_data_dir, "ce-data-dict.xlsx")
)

gt(head(car_data))

newid	finlwt21	wtrep01	wtrep02	wtrep03	wtrep04	wtrep05	wtrep06	wtrep07	wtrep08	wtrep09	wtrep10	wtrep11	wtrep12	wtrep13	wtrep14	wtrep15	wtrep16	wtrep17	wtrep18	wtrep19	wtrep20	wtrep21	wtrep22	wtrep23	wtrep24	wtrep25	wtrep26	wtrep27	wtrep28	wtrep29	wtrep30	wtrep31	wtrep32	wtrep33	wtrep34	wtrep35	wtrep36	wtrep37	wtrep38	wtrep39	wtrep40	wtrep41	wtrep42	wtrep43	wtrep44	bls_urbn	ucc	ref_yr	ref_mo	survey
04378034	15009.70	26582.38	30220.87	28559.33	0.00	28263.97	0.00	26801.74	30392.73	0.00	0.00	35877.18	0.00	0.00	31583.15	0.00	28192.04	0.00	0.00	27742.65	32402.03	32193.33	0.00	33288.04	33269.07	28259.38	31891.03	32534.43	0.00	0.00	0.00	26770.84	0.00	24617.21	31914.74	31437.74	0.00	0.00	0.00	0.00	0.00	33248.90	0.00	0.00	0.00	Urban	NA	NA	NA	I
04378044	14659.23	0.00	24636.01	25415.50	0.00	29842.23	26240.00	31559.69	0.00	0.00	32690.11	0.00	35219.62	0.00	0.00	32209.17	28073.98	0.00	30630.13	28545.87	0.00	22419.07	0.00	29752.76	35818.35	0.00	0.00	0.00	37168.17	0.00	0.00	0.00	0.00	0.00	32609.68	28785.20	32743.62	0.00	30564.90	0.00	0.00	0.00	33512.77	26508.03	28093.74	Urban	NA	NA	NA	I
04378144	13807.65	0.00	28979.32	27083.40	26693.86	27992.45	29585.03	0.00	29732.86	29316.57	27431.31	0.00	0.00	28173.63	0.00	22879.56	0.00	0.00	25564.56	26377.60	0.00	26336.40	27843.16	0.00	26624.50	0.00	26128.58	27506.85	0.00	0.00	0.00	27602.16	0.00	0.00	0.00	0.00	0.00	28507.23	31304.32	25729.67	0.00	32459.95	0.00	0.00	0.00	Urban	NA	NA	NA	I
04378174	34366.94	0.00	56197.54	0.00	68378.57	0.00	0.00	64913.65	51360.36	0.00	68007.76	68969.61	0.00	61009.71	0.00	73813.78	67483.74	0.00	0.00	0.00	61308.65	0.00	0.00	0.00	0.00	0.00	64421.14	80364.05	63777.95	0.00	68784.96	0.00	0.00	0.00	66196.82	69101.02	68260.70	69157.21	59350.62	0.00	56558.47	61551.35	63870.17	0.00	0.00	Urban	NA	NA	NA	I
04378194	26543.42	48347.47	0.00	52400.57	0.00	0.00	0.00	54452.42	56357.94	0.00	49279.82	0.00	49634.22	50323.44	0.00	0.00	51177.03	0.00	0.00	54998.37	0.00	50254.05	0.00	0.00	49948.46	50022.11	50772.96	0.00	48660.00	49470.47	48256.39	56070.84	51642.24	0.00	0.00	0.00	51286.78	0.00	49888.79	47868.12	55679.25	0.00	0.00	0.00	0.00	Urban	NA	NA	NA	I
04378264	15160.89	0.00	26527.27	0.00	35462.68	28120.91	0.00	0.00	0.00	31427.12	0.00	0.00	0.00	0.00	0.00	38128.59	34121.49	33596.86	0.00	29419.94	0.00	0.00	0.00	37406.35	34331.83	35157.93	32163.44	27623.50	0.00	26467.51	30457.72	32260.50	0.00	0.00	29017.90	0.00	34166.82	0.00	0.00	33442.30	34997.28	0.00	26329.75	25890.87	0.00	Urban	NA	NA	NA	I

With the table output of ce_prepdata() one can now split up the data by the demographic variable of interest and operate on each of the subsets independently to calculate estimates. Below I’ll show the workflow for calculating estimated mean expenditures only for urban consumers.

car_data_urban <- filter(car_data, bls_urbn %in% "Urban")

ce_mean(car_data_urban) |>
  gt()

agg_exp	mean_exp	se	cv
310383802180	2472.104	152.2238	6.157661

Since I’m intersted in seeing this mean for all categories of urbanicity, in the next step I’ll nest the car data by urbanicity in order to be able to operate on subsets of the data independently. For this I’ll use the nest() function from the tidyr package.

car_data_nested <- nest(car_data, .by = "bls_urbn")

car_data_nested
#> # A tibble: 2 × 2
#>   bls_urbn data                  
#>   <fct>    <list>                
#> 1 Urban    <tibble [23,945 × 53]>
#> 2 Rural    <tibble [1,600 × 53]>

With the data now grouped and subset by urbanicity I’ll now use map() from the purrr package to apply ce_mean() to the subsets of data.

car_data_nested |>
  mutate(ce_ann_means = map(data, ce_mean)) |>
  select(-data) |>
  unnest(ce_ann_means) |>
  gt()

bls_urbn	agg_exp	mean_exp	se	cv
Urban	310383802180	2472.104	152.2238	6.157661
Rural	30912654497	3844.600	683.8189	17.786476

Getting the annual, weighted estimate of the median (or another quantile) would be just as easy. Here I’ll calculate the first percentile the median, the 95th percentile, and the 99.1 percentile through 99.9 percentile for the overall sample.

ce_quantiles(
  car_data,
  probs = c(0.01, 0.5, 0.95, seq(0.99, 0.999, by = 0.001))
) |>
  gt()

probs	quantile
1.0%	0
50.0%	0
95.0%	0
99.0%	20000
99.1%	21694
99.2%	23271
99.3%	25000
99.4%	27005
99.5%	29953
99.6%	32000
99.7%	35000
99.8%	40000
99.9%	47000

Calculate CE weighted quantile estimates by urbanicity

Performing the operation by urbanicity would be similar to the way it’s done with nested means. Here I’ll get the same quantiles by urbanicity and pivot the table to compare the categories side-by-side using the nested data.

car_data_nested |>
  mutate(
    ce_ann_quantiles = map(
      data,
      \(x) ce_quantiles(
        x,
        probs = c(0.01, 0.50, 0.95, seq(0.99, 0.999, by = 0.001))
      )
    )
  ) |>
  select(-data) |>
  unnest(ce_ann_quantiles) |>
  pivot_wider(values_from = quantile, names_from = bls_urbn) |>
  gt()

probs	Urban	Rural
1.0%	0	0
50.0%	0	0
95.0%	0	0
99.0%	20000	26100
99.1%	21000	28000
99.2%	23000	28000
99.3%	25000	32000
99.4%	26780	33536
99.5%	29000	37500
99.6%	31500	45000
99.7%	35000	45000
99.8%	39000	52430
99.9%	45678	54639

Clean-up

Finally, now that the analysis is done, I’ll delete the temporary directory that contains all the CE data.

unlink(ce_data_dir, recursive = TRUE, force = TRUE)