CE Estimates by Demographic Category
Source:vignettes/articles/ce-estimates-by-demographics.Rmd
ce-estimates-by-demographics.Rmd
Introduction
The below example shows how one can use cepumd
to
generate annual, weighted expenditure means and quantiles for the
different categories of a grouping variable. This example will focus on
getting these estimates for used cars and trucks using 2021 Interview
data since reports of vehicle purchases are only taken from the
Interview survey. It’s also useful to show how to calculate weighted
quantiles of expenditures since this is only statistically valid when
using a single survey instrument.
I’ll need the following data and metadata sources:
- Interview survey data zip files for 2020 and 2021
- Hierarchical grouping zip files
- CE Data Dictionary
Data gathering
Data files can be downloaded from the CE PUMD Data Files page (it’s easiest to use CSV) and hierarchical grouping files can be downloaded from the CE PUMD Documentation page
To get the files I’ll first create a temporary directory and download
all of the files that I need from the BLS website into this directory.
You might choose to store your files differently, but this convention
will keep the files organized, it will keep the code simple, and
everything will be in a folder that will be easy to clean up after.
Since the BLS blocks third party applications I’ll add a user-agent to
identify myself in the download function that is stored in a variable
called cepumd_ua
(not shown).
ce_data_dir <- tempdir()
download.file(
"https://www.bls.gov/cex/pumd/data/comma/intrvw21.zip",
fs::path(ce_data_dir, "intrvw21.zip"),
mode = "wb",
headers = list(
"User-Agent" = cepumd_ua
)
)
download.file(
"https://www.bls.gov/cex/pumd/data/comma/intrvw20.zip",
fs::path(ce_data_dir, "intrvw20.zip"),
mode = "wb",
headers = list(
"User-Agent" = cepumd_ua
)
)
download.file(
"https://www.bls.gov/cex/pumd/stubs.zip",
fs::path(ce_data_dir, "stubs.zip"),
mode = "wb",
headers = list(
"User-Agent" = cepumd_ua
)
)
download.file(
"https://www.bls.gov/cex/pumd/ce-pumd-interview-diary-dictionary.xlsx",
fs::path(ce_data_dir, "ce-data-dict.xlsx"),
mode = "wb",
headers = list(
"User-Agent" = cepumd_ua
)
)
Calculate CE weighted mean estimate by urbanicity
Now that I have the files I’ll start the analysis. First I’ll load the hierarchical grouping file and filter it for categories of used cars or used trucks.
int21_hg <- ce_hg(
2021,
interview,
hg_zip_path = file.path(ce_data_dir, "stubs.zip")
)
int21_hg |>
filter(str_detect(title, "Used (cars|trucks)")) |>
gt()
level | title | ucc | survey | factor |
---|---|---|---|---|
5 | Used cars | 460110 | I | 1 |
5 | Used trucks | 460901 | I | 1 |
These are the UCCs for “Used cars and trucks. I’ll use the code above
to set the uccs
argument in ce_prepdata()
.
I’ll also include the bls_urbn
variable in the
...
argument to get estimates by urbanicity.
car_data <- ce_prepdata(
2021,
interview,
int21_hg,
uccs = int21_hg |>
filter(str_detect(title, "Used (cars|trucks)"), !is.na(as.numeric(ucc))) |>
pull(ucc),
bls_urbn, # <------- this is the variable for urbanicity
int_zp = c(
file.path(ce_data_dir, "intrvw20.zip"),
file.path(ce_data_dir, "intrvw21.zip")
),
recode_variables = TRUE,
dict_path = file.path(ce_data_dir, "ce-data-dict.xlsx")
)
gt(head(car_data))
newid | finlwt21 | wtrep01 | wtrep02 | wtrep03 | wtrep04 | wtrep05 | wtrep06 | wtrep07 | wtrep08 | wtrep09 | wtrep10 | wtrep11 | wtrep12 | wtrep13 | wtrep14 | wtrep15 | wtrep16 | wtrep17 | wtrep18 | wtrep19 | wtrep20 | wtrep21 | wtrep22 | wtrep23 | wtrep24 | wtrep25 | wtrep26 | wtrep27 | wtrep28 | wtrep29 | wtrep30 | wtrep31 | wtrep32 | wtrep33 | wtrep34 | wtrep35 | wtrep36 | wtrep37 | wtrep38 | wtrep39 | wtrep40 | wtrep41 | wtrep42 | wtrep43 | wtrep44 | bls_urbn | mo_scope | popwt | ucc | ref_yr | ref_mo | cost | survey |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
04378034 | 15009.70 | 26582.38 | 30220.87 | 28559.33 | 0.00 | 28263.97 | 0.00 | 26801.74 | 30392.73 | 0.00 | 0.00 | 35877.18 | 0.00 | 0.00 | 31583.15 | 0.00 | 28192.04 | 0.00 | 0.00 | 27742.65 | 32402.03 | 32193.33 | 0.00 | 33288.04 | 33269.07 | 28259.38 | 31891.03 | 32534.43 | 0.00 | 0.00 | 0.00 | 26770.84 | 0.00 | 24617.21 | 31914.74 | 31437.74 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 33248.90 | 0.00 | 0.00 | 0.00 | Urban | 0 | 0 | NA | NA | NA | 0 | I |
04378044 | 14659.23 | 0.00 | 24636.01 | 25415.50 | 0.00 | 29842.23 | 26240.00 | 31559.69 | 0.00 | 0.00 | 32690.11 | 0.00 | 35219.62 | 0.00 | 0.00 | 32209.17 | 28073.98 | 0.00 | 30630.13 | 28545.87 | 0.00 | 22419.07 | 0.00 | 29752.76 | 35818.35 | 0.00 | 0.00 | 0.00 | 37168.17 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 32609.68 | 28785.20 | 32743.62 | 0.00 | 30564.90 | 0.00 | 0.00 | 0.00 | 33512.77 | 26508.03 | 28093.74 | Urban | 0 | 0 | NA | NA | NA | 0 | I |
04378144 | 13807.65 | 0.00 | 28979.32 | 27083.40 | 26693.86 | 27992.45 | 29585.03 | 0.00 | 29732.86 | 29316.57 | 27431.31 | 0.00 | 0.00 | 28173.63 | 0.00 | 22879.56 | 0.00 | 0.00 | 25564.56 | 26377.60 | 0.00 | 26336.40 | 27843.16 | 0.00 | 26624.50 | 0.00 | 26128.58 | 27506.85 | 0.00 | 0.00 | 0.00 | 27602.16 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 28507.23 | 31304.32 | 25729.67 | 0.00 | 32459.95 | 0.00 | 0.00 | 0.00 | Urban | 0 | 0 | NA | NA | NA | 0 | I |
04378174 | 34366.94 | 0.00 | 56197.54 | 0.00 | 68378.57 | 0.00 | 0.00 | 64913.65 | 51360.36 | 0.00 | 68007.76 | 68969.61 | 0.00 | 61009.71 | 0.00 | 73813.78 | 67483.74 | 0.00 | 0.00 | 0.00 | 61308.65 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 64421.14 | 80364.05 | 63777.95 | 0.00 | 68784.96 | 0.00 | 0.00 | 0.00 | 66196.82 | 69101.02 | 68260.70 | 69157.21 | 59350.62 | 0.00 | 56558.47 | 61551.35 | 63870.17 | 0.00 | 0.00 | Urban | 0 | 0 | NA | NA | NA | 0 | I |
04378194 | 26543.42 | 48347.47 | 0.00 | 52400.57 | 0.00 | 0.00 | 0.00 | 54452.42 | 56357.94 | 0.00 | 49279.82 | 0.00 | 49634.22 | 50323.44 | 0.00 | 0.00 | 51177.03 | 0.00 | 0.00 | 54998.37 | 0.00 | 50254.05 | 0.00 | 0.00 | 49948.46 | 50022.11 | 50772.96 | 0.00 | 48660.00 | 49470.47 | 48256.39 | 56070.84 | 51642.24 | 0.00 | 0.00 | 0.00 | 51286.78 | 0.00 | 49888.79 | 47868.12 | 55679.25 | 0.00 | 0.00 | 0.00 | 0.00 | Urban | 0 | 0 | NA | NA | NA | 0 | I |
04378264 | 15160.89 | 0.00 | 26527.27 | 0.00 | 35462.68 | 28120.91 | 0.00 | 0.00 | 0.00 | 31427.12 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 38128.59 | 34121.49 | 33596.86 | 0.00 | 29419.94 | 0.00 | 0.00 | 0.00 | 37406.35 | 34331.83 | 35157.93 | 32163.44 | 27623.50 | 0.00 | 26467.51 | 30457.72 | 32260.50 | 0.00 | 0.00 | 29017.90 | 0.00 | 34166.82 | 0.00 | 0.00 | 33442.30 | 34997.28 | 0.00 | 26329.75 | 25890.87 | 0.00 | Urban | 0 | 0 | NA | NA | NA | 0 | I |
With the table output of ce_prepdata()
one can now split
up the data by the demographic variable of interest and operate on each
of the subsets independently to calculate estimates. Below I’ll show the
workflow for calculating estimated mean expenditures only for urban
consumers.
agg_exp | mean_exp | se | cv |
---|---|---|---|
310383802180 | 2472.104 | 152.2238 | 6.157661 |
Since I’m intersted in seeing this mean for all categories of
urbanicity, in the next step I’ll nest the car data by urbanicity in
order to be able to operate on subsets of the data independently. For
this I’ll use the nest()
function from the
tidyr
package.
car_data_nested <- nest(car_data, .by = "bls_urbn")
car_data_nested
#> # A tibble: 2 × 2
#> bls_urbn data
#> <fct> <list>
#> 1 Urban <tibble [23,945 × 53]>
#> 2 Rural <tibble [1,600 × 53]>
With the data now grouped and subset by urbanicity I’ll now use
map()
from the purrr
package to apply
ce_mean()
to the subsets of data.
car_data_nested |>
mutate(ce_ann_means = map(data, ce_mean)) |>
select(-data) |>
unnest(ce_ann_means) |>
gt()
bls_urbn | agg_exp | mean_exp | se | cv |
---|---|---|---|---|
Urban | 310383802180 | 2472.104 | 152.2238 | 6.157661 |
Rural | 30912654497 | 3844.600 | 683.8189 | 17.786476 |
Getting the annual, weighted estimate of the median (or another quantile) would be just as easy. Here I’ll calculate the first percentile the median, the 95th percentile, and the 99.1 percentile through 99.9 percentile for the overall sample.
ce_quantiles(
car_data,
probs = c(0.01, 0.5, 0.95, seq(0.99, 0.999, by = 0.001))
) |>
gt()
probs | quantile |
---|---|
1.0% | 0 |
50.0% | 0 |
95.0% | 0 |
99.0% | 20000 |
99.1% | 21694 |
99.2% | 23271 |
99.3% | 25000 |
99.4% | 27005 |
99.5% | 29953 |
99.6% | 32000 |
99.7% | 35000 |
99.8% | 40000 |
99.9% | 47000 |
Calculate CE weighted quantile estimates by urbanicity
Performing the operation by urbanicity would be similar to the way it’s done with nested means. Here I’ll get the same quantiles by urbanicity and pivot the table to compare the categories side-by-side using the nested data.
car_data_nested |>
mutate(
ce_ann_quantiles = map(
data,
\(x) ce_quantiles(
x,
probs = c(0.01, 0.50, 0.95, seq(0.99, 0.999, by = 0.001))
)
)
) |>
select(-data) |>
unnest(ce_ann_quantiles) |>
pivot_wider(values_from = quantile, names_from = bls_urbn) |>
gt()
probs | Urban | Rural |
---|---|---|
1.0% | 0 | 0 |
50.0% | 0 | 0 |
95.0% | 0 | 0 |
99.0% | 20000 | 26100 |
99.1% | 21000 | 28000 |
99.2% | 23000 | 28000 |
99.3% | 25000 | 32000 |
99.4% | 26780 | 33536 |
99.5% | 29000 | 37500 |
99.6% | 31500 | 45000 |
99.7% | 35000 | 45000 |
99.8% | 39000 | 52430 |
99.9% | 45678 | 54639 |
Clean-up
Finally, now that the analysis is done, I’ll delete the temporary directory that contains all the CE data.
unlink(ce_data_dir, recursive = TRUE, force = TRUE)