| Field | Value |
|---|---|
| Source | hospital-admissions |
| Signal | smoothed_covid19_from_claims |
| Label | Smoothed COVID-19 from Claims |
| Geo resolution | state |
| Temporal resolution (reference date) | day |
| Reference date range | 2023-01-01 to 2025-08-15 |
| Report date range | 2023-01-05 to 2026-06-05 |
| Minimum revision history | 365 days |
| Number of locations | 51 |
| Number of reports | 11,991,605 |
Revision Analysis: Smoothed COVID-19 from Claims
This notebook analyses the revision behaviour of a single time series stored as an epi_archive. The central question is:
At what lag does a series become reliable enough to use?
Results at a glance
The time-series plot below compares the Report value available at selected lags using all reference dates in the analysis period.
For the diagnostics below, reference dates after 2025-06-15 are excluded because they have fewer than 365 days of possible revision history before the render date (2026-06-15). This helps the latest-report comparison reflect a more stable revision pattern.
The next diagnostic summarizes revision size using Relative Error. For each current report, Relative Error compares the Report value with the latest Report value available for the same location and reference date:
\[ \text{Relative Error} = \frac{|r_c - r_l|}{|r_l|} \]
Here r_c is the Report value in the current report and r_l is the Report value in the latest report for the same location and reference date.
A Relative Error of 0% means the current report matches the latest report; a Relative Error of 10% means it is still 10% away from the latest report.
This heatmap shows the Relative Error surface over daily reference date and lag. Each cell summarizes the mean Relative Error across all states for reports with the same reference date and lag. Epi seasons are shown one page at a time, with the most recent epi season first. White cells mean that no report was available for that reference-date and lag combination, so they should be read as missing data rather than zero Relative Error.
1 Is the first published Report value biased?
Here we ask whether the first published Report value tends to be below or above the latest available Report value.
For each location-reference-date pair we compute Initial Bias:
\[\text{Initial Bias} = \frac{r_f - r_l}{r_l}\]
Here r_f is the Report value in the first report and r_l is the Report value in the latest report for the same location and reference date.
A negative Initial Bias means the first published Report value was below the latest Report value (under-reporting); a positive value means it was above the latest Report value (over-reporting).
The boxplot below shows the distribution of Initial Bias across reference dates for each location. The box spans the 25th-75th percentile range, the middle line is the median, and the whiskers use the standard 1.5 IQR rule. The color is based on each location’s median Initial Bias direction, so individual pairs inside a colored box can still fall on either side of zero.
Code
bias_df <- pair_summary %>%
transmute(
geo_value,
time_value,
value_first = first_report_value,
value_latest = latest_report_value,
initial_bias
)
bias_by_loc <- bias_df %>%
group_by(geo_value) %>%
summarize(
median_bias = median(initial_bias, na.rm = TRUE),
pct_under = mean(initial_bias < 0, na.rm = TRUE),
pct_over = mean(initial_bias > 0, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(median_bias)Code
bias_df %>%
left_join(bias_by_loc %>% select(geo_value, median_bias), by = "geo_value") %>%
mutate(
geo_value = factor(geo_value, levels = bias_by_loc$geo_value),
direction = case_when(
median_bias < 0 ~ "Under-reporting",
median_bias > 0 ~ "Over-reporting",
TRUE ~ "No Initial Bias"
)
) %>%
ggplot(aes(x = geo_value, y = initial_bias, fill = direction)) +
geom_boxplot(outlier.shape = NA, alpha = 0.7) +
geom_hline(yintercept = 0, linewidth = 0.4, color = "gray40") +
scale_x_discrete(labels = toupper) +
scale_y_continuous(labels = label_percent(), breaks = breaks_pretty()) +
coord_cartesian(ylim = c(
min(0, quantile(bias_df$initial_bias, 0.025, na.rm = TRUE)),
max(0, quantile(bias_df$initial_bias, 0.975, na.rm = TRUE))
)) +
labs(
title = "Initial Bias distribution by location",
subtitle = "Distribution of Initial Bias by location; color indicates median direction.",
x = NULL, y = "Initial Bias", fill = NULL
) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 11, face = "bold"))The table below summarizes the median Initial Bias by location, along with the share of location-reference-date pairs that were initially below or above the latest Report value.
2 How soon can we trust the signal?
This section asks which lag is needed before Report values are close enough to their latest Report value. The thresholds use Relative Error, defined in Results at a glance, and describe persistent stability: once a location-reference-date pair enters the threshold, it must stay there.
Code
thresholds <- c(0.5, 0.2, 0.1, 0.05, 0.01, 0.005)
threshold_label <- function(x) {
ifelse(x < 0.01, label_percent(accuracy = 0.1)(x), label_percent(accuracy = 1)(x))
}
stability_multi <- convergence_df %>%
arrange(geo_value, time_value, publication_lag) %>%
group_by(geo_value, time_value) %>%
group_modify(function(df, key) {
map_dfr(thresholds, function(tau) {
over <- which(df$abs_rel_diff > tau)
lag_stable <- if (length(over) == 0) {
min(df$publication_lag, na.rm = TRUE)
} else {
df$publication_lag[min(max(over) + 1, nrow(df))]
}
tibble(
lag_stable = lag_stable,
threshold = threshold_label(tau),
threshold_num = tau
)
})
}) %>%
ungroup()
rev_window_df <- rev_beh %>%
filter(n_revisions > 0) %>%
transmute(
geo_value, time_value,
lag_stable = as_days(max_lag),
threshold = "0%",
threshold_num = 0
)
stability_combined <- bind_rows(stability_multi, rev_window_df) %>%
mutate(threshold = factor(threshold, levels = threshold_levels))
invisible(gc())2.1 Cumulative stabilization by Relative Error threshold
The curve answers: as lag increases, what fraction of location-reference-date pairs have settled within the specified Relative Error threshold and never left it? Higher thresholds are easier to satisfy, so the curves and stacked bands are nested: the 50% threshold should mature earlier than stricter thresholds such as 5%, 0.5%, or 0%.
Code
max_ecdf_day <- max(stability_combined$lag_stable, na.rm = TRUE)
stability_ecdf <- stability_combined %>%
mutate(
threshold = factor(threshold, levels = threshold_levels)
) %>%
group_by(threshold) %>%
group_modify(function(df, key) {
days <- seq(0, max_ecdf_day)
tibble(
lag_stable = days,
cumulative_fraction = map_dbl(days, ~ mean(df$lag_stable <= .x, na.rm = TRUE))
)
}) %>%
ungroup()
stability_wide <- stability_ecdf %>%
pivot_wider(names_from = threshold, values_from = cumulative_fraction)
stability_bands <- bind_rows(
stability_wide %>% transmute(lag_stable, band = "50% to 100%", ymin = `50%`, ymax = 1),
stability_wide %>% transmute(lag_stable, band = "50% to 20%", ymin = `20%`, ymax = `50%`),
stability_wide %>% transmute(lag_stable, band = "20% to 10%", ymin = `10%`, ymax = `20%`),
stability_wide %>% transmute(lag_stable, band = "10% to 5%", ymin = `5%`, ymax = `10%`),
stability_wide %>% transmute(lag_stable, band = "5% to 1%", ymin = `1%`, ymax = `5%`),
stability_wide %>% transmute(lag_stable, band = "1% to 0.5%", ymin = `0.5%`, ymax = `1%`),
stability_wide %>% transmute(lag_stable, band = "Within 0.5%", ymin = 0, ymax = `0.5%`)
) %>%
mutate(
lower = pmin(ymin, ymax, na.rm = TRUE),
upper = pmax(ymin, ymax, na.rm = TRUE),
ymin = lower,
ymax = upper,
band = factor(
band,
levels = c(
"Within 0.5%", "1% to 0.5%", "5% to 1%",
"10% to 5%", "20% to 10%", "50% to 20%",
"50% to 100%"
)
)
) %>%
select(-lower, -upper)
final_revision_ecdf <- stability_ecdf %>%
filter(threshold == "0%")
max_day <- max(stability_ecdf$lag_stable, na.rm = TRUE)
right_start <- if (max_day > tail_start) tail_start else focus_days
stability_plot <- function(x_limits, show_y = TRUE) {
ggplot() +
geom_ribbon(
data = stability_bands,
aes(x = lag_stable, ymin = ymin, ymax = ymax, fill = band),
alpha = 0.52,
show.legend = FALSE
) +
geom_ribbon(
data = final_revision_ecdf,
aes(x = lag_stable, ymin = 0, ymax = cumulative_fraction),
fill = final_revision_fill,
alpha = 0.45,
show.legend = FALSE
) +
geom_line(
data = stability_ecdf,
aes(x = lag_stable, y = cumulative_fraction, color = threshold),
linewidth = 0.9
) +
geom_hline(yintercept = 1, color = "black", linewidth = 0.7) +
scale_x_continuous(limits = x_limits, breaks = focus_day_breaks(x_limits), expand = expansion(mult = 0)) +
scale_y_continuous(labels = label_percent(), limits = c(0, 1), expand = expansion(mult = 0)) +
scale_color_manual(values = threshold_cols) +
scale_fill_manual(values = band_cols) +
guides(color = guide_legend(byrow = TRUE)) +
labs(
x = "Lag (days)",
y = if (show_y) "Cumulative fraction of pairs" else NULL,
color = "Threshold"
) +
theme(
axis.title.y = if (show_y) element_text() else element_blank(),
axis.text.y = if (show_y) element_text() else element_blank(),
axis.ticks.y = if (show_y) element_line() else element_blank()
)
}
make_cumulative_stabilization_plot <- function(data, title = NULL, show_legend = FALSE) {
data <- data %>%
filter(!is.na(lag_stable), !is.na(threshold)) %>%
mutate(threshold = factor(threshold, levels = threshold_levels))
if (nrow(data) == 0) {
return(ggplot() + labs(title = title) + theme_void())
}
cumulative_data <- data %>%
group_by(threshold) %>%
group_modify(function(df, key) {
tibble(
lag_stable = seq(0, focus_days),
cumulative_fraction = map_dbl(lag_stable, ~ mean(df$lag_stable <= .x, na.rm = TRUE))
)
}) %>%
ungroup()
cumulative_wide <- cumulative_data %>%
pivot_wider(names_from = threshold, values_from = cumulative_fraction)
cumulative_bands <- bind_rows(
cumulative_wide %>% transmute(lag_stable, band = "50% to 100%", ymin = `50%`, ymax = 1),
cumulative_wide %>% transmute(lag_stable, band = "50% to 20%", ymin = `20%`, ymax = `50%`),
cumulative_wide %>% transmute(lag_stable, band = "20% to 10%", ymin = `10%`, ymax = `20%`),
cumulative_wide %>% transmute(lag_stable, band = "10% to 5%", ymin = `5%`, ymax = `10%`),
cumulative_wide %>% transmute(lag_stable, band = "5% to 1%", ymin = `1%`, ymax = `5%`),
cumulative_wide %>% transmute(lag_stable, band = "1% to 0.5%", ymin = `0.5%`, ymax = `1%`),
cumulative_wide %>% transmute(lag_stable, band = "Within 0.5%", ymin = 0, ymax = `0.5%`)
) %>%
mutate(
lower = pmin(ymin, ymax, na.rm = TRUE),
upper = pmax(ymin, ymax, na.rm = TRUE),
ymin = lower,
ymax = upper,
band = factor(
band,
levels = c(
"Within 0.5%", "1% to 0.5%", "5% to 1%",
"10% to 5%", "20% to 10%", "50% to 20%",
"50% to 100%"
)
)
) %>%
select(-lower, -upper)
cumulative_final <- cumulative_data %>%
filter(threshold == "0%")
ggplot() +
geom_ribbon(
data = cumulative_bands,
aes(x = lag_stable, ymin = ymin, ymax = ymax, fill = band),
alpha = 0.52,
show.legend = FALSE
) +
geom_ribbon(
data = cumulative_final,
aes(x = lag_stable, ymin = 0, ymax = cumulative_fraction),
fill = final_revision_fill,
alpha = 0.45,
show.legend = FALSE
) +
geom_line(
data = cumulative_data,
aes(x = lag_stable, y = cumulative_fraction, color = threshold),
linewidth = 0.8
) +
geom_hline(yintercept = 1, color = "black", linewidth = 0.5) +
scale_x_continuous(limits = c(0, focus_days), breaks = focus_day_breaks(c(0, focus_days)), expand = expansion(mult = 0)) +
scale_y_continuous(labels = label_percent(), limits = c(0, 1), expand = expansion(mult = 0)) +
scale_color_manual(values = threshold_cols) +
scale_fill_manual(values = band_cols) +
guides(color = if (show_legend) guide_legend(byrow = TRUE) else "none") +
labs(
title = title,
x = "Lag (days)",
y = "Cumulative fraction",
color = "Threshold"
) +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title = element_text(size = 12, face = "bold"),
axis.text = element_text(size = 11),
legend.title = element_text(size = 11, face = "bold"),
legend.text = element_text(size = 11)
)
}
left_plot <- stability_plot(c(0, min(focus_days, max_day)), show_y = TRUE)
right_plot <- stability_plot(c(right_start, max_day), show_y = FALSE)
(left_plot + right_plot + plot_layout(widths = c(3, 1), guides = "collect")) +
plot_annotation(
title = "Cumulative stabilization by Relative Error threshold",
subtitle = sprintf(
"Axis break after %d days; right panel shows the long tail from day %d onward.",
focus_days, right_start
)
)The table below gives the key readouts from the cumulative curves. For each threshold, the columns answer how many days are needed before a given fraction of location-reference-date pairs have reached persistent stability.
2.2 Relative Error by lag
This view summarizes Relative Error at each lag. The line is the median across reports, and the band shows the 25th-75th percentile range. The x-axis is split: the left panel shows early lags, and the right panel shows the long tail. Days between the two panels are intentionally omitted and should not be read as continuous spacing.
Code
overall_conv <- convergence_df %>%
group_by(publication_lag) %>%
summarize(
q25_rel_diff = quantile(abs_rel_diff, 0.25, na.rm = TRUE),
median_rel_diff = median(abs_rel_diff, na.rm = TRUE),
q75_rel_diff = quantile(abs_rel_diff, 0.75, na.rm = TRUE),
n_reports = n(),
.groups = "drop"
) %>%
filter(n_reports > 5)
convergence_focus_days <- focus_days
max_conv_day <- max(overall_conv$publication_lag, na.rm = TRUE)
right_conv_start <- if (max_conv_day > tail_start) tail_start else convergence_focus_days
revision_distance_plot <- function(x_limits, show_y = TRUE) {
ggplot(overall_conv, aes(x = publication_lag)) +
geom_ribbon(
aes(ymin = q25_rel_diff, ymax = q75_rel_diff),
fill = dis_pal[1],
alpha = 0.22
) +
geom_line(aes(y = median_rel_diff), color = dis_pal[1], linewidth = 1) +
scale_x_continuous(
limits = x_limits,
breaks = focus_day_breaks(x_limits, convergence_focus_days),
expand = expansion(mult = 0)
) +
scale_y_continuous(labels = label_percent(), breaks = breaks_pretty()) +
labs(
x = "Lag (days)",
y = if (show_y) "Relative Error" else NULL
) +
theme(
axis.title.y = if (show_y) element_text() else element_blank(),
axis.text.y = if (show_y) element_text() else element_blank(),
axis.ticks.y = if (show_y) element_line() else element_blank()
)
}
revision_distance_left <- revision_distance_plot(c(0, min(convergence_focus_days, max_conv_day)), show_y = TRUE)
revision_distance_right <- revision_distance_plot(c(right_conv_start, max_conv_day), show_y = FALSE)
(revision_distance_left + revision_distance_right + plot_layout(widths = c(3, 1))) +
plot_annotation(
title = "Relative Error by lag",
subtitle = sprintf(
"Axis break: days %d-%d are omitted. Line: median; band: 25th-75th percentile range.",
convergence_focus_days, right_conv_start
)
)2.3 Stabilization time distribution by Relative Error threshold
This plot shows the shape of the wait-time distribution behind the cumulative curves above. Bars show the empirical distribution within each threshold; the smooth line is a density guide on the same approximate proportion scale. This is the overall stabilization-time distribution. Later sections reuse the same stabilization-time quantity, first split by location and then by reference date.
Code
stability_distribution <- stability_combined %>%
filter(!is.na(lag_stable)) %>%
mutate(threshold = factor(threshold, levels = threshold_levels))
distribution_bin_width <- 14
distribution_max_day <- ceiling(max(stability_distribution$lag_stable, na.rm = TRUE) / distribution_bin_width) *
distribution_bin_width
distribution_bin_mids <- seq(
distribution_bin_width / 2,
distribution_max_day + distribution_bin_width / 2,
by = distribution_bin_width
)
stability_distribution_bars <- stability_distribution %>%
mutate(
bin_start = floor(lag_stable / distribution_bin_width) * distribution_bin_width,
bin_mid = bin_start + distribution_bin_width / 2
) %>%
count(threshold, bin_mid, name = "n") %>%
complete(threshold, bin_mid = distribution_bin_mids, fill = list(n = 0)) %>%
left_join(count(stability_distribution, threshold, name = "threshold_total"), by = "threshold") %>%
mutate(proportion = n / threshold_total)
stability_distribution_density <- stability_distribution %>%
group_by(threshold) %>%
group_modify(function(df, key) {
if (n_distinct(df$lag_stable) < 2) {
return(tibble(x = numeric(), y = numeric()))
}
dens <- density(
df$lag_stable,
from = 0,
to = distribution_max_day,
adjust = 1.1,
n = 512
)
tibble(
x = dens$x,
y = dens$y * distribution_bin_width
)
}) %>%
ungroup()
ggplot() +
geom_col(
data = stability_distribution_bars,
aes(x = bin_mid, y = proportion, fill = threshold),
width = distribution_bin_width * 0.95,
color = "white",
linewidth = 0.15,
alpha = 0.72
) +
geom_line(
data = stability_distribution_density,
aes(x = x, y = y, color = threshold),
linewidth = 0.9,
na.rm = TRUE
) +
facet_wrap(~threshold, ncol = 1, scales = "free_y") +
scale_x_continuous(breaks = days_since_breaks(c(0, distribution_max_day)), expand = expansion(mult = 0)) +
scale_y_continuous(labels = label_percent(), breaks = breaks_pretty()) +
scale_fill_manual(values = threshold_cols, guide = "none") +
scale_color_manual(values = threshold_cols) +
guides(color = "none") +
labs(
title = "Stabilization time distribution by Relative Error threshold",
subtitle = sprintf(
"Bars show binned proportions; smoothed lines show density scaled to the same %d-day bin width.",
distribution_bin_width
),
x = "Lag at stabilization (days)",
y = "Proportion of pairs"
)3 Is revision behavior different by location?
Revision behavior can vary by geography. This section first compares the total amount of revision activity and lag by location, then compares how long each location takes to reach the stability thresholds used above. Locations are grouped by HHS region (U.S. Department of Health and Human Services region).
Code
hhs_geo_order <- hhs_region_map %>%
mutate(
hhs_region = factor(hhs_region, levels = paste("HHS", 1:10)),
geo_value = tolower(geo_value)
) %>%
filter(geo_value %in% unique(tolower(stability_combined$geo_value))) %>%
arrange(hhs_region, geo_value) %>%
mutate(
x_pos = row_number(),
geo_label = toupper(geo_value),
hhs_label = str_replace(as.character(hhs_region), "HHS ", "HHS\nRegion ")
)
unmapped_geos <- setdiff(unique(tolower(stability_combined$geo_value)), hhs_geo_order$geo_value)
if (length(unmapped_geos) > 0) {
hhs_geo_order <- bind_rows(
hhs_geo_order,
tibble(
geo_value = sort(unmapped_geos),
hhs_region = factor("Other", levels = c(paste("HHS", 1:10), "Other")),
x_pos = seq(nrow(hhs_geo_order) + 1, nrow(hhs_geo_order) + length(unmapped_geos)),
geo_label = toupper(sort(unmapped_geos)),
hhs_label = "Other"
)
)
}
geo_sorting <- hhs_geo_order$geo_value
hhs_region_bounds <- hhs_geo_order %>%
group_by(hhs_region, hhs_label) %>%
summarize(
xmin = min(x_pos),
xmax = max(x_pos),
xmid = mean(x_pos),
.groups = "drop"
)
hhs_bracket_plot <- function() {
ggplot(hhs_region_bounds) +
geom_segment(
aes(x = xmin, xend = xmax, y = 0.72, yend = 0.72),
color = "gray45",
linewidth = 0.55
) +
geom_segment(
aes(x = xmin, xend = xmin, y = 0.72, yend = 0.95),
color = "gray45",
linewidth = 0.55
) +
geom_segment(
aes(x = xmax, xend = xmax, y = 0.72, yend = 0.95),
color = "gray45",
linewidth = 0.55
) +
geom_text(
aes(x = xmid, y = 0.22, label = hhs_label),
size = 3.5,
fontface = "bold",
color = "gray20"
) +
scale_x_continuous(
limits = range(hhs_geo_order$x_pos) + c(-0.5, 0.5),
expand = expansion(mult = c(0.01, 0.01))
) +
coord_cartesian(ylim = c(0, 1), clip = "off") +
labs(x = "Location ordered by HHS region") +
theme_void() +
theme(
axis.title.x = element_text(size = 13, face = "bold", margin = margin(t = 14)),
plot.margin = margin(-4, 5.5, 16, 5.5)
)
}
rev_by_loc <- rev_beh %>%
mutate(geo_value = tolower(geo_value)) %>%
group_by(geo_value) %>%
summarize(
n_pairs = n(),
total_revisions = sum(n_revisions, na.rm = TRUE),
avg_revisions = mean(n_revisions, na.rm = TRUE),
median_window_days = median(as_days(max_lag), na.rm = TRUE),
p90_window_days = quantile(as_days(max_lag), 0.9, na.rm = TRUE),
avg_rel_spread = mean(rel_spread, na.rm = TRUE),
median_conv_days = median(as_days(lag_near_latest), na.rm = TRUE),
.groups = "drop"
)3.1 Cumulative stabilization by Relative Error threshold by HHS region
For each state, the curve shows the fraction of location-reference-date pairs that have settled within each Relative Error threshold by a given lag and never left it. The view is paged by HHS region (U.S. Department of Health and Human Services region). Each small panel is one state. Lines and stacked bands use the same nested threshold colors as Section 2.1; compare panels to see whether states in the same HHS region stabilize at similar lags.
3.2 Relative Error by HHS region
This view shows how Relative Error changes with lag within each HHS region. Each panel is one HHS region, each line is a state median, and each shaded band is the 25th-75th percentile range for that state.
Code
region_conv <- convergence_df %>%
mutate(geo_value = tolower(geo_value)) %>%
inner_join(hhs_region_map, by = "geo_value") %>%
group_by(hhs_region, geo_value, publication_lag) %>%
summarize(
q25_rel_diff = quantile(abs_rel_diff, 0.25, na.rm = TRUE),
median_rel_diff = median(abs_rel_diff, na.rm = TRUE),
q75_rel_diff = quantile(abs_rel_diff, 0.75, na.rm = TRUE),
n_reports = n(),
.groups = "drop"
) %>%
filter(n_reports > 1) %>%
filter(publication_lag <= convergence_focus_days) %>%
mutate(
hhs_region = factor(hhs_region, levels = paste("HHS", 1:10)),
geo_value = factor(geo_value, levels = sort(unique(geo_value)))
)
plot_hhs_region <- function(region_name) {
region_data <- region_conv %>%
filter(hhs_region == region_name) %>%
mutate(geo_value = droplevels(geo_value))
state_levels <- levels(region_data$geo_value)
state_cols <- setNames(hue_pal()(length(state_levels)), state_levels)
state_labels <- setNames(toupper(state_levels), state_levels)
ggplot(region_data, aes(x = publication_lag, group = geo_value)) +
geom_ribbon(
aes(ymin = q25_rel_diff, ymax = q75_rel_diff, fill = geo_value),
alpha = 0.12,
color = NA
) +
geom_line(aes(y = median_rel_diff, color = geo_value), linewidth = 0.75, alpha = 0.95) +
scale_x_continuous(
limits = c(0, convergence_focus_days),
breaks = focus_day_breaks(c(0, convergence_focus_days), convergence_focus_days),
expand = expansion(mult = 0)
) +
scale_y_continuous(labels = label_percent(), breaks = breaks_pretty()) +
scale_color_manual(values = state_cols, labels = state_labels) +
scale_fill_manual(values = state_cols, guide = "none") +
labs(
title = region_name,
x = "Lag (days)",
y = "Relative Error",
color = NULL
) +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title = element_text(size = 12, face = "bold"),
axis.text = element_text(size = 11),
legend.position = c(0.98, 0.98),
legend.justification = c(1, 1),
legend.background = element_rect(fill = scales::alpha("white", 0.78), color = "gray80"),
legend.text = element_text(size = 11),
legend.key.height = unit(0.8, "lines"),
legend.key.width = unit(1.0, "lines")
) +
guides(color = guide_legend(nrow = 2, byrow = TRUE))
}
region_plots <- lapply(paste("HHS", 1:10), plot_hhs_region)
wrap_plots(region_plots, ncol = 2) +
plot_annotation(
title = "Relative Error by HHS region",
subtitle = sprintf(
"Relative Error across lags 0-%d days; lines show state-level medians and shaded bands show 25th-75th percentile ranges.",
convergence_focus_days
)
)3.3 Stabilization time distribution by location
This is the location-level version of the stabilization-time distribution from Section 2.2. Each point is the median time to reach a threshold for one location; each vertical line is the 25th-75th percentile range. The main reading is across states: compare whether some locations consistently need longer lags to reach the same threshold. The 0% threshold means the Report value has matched the latest Report value, while the other thresholds allow nonzero Relative Error.
Code
location_maturation_df <- stability_combined %>%
filter(!is.na(lag_stable), !is.na(threshold)) %>%
mutate(geo_value = tolower(geo_value)) %>%
inner_join(hhs_geo_order, by = "geo_value") %>%
mutate(
geo_value = factor(geo_value, levels = geo_sorting)
) %>%
group_by(geo_value, x_pos, geo_label, threshold) %>%
summarize(
median_lag = median(lag_stable, na.rm = TRUE),
q25_lag = quantile(lag_stable, 0.25, na.rm = TRUE),
q75_lag = quantile(lag_stable, 0.75, na.rm = TRUE),
.groups = "drop"
)
location_maturation_plot <- ggplot(
location_maturation_df,
aes(x = x_pos, y = median_lag, ymin = q25_lag, ymax = q75_lag, color = threshold)
) +
geom_pointrange(position = position_dodge(width = 0.7), linewidth = 0.45, size = 0.9) +
scale_x_continuous(
breaks = hhs_geo_order$x_pos,
labels = hhs_geo_order$geo_label,
expand = expansion(mult = c(0.01, 0.01))
) +
scale_y_continuous(
limits = c(0, NA),
breaks = days_since_breaks(c(0, max(location_maturation_df$q75_lag, na.rm = TRUE)))
) +
scale_color_manual(values = threshold_cols) +
guides(color = guide_legend(byrow = TRUE)) +
labs(
title = "Stabilization time distribution by location",
subtitle = "Lag needed to reach each stability threshold (dots = median, lines = IQR)",
x = NULL, y = "Lag at stabilization (days)",
color = "Threshold"
) +
theme(
axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 11, face = "bold")
)
location_maturation_plot / hhs_bracket_plot() + plot_layout(heights = c(12, 2.1))4 Are there patterns over reference date?
This section asks whether revision behavior changes across reference dates. It first summarizes the amount of revision activity over time, then repeats the same two-part reading pattern used above: cumulative stabilization first, then the Relative Error curve. The final plot turns back to stabilization time, matching the quantity used in Sections 2.2 and 3.3.
Rather than using calendar years, these views use epi seasons (cross-year periods used to align respiratory-virus timing). CDC respiratory virus surveillance is commonly organized by MMWR weeks and named as cross-year seasons. In this report, each season follows a flu-surveillance-style MMWR season: MMWR week 40 through MMWR week 39 of the following year. Seasons are labeled by the two calendar years they span (for example, 2022-2023).
4.1 Cumulative stabilization by Relative Error threshold and epi season
For each epi season, the curve shows the fraction of location-reference-date pairs that have settled within each Relative Error threshold by a given lag and never left it. Higher thresholds are easier to satisfy, so the nested color pattern should be interpreted the same way as in the overall cumulative stabilization plot.
Code
season_stabilization_plots <- map(season_levels, function(season_value) {
make_cumulative_stabilization_plot(
stability_season_df %>% filter(epi_season == season_value),
title = season_value,
show_legend = TRUE
)
})
season_stabilization_panels <- wrap_plots(season_stabilization_plots, ncol = 2)
(guide_area() / season_stabilization_panels) +
plot_layout(heights = c(0.22, 1), guides = "collect") +
plot_annotation(
title = "Cumulative stabilization by Relative Error threshold and epi season",
subtitle = sprintf(
"Shown separately for each MMWR week 40-39 epi season. Lags 0-%d days.",
focus_days
)
)4.2 Relative Error by epi season
This view shows how Relative Error changes with lag across epi seasons. Lines are medians, and shaded bands show the 25th-75th percentile range within each season.
Code
season_conv <- convergence_season_df %>%
group_by(epi_season, publication_lag) %>%
summarize(
q25_rel_diff = quantile(abs_rel_diff, 0.25, na.rm = TRUE),
median_rel_diff = median(abs_rel_diff, na.rm = TRUE),
q75_rel_diff = quantile(abs_rel_diff, 0.75, na.rm = TRUE),
n_reports = n(),
.groups = "drop"
) %>%
filter(n_reports > 5) %>%
filter(publication_lag <= convergence_focus_days)
ggplot(season_conv, aes(x = publication_lag, color = epi_season, fill = epi_season)) +
geom_ribbon(
aes(ymin = q25_rel_diff, ymax = q75_rel_diff),
alpha = 0.14,
color = NA
) +
geom_line(aes(y = median_rel_diff), linewidth = 0.9) +
scale_x_continuous(
limits = c(0, convergence_focus_days),
breaks = focus_day_breaks(c(0, convergence_focus_days), convergence_focus_days),
expand = expansion(mult = 0)
) +
scale_y_continuous(labels = label_percent(), breaks = breaks_pretty()) +
labs(
title = "Relative Error by epi season",
subtitle = sprintf(
"Seasons run MMWR week 40-39; lags 0-%d days.\nLines show medians and shaded bands show 25th-75th percentile ranges.",
convergence_focus_days
),
x = "Lag (days)",
y = "Relative Error",
color = "Epi season",
fill = "Epi season"
)4.3 Stabilization time distribution over reference date
This is the time-level version of the stabilization-time distribution from Sections 2.2 and 3.3. Instead of showing the full distribution or splitting by location, it asks whether the time to reach each threshold changes across reference-date periods. Lines show the median stabilization time, and shaded bands show the 25th-75th percentile range. Recent reference dates may be censored because there has not yet been enough time to observe their full revision history. The censoring boundary is not an epidemiological cutoff; it is the observable maximum lag implied by the available report dates. Apparent jumps or short discontinuities near that boundary are data-support effects rather than an additional axis break.
Code
max_v <- max(diagnostic_archive$DT$version)
date_range_days <- as.numeric(max(stability_combined$time_value) - min(stability_combined$time_value))
agg_unit <- case_when(
date_range_days > 365 * 2 ~ "quarter",
date_range_days > 365 ~ "month",
date_range_days > 100 ~ "2-week",
date_range_days > 40 ~ "week",
date_range_days > 20 ~ "3-day",
TRUE ~ "day"
)
floor_time_bucket <- function(date, unit) {
date <- as.Date(date)
if (unit == "2-week") {
origin <- min(date, na.rm = TRUE)
return(origin + 14 * floor(as.numeric(date - origin) / 14))
}
if (unit == "3-day") {
origin <- min(date, na.rm = TRUE)
return(origin + 3 * floor(as.numeric(date - origin) / 3))
}
floor_date(date, unit = unit)
}
stability_time_summary <- stability_combined %>%
filter(!is.na(lag_stable), !is.na(threshold)) %>%
mutate(
threshold = factor(threshold, levels = threshold_levels),
time_bucket = floor_time_bucket(time_value, agg_unit)
) %>%
group_by(threshold, time_bucket) %>%
summarize(
median_lag = median(lag_stable, na.rm = TRUE),
q25_lag = quantile(lag_stable, 0.25, na.rm = TRUE),
q75_lag = quantile(lag_stable, 0.75, na.rm = TRUE),
.groups = "drop"
)
ggplot(stability_time_summary, aes(x = time_bucket, color = threshold, fill = threshold)) +
geom_ribbon(
aes(ymin = q25_lag, ymax = q75_lag),
alpha = 0.22,
color = NA
) +
geom_line(aes(y = median_lag), linewidth = 0.85) +
geom_function(
fun = function(x) as.numeric(max_v - as.Date(x, origin = "1970-01-01")),
linetype = "dashed", color = "gray50", linewidth = 0.5
) +
annotate(
"text",
x = min(stability_time_summary$time_bucket),
y = as_days(max_v - min(stability_combined$time_value)),
label = "Observable maximum lag", color = "gray50",
vjust = 1, hjust = -1, size = 3
) +
scale_x_date(breaks = breaks_pretty()) +
scale_y_continuous(limits = c(0, NA), breaks = breaks_pretty()) +
scale_color_manual(values = threshold_cols) +
scale_fill_manual(values = threshold_cols) +
labs(
title = "Stabilization time distribution over reference date",
subtitle = sprintf(
"Reference dates grouped by %s. Lines show medians; shaded bands show the 25th-75th percentile range.",
agg_unit
),
x = "Reference date",
y = "Lag at stabilization (days)",
color = "Threshold",
fill = "Threshold"
)5 Are revision patterns related to other signal features?
After checking first-release behavior, maturation time, geography, and epi season, this section asks whether other observable features of the signal are related to revision behavior. These views are exploratory: they help identify where revision patterns may be driven by signal magnitude, initial lag (the lag of the first report observed for a location-reference-date pair), or finer reference-date structure.
5.1 Latest Report value vs. Relative Error spread
This plot compares signal magnitude with the range of Relative Error observed over the report history for the same location and reference date.
- Point: one location-reference-date pair.
- X-axis: the latest Report value for that pair.
- Y-axis: Relative Error spread, defined as maximum Relative Error minus minimum Relative Error across reports.
- Interpretation: a larger y-value means the pair moved through a wider range of distances from its latest Report value during the revision process.
Code
relative_error_spread_df <- convergence_df %>%
group_by(geo_value, time_value) %>%
summarize(
relative_error_spread = max(abs_rel_diff, na.rm = TRUE) - min(abs_rel_diff, na.rm = TRUE),
.groups = "drop"
) %>%
inner_join(
pair_summary %>% select(geo_value, time_value, latest_report_value),
by = c("geo_value", "time_value")
) %>%
filter(
latest_report_value > 0,
relative_error_spread > 0,
is.finite(latest_report_value),
is.finite(relative_error_spread)
)
relative_error_spread_df %>%
ggplot(aes(x = latest_report_value, y = relative_error_spread)) +
geom_point(alpha = 0.2, size = 1) +
geom_smooth() +
scale_x_log10(labels = label_comma()) +
scale_y_log10(labels = label_percent()) +
labs(
title = "Latest Report value vs. Relative Error spread",
x = "Latest Report value (log scale)",
y = "Relative Error spread (log scale)"
)5.2 Relative Error by initial lag
Early publication can be useful, but it may also leave more room for later revision. This view groups location-reference-date pairs by initial lag (the lag of the first report observed for each location-reference-date pair), then compares how Relative Error changes across lags.
Code
initial_lag_df <- convergence_df %>%
group_by(geo_value, time_value) %>%
summarize(
initial_publication_lag = min(publication_lag, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(
initial_lag_group = cut(
initial_publication_lag,
breaks = c(-Inf, 7, 14, 30, Inf),
labels = c("<=7 days", "8-14 days", "15-30 days", ">30 days"),
right = TRUE
)
)
initial_lag_revision_df <- convergence_df %>%
inner_join(
initial_lag_df %>% select(geo_value, time_value, initial_lag_group),
by = c("geo_value", "time_value")
) %>%
filter(!is.na(initial_lag_group), publication_lag <= convergence_focus_days) %>%
group_by(initial_lag_group, publication_lag) %>%
summarize(
q25_rel_diff = quantile(abs_rel_diff, 0.25, na.rm = TRUE),
median_rel_diff = median(abs_rel_diff, na.rm = TRUE),
q75_rel_diff = quantile(abs_rel_diff, 0.75, na.rm = TRUE),
n_reports = n(),
.groups = "drop"
) %>%
filter(n_reports > 5)
ggplot(initial_lag_revision_df, aes(x = publication_lag, color = initial_lag_group, fill = initial_lag_group)) +
geom_ribbon(aes(ymin = q25_rel_diff, ymax = q75_rel_diff), alpha = 0.14, color = NA) +
geom_line(aes(y = median_rel_diff), linewidth = 0.95) +
scale_x_continuous(
limits = c(0, convergence_focus_days),
breaks = focus_day_breaks(c(0, convergence_focus_days), convergence_focus_days),
expand = expansion(mult = 0)
) +
scale_y_continuous(labels = label_percent(), breaks = breaks_pretty()) +
labs(
title = "Relative Error by initial lag",
subtitle = "Groups are based on the first observed lag for each reference date.",
x = "Lag (days)",
y = "Relative Error",
color = "Initial lag",
fill = "Initial lag"
)6 Additional revision diagnostics
These diagnostics are useful for checking data support and broad revision patterns, but they are secondary to the main interpretation sections above.
6.1 Overall revision diagnostics
This section collects secondary diagnostics for data support and revision activity. These quantities are useful checks, but the main interpretation should come from the Relative Error, Initial Bias, and stabilization results above.
- Time range: 2023-01-01 to 2025-06-15
- Min lag (days to first report): Min.: 4, 1st Qu.: 4, Median: 4, Mean: 5, 3rd Qu.: 4, Max.: 547
- No revisions: 0%
- Quick revisions (≤ 7 days): 0%
- Few revisions (≤ 3): 0%
- Minor Relative Error spread (< 0.1): 1%
- Major absolute revisions (> 0.453): 33%
- Days until within 20% of latest Report value: Min.: 4, 1st Qu.: 12, Median: 25, Mean: 34.6, 3rd Qu.: 38, Max.: 699
The location table summarizes average revision activity, first-report timing, latest-report lag, absolute spread in Report values, Relative Error spread, and the average lag until the Report value stays within the default convergence threshold of the latest Report value.
6.2 Number of revisions per location
This figure counts revision events by location across all reference dates. It is intended as a data-support diagnostic: locations with many revisions have deeper report histories, while locations with few revisions may provide less stable evidence about long-lag behavior.
Code
revisions_by_location_plot <- rev_by_loc %>%
inner_join(hhs_geo_order, by = "geo_value") %>%
ggplot(aes(x = x_pos, y = total_revisions, fill = hhs_region)) +
geom_col(width = 0.82, alpha = 0.88) +
scale_x_continuous(
breaks = hhs_geo_order$x_pos,
labels = hhs_geo_order$geo_label,
expand = expansion(mult = c(0.01, 0.01))
) +
scale_y_continuous(labels = label_comma(), breaks = breaks_pretty()) +
labs(
title = "Number of revisions per location",
subtitle = "Total revision events across all reference dates",
x = NULL,
y = "Number of revisions",
fill = "HHS region"
) +
guides(fill = "none") +
theme(
axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 11, face = "bold"),
legend.position = "none"
)
revisions_by_location_plot / hhs_bracket_plot() + plot_layout(heights = c(12, 2.1))6.3 Lag distribution by location
The diagnostic below shows the distribution of lag for all states in the selected HHS region. Here, lag is the number of days between the reference date and the report date. The split between first reports and later reports helps separate initial reporting timing from later revisions.
Code
backfill_df <- diagnostic_report_df %>%
group_by(geo_value, time_value) %>%
mutate(
backfill_gap = publication_lag,
is_initial_release = version == min(version, na.rm = TRUE)
) %>%
ungroup()6.4 Number of revisions over reference date
This figure summarizes how the number of revisions per location changes over reference dates. The line shows the median across locations, and the shaded band shows the 25th-75th percentile range.
6.5 Lag distribution by epi season
The diagnostic below shows lag by epi season. Each panel is one MMWR week 40-39 season, allowing the seasonal timing pattern to be compared directly.
Code
backfill_df %>%
mutate(epi_season = factor(epi_season_label(time_value), levels = season_levels)) %>%
filter(!is.na(epi_season)) %>%
ggplot(aes(x = backfill_gap, fill = is_initial_release)) +
geom_histogram(bins = 40, position = "identity", alpha = 0.7, color = "white") +
facet_wrap(~epi_season, ncol = 2, scales = "free_y") +
scale_fill_manual(
values = c("TRUE" = dis_pal[2], "FALSE" = dis_pal[1]),
labels = c("TRUE" = "First report", "FALSE" = "Later report"),
name = "Report type"
) +
scale_x_continuous(breaks = breaks_pretty()) +
labs(
title = "Lag distribution by epi season",
subtitle = "MMWR week 40-39 epi seasons; first-report timing vs. later revision depth",
x = "Lag (days)",
y = "Count"
) +
theme(strip.text.x = element_text(size = 12, face = "bold"))7 Summary
For Smoothed COVID-19 from Claims (hospital-admissions / smoothed_covid19_from_claims), the diagnostic reference dates covering 2023-01-01 to 2025-06-15 show the following revision and backfill profile:
- 0% of location-reference-date pairs were never revised after their first published Report value.
- The median lag to persistent stability is 658 days: half of all revised location-reference-date pairs reached their latest Report value by lag 658.
- The 90th percentile for maturity is 1004 days (~144 weeks).
- 33% of locations exhibit systematic under-reporting in their first published Report value.
- Median days to reach stability: 50%: 6 days | 20%: 25 days | 10%: 50 days | 5%: 98 days | 1%: 277 days | 0.5%: 343 days | 0%: 658 days.