There could be stark differences in posterior predictive results among models, but I suspect in many cases those differences will be difficult to interpret on their own… and anyway should be reflected in the initial evaluation of models using something like multinomial logistic regression, since a scenario that consistently generates summary statistics farther from the observed data will be downweighted in the rejection step during model selection.

Also unsure about the borrowed frequentist terminology of ‘type I’ and ‘type II’ error for model selection. Maybe it is better to think of false positives and false negatives with respect to each ‘true’ model based on pseudo-observed data from prior predictive simulations, rather than invoking a null-hypothesis framework?

Regarding parameter estimation, graphical inspection of posterior predictive tests (comparing simulated vs. observed summary statistics) at least permits some judgment of whether a particular model can plausibly generate the observed data. But the predictive coverage (e.g., proportion of posterior predictive pseudo-observations that fall within the 90% HPD estimate for each pod simulation) seems more helpful.

Likewise, the uniformity of posterior predictive quantiles illustrated in Wegmann et al. 2010 etc. could facilitate more nuanced evaluation of the posterior shape. For example, one might find that posterior distributions are biased toward larger or smaller values, but still provide an adequate estimate of credibility intervals (e.g., ca. 90% of pods fall within their estimated 90% HPDs)… or even a conservative estimate (e.g., ca. 95% of pods within their estimated 90% HPDs).

Best,

Chris