中文

When Sample Selection Bias Precipitates Model Collapseedit

ICML 2026 paper on low-resource verification regimes, sample-selection bias, model collapse, and collaborative Wasserstein-geometry proxies.

When Sample Selection Bias Precipitates Model Collapse is an ICML 2026 conference paper by Xinbao Qiao, Xianglong Du, Wei Liu, Jingqi Zhang, Peihua Mai, Meng Zhang, and Yan Pang. It examines how low-resource verification regimes in data silos can turn sample selection from a safeguard against model collapse into a mechanism that accelerates it, and how collaborative distributional proxies can reduce that failure mode.

Teaser: local selection bias narrows recursive synthetic data, while collaborative Wasserstein verification preserves diversity

Overviewedit

The paper studies model collapse in recursive synthetic-data training. Prior work often treats data selection as a stabilizing tool: a verifier filters generated samples so that only high-quality synthetic data are reused for training. This paper makes the verifier the object of analysis. When the verifier sees only a small, fragmented, and biased slice of the target distribution, selection can repeatedly reward samples near that local slice and remove globally relevant tail modes that future generators need.

The motivating setting is a low-resource data-silo environment. A hospital consortium, bank, or proprietary institution may evaluate synthetic samples against its own limited reference data because raw data cannot be pooled. Selection then becomes a confirmation-bias mechanism: samples close to the local view are retained, while rare but valid modes are pruned away. The updated framing emphasizes why low-resource communities are especially vulnerable: tail regions are already weakly represented before synthetic augmentation begins, so local filtering can turn scarcity into persistent coverage loss.

Methodedit

The paper first formalizes biased top-alpha selection under Gaussian modeling and connects it to variance collapse across recursive generations. It then proposes collaborative evaluation methods that replace a single local verifier with distributional proxies computed across parties without raw-data exchange. The methodological shift is from sample quality as judged by one low-resource silo to distributional fit against a proxy for the global target.

Two schemes are described:

  • Scheme I, collaborative geodesic interpolation, constructs proxy measures along Wasserstein geodesics between synthetic and local real distributions;
  • Scheme II, collaborative Wasserstein barycenter estimation, computes a reusable barycenter proxy for the collective reference distribution.

Both schemes use Wasserstein-gradient-based sample scoring, so the synthetic pool is evaluated against a multi-party distributional reference rather than one biased silo.

Collaborative Wasserstein barycenter methodology

Key formulaedit

The paper's theory links local selection, diversity decay, and Wasserstein cost. In the following summary, RtR_t is the selected top-α\alpha region, DtD_t is the filtered synthetic distribution at generation tt, and DD^\star is the target distribution.

Local verifier selection is summarized as truncated sampling:

Xi,tTN(μt1,Σt1,Rt),Pr(XRt)=α.X_{i,t}\sim \operatorname{TN}(\mu_{t-1},\Sigma_{t-1},R_t), \qquad \Pr(X\in R_t)=\alpha .

The resulting diversity decay can be expressed through the covariance trace:

Tr(Σt)Tr(Σ0)Ctλmin(Ψ).\frac{\operatorname{Tr}(\Sigma_t)}{\operatorname{Tr}(\Sigma_0)} \asymp C\,t^{-\lambda_{\min}(\Psi_\infty)} .

A Wasserstein generalization bound then relates model risk under the target distribution to the filtered distribution:

RD(ht)RDt(ht)+2LϵWp(Dt,D)+δ.\mathcal{R}_{D^\star}(h_t) \le \mathcal{R}_{D_t}(h_t) +2L\epsilon\,W_p(D_t,D^\star)+\delta .

The collaborative scoring rule can be viewed through a dual potential ff^\star:

S(xi)=f(xi)1N1jif(xj).S(x_i) = f^\star(x_i) -\frac{1}{N-1}\sum_{j\ne i} f^\star(x_j).

The formulas explain the paper's main mechanism: biased selection can make the retained distribution increasingly narrow, while collaborative Wasserstein proxies try to reduce the discrepancy between filtered synthetic data and the global target distribution.

Resultsedit

The manuscript reports DDPM-style recursive image-generation experiments on CIFAR-10, STL-10, and CelebA. Baselines include Random selection, K-means, CenterMatch, and CovMatch. Under non-IID or locally skewed references, local-selection baselines can fall behind random selection, while the collaborative schemes better preserve both sample quality and mode coverage.

The main lesson is that low-resource regimes are not merely smaller versions of high-resource settings. When real-data coverage is scarce or fragmented, tail modes may already be difficult to observe. Local-reference selection can then confuse rare but valid samples with low-quality generations, systematically suppressing underrepresented regions of the target distribution. An appendix experiment with a topic-local LLM verifier makes the same point semantically: filtering against a narrow local topic can reduce held-out topic coverage rather than protect it.

FID trends under recursive synthetic-data training

Class-proportion trends under recursive selection

Placementedit

This work belongs to Synthetic Data, Synthetic Data, Recursive Synthetic Data Training, Data Selection, Sample Selection Bias, Data Silos, Collaborative Evaluation, and Wasserstein Geometry. It is the synthetic-data counterpart to Qiao's unlearning papers: instead of asking how to remove data after training, it asks how selection and verification shape the data stream before future training. The low-resource emphasis also connects the paper to the social side of model collapse: distributional tail loss can correspond to the loss of culturally, linguistically, or institutionally underrepresented content.