中文

When Sample Selection Bias Precipitates Model Collapseedit

ICML 2026 paper on local sample-selection bias, model collapse, and collaborative Wasserstein-geometry proxies.

When Sample Selection Bias Precipitates Model Collapse is an ICML 2026 conference paper by Xinbao Qiao, Xianglong Du, Wei Liu, Jingqi Zhang, Peihua Mai, Meng Zhang, and Yan Pang. It examines how local verification in data silos can turn recursive synthetic-data training into a diversity-loss process, and how collaborative distributional proxies can reduce that failure mode.

Teaser: local selection bias narrows recursive synthetic data, while collaborative Wasserstein verification preserves diversity

Overviewedit

The paper studies model collapse in recursive synthetic-data training. Prior work often treats data selection as a stabilizing tool: a verifier filters generated samples so that only high-quality synthetic data are reused for training. This paper makes the verifier the object of analysis. When the verifier sees only a biased local slice of the target distribution, selection can repeatedly reward samples near that local slice and remove tail modes that future generators need.

The motivating setting is a data-silo environment. A hospital, bank, or proprietary institution may evaluate synthetic samples against its own limited reference data. Selection then becomes a confirmation-bias mechanism: samples close to the local view are retained, while distributional tails needed for generalization are pruned away.

Methodedit

The paper first formalizes biased top-alpha selection under Gaussian modeling and connects it to variance collapse across recursive generations. It then proposes collaborative evaluation methods that replace a single local verifier with distributional proxies computed across parties without raw-data exchange. The methodological shift is from sample quality as judged by one silo to distributional fit against a proxy for the global target.

Two schemes are described:

  • Scheme I, collaborative geodesic interpolation, constructs proxy measures along Wasserstein geodesics between synthetic and local real distributions;
  • Scheme II, collaborative Wasserstein barycenter estimation, computes a reusable barycenter proxy for the collective reference distribution.

Both schemes use Wasserstein-gradient-based sample scoring, so the synthetic pool is evaluated against a multi-party distributional reference rather than one biased silo.

Collaborative Wasserstein barycenter methodology

Key formulaedit

The paper's theory links local selection, diversity decay, and Wasserstein cost. In the following summary, RtR_t is the selected top-α\alpha region, DtD_t is the filtered synthetic distribution at generation tt, and DD^\star is the target distribution.

Local verifier selection is summarized as truncated sampling:

Xi,tTN(μt1,Σt1,Rt),Pr(XRt)=α.X_{i,t}\sim \operatorname{TN}(\mu_{t-1},\Sigma_{t-1},R_t), \qquad \Pr(X\in R_t)=\alpha .

The resulting diversity decay can be expressed through the covariance trace:

Tr(Σt)Tr(Σ0)Ctλmin(Ψ).\frac{\operatorname{Tr}(\Sigma_t)}{\operatorname{Tr}(\Sigma_0)} \asymp C\,t^{-\lambda_{\min}(\Psi_\infty)} .

A Wasserstein generalization bound then relates model risk under the target distribution to the filtered distribution:

RD(ht)RDt(ht)+2LϵWp(Dt,D)+δ.\mathcal{R}_{D^\star}(h_t) \le \mathcal{R}_{D_t}(h_t) +2L\epsilon\,W_p(D_t,D^\star)+\delta .

The collaborative scoring rule can be viewed through a dual potential ff^\star:

S(xi)=f(xi)1N1jif(xj).S(x_i) = f^\star(x_i) -\frac{1}{N-1}\sum_{j\ne i} f^\star(x_j).

The formulas explain the paper's main mechanism: biased selection can make the retained distribution increasingly narrow, while collaborative Wasserstein proxies try to reduce the discrepancy between filtered synthetic data and the global target distribution.

Resultsedit

The manuscript reports DDPM-style recursive image-generation experiments on CIFAR-10, STL-10, and CelebA. Baselines include Random selection, K-means, CenterMatch, and CovMatch. Under non-IID or locally skewed references, local-selection baselines can fall behind random selection, while the collaborative schemes better preserve both sample quality and mode coverage.

FID trends under recursive synthetic-data training

Class-proportion trends under recursive selection

Placementedit

This work belongs to Synthetic Data, Synthetic Data, Recursive Synthetic Data Training, Data Selection, Sample Selection Bias, Data Silos, Collaborative Evaluation, and Wasserstein Geometry. It is the synthetic-data counterpart to Qiao's unlearning papers: instead of asking how to remove data after training, it asks how selection and verification shape the data stream before future training.