Synthetic Dataedit

Research topic on synthetic data, recursive training, low-resource verification, selection bias, and model collapse.

Synthetic Data is the short research-topic label for Qiao's work on generated data, recursive training, and model collapse. The full cluster remains broader than the label: it includes recursive synthetic-data training, data selection, sample selection bias, model collapse, data silos, and Wasserstein geometry.¹

Introductionedit

The topic treats synthetic data as both a resource and a risk. Generated samples can reduce data-access costs and support privacy-preserving workflows, but recursive use of selected synthetic data can also narrow the training distribution. This page records that tension in the specific setting of low-resource verification, biased local selection, and collaborative evaluation.

Role in this wikiedit

This page keeps the biography readable by giving the long technical background its own location. On the main page, "Synthetic Data" is enough to signal the topic. Here, the topic is unpacked as a research problem: generated samples can improve coverage or reduce access costs, but recursive use of generated data can amplify bias, erase modes, or distort the target distribution. The newer emphasis is that low-resource communities are not only short on data; they are also more exposed to tail loss when local verifiers mistake rare but valid samples for low-quality generations.

Publicationsedit

Paper	Venue/status
When Sample Selection Bias Precipitates Model Collapse	ICML 2026, 6-11 July 2026, Seoul.

Connection to Qiao's workedit

When Sample Selection Bias Precipitates Model Collapse studies how local selection bias can trigger collapse in low-resource, siloed recursive training, then uses collaborative Wasserstein-style signals to diagnose the problem. This connects synthetic-data reliability to AI and networks because the key difficulty is not only generation quality, but also distributed access to evidence about the data distribution.

Footnotesedit

Shumailov et al., "AI models collapse when trained on recursively generated data", Nature 631, 755-759 (2024), is a widely cited reference for the recursive model-collapse framing. ↩

Synthetic Dataedit

Introductionedit

Role in this wikiedit

Publicationsedit

Connection to Qiao's workedit

See alsoedit

Footnotesedit