中文

Recursive Synthetic Data Trainingedit

Concept page for training models repeatedly on data generated by earlier models.

Recursive Synthetic Data Training is a training process in which generated data from one model generation become part of the training set for a later generation. The process can be intentional, as in self-training or synthetic-data bootstrapping, or incidental, as generated content enters future training corpora.1

Role in this wikiedit

This page explains the process behind model collapse. It is distinct from synthetic data in general: a one-time synthetic augmentation may be useful, while repeated reuse can amplify distributional errors. The wiki uses this page to separate the mechanism from the outcome. Recursive training is the loop; collapse is one possible degenerative result.

Connection to Qiao's workedit

When Sample Selection Bias Precipitates Model Collapse studies recursive training under local sample-selection bias. The paper's setting is especially relevant to AI and networks because the data process is distributed: parties may see different data, select different samples, and only share limited signals. Recursive synthetic-data training therefore becomes a cross-silo reliability problem, not only a generative-model problem.

See alsoedit

Footnotesedit

  1. The 2024 Nature paper "AI models collapse when trained on recursively generated data" helped popularize this recursive framing for generative models and synthetic corpora.