Data Selectionedit

Concept page for choosing training or evaluation data under reliability constraints.

Data Selection is the process of choosing which examples are used for training, pruning, evaluation, or synthetic-data reuse. In this wiki it is treated as a central data-centric operation: selection can reduce cost and improve quality, but biased selection can also distort a model's view of the target distribution.

Role in this wikiedit

The page links Data Centric ML to both AI and networks and Synthetic Data. In decentralized or siloed settings, selection is often local: each participant sees only part of the data and chooses examples according to local goals or constraints. That makes selection a networked problem rather than a purely statistical preprocessing step.

Connection to Qiao's workedit

Data selection appears in When Sample Selection Bias Precipitates Model Collapse, where biased local selection can worsen recursive synthetic-data training. In the unlearning papers, selection reappears as removal or reweighting: the model is changed by changing which data count.

Data Selectionedit

Role in this wikiedit

Connection to Qiao's workedit

See alsoedit