Data Silosedit

Concept page for learning and evaluation when data are distributed across separate holders.

Data Silos are organizational, legal, technical, or geographic separations that prevent all training data from being pooled in one place. In this wiki the term is used for institutions, devices, or clients that each hold only a partial view of the target distribution.

Role in this wikiedit

Data silos are a key reason why AI and networks differs from ordinary centralized machine learning. When each party only sees local data, model training and evaluation must work under communication, privacy, and representation constraints. A silo can be useful because it protects data ownership, but it also makes global diagnosis harder. Bias may be invisible locally and obvious only when evidence is compared across parties. This is especially consequential for low-resource holders whose local data may underrepresent tail regions from the start.

Connection to Qiao's workedit

Data silos are central to When Sample Selection Bias Precipitates Model Collapse, where recursive synthetic-data training is studied under low-resource local sample-selection bias. In this setting, the research question is not just model accuracy, but how distributed parties can coordinate without assuming complete data access.

Data Silosedit

Role in this wikiedit

Connection to Qiao's workedit

See alsoedit