How to Audit a Computer Vision Dataset
A useful dataset audit answers four questions: What's in here? Is it labeled well? What's missing? And what should we do next? Most teams can answer the first; very few can answer the last three with evidence.
// key_takeaways
- Audit composition, label quality, and coverage — with evidence.
- Duplication and label inconsistency are the most common silent killers.
- The real output is a ranked data roadmap, not a grade.
Composition first: class distribution, duplication rate, near-duplicate clusters, source diversity, and metadata completeness. Imbalance and duplication are the most common silent killers — a dataset that is 40% near-duplicates is half the dataset you think you have.
Label quality next: sample-based re-annotation by independent annotators, inter-annotator agreement on ambiguous classes, and a taxonomy review — many 'model errors' turn out to be definitional inconsistencies in the labels themselves.
Then coverage: map the dataset against the conditions of deployment — lighting, occlusion, object states, environments — and against your model's actual failure clusters. The gaps between those maps are your collection roadmap, ranked by failure cost. That ranked roadmap, not a score, is the real output of an audit.