Audit

How to Audit a Computer Vision Dataset

Datafy Lab Insights · 5 min read

A useful dataset audit answers four questions: What's in here? Is it labeled well? What's missing? And what should we do next? Most teams can answer the first; very few can answer the last three with evidence.

// key_takeaways

Audit composition, label quality, and coverage — with evidence.
Duplication and label inconsistency are the most common silent killers.
The real output is a ranked data roadmap, not a grade.

Composition first: class distribution, duplication rate, near-duplicate clusters, source diversity, and metadata completeness. Imbalance and duplication are the most common silent killers — a dataset that is 40% near-duplicates is half the dataset you think you have.

Label quality next: sample-based re-annotation by independent annotators, inter-annotator agreement on ambiguous classes, and a taxonomy review — many 'model errors' turn out to be definitional inconsistencies in the labels themselves.

Then coverage: map the dataset against the conditions of deployment — lighting, occlusion, object states, environments — and against your model's actual failure clusters. The gaps between those maps are your collection roadmap, ranked by failure cost. That ranked roadmap, not a score, is the real output of an audit.

Book a Data Failure Audit