Abstract
Model performance evaluation is a critical and expensive task in machine learning and computer vision. Without clear guidelines, practitioners often estimate model accuracy using a one-time completely random selection of the data. However, by employing tailored sampling and estimation strategies, one can obtain more precise estimates and reduce annotation costs. In this paper, we propose a statistical framework for model evaluation that includes stratification, sampling, and estimation components. We examine the statistical properties of each component and evaluate their efficiency (precision). One key result of our work is that stratification via \(k\)-means clustering based on accurate predictions of model performance yields efficient estimators. Our experiments on computer vision datasets show that this method consistently provides more precise accuracy estimates than the traditional simple random sampling, even with substantial efficiency gains of 10x. We also find that model-assisted estimators, which leverage predictions of model accuracy on the unlabeled portion of the dataset, are generally more efficient than the traditional estimates based solely on the labeled data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Angelopoulos, A.N., Bates, S., Fannjiang, C., Jordan, M.I., Zrnic, T.: Prediction-powered inference. Science 382(6671), 669–674 (2023)
Angelopoulos, A.N., Duchi, J.C., Zrnic, T.: Ppi++: Efficient prediction-powered inference. arXiv preprint arXiv:2311.01453 (2023)
Ash, J.T., Zhang, C., Krishnamurthy, A., Langford, J., Agarwal, A.: Deep batch active learning by diverse, uncertain gradient lower bounds. arXiv preprint arXiv:1906.03671 (2019)
Baek, C., Jiang, Y., Raghunathan, A., Kolter, J.Z.: Agreement-on-the-line: predicting the performance of neural networks under distribution shift. Adv. Neural. Inf. Process. Syst. 35, 19274–19289 (2022)
Barbu, A., et al.: Objectnet: a large-scale bias-controlled dataset for pushing the limits of object recognition models. Adv. Neural Inform. Process. Syst. 32 (2019)
Beery, S., Cole, E., Gjoka, A.: The iwildcam 2020 competition dataset. arXiv preprint arXiv:2004.10340 (2020)
Breidt, F.J., Claeskens, G., Opsomer, J.: Model-assisted estimation for complex surveys using penalised splines. Biometrika 92(4), 831–846 (2005)
Breidt, F.J., Opsomer, J.D.: Model-assisted survey estimation with modern prediction techniques. Stat. Sci. 32(2), 190–205 (2017). https://doi.org/10.1214/16-STS589
Brus, D.J.: Spatial sampling with R. CRC Press (2022)
Chen, M., Goel, K., Sohoni, N.S., Poms, F., Fatahalian, K., Ré, C.: Mandoline: Model evaluation under distribution shift. In: International Conference on Machine Learning, pp. 1617–1629. PMLR (2021)
Chen, T., Lumley, T.: Optimal multiwave sampling for regression modeling in two-phase designs. Stat. Med. 39(30), 4912–4921 (2020)
Chen, T., Lumley, T.: Optimal sampling for design-based estimators of regression models. Stat. Med. 41(8), 1482–1497 (2022)
Chen, Y., Zhang, S., Song, R.: Scoring your prediction on unseen data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 3279–3288 (June 2023)
Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: benchmark and state of the art. Proc. IEEE 105(10), 1865–1883 (2017)
Chouldechova, A., Deng, S., Wang, Y., Xia, W., Perona, P.: Unsupervised and semi-supervised bias benchmarking in face recognition. In: European Conference on Computer Vision, pp. 289–306. Springer (2022). https://doi.org/10.1007/978-3-031-19778-9_17
Chu, W., Zinkevich, M., Li, L., Thomas, A., Tseng, B.: Unbiased online active learning in data streams. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery And Data Mining, pp. 195–203 (2011)
Chuang, C.Y., Torralba, A., Jegelka, S.: Estimating generalization under distribution shifts via domain-invariant representations. arXiv preprint arXiv:2007.03511 (2020)
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., , Vedaldi, A.: Describing textures in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)
Clark, R.G., Steel, D.G.: Sample design for analysis using high-influence probability sampling. J. R. Stat. Soc. Ser. A Stat. Soc. 185(4), 1733–1756 (2022)
Coates, A., Ng, A., Lee, H.: An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 215–223. JMLR Workshop and Conference Proceedings (2011)
Cochran, W.G.: Sampling Techniques. John Wiley & Sons (1977)
Cohn, D.A., Ghahramani, Z., Jordan, M.I.: Active learning with statistical models. J. Artif. Intell. Res. 4, 129–145 (1996)
Deng, L.: The mnist database of handwritten digit images for machine learning research. IEEE Signal Process. Mag. 29(6), 141–142 (2012)
Deng, W., Gould, S., Zheng, L.: What does rotation prediction tell us about classifier accuracy under varying testing environments? In: International Conference on Machine Learning, pp. 2579–2589. PMLR (2021)
Deng, W., Zheng, L.: Are labels always necessary for classifier accuracy evaluation? In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15069–15078 (2021)
Emma, D., Jared, J., Cukierski, W.: Diabetic retinopathy detection (2015). https://kaggle.com/competitions/diabetic-retinopathy-detection
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2007 (VOC 2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html
Farquhar, S., Gal, Y., Rainforth, T.: On statistical bias in active learning: How and when to fix it. arXiv preprint arXiv:2101.11665 (2021)
Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In: 2004 Conference on Computer Vision and Pattern Recognition Workshop, pp. 178–178. IEEE (2004)
Fuller, W.A.: Sampling Statistics. John Wiley & Sons (2011)
Gal, Y., Islam, R., Ghahramani, Z.: Deep bayesian active learning with image data. In: International Conference on Machine Learning, pp. 1183–1192. PMLR (2017)
Ganti, R., Gray, A.: Upal: Unbiased pool based active learning. In: Artificial Intelligence and Statistics, pp. 422–431. PMLR (2012)
Garg, S., Balakrishnan, S., Lipton, Z.C., Neyshabur, B., Sedghi, H.: Leveraging unlabeled data to predict out-of-distribution performance. arXiv preprint arXiv:2201.04234 (2022)
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the kitti dataset. Inter. J. Robot. Res. (IJRR) (2013)
Graubardand, B.I., Korn, E.L.: Inference for superpopulation parameters using sample surveys. Stat. Sci. 17(1), 73–96 (2002)
Guillory, D., Shankar, V., Ebrahimi, S., Darrell, T., Schmidt, L.: Predicting with confidence on unseen distributions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1134–1144 (2021)
Hájek, J.: Optimal strategy and other problems in probability sampling. Časopis pro pěstování matematiky 84(4), 387–423 (1959)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hébert-Johnson, U., Kim, M., Reingold, O., Rothblum, G.: Multicalibration: calibration for the (computationally-identifiable) masses. In: International Conference on Machine Learning, pp. 1939–1948. PMLR (2018)
Helber, P., Bischke, B., Dengel, A., Borth, D.: Eurosat: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Selected Topics Appli. Earth Observations Remote Sensing (2019)
Hendrycks, D., et al.: The many faces of robustness: a critical analysis of out-of-distribution generalization. In: ICCV (2021)
Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., Song, D.: Natural adversarial examples. In: CVPR (2021)
Horvitz, D.G., Thompson, D.J.: A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 47(260), 663–685 (1952)
Ilharco, G., et al.: Openclip (2021). https://doi.org/10.5281/zenodo.5143773
Imberg, H., Axelson-Fisk, M., Jonasson, J.: Optimal subsampling designs. arXiv preprint arXiv:2304.03019 (2023)
Imberg, H., Jonasson, J., Axelson-Fisk, M.: Optimal sampling in unbiased active learning. In: International Conference on Artificial Intelligence and Statistics, pp. 559–569. PMLR (2020)
Imberg, H., Yang, X., Flannagan, C., Bärgman, J.: Active sampling: A machine-learning-assisted framework for finite population inference with optimal subsamples. arXiv preprint arXiv:2212.10024 (2022)
Isaki, C.T., Fuller, W.A.: Survey design under the regression superpopulation model. J. Am. Stat. Assoc. 77(377), 89–96 (1982)
Jiang, Y., Nagarajan, V., Baek, C., Kolter, J.Z.: Assessing generalization of sgd via disagreement. arXiv preprint arXiv:2106.13799 (2021)
Johnson, J., Hariharan, B., Van Der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910 (2017)
Kim, M.P., Kern, C., Goldwasser, S., Kreuter, F., Reingold, O.: Universal adaptability: target-independent inference that competes with propensity scoring. Proc. Nat. Acad. Sci. 119(4), e2108097119 (2022)
Kirsch, A., Van Amersfoort, J., Gal, Y.: Batchbald: efficient and diverse batch acquisition for deep bayesian active learning. Adv. Neural Inform. Process. Syst. 32 (2019)
Koh, P.W., et al.: Wilds: a benchmark of in-the-wild distribution shifts. In: International Conference on Machine Learning, pp. 5637–5664. PMLR (2021)
Kossen, J., Farquhar, S., Gal, Y., Rainforth, T.: Active surrogate estimators: an active learning approach to label-efficient model evaluation. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural Information Processing Systems, vol. 35, pp. 24557–24570. Curran Associates, Inc. (2022)
Kossen, J., Farquhar, S., Gal, Y., Rainforth, T.: Active testing: sample-efficient model evaluation. In: International Conference on Machine Learning, pp. 5753–5763. PMLR (2021)
Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine-grained categorization. In: 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia (2013)
Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
Kull, M., Flach, P.: Novel decompositions of proper scoring rules for classification: score adjustment as precursor to calibration. In: Appice, A., Rodrigues, P.P., Santos Costa, V., Soares, C., Gama, J., Jorge, A. (eds.) ECML PKDD 2015. LNCS (LNAI), vol. 9284, pp. 68–85. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23528-8_5
LAION AI: Clip benchmark. https://github.com/LAION-AI/CLIP_benchmark
LeCun, Y., Huang, F.J., Bottou, L.: Learning methods for generic object recognition with invariance to pose and lighting. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2004. vol. 2, pp. II–104. IEEE (2004)
Lewis, D.D.: A sequential algorithm for training text classifiers: corrigendum and additional data. In: ACM SIGIR Forum, vol. 29, pp. 13–19. ACM New York (1995)
Lewis, D.D., Catlett, J.: Heterogeneous uncertainty sampling for supervised learning. In: Machine Learning Proceedings 1994, pp. 148–156. Elsevier (1994)
Li, Z., Ma, X., Xu, C., Cao, C., Xu, J., Lü, J.: Boosting operational dnn testing efficiency through conditioning 10(1145/3338906), 3338930 (2019)
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF Conference On Computer Vision and Pattern Recognition, pp. 11976–11986 (2022)
Lohr, S.L.: Sampling: design and analysis. CRC press (2021)
Lumley, T., Shaw, P.A., Dai, J.Y.: Connections between survey calibration estimators and semiparametric models for incomplete data. Int. Stat. Rev. 79(2), 200–220 (2011)
Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 (2013)
Matthey, L., Higgins, I., Hassabis, D., Lerchner, A.: dsprites: Disentanglement testing sprites dataset (2017). https://github.com/deepmind/dsprites-dataset/
McConville, K.S., Breidt, F.J., Lee, T.C., Moisen, G.G.: Model-assisted survey regression estimation with the lasso. J. Surv. Statist. Methodol. 5(2), 131–158 (2017)
Miller, B.A., Vila, J., Kirn, M., Zipkin, J.R.: Classifier performance estimation with unbalanced, partially labeled data. In: Torgo, L., Matwin, S., Weiss, G., Moniz, N., Branco, P. (eds.) Proceedings of The International Workshop on Cost-Sensitive Learning. Proceedings of Machine Learning Research, 05 May, vol. 88, pp. 4–16. PMLR (2018)
Miller, J.P., et al.: Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In: International Conference on Machine Learning, pp. 7721–7735. PMLR (2021)
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning (2011)
Neyman, J.: On the two different aspects of the representative method: the method of stratified sampling and the method of purposive selection. In: Breakthroughs in Statistics: Methodology and Distribution, pp. 123–150. Springer (1992). https://doi.org/10.1007/978-1-4612-4380-9_12
Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729. IEEE (2008)
Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.: Cats and dogs. In: 2012 IEEE conference on Computer Vision and Pattern Recognition, pp. 3498–3505. IEEE (2012)
Poms, F., et al.: Low-shot validation: active importance sampling for estimating classifier performance on rare categories. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10705–10714 (October 2021)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Recht, B., Roelofs, R., Schmidt, L., Shankar, V.: Do imagenet classifiers generalize to imagenet? In: International Conference on Machine Learning, pp. 5389–5400. PMLR (2019)
Ren, P., et al.: A survey of deep active learning. ACM Comput. Surv. (CSUR) 54(9), 1–40 (2021)
Roth, A.: Uncertain: Modern topics in uncertainty estimation (2022)
Russakovsky, O., et al.: ImageNet Large Scale Visual Recognition Challenge. Inter. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Särndal, C.E.: The calibration approach in survey theory and practice. Surv. Pract. 33(2), 99–119 (2007)
Särndal, C.E., Swensson, B., Wretman, J.: Model assisted survey sampling. Springer Science & Business Media (2003)
Sawade, C., Landwehr, N., Bickel, S., Scheffer, T.: Active risk estimation. In: Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML 2010. pp. 951-958. Omnipress, Madison, WI, USA (2010)
Sawade, C., Landwehr, N., Scheffer, T.: Active estimation of f-measures. In: Lafferty, J., Williams, C., Shawe-Taylor, J., Zemel, R., Culotta, A. (eds.) Advances in Neural Information Processing Systems, vol. 23. Curran Associates, Inc. (2010)
Scheffer, T., Decomain, C., Wrobel, S.: Active hidden markov models for information extraction. In: Hoffmann, F., Hand, D.J., Adams, N., Fisher, D., Guimaraes, G. (eds.) IDA 2001. LNCS, vol. 2189, pp. 309–318. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44816-0_31
Schuhmann, C., et al.: LAION-5b: an open large-scale dataset for training next generation image-text models. In: Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2022). https://openreview.net/forum?id=M3Y74vmsMcY
Sener, O., Savarese, S.: Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489 (2017)
Settles, B.: Active learning literature survey (2009)
Siddhant, A., Lipton, Z.C.: Deep bayesian active learning for natural language processing: Results of a large-scale empirical study. arXiv preprint arXiv:1808.05697 (2018)
Socher, R., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642 (2013)
Stallkamp, J., Schlipsing, M., Salmen, J., Igel, C.: The german traffic sign recognition benchmark: a multi-class classification competition. In: The 2011 International Joint Conference on Neural Networks, pp. 1453–1460. IEEE (2011)
Taylor, J., Earnshaw, B., Mabey, B., Victors, M., Yosinski, J.: Rxrx1: an image set for cellular morphological variation across many experimental batches. In: International Conference on Learning Representations (ICLR) (2019)
Tillé, Y.: Sampling and estimation from finite populations. John Wiley & Sons (2020)
Veeling, B.S., Linmans, J., Winkens, J., Cohen, T., Welling, M.: Rotation equivariant CNNs for digital pathology. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11071, pp. 210–218. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00934-2_24
Wald, Y., Feder, A., Greenfeld, D., Shalit, U.: On calibration and out-of-domain generalization. Adv. Neural. Inf. Process. Syst. 34, 2215–2227 (2021)
Wang, H., Ge, S., Lipton, Z., Xing, E.P.: Learning robust global representations by penalizing local predictive power. Adv. Neural Inform. Process. Syst., 10506–10518 (2019)
Welinder, P., Welling, M., Perona, P.: A lazy man’s approach to benchmarking: Semisupervised classifier evaluation and recalibration. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2013)
Wenzel, F., et al.: Assaying out-of-distribution generalization in transfer learning. Adv. Neural. Inf. Process. Syst. 35, 7181–7198 (2022)
Wu, C., Sitter, R.R.: A model-calibration approach to using complete auxiliary information from survey data. J. Am. Stat. Assoc. 96(453), 185–193 (2001)
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: large-scale scene recognition from abbey to zoo. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3485–3492 (June 2010). https://doi.org/10.1109/CVPR.2010.5539970
Yu, Y., Bates, S., Ma, Y., Jordan, M.: Robust calibration with multi-domain temperature scaling. Adv. Neural. Inf. Process. Syst. 35, 27510–27523 (2022)
Yu, Y., Yang, Z., Wei, A., Ma, Y., Steinhardt, J.: Predicting out-of-distribution error with the projection norm. In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S. (eds.) Proceedings of the 39th International Conference on Machine Learning, 17–23 Jul. Proceedings of Machine Learning Research, vol. 162, pp. 25721–25746. PMLR (2022)
Zhai, Xet al.: The visual task adaptation benchmark (2020). https://openreview.net/forum?id=BJena3VtwS
Zrnic, T., Candès, E.J.: Active statistical inference. arXiv preprint arXiv:2403.03208 (2024)
Zrnic, T., Candès, E.J.: Cross-prediction-powered inference. Proc. Nat. Acad. Sci. 121(15), e2322083121 (2024)
Acknowledgments
We thank the anonymous reviewers for their encouraging comments and valuable suggestions that have improved our manuscript. We also thank Tijana Zrnic for highlighting the connections between prediction-powered inference and our work, as well as Georgy Noarov for pointing out the link between our results and decompositions for proper scoring rules.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Fogliato, R., Patil, P., Monfort, M., Perona, P. (2025). A Framework for Efficient Model Evaluation Through Stratification, Sampling, and Estimation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15146. Springer, Cham. https://doi.org/10.1007/978-3-031-73223-2_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-73223-2_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73222-5
Online ISBN: 978-3-031-73223-2
eBook Packages: Computer ScienceComputer Science (R0)