Abstract
The assessment of the utility of an anonymized data set can be operationalized by the determination of the amount of information loss. To investigate the possible degradation of the relationship between variables after anonymization, hence measuring the loss, we perform an a posteriori analysis of variance. Several anonymized scenarios are compared with the original data. Differential privacy is applied as data anonymization process. We assess data utility based on the agreement between the original data structure and the anonymized structures. Data quality and utility are quantified by standard metrics, characteristics of the groups obtained. In addition, we use analysis of variance to show how estimates change. For illustration, we apply this approach to Brazilian Higher Education data with focus on the main effects of interaction terms involving gender differentiation. The findings indicate that blindly using anonymized data for scientific purposes could potentially undermine the validity of the conclusions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
An, P.E.: MANUAL DO ENADE Exame Nacional de Desempenho dos Estudantes. Dados (2004)
Fernandes, A. de O., Gomes, S. dos S.: Exame Nacional de Desempenho de Estudantes (Enade): Tendências da produção científica brasileira (2004–2018). Educ. Policy Anal. Arch. 30 (2022). https://doi.org/10.14507/epaa.30.6547
Bertolin, J.C.G., Marcon, T.: O (des)entendimento de qualidade na educação superior brasileira – Das quimeras do provão e do ENADE à realidade do capital cultural dos estudantes. Avaliação. 20, 105–122 (2015). 10.590/S1414-40772015000100008
Dalenius, T.: Towards a methodology for statistical disclosure control. Stat. Tidskr. Stat. Rev. 15, 429–444 (1977)
Dalenius, T.: Finding a needle in a haystack. J. Off. Stat. 2, 329–336 (1986)
Hand, D.J.: Statistical challenges of administrative and transaction data. J. R. Stat. Soc. Ser. A Stat. Soc. 181, 555–605 (2018). https://doi.org/10.1111/rssa.12315
Santos, W., Sousa, G., Prata, P., Ferrao, M.E.: Data anonymization: K-anonymity sensitivity analysis. In: 2020 15th Iberian Conference on Information Systems and Technologies (CISTI), pp. 1–6. IEEE, Sevilla (2020)
Ferrão, M.E., Prata, P., Fazendeiro, P.: Utility-driven assessment of anonymized data via clustering. Sci. Data. 9, 1–11 (2022). https://doi.org/10.1038/s41597-022-01561-6
INEP - Instituto Nacional de Estudos e Pesquisas Educacionais Anísio Teixeira: ANRESC (Prova Brasil). https://www.gov.br/inep/pt-br/acesso-a-informacao/dados-abertos/microdados
Cox, L.H.: Suppression methodology and statistical disclosure control. J. Am. Stat. Assoc. 75, 377–385 (1980). https://doi.org/10.1080/01621459.1980.10477481
Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006). https://doi.org/10.1007/11787006_1
Dwork, C.: Differential privacy: a survey of results. In: Agrawal, M., Du, D., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-79228-4_1
Beimel, A., Nissim, K., Stemmer, U.: Private learning and sanitization: pure vs. approximate differential privacy. Theory Comput. 12, 1–61 (2016). https://doi.org/10.4086/toc.2016.v012a001
Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., Naor, M.: Our data, ourselves: privacy via distributed noise generation. In: Vaudenay, S. (ed.) EUROCRYPT 2006. LNCS, vol. 4004, pp. 486–503. Springer, Heidelberg (2006). https://doi.org/10.1007/11761679_29
Kasiviswanathan, S.P., Smith, A.: On the “semantics” of differential privacy: a bayesian formulation. J. Priv. Confidentiality. 6 (2014). https://doi.org/10.29012/jpc.v6i1.634
Bild, R., Kuhn, K.A., Prasser, F.: SafePub: a truthful data anonymization algorithm with strong privacy guarantees. Proc. Priv. Enhancing Technol. 2018, 67–87 (2018). https://doi.org/10.1515/popets-2018-0004
Avraam, D., Boyd, A., Goldstein, H., Burton, P.: A software package for the application of probabilistic anonymisation to sensitive individual-level data: a proof of principle with an example from the ALSPAC birth cohort study. Longit. Life Course Stud. 9, 433–446 (2018). https://doi.org/10.14301/llcs.v9i4.478
Goldstein, H., Shlomo, N.: A probabilistic procedure for anonymisation, for assessing the risk of re-identification and for the analysis of perturbed data sets. J. Off. Stat. 36, 89–115 (2020). https://doi.org/10.2478/jos-2020-0005
Jagannathan, G., Pillaipakkamnatt, K., Wright, R.N.: A practical differentially private random decision tree classifier. ICDM Work. In: 2009 - IEEE International Conference on Data Mining, pp. 114–121 (2009). https://doi.org/10.1109/ICDMW.2009.93
Jain, P., Gyanchandani, M., Khare, N.: Differential privacy: its technological prescriptive using big data. J. Big Data 5(1), 1–24 (2018). https://doi.org/10.1186/s40537-018-0124-9
Li, N., Qardaji, W., Su, D.: On sampling, anonymization, and differential privacy or, k -anonymization meets differential privacy. In: Proceedings of the 7th ACM Symposium on Information, Computer and Communications Security - ASIACCS 2012, p. 32. ACM Press, New York (2012)
Sweeney, L.: A model for protecting privacy. Ieee S&P ‘02. 10, 1–14 (2002)
Prasser, F., Eicher, J., Spengler, H., Bild, R., Kuhn, K.A.: Flexible data anonymization using ARX—Current status and challenges ahead. Softw. Pract. Exp. 50, 1277–1304 (2020). https://doi.org/10.1002/spe.2812
LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Mondrian multidimensional K-anonymity. In: 22nd International Conference on Data Engineering (ICDE 2006), p. 25. IEEE (2006)
Gionis, A., Tassa, T.: k-anonymization with minimal loss of information. In: Arge, L., Hoffmann, M., Welzl, E. (eds.) ESA 2007. LNCS, vol. 4698, pp. 439–450. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-75520-3_40
Bayardo, R.J., Agrawal, R.: Data privacy through optimal k-anonymization. In: 21st International Conference on Data Engineering (ICDE 2005), pp. 217–228. IEEE (2005)
Scheffé, H.: The Analysis of Variance. Wiley, Hoboken (1999)
Yu, S.: Big privacy: challenges and opportunities of privacy study in the age of big data. IEEE Access. 4, 2751–2763 (2016). https://doi.org/10.1109/ACCESS.2016.2577036
El Emam, K.: Guide to the De-Identification of Personal Health Information. Auerbach Publications, Boca Raton (2013)
Kniola, L.: Plausible adversaries in re-identification risk assessment. In: PhUSE Annual Conference (2017)
Prasser, F., Bild, R., Kuhn, K.A.: A generic method for assessing the quality of De-identified health data. Stud. Health Technol. Inform. 228, 312–316 (2016). https://doi.org/10.3233/978-1-61499-678-1-312
Soria-Comas, J., Domingo-Ferrer, J., Sanchez, D., Martinez, S.: t-closeness through microaggregation: Strict privacy with enhanced utility preservation. IEEE Trans. Knowl. Data Eng. 27, 3098–3110 (2015)
Acknowledgements
This work was partially funded by FCT- Fundação para a Ciência e a Tecnologia through project number CEMAPRE/REM - UIDB/05069/2020 and by FCT/MCTES through national funds and when applicable co-funded EU funds under the project UIDB/50008/2020.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ferrão, M.E., Prata, P., Fazendeiro, P. (2023). Anonymized Data Assessment via Analysis of Variance: An Application to Higher Education Evaluation. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2023 Workshops. ICCSA 2023. Lecture Notes in Computer Science, vol 14105. Springer, Cham. https://doi.org/10.1007/978-3-031-37108-0_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-37108-0_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-37107-3
Online ISBN: 978-3-031-37108-0
eBook Packages: Computer ScienceComputer Science (R0)