Anonymized Data Assessment via Analysis of Variance: An Application to Higher Education Evaluation

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14105))

Included in the following conference series:

International Conference on Computational Science and Its Applications

602 Accesses

Abstract

The assessment of the utility of an anonymized data set can be operationalized by the determination of the amount of information loss. To investigate the possible degradation of the relationship between variables after anonymization, hence measuring the loss, we perform an a posteriori analysis of variance. Several anonymized scenarios are compared with the original data. Differential privacy is applied as data anonymization process. We assess data utility based on the agreement between the original data structure and the anonymized structures. Data quality and utility are quantified by standard metrics, characteristics of the groups obtained. In addition, we use analysis of variance to show how estimates change. For illustration, we apply this approach to Brazilian Higher Education data with focus on the main effects of interaction terms involving gender differentiation. The findings indicate that blindly using anonymized data for scientific purposes could potentially undermine the validity of the conclusions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 63.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 79.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

An, P.E.: MANUAL DO ENADE Exame Nacional de Desempenho dos Estudantes. Dados (2004)
Google Scholar
Fernandes, A. de O., Gomes, S. dos S.: Exame Nacional de Desempenho de Estudantes (Enade): Tendências da produção científica brasileira (2004–2018). Educ. Policy Anal. Arch. 30 (2022). https://doi.org/10.14507/epaa.30.6547
Bertolin, J.C.G., Marcon, T.: O (des)entendimento de qualidade na educação superior brasileira – Das quimeras do provão e do ENADE à realidade do capital cultural dos estudantes. Avaliação. 20, 105–122 (2015). 10.590/S1414-40772015000100008
Google Scholar
Dalenius, T.: Towards a methodology for statistical disclosure control. Stat. Tidskr. Stat. Rev. 15, 429–444 (1977)
Google Scholar
Dalenius, T.: Finding a needle in a haystack. J. Off. Stat. 2, 329–336 (1986)
Google Scholar
Hand, D.J.: Statistical challenges of administrative and transaction data. J. R. Stat. Soc. Ser. A Stat. Soc. 181, 555–605 (2018). https://doi.org/10.1111/rssa.12315
Santos, W., Sousa, G., Prata, P., Ferrao, M.E.: Data anonymization: K-anonymity sensitivity analysis. In: 2020 15th Iberian Conference on Information Systems and Technologies (CISTI), pp. 1–6. IEEE, Sevilla (2020)
Google Scholar
Ferrão, M.E., Prata, P., Fazendeiro, P.: Utility-driven assessment of anonymized data via clustering. Sci. Data. 9, 1–11 (2022). https://doi.org/10.1038/s41597-022-01561-6
Article Google Scholar
INEP - Instituto Nacional de Estudos e Pesquisas Educacionais Anísio Teixeira: ANRESC (Prova Brasil). https://www.gov.br/inep/pt-br/acesso-a-informacao/dados-abertos/microdados
Cox, L.H.: Suppression methodology and statistical disclosure control. J. Am. Stat. Assoc. 75, 377–385 (1980). https://doi.org/10.1080/01621459.1980.10477481
Article MATH Google Scholar
Dwork, C.: Differential privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4052, pp. 1–12. Springer, Heidelberg (2006). https://doi.org/10.1007/11787006_1
Chapter Google Scholar
Dwork, C.: Differential privacy: a survey of results. In: Agrawal, M., Du, D., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-79228-4_1
Chapter MATH Google Scholar
Beimel, A., Nissim, K., Stemmer, U.: Private learning and sanitization: pure vs. approximate differential privacy. Theory Comput. 12, 1–61 (2016). https://doi.org/10.4086/toc.2016.v012a001
Article MathSciNet MATH Google Scholar
Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., Naor, M.: Our data, ourselves: privacy via distributed noise generation. In: Vaudenay, S. (ed.) EUROCRYPT 2006. LNCS, vol. 4004, pp. 486–503. Springer, Heidelberg (2006). https://doi.org/10.1007/11761679_29
Chapter Google Scholar
Kasiviswanathan, S.P., Smith, A.: On the “semantics” of differential privacy: a bayesian formulation. J. Priv. Confidentiality. 6 (2014). https://doi.org/10.29012/jpc.v6i1.634
Bild, R., Kuhn, K.A., Prasser, F.: SafePub: a truthful data anonymization algorithm with strong privacy guarantees. Proc. Priv. Enhancing Technol. 2018, 67–87 (2018). https://doi.org/10.1515/popets-2018-0004
Article Google Scholar
Avraam, D., Boyd, A., Goldstein, H., Burton, P.: A software package for the application of probabilistic anonymisation to sensitive individual-level data: a proof of principle with an example from the ALSPAC birth cohort study. Longit. Life Course Stud. 9, 433–446 (2018). https://doi.org/10.14301/llcs.v9i4.478
Article Google Scholar
Goldstein, H., Shlomo, N.: A probabilistic procedure for anonymisation, for assessing the risk of re-identification and for the analysis of perturbed data sets. J. Off. Stat. 36, 89–115 (2020). https://doi.org/10.2478/jos-2020-0005
Article Google Scholar
Jagannathan, G., Pillaipakkamnatt, K., Wright, R.N.: A practical differentially private random decision tree classifier. ICDM Work. In: 2009 - IEEE International Conference on Data Mining, pp. 114–121 (2009). https://doi.org/10.1109/ICDMW.2009.93
Jain, P., Gyanchandani, M., Khare, N.: Differential privacy: its technological prescriptive using big data. J. Big Data 5(1), 1–24 (2018). https://doi.org/10.1186/s40537-018-0124-9
Article Google Scholar
Li, N., Qardaji, W., Su, D.: On sampling, anonymization, and differential privacy or, k -anonymization meets differential privacy. In: Proceedings of the 7th ACM Symposium on Information, Computer and Communications Security - ASIACCS 2012, p. 32. ACM Press, New York (2012)
Google Scholar
Sweeney, L.: A model for protecting privacy. Ieee S&P ‘02. 10, 1–14 (2002)
Google Scholar
Prasser, F., Eicher, J., Spengler, H., Bild, R., Kuhn, K.A.: Flexible data anonymization using ARX—Current status and challenges ahead. Softw. Pract. Exp. 50, 1277–1304 (2020). https://doi.org/10.1002/spe.2812
Article Google Scholar
LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Mondrian multidimensional K-anonymity. In: 22nd International Conference on Data Engineering (ICDE 2006), p. 25. IEEE (2006)
Google Scholar
Gionis, A., Tassa, T.: k-anonymization with minimal loss of information. In: Arge, L., Hoffmann, M., Welzl, E. (eds.) ESA 2007. LNCS, vol. 4698, pp. 439–450. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-75520-3_40
Chapter Google Scholar
Bayardo, R.J., Agrawal, R.: Data privacy through optimal k-anonymization. In: 21st International Conference on Data Engineering (ICDE 2005), pp. 217–228. IEEE (2005)
Google Scholar
Scheffé, H.: The Analysis of Variance. Wiley, Hoboken (1999)
MATH Google Scholar
Yu, S.: Big privacy: challenges and opportunities of privacy study in the age of big data. IEEE Access. 4, 2751–2763 (2016). https://doi.org/10.1109/ACCESS.2016.2577036
Article Google Scholar
El Emam, K.: Guide to the De-Identification of Personal Health Information. Auerbach Publications, Boca Raton (2013)
Book Google Scholar
Kniola, L.: Plausible adversaries in re-identification risk assessment. In: PhUSE Annual Conference (2017)
Google Scholar
Prasser, F., Bild, R., Kuhn, K.A.: A generic method for assessing the quality of De-identified health data. Stud. Health Technol. Inform. 228, 312–316 (2016). https://doi.org/10.3233/978-1-61499-678-1-312
Article Google Scholar
Soria-Comas, J., Domingo-Ferrer, J., Sanchez, D., Martinez, S.: t-closeness through microaggregation: Strict privacy with enhanced utility preservation. IEEE Trans. Knowl. Data Eng. 27, 3098–3110 (2015)
Article Google Scholar

Download references

Acknowledgements

This work was partially funded by FCT- Fundação para a Ciência e a Tecnologia through project number CEMAPRE/REM - UIDB/05069/2020 and by FCT/MCTES through national funds and when applicable co-funded EU funds under the project UIDB/50008/2020.

Author information

Authors and Affiliations

University of Beira Interior, Covilhã, Portugal
Maria Eugénia Ferrão, Paula Prata & Paulo Fazendeiro
Centre for Applied Mathematics and Economics (CEMAPRE), Lisbon, Portugal
Maria Eugénia Ferrão
Instituto de Telecomunicações (IT), Covilhã, Portugal
Paula Prata & Paulo Fazendeiro

Authors

Maria Eugénia Ferrão
View author publications
You can also search for this author in PubMed Google Scholar
Paula Prata
View author publications
You can also search for this author in PubMed Google Scholar
Paulo Fazendeiro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paula Prata .

Editor information

Editors and Affiliations

University of Perugia, Perugia, Italy
Osvaldo Gervasi
University of Basilicata, Potenza, Italy
Beniamino Murgante
University of Minho, Braga, Portugal
Ana Maria A. C. Rocha
University of Cagliari, Cagliari, Italy
Chiara Garau
University of Basilicata, Potenza, Italy
Francesco Scorza
University of Massachusetts Medical School, Worcester, MA, USA
Yeliz Karaca
Polytechnic University of Bari, Bari, Italy
Carmelo M. Torre

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ferrão, M.E., Prata, P., Fazendeiro, P. (2023). Anonymized Data Assessment via Analysis of Variance: An Application to Higher Education Evaluation. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2023 Workshops. ICCSA 2023. Lecture Notes in Computer Science, vol 14105. Springer, Cham. https://doi.org/10.1007/978-3-031-37108-0_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-37108-0_9
Published: 01 July 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-37107-3
Online ISBN: 978-3-031-37108-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics