More Web Proxy on the site http://driver.im/

research-article

A Novel Hybrid Approach for Multi-Dimensional Data Anonymization for Apache Spark

Authors:

Sibghat Ullah Bazai,

Julian Jang-Jaccard,

Hooman AlavizadehAuthors Info & Claims

ACM Transactions on Privacy and Security, Volume 25, Issue 1

Article No.: 5, Pages 1 - 25

https://doi.org/10.1145/3484945

Published: 23 November 2021 Publication History

Abstract

Multi-dimensional data anonymization approaches (e.g., Mondrian) ensure more fine-grained data privacy by providing a different anonymization strategy applied for each attribute. Many variations of multi-dimensional anonymization have been implemented on different distributed processing platforms (e.g., MapReduce, Spark) to take advantage of their scalability and parallelism supports. According to our critical analysis on overheads, either existing iteration-based or recursion-based approaches do not provide effective mechanisms for creating the optimal number of and relative size of resilient distributed datasets (RDDs), thus heavily suffer from performance overheads. To solve this issue, we propose a novel hybrid approach for effectively implementing a multi-dimensional data anonymization strategy (e.g., Mondrian) that is scalable and provides high-performance. Our hybrid approach provides a mechanism to create far fewer RDDs and smaller size partitions attached to each RDD than existing approaches. This optimal RDD creation and operations approach is critical for many multi-dimensional data anonymization applications that create tremendous execution complexity. The new mechanism in our proposed hybrid approach can dramatically reduce the critical overheads involved in re-computation cost, shuffle operations, message exchange, and cache management.

References

[1]

IPUMS International. (2007). Retrieved 25 Sept 2021 from https://international.ipums.org/international/.

[2]

Mohammed Al-Zobbi, Seyed Shahrestani, and Chun Ruan. 2016. Sensitivity-based anonymization of big data. In Proceedings of the 2016 IEEE 41st Conference on Local Computer Networks Workshops. IEEE, 58–64.

[3]

J. Andrew, J. Karthikeyan, and Jeffy Jebastin. 2019. Privacy preserving big data publication on cloud using mondrian anonymization techniques and deep neural networks. In Proceedings of the 2019 5th International Conference on Advanced Computing & Communication Systems. IEEE, 722–727.

[4]

Spiros Antonatos, Stefano Braghin, Naoise Holohan, Yiannis Gkoufas, and Pol Mac Aonghusa. 2018. Prima: An end-to-end framework for privacy at scale. In Proceedings of the 2018 IEEE 34th International Conference on Data Engineering. IEEE, 1531–1542.

[5]

Farough Ashkouti, Amir Sheikhahmadi, and Keyhan Khamforoosh. 2021. DI-Mondrian: Distributed improved Mondrian for satisfaction of the L-diversity privacy model using Apache Spark. Information Sciences 546 (2021), 1–24.

[6]

Machanavajjhala Ashwin, Kifer Daniel, Gehrke Johannes, and Venkitasubramaniam Muthuramakrishnan. 2007. L-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data 1, 1 (2007), 1–52.

Digital Library

[7]

Arthur Asuncion and David Newman. 2007. UCI Machine Learning Repository. Retrieved 25 Sept 2021 from http://archive.ics.uci.edu/ml.

[8]

Sibghat Ullah Bazai and Julian Jang-Jaccard. 2019. SparkDA: RDD-based high-performance data anonymization technique for Spark platform. In Proceedings of the International Conference on Network and System Security. J. Liu and X. Huang (Eds.), Lecture Notes in Computer Science, Vol. 11928. Springer, 646–662.

Digital Library

[9]

Sibghat Ullah Bazai and Julian Jang-Jaccard. 2020. In-memory data anonymization using scalable and high performance RDD design. Electronics 9, 10 (2020), 1732.

[10]

Sibghat Ullah Bazai, Julian Jang-Jaccard, and Hooman Alavizadeh. 2021. Scalable, high-performance, and generalized subtree data anonymization approach for Apache Spark. Electronics 10, 5 (2021), 589.

[11]

Igor Bilogrevic, Julien Freudiger, Emiliano De Cristofaro, and Ersin Uzun. 2014. What’s the gist? Privacy-preserving aggregation of user profiles. In Proceedings of the 19th European Symposium on Research in Computer Security. Springer, 128–145.

Digital Library

[12]

Antorweep Chakravorty, Chunming Rong, K. R. Jayaram, and Shu Tao. 2017. Scalable, efficient anonymization with INCOGNITO-framework & algorithm. In Proceedings of the 2017 IEEE International Congress on Big Data. IEEE, 39–48.

[13]

Antorweep Chakravorty, Tomasz Wiktor Wlodarczyk, and Chunming Rong. 2014. A scalable k-anonymization solution for preserving privacy in an aging-in-place welfare intercloud. In Proceedings of the 2014 IEEE International Conference on Cloud Engineering. IEEE, 424–431.

Digital Library

[14]

Can Eyupoglu, Muhammed Ali Aydin, Abdul Halim Zaim, and Ahmet Sertbas. 2018. An efficient big data anonymization algorithm based on chaos and perturbation techniques. Entropy 20, 5 (2018), 373.

[15]

Benjamin C. M. Fung, Ke Wang, and Philip S. Yu. 2005. Top-down specialization for information and privacy preservation. In Proceedings of the 21st International Conference on Data Engineering. IEEE, 205–216.

Digital Library

[16]

Zhi-Qiang Gao and Long-Jun Zhang. 2017. DPHKMS: An efficient hybrid clustering preserving differential privacy in Spark. In Proceedings of the International Conference on Emerging Internetworking, Data & Web Technologies. Springer, 367–377.

[17]

Gabriel Ghinita, Panagiotis Karras, Panos Kalnis, and Nikos Mamoulis. 2009. A framework for efficient data anonymization under privacy and accuracy constraints. ACM Transactions on Database Systems 34, 2 (2009), 1–47.

Digital Library

[18]

Jiaqi Gu, Yugo H. Watanabe, William A. Mazza, Alexander Shkapsky, Mohan Yang, Ling Ding, and Carlo Zaniolo. 2019. RaSQL: Greater power and performance for big data analytics with recursive-aggregate-SQL on Spark. In Proceedings of the 2019 International Conference on Management of Data. 467–484.

Digital Library

[19]

Pavlos Katsogridakis, Sofia Papagiannaki, and Polyvios Pratikakis. 2017. Execution of recursive queries in Apache Spark. In Proceedings of the European Conference on Parallel Processing. F. Rivera, T. Pena, and J. Cabaleiro (Eds.), Lecture Notes in Computer Science, Vol. 10417. Springer, 289–302.

[20]

Daniel Kifer and Johannes Gehrke. 2006. Injecting utility into anonymized datasets. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data. ACM, 217–228.

Digital Library

[21]

Suan Lee, Seok Kang, Jinho Kim, and Eun Jung Yu. 2019. Scalable distributed data cube computation for large-scale multidimensional data analysis on a Spark cluster. Cluster Computing 22, 1 (2019), 2063–2087.

Digital Library

[22]

Kristen LeFevre, David J. DeWitt, and Raghu Ramakrishnan. 2006. Mondrian multidimensional k-anonymity. In Proceedings of the 22nd International Conference on Data Engineering. IEEE, 25–25.

Digital Library

[23]

Kristen LeFevre, David J. DeWitt, and Raghu Ramakrishnan. 2006. Workload-aware anonymization. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 277–286.

Digital Library

[24]

Jiuyong Li, Jixue Liu, Muzammil Baig, and Raymond Chi-Wing Wong. 2011. Information based data anonymization for classification utility. Data & Knowledge Engineering 70, 12 (2011), 1030–1045.

Digital Library

[25]

Jianzhong Li, Beng Chin Ooi, and Weiping Wang. 2008. Anonymizing streaming data for privacy protection. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering. IEEE, 1367–1369.

Digital Library

[26]

M. Lichman. 2013. UCI Machine Learning Repository-Census+ Income Dataset. Retrieved 25 Sept 2021 from http://archive.ics.uci.edu/ml/datasets/Census-Income?(KDD).

[27]

Robert McCaa. 2013. The big census data revolution: IPUMS-international. trans-border access to decades of census samples for three-fourths of the world and more. Revista de Demografia Historica 30, 1 (2013), 69.

[28]

Brijesh B. Mehta and Udai Pratap Rao. 2017. Privacy preserving big data publishing: A scalable k-anonymization approach using MapReduce. IET Software 11, 5 (2017), 271–276.

[29]

Amin Nezarat and Khadije Yavari. 2019. A distributed method based on Mondrian algorithm for big data anonymization. In Proceedings of the International Congress on High-Performance Computing and Big Data Analysis. L. Grandinetti, S. Mirtaheri, and R. Shahbazian (Eds.), Communications in Computer and Information Science, Vol. 891. Springer, 84–97.

[30]

Bogdan Nicolae, Carlos H. A. Costa, Claudia Misale, Kostas Katrinis, and Yoonho Park. 2016. Leveraging adaptive I/O to optimize collective data shuffling patterns for big data analytics. IEEE Transactions on Parallel and Distributed Systems 28, 6 (2016), 1663–1674.

Digital Library

[31]

Alexandra Pomares-Quimbaya, Alejandro Sierra-Múnera, Jaime Mendoza-Mendoza, Julián Malaver-Moreno, Hernán Carvajal, and Victor Moncayo. 2019. Anonylitics: From a small data to a big data anonymization system for analytical projects. In Proceedings of the 21st International Conference on Enterprise Information Systems. 61–71.

[32]

F. Oppacher. R. Cattral. 2007. Poker Hand Data Set. Retrieved 25 Sept 2021 from https://archive.ics.uci.edu/ml/datasets/Poker+Hand.

[33]

Marek Rogala, Jan Hidders, and Jacek Sroka. 2016. DatalogRA: Datalog with recursive aggregation in the Spark RDD model. In Proceedings of the 4th International Workshop on Graph Data Management Experiences and Systems. 1–6.

Digital Library

[34]

Julián Salas and Josep Domingo-Ferrer. 2018. Some basics on privacy techniques, anonymization and their big data challenges. Mathematics in Computer Science 12, 3 (2018), 263–274.

[35]

A. H. M. Sarowar Sattar, Jiuyong Li, Xiaofeng Ding, Jixue Liu, and Millist Vincent. 2013. A general framework for privacy preserving data publishing. Knowledge-Based Systems 54, C (2013), 276–287.

Digital Library

[36]

Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan. 2015. Clash of the titans: MapReduce vs. Spark for large scale data analytics. Proceedings of the VLDB Endowment 8, 13 (2015), 2110–2121.

Digital Library

[37]

Alexander Shkapsky, Mohan Yang, Matteo Interlandi, Hsuan Chiu, Tyson Condie, and Carlo Zaniolo. 2016. Big data analytics with datalog queries on Spark. In Proceedings of the 2016 International Conference on Management of Data. 1135–1149.

Digital Library

[38]

Ugur Sopaoglu and Osman Abul. 2017. A top-down k-anonymization implementation for Apache Spark. In Proceedings of the 2017 IEEE International Conference on Big Data. IEEE, 4513–4521.

[39]

Daniel M. Thompson, James J. Feigenbaum, Andrew B. Hall, and Jesse Yoder. 2019. Who Becomes a Member of Congress? Evidence from De-Anonymized Census Data. Technical Report. National Bureau of Economic Research.

[40]

Ei Nyein Chan Wai, Pei-Wei Tsai, and Jeng-Shyang Pan. 2016. Hierarchical PSO clustering on MapReduce for scalable privacy preservation in big data. In Proceedings of the 10th International Conference on Genetic and Evolutionary Computing. Springer, 36–44.

[41]

Ke Wang, Philip S. Yu, and Sourav Chakraborty. 2004. Bottom-up generalization: A data mining solution to privacy protection. In Proceedings of the 4th IEEE International Conference on Data Mining. IEEE, 249–256.

Digital Library

[42]

Jian Xu, Wei Wang, Jian Pei, Xiaoyuan Wang, Baile Shi, and Ada Wai-Chee Fu. 2006. Utility-based anonymization using local recoding. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 785–790.

Digital Library

[43]

Matei Zaharia. 2016. An Architecture for Fast and General Data Processing on Large Clusters. Morgan and Claypool.

Digital Library

[44]

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. USENIX Association, 2–2.

Digital Library

[45]

Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. HotCloud 10, 10–10 (2010), 95.

Digital Library

[46]

Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: A unified engine for big data processing. Communications of the ACM 59, 11 (2016), 56–65.

Digital Library

[47]

Kaihui Zhang, Yusuke Tanimura, Hidemoto Nakada, and Hirotaka Ogawa. 2017. Understanding and improving disk-based intermediate data caching in Spark. In Proceedings of the 2017 IEEE International Conference on Big Data. IEEE, 2508–2517.

[48]

Xuyun Zhang, Wanchun Dou, Jian Pei, Surya Nepal, Chi Yang, Chang Liu, and Jinjun Chen. 2015. Proximity-aware local-recoding anonymization with MapReduce for scalable big data privacy preservation in cloud. IEEE Transactions on Computers 64, 8 (2015), 2293–2307.

Digital Library

[49]

Xuyun Zhang, Lianyong Qi, Qiang He, and Wanchun Dou. 2016. Scalable iterative implementation of Mondrian for big data multidimensional anonymisation. In Proceedings of the 9th International Conference on Security, Privacy and Anonymity in Computation, Communication and Storage. Springer, 311–320.

[50]

Xuyun Zhang, Chi Yang, Surya Nepal, Chang Liu, Wanchun Dou, and Jinjun Chen. 2013. A MapReduce based approach of scalable multidimensional anonymization for big data privacy preservation on cloud. In Proceedings of the 2013 International Conference on Cloud and Green Computing. IEEE, 105–112.

Digital Library

Cited By

Wang WXue TLi SWang ZZhang BWen Y(2024)Interpretable Risk-aware Access Control for Spark: Blocking Attack Purpose Behind Actions2024 IEEE 42nd International Conference on Computer Design (ICCD)10.1109/ICCD63220.2024.00028(122-129)Online publication date: 18-Nov-2024
https://doi.org/10.1109/ICCD63220.2024.00028
Li BPei ZZhang CHao F(2023)Efficient Associate Rules Mining Based on Topology for Items of Transactional DataMathematics10.3390/math1102040111:2(401)Online publication date: 12-Jan-2023
https://doi.org/10.3390/math11020401
Bao YLu RZhang SCheng XLian HGuan YQiu W(2023)Lightweight and Bilateral Controllable Data Sharing for Secure Autonomous Vehicles Platooning ServiceIEEE Transactions on Vehicular Technology10.1109/TVT.2023.3285306(1-16)Online publication date: 2023
https://doi.org/10.1109/TVT.2023.3285306
Show More Cited By

Index Terms

A Novel Hybrid Approach for Multi-Dimensional Data Anonymization for Apache Spark
1. Security and privacy
  1. Human and societal aspects of security and privacy
    1. Privacy protections

Recommendations

SparkDA: RDD-Based High-Performance Data Anonymization Technique for Spark Platform
Network and System Security
Abstract
Recent proposals in data anonymization have mostly been focused around MapReduce, though the advantages of Spark have been well documented. To address this concern, we propose a new novel data anonymization technique for Apache Spark. SparkDA, our ...
Learning Apache Spark 2.0
A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing
ICDS 2015: Proceedings of the Second International Conference on Data Science - Volume 9208

With the fast development of remote sensing techniques, the volume of acquired data grows exponentially. This brings a big challenge to process massive remote sensing data. In the paper, an in-memory computing framework is proposed to address this ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Privacy and Security

ACM Transactions on Privacy and Security Volume 25, Issue 1

February 2022

219 pages

ISSN:2471-2566

EISSN:2471-2574

DOI:10.1145/3485162

Editor:
Ninghui Li
Purdue University, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 November 2021

Accepted: 01 September 2021

Revised: 01 August 2021

Received: 01 October 2020

Published in TOPS Volume 25, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
524
Total Downloads

Downloads (Last 12 months)73
Downloads (Last 6 weeks)5

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang WXue TLi SWang ZZhang BWen Y(2024)Interpretable Risk-aware Access Control for Spark: Blocking Attack Purpose Behind Actions2024 IEEE 42nd International Conference on Computer Design (ICCD)10.1109/ICCD63220.2024.00028(122-129)Online publication date: 18-Nov-2024
https://doi.org/10.1109/ICCD63220.2024.00028
Li BPei ZZhang CHao F(2023)Efficient Associate Rules Mining Based on Topology for Items of Transactional DataMathematics10.3390/math1102040111:2(401)Online publication date: 12-Jan-2023
https://doi.org/10.3390/math11020401
Bao YLu RZhang SCheng XLian HGuan YQiu W(2023)Lightweight and Bilateral Controllable Data Sharing for Secure Autonomous Vehicles Platooning ServiceIEEE Transactions on Vehicular Technology10.1109/TVT.2023.3285306(1-16)Online publication date: 2023
https://doi.org/10.1109/TVT.2023.3285306
Tabassum IBazai SZaland ZMarjan SKhan MGhafoor M(2022)Cyber Security's Silver Bullet - A Systematic Literature Review of AI-Powered Security2022 3rd International Informatics and Software Engineering Conference (IISEC)10.1109/IISEC56263.2022.9998305(1-7)Online publication date: 15-Dec-2022
https://doi.org/10.1109/IISEC56263.2022.9998305
Tareen SBazai SUllah SUllah RMarjan SGhafoor M(2022)Phishing and Intrusion Attacks: An Overview of Classification Mechanisms2022 3rd International Informatics and Software Engineering Conference (IISEC)10.1109/IISEC56263.2022.9998205(1-5)Online publication date: 15-Dec-2022
https://doi.org/10.1109/IISEC56263.2022.9998205
Majeed AKhan SHwang S(2022)Toward Privacy Preservation Using Clustering Based Anonymization: Recent Advances and Future Research OutlookIEEE Access10.1109/ACCESS.2022.317521910(53066-53097)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3175219
Bazai SGhafoor MAqeel MRoomi M(2021)Kernel Virtual Machine based High Performance Environment for Grid and Jungle Computing2021 2nd International Informatics and Software Engineering Conference (IISEC)10.1109/IISEC54230.2021.9672355(1-6)Online publication date: 16-Dec-2021
https://doi.org/10.1109/IISEC54230.2021.9672355

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents