[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

A Novel Hybrid Approach for Multi-Dimensional Data Anonymization for Apache Spark

Published: 23 November 2021 Publication History

Abstract

Multi-dimensional data anonymization approaches (e.g., Mondrian) ensure more fine-grained data privacy by providing a different anonymization strategy applied for each attribute. Many variations of multi-dimensional anonymization have been implemented on different distributed processing platforms (e.g., MapReduce, Spark) to take advantage of their scalability and parallelism supports. According to our critical analysis on overheads, either existing iteration-based or recursion-based approaches do not provide effective mechanisms for creating the optimal number of and relative size of resilient distributed datasets (RDDs), thus heavily suffer from performance overheads. To solve this issue, we propose a novel hybrid approach for effectively implementing a multi-dimensional data anonymization strategy (e.g., Mondrian) that is scalable and provides high-performance. Our hybrid approach provides a mechanism to create far fewer RDDs and smaller size partitions attached to each RDD than existing approaches. This optimal RDD creation and operations approach is critical for many multi-dimensional data anonymization applications that create tremendous execution complexity. The new mechanism in our proposed hybrid approach can dramatically reduce the critical overheads involved in re-computation cost, shuffle operations, message exchange, and cache management.

References

[1]
IPUMS International. (2007). Retrieved 25 Sept 2021 from https://international.ipums.org/international/.
[2]
Mohammed Al-Zobbi, Seyed Shahrestani, and Chun Ruan. 2016. Sensitivity-based anonymization of big data. In Proceedings of the 2016 IEEE 41st Conference on Local Computer Networks Workshops. IEEE, 58–64.
[3]
J. Andrew, J. Karthikeyan, and Jeffy Jebastin. 2019. Privacy preserving big data publication on cloud using mondrian anonymization techniques and deep neural networks. In Proceedings of the 2019 5th International Conference on Advanced Computing & Communication Systems. IEEE, 722–727.
[4]
Spiros Antonatos, Stefano Braghin, Naoise Holohan, Yiannis Gkoufas, and Pol Mac Aonghusa. 2018. Prima: An end-to-end framework for privacy at scale. In Proceedings of the 2018 IEEE 34th International Conference on Data Engineering. IEEE, 1531–1542.
[5]
Farough Ashkouti, Amir Sheikhahmadi, and Keyhan Khamforoosh. 2021. DI-Mondrian: Distributed improved Mondrian for satisfaction of the L-diversity privacy model using Apache Spark. Information Sciences 546 (2021), 1–24.
[6]
Machanavajjhala Ashwin, Kifer Daniel, Gehrke Johannes, and Venkitasubramaniam Muthuramakrishnan. 2007. L-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data 1, 1 (2007), 1–52.
[7]
Arthur Asuncion and David Newman. 2007. UCI Machine Learning Repository. Retrieved 25 Sept 2021 from http://archive.ics.uci.edu/ml.
[8]
Sibghat Ullah Bazai and Julian Jang-Jaccard. 2019. SparkDA: RDD-based high-performance data anonymization technique for Spark platform. In Proceedings of the International Conference on Network and System Security. J. Liu and X. Huang (Eds.), Lecture Notes in Computer Science, Vol. 11928. Springer, 646–662.
[9]
Sibghat Ullah Bazai and Julian Jang-Jaccard. 2020. In-memory data anonymization using scalable and high performance RDD design. Electronics 9, 10 (2020), 1732.
[10]
Sibghat Ullah Bazai, Julian Jang-Jaccard, and Hooman Alavizadeh. 2021. Scalable, high-performance, and generalized subtree data anonymization approach for Apache Spark. Electronics 10, 5 (2021), 589.
[11]
Igor Bilogrevic, Julien Freudiger, Emiliano De Cristofaro, and Ersin Uzun. 2014. What’s the gist? Privacy-preserving aggregation of user profiles. In Proceedings of the 19th European Symposium on Research in Computer Security. Springer, 128–145.
[12]
Antorweep Chakravorty, Chunming Rong, K. R. Jayaram, and Shu Tao. 2017. Scalable, efficient anonymization with INCOGNITO-framework & algorithm. In Proceedings of the 2017 IEEE International Congress on Big Data. IEEE, 39–48.
[13]
Antorweep Chakravorty, Tomasz Wiktor Wlodarczyk, and Chunming Rong. 2014. A scalable k-anonymization solution for preserving privacy in an aging-in-place welfare intercloud. In Proceedings of the 2014 IEEE International Conference on Cloud Engineering. IEEE, 424–431.
[14]
Can Eyupoglu, Muhammed Ali Aydin, Abdul Halim Zaim, and Ahmet Sertbas. 2018. An efficient big data anonymization algorithm based on chaos and perturbation techniques. Entropy 20, 5 (2018), 373.
[15]
Benjamin C. M. Fung, Ke Wang, and Philip S. Yu. 2005. Top-down specialization for information and privacy preservation. In Proceedings of the 21st International Conference on Data Engineering. IEEE, 205–216.
[16]
Zhi-Qiang Gao and Long-Jun Zhang. 2017. DPHKMS: An efficient hybrid clustering preserving differential privacy in Spark. In Proceedings of the International Conference on Emerging Internetworking, Data & Web Technologies. Springer, 367–377.
[17]
Gabriel Ghinita, Panagiotis Karras, Panos Kalnis, and Nikos Mamoulis. 2009. A framework for efficient data anonymization under privacy and accuracy constraints. ACM Transactions on Database Systems 34, 2 (2009), 1–47.
[18]
Jiaqi Gu, Yugo H. Watanabe, William A. Mazza, Alexander Shkapsky, Mohan Yang, Ling Ding, and Carlo Zaniolo. 2019. RaSQL: Greater power and performance for big data analytics with recursive-aggregate-SQL on Spark. In Proceedings of the 2019 International Conference on Management of Data. 467–484.
[19]
Pavlos Katsogridakis, Sofia Papagiannaki, and Polyvios Pratikakis. 2017. Execution of recursive queries in Apache Spark. In Proceedings of the European Conference on Parallel Processing. F. Rivera, T. Pena, and J. Cabaleiro (Eds.), Lecture Notes in Computer Science, Vol. 10417. Springer, 289–302.
[20]
Daniel Kifer and Johannes Gehrke. 2006. Injecting utility into anonymized datasets. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data. ACM, 217–228.
[21]
Suan Lee, Seok Kang, Jinho Kim, and Eun Jung Yu. 2019. Scalable distributed data cube computation for large-scale multidimensional data analysis on a Spark cluster. Cluster Computing 22, 1 (2019), 2063–2087.
[22]
Kristen LeFevre, David J. DeWitt, and Raghu Ramakrishnan. 2006. Mondrian multidimensional k-anonymity. In Proceedings of the 22nd International Conference on Data Engineering. IEEE, 25–25.
[23]
Kristen LeFevre, David J. DeWitt, and Raghu Ramakrishnan. 2006. Workload-aware anonymization. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 277–286.
[24]
Jiuyong Li, Jixue Liu, Muzammil Baig, and Raymond Chi-Wing Wong. 2011. Information based data anonymization for classification utility. Data & Knowledge Engineering 70, 12 (2011), 1030–1045.
[25]
Jianzhong Li, Beng Chin Ooi, and Weiping Wang. 2008. Anonymizing streaming data for privacy protection. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering. IEEE, 1367–1369.
[26]
M. Lichman. 2013. UCI Machine Learning Repository-Census+ Income Dataset. Retrieved 25 Sept 2021 from http://archive.ics.uci.edu/ml/datasets/Census-Income?(KDD).
[27]
Robert McCaa. 2013. The big census data revolution: IPUMS-international. trans-border access to decades of census samples for three-fourths of the world and more. Revista de Demografia Historica 30, 1 (2013), 69.
[28]
Brijesh B. Mehta and Udai Pratap Rao. 2017. Privacy preserving big data publishing: A scalable k-anonymization approach using MapReduce. IET Software 11, 5 (2017), 271–276.
[29]
Amin Nezarat and Khadije Yavari. 2019. A distributed method based on Mondrian algorithm for big data anonymization. In Proceedings of the International Congress on High-Performance Computing and Big Data Analysis. L. Grandinetti, S. Mirtaheri, and R. Shahbazian (Eds.), Communications in Computer and Information Science, Vol. 891. Springer, 84–97.
[30]
Bogdan Nicolae, Carlos H. A. Costa, Claudia Misale, Kostas Katrinis, and Yoonho Park. 2016. Leveraging adaptive I/O to optimize collective data shuffling patterns for big data analytics. IEEE Transactions on Parallel and Distributed Systems 28, 6 (2016), 1663–1674.
[31]
Alexandra Pomares-Quimbaya, Alejandro Sierra-Múnera, Jaime Mendoza-Mendoza, Julián Malaver-Moreno, Hernán Carvajal, and Victor Moncayo. 2019. Anonylitics: From a small data to a big data anonymization system for analytical projects. In Proceedings of the 21st International Conference on Enterprise Information Systems. 61–71.
[32]
F. Oppacher. R. Cattral. 2007. Poker Hand Data Set. Retrieved 25 Sept 2021 from https://archive.ics.uci.edu/ml/datasets/Poker+Hand.
[33]
Marek Rogala, Jan Hidders, and Jacek Sroka. 2016. DatalogRA: Datalog with recursive aggregation in the Spark RDD model. In Proceedings of the 4th International Workshop on Graph Data Management Experiences and Systems. 1–6.
[34]
Julián Salas and Josep Domingo-Ferrer. 2018. Some basics on privacy techniques, anonymization and their big data challenges. Mathematics in Computer Science 12, 3 (2018), 263–274.
[35]
A. H. M. Sarowar Sattar, Jiuyong Li, Xiaofeng Ding, Jixue Liu, and Millist Vincent. 2013. A general framework for privacy preserving data publishing. Knowledge-Based Systems 54, C (2013), 276–287.
[36]
Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan. 2015. Clash of the titans: MapReduce vs. Spark for large scale data analytics. Proceedings of the VLDB Endowment 8, 13 (2015), 2110–2121.
[37]
Alexander Shkapsky, Mohan Yang, Matteo Interlandi, Hsuan Chiu, Tyson Condie, and Carlo Zaniolo. 2016. Big data analytics with datalog queries on Spark. In Proceedings of the 2016 International Conference on Management of Data. 1135–1149.
[38]
Ugur Sopaoglu and Osman Abul. 2017. A top-down k-anonymization implementation for Apache Spark. In Proceedings of the 2017 IEEE International Conference on Big Data. IEEE, 4513–4521.
[39]
Daniel M. Thompson, James J. Feigenbaum, Andrew B. Hall, and Jesse Yoder. 2019. Who Becomes a Member of Congress? Evidence from De-Anonymized Census Data. Technical Report. National Bureau of Economic Research.
[40]
Ei Nyein Chan Wai, Pei-Wei Tsai, and Jeng-Shyang Pan. 2016. Hierarchical PSO clustering on MapReduce for scalable privacy preservation in big data. In Proceedings of the 10th International Conference on Genetic and Evolutionary Computing. Springer, 36–44.
[41]
Ke Wang, Philip S. Yu, and Sourav Chakraborty. 2004. Bottom-up generalization: A data mining solution to privacy protection. In Proceedings of the 4th IEEE International Conference on Data Mining. IEEE, 249–256.
[42]
Jian Xu, Wei Wang, Jian Pei, Xiaoyuan Wang, Baile Shi, and Ada Wai-Chee Fu. 2006. Utility-based anonymization using local recoding. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 785–790.
[43]
Matei Zaharia. 2016. An Architecture for Fast and General Data Processing on Large Clusters. Morgan and Claypool.
[44]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. USENIX Association, 2–2.
[45]
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. HotCloud 10, 10–10 (2010), 95.
[46]
Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. 2016. Apache Spark: A unified engine for big data processing. Communications of the ACM 59, 11 (2016), 56–65.
[47]
Kaihui Zhang, Yusuke Tanimura, Hidemoto Nakada, and Hirotaka Ogawa. 2017. Understanding and improving disk-based intermediate data caching in Spark. In Proceedings of the 2017 IEEE International Conference on Big Data. IEEE, 2508–2517.
[48]
Xuyun Zhang, Wanchun Dou, Jian Pei, Surya Nepal, Chi Yang, Chang Liu, and Jinjun Chen. 2015. Proximity-aware local-recoding anonymization with MapReduce for scalable big data privacy preservation in cloud. IEEE Transactions on Computers 64, 8 (2015), 2293–2307.
[49]
Xuyun Zhang, Lianyong Qi, Qiang He, and Wanchun Dou. 2016. Scalable iterative implementation of Mondrian for big data multidimensional anonymisation. In Proceedings of the 9th International Conference on Security, Privacy and Anonymity in Computation, Communication and Storage. Springer, 311–320.
[50]
Xuyun Zhang, Chi Yang, Surya Nepal, Chang Liu, Wanchun Dou, and Jinjun Chen. 2013. A MapReduce based approach of scalable multidimensional anonymization for big data privacy preservation on cloud. In Proceedings of the 2013 International Conference on Cloud and Green Computing. IEEE, 105–112.

Cited By

View all
  • (2024)Interpretable Risk-aware Access Control for Spark: Blocking Attack Purpose Behind Actions2024 IEEE 42nd International Conference on Computer Design (ICCD)10.1109/ICCD63220.2024.00028(122-129)Online publication date: 18-Nov-2024
  • (2023)Efficient Associate Rules Mining Based on Topology for Items of Transactional DataMathematics10.3390/math1102040111:2(401)Online publication date: 12-Jan-2023
  • (2023)Lightweight and Bilateral Controllable Data Sharing for Secure Autonomous Vehicles Platooning ServiceIEEE Transactions on Vehicular Technology10.1109/TVT.2023.3285306(1-16)Online publication date: 2023
  • Show More Cited By

Index Terms

  1. A Novel Hybrid Approach for Multi-Dimensional Data Anonymization for Apache Spark

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Privacy and Security
    ACM Transactions on Privacy and Security  Volume 25, Issue 1
    February 2022
    219 pages
    ISSN:2471-2566
    EISSN:2471-2574
    DOI:10.1145/3485162
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 23 November 2021
    Accepted: 01 September 2021
    Revised: 01 August 2021
    Received: 01 October 2020
    Published in TOPS Volume 25, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Spark
    2. Mondrian
    3. data anonymization
    4. multi-dimensional data
    5. resilient distributed dataset (RDD)

    Qualifiers

    • Research-article
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)74
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 24 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Interpretable Risk-aware Access Control for Spark: Blocking Attack Purpose Behind Actions2024 IEEE 42nd International Conference on Computer Design (ICCD)10.1109/ICCD63220.2024.00028(122-129)Online publication date: 18-Nov-2024
    • (2023)Efficient Associate Rules Mining Based on Topology for Items of Transactional DataMathematics10.3390/math1102040111:2(401)Online publication date: 12-Jan-2023
    • (2023)Lightweight and Bilateral Controllable Data Sharing for Secure Autonomous Vehicles Platooning ServiceIEEE Transactions on Vehicular Technology10.1109/TVT.2023.3285306(1-16)Online publication date: 2023
    • (2022)Cyber Security's Silver Bullet - A Systematic Literature Review of AI-Powered Security2022 3rd International Informatics and Software Engineering Conference (IISEC)10.1109/IISEC56263.2022.9998305(1-7)Online publication date: 15-Dec-2022
    • (2022)Phishing and Intrusion Attacks: An Overview of Classification Mechanisms2022 3rd International Informatics and Software Engineering Conference (IISEC)10.1109/IISEC56263.2022.9998205(1-5)Online publication date: 15-Dec-2022
    • (2022)Toward Privacy Preservation Using Clustering Based Anonymization: Recent Advances and Future Research OutlookIEEE Access10.1109/ACCESS.2022.317521910(53066-53097)Online publication date: 2022
    • (2021)Kernel Virtual Machine based High Performance Environment for Grid and Jungle Computing2021 2nd International Informatics and Software Engineering Conference (IISEC)10.1109/IISEC54230.2021.9672355(1-6)Online publication date: 16-Dec-2021

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media