[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3267809.3267833acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Uncertainty Propagation in Data Processing Systems

Published: 11 October 2018 Publication History

Abstract

We are seeing an explosion of uncertain data---i.e., data that is more properly represented by probability distributions or estimated values with error bounds rather than exact values---from sensors in IoT, sampling-based approximate computations and machine learning algorithms. In many cases, performing computations on uncertain data as if it were exact leads to incorrect results. Unfortunately, developing applications for processing uncertain data is a major challenge from both the mathematical and performance perspectives. This paper proposes and evaluates an approach for tackling this challenge in DAG-based data processing systems. We present a framework for uncertainty propagation (UP) that allows developers to modify precise implementations of DAG nodes to process uncertain inputs with modest effort. We implement this framework in a system called UP-MapReduce, and use it to modify ten applications, including AI/ML, image processing and trend analysis applications to process uncertain data. Our evaluation shows that UP-MapReduce propagates uncertainties with high accuracy and, in many cases, low performance overheads. For example, a social network trend analysis application that combines data sampling with UP can reduce execution time by 2.3x when the user can tolerate a maximum relative error of 5% in the final answer. These results demonstrate that our UP framework presents a compelling approach for handling uncertain data in DAG processing.

References

[1]
Sameer Agarwal, Henry Milner, Ariel Kleiner, Ameet Talwalkar, Michael Jordan, Samuel Madden, Barzan Mozafari, and Ion Stoica. 2014. Knowing when you're wrong: building fast and reliable approximate query processing systems. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, 481--492.
[2]
Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, and Ion Stoica. 2013. BlinkDB: queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems. ACM, 29--42.
[3]
Kai O Arras. 1998. An Introduction To Error Propagation: Derivation, Meaning and Examples of Equation Cy= Fx Cx FxT. Technical Report EPFL-ASL-TR-98-01 R3. ETH Zurich.
[4]
James Bornholt, Todd Mytkowicz, and Kathryn S McKinley. 2014. Uncertain< T>: A first-order type for uncertain data. ACM SIGPLAN Notices 49, 4 (2014), 51--66.
[5]
Brett Boston, Adrian Sampson, Dan Grossman, and Luis Ceze. 2015. Probability type inference for flexible approximate programming. In ACM SIGPLAN Notices, Vol. 50. ACM, 470--487.
[6]
Michael Carbin, Deokhwan Kim, Sasa Misailovic, and Martin C Rinard. 2012. Proving acceptability properties of relaxed nondeterministic approximate programs. ACM SIGPLAN Notices 47, 6 (2012), 169--180.
[7]
Ronnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. 2008. SCOPE: easy and efficient parallel processing of massive data sets. Proceedings of the VLDB Endowment 1, 2 (2008), 1265--1276.
[8]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.
[9]
Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012. Architecture support for disciplined approximate programming. In ACM SIGPLAN Notices, Vol. 47. ACM, 301--312.
[10]
Ludwig Fahrmeir and Gerhard Tutz. 2013. Multivariate statistical modelling based on generalized linear models. Springer Science & Business Media.
[11]
Íñigo Goiri, Ricardo Bianchini, Santosh Nagarakatte, and Thu D Nguyen. 2015. ApproxHadoop: Bringing Approximations to MapReduce Frameworks. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 383--397.
[12]
Peter Hall, Byeong U Park, and Richard J Samworth. 2008. Choice of neighbor order in nearest-neighbor classification. The Annals of Statistics (2008), 2135--2152.
[13]
Jon C Helton and Freddie Joe Davis. 2003. Latin hypercube sampling and the propagation of uncertainty in analyses of complex systems. Reliability Engineering & System Safety 81, 1 (2003), 23--69.
[14]
Henry Hoffmann. 2015. JouleGuard: energy guarantees for approximate applications. In Proceedings of the 25th Symposium on Operating Systems Principles. ACM, 198--214.
[15]
Anil K Jain. 2010. Data clustering: 50 years beyond K-means. Pattern recognition letters 31, 8 (2010), 651--666.
[16]
Shantanu Joshi and Christopher Jermaine. 2009. Sampling-based estimators for subset-based queries. The VLDB Journal---The International Journal on Very Large Data Bases 18, 1 (2009), 181--202.
[17]
Samuel Karlin and William J Studden. 1966. Tchebycheff systems: With applications in analysis and statistics. Interscience New York.
[18]
Patrick Kenny, Gilles Boulianne, and Pierre Dumouchel. 2005. Eigenvoice modeling with sparse training data. Speech and Audio Processing, IEEE Transactions on 13, 3 (2005), 345--354.
[19]
Sang Ho on Lee and Wei Chen. 2009. A comparative study of uncertainty propagation methods for black-box-type problems. Structural and Multidisciplinary Optimization 37, 3 (2009), 239--253.
[20]
Harsha V Madhyastha, Ethan Katz-Bassett, Thomas E Anderson, Arvind Krishnamurthy, and Arun Venkataramani. 2009. iPlane Nano: Path Prediction for Peer-to-Peer Applications. In NSDI, Vol. 9. 137--152.
[21]
Pascal Massart. 1990. The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. The Annals of Probability (1990), 1269--1283.
[22]
Julian Mcauley and Jure Leskovec. 2014. Discovering Social Circles in Ego Networks. ACM Trans. Knowl. Discov. Data 8, 1, Article 4 (Feb. 2014), 28 pages.
[23]
Joshua San Miguel, Mario Badr, and Natalie Enright Jerger. 2014. Load value approximation. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 127--139.
[24]
Sasa Misailovic, Michael Carbin, Sara Achour, Zichao Qi, and Martin C Rinard. 2014. Chisel: Reliability-and accuracy-aware optimization of approximate computational kernels. In ACM SIGPLAN Notices, Vol. 49. ACM, 309--328.
[25]
Sasa Misailovic, Daniel M Roy, and Martin C Rinard. 2011. Probabilistically accurate program transformations. In International Static Analysis Symposium. Springer, 316--333.
[26]
Kevin P Murphy. 2012. Machine learning: a probabilistic perspective. MIT press.
[27]
Yousef Saad. 2003. Iterative methods for sparse linear systems. Siam.
[28]
Mehrzad Samadi, Davoud Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke. 2014. Paraprox: Pattern-based approximation for data parallel applications. In ACM SIGARCH Computer Architecture News, Vol. 42. ACM, 35--50.
[29]
Adrian Sampson, Jacob Nelson, Karin Strauss, and Luis Ceze. 2014. Approximate Storage in Solid-State Memories. ACM Trans. Comput. Syst. 32, 3, Article 9 (Sept. 2014), 23 pages.
[30]
Adrian Sampson, Pavel Panchekha, Todd Mytkowicz, Kathryn S McKinley, Dan Grossman, and Luis Ceze. 2014. Expressing and verifying probabilistic assertions. ACM SIGPLAN Notices 49, 6 (2014), 112--122.
[31]
George AF Seber and Alan J Lee. 2012. Linear regression analysis. Vol. 936. John Wiley & Sons.
[32]
Apache Spark. 2016. Apache Spark: Lightning-fast cluster computing. URL http://spark. apache. org (2016).
[33]
M. A. Turk and A. P. Pentland. 1991. Face recognition using eigenfaces. In Proceedings. 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 586--591.
[34]
Steven Vajda. 2014. Probabilistic programming. Academic Press.
[35]
Xiaofei Wang, Min Chen, Tarik Taleb, Adlen Ksentini, and Victor Leung. 2014. Cache in the air: exploiting content caching and delivery techniques for 5G systems. IEEE Communications Magazine 52, 2 (2014), 131--139.
[36]
Jyh-Jong Wei, Chuang-Jan Chang, Nai-Kuan Chou, and Gwo-Jen Jan. 2001. ECG data compression using truncated singular value decomposition. Information Technology in Biomedicine, IEEE Transactions on 5, 4 (2001), 290--299.
[37]
Tom White. 2012. Hadoop: The definitive guide. O'Reilly Media, Inc.
[38]
Sai Wu, Beng Chin Ooi, and Kian-Lee Tan. 2010. Continuous sampling for online aggregation over multiple queries. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 651--662.
[39]
Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: cluster computing with working sets. HotCloud 10 (2010), 10--16.
[40]
Andreas Züfle, Tobias Emrich, Klaus Arthur Schmid, Nikos Mamoulis, Arthur Zimek, and Matthias Renz. 2014. Representative clustering of uncertain data. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 243--252.

Cited By

View all
  • (2022)Taming System Dynamics on Resource Optimization for Data Processing Workflows: A Probabilistic ApproachIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.309140033:1(231-248)Online publication date: 1-Jan-2022
  • (2022)The Laplace Microarchitecture for Tracking Data UncertaintyIEEE Micro10.1109/MM.2022.316606742:4(78-86)Online publication date: 1-Jul-2022
  • (2021)Performance Analysis of Naïve Bayes Classifier Over Similarity Score-Based Techniques for Missing Link Prediction in Ego NetworksJournal of Information Technology Research10.4018/JITR.202101010714:1(110-122)Online publication date: 23-Mar-2021
  • Show More Cited By

Index Terms

  1. Uncertainty Propagation in Data Processing Systems

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SoCC '18: Proceedings of the ACM Symposium on Cloud Computing
    October 2018
    546 pages
    ISBN:9781450360111
    DOI:10.1145/3267809
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 October 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. DAG Data Processing
    2. Uncertainty Propagation

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    SoCC '18
    Sponsor:
    SoCC '18: ACM Symposium on Cloud Computing
    October 11 - 13, 2018
    CA, Carlsbad, USA

    Acceptance Rates

    Overall Acceptance Rate 169 of 722 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)49
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 13 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Taming System Dynamics on Resource Optimization for Data Processing Workflows: A Probabilistic ApproachIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.309140033:1(231-248)Online publication date: 1-Jan-2022
    • (2022)The Laplace Microarchitecture for Tracking Data UncertaintyIEEE Micro10.1109/MM.2022.316606742:4(78-86)Online publication date: 1-Jul-2022
    • (2021)Performance Analysis of Naïve Bayes Classifier Over Similarity Score-Based Techniques for Missing Link Prediction in Ego NetworksJournal of Information Technology Research10.4018/JITR.202101010714:1(110-122)Online publication date: 23-Mar-2021
    • (2021)The Laplace Microarchitecture for Tracking Data Uncertainty and Its Implementation in a RISC-V ProcessorMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480131(1254-1269)Online publication date: 18-Oct-2021
    • (2021)A Method for Debugging Process Discovery Pipelines to Analyze the Consistency of Model PropertiesBusiness Process Management10.1007/978-3-030-85469-0_7(65-84)Online publication date: 28-Aug-2021
    • (2020)A Novel Method to Dynamically Fix Threshold for Node Neighbourhood Based Link Prediction TechniquesInternational Journal of Information Technology and Web Engineering10.4018/IJITWE.202001010215:1(17-34)Online publication date: Jan-2020
    • (2019)Incorporating Probabilistic Optimizations for Resource Provisioning of Data Processing WorkflowsProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337847(1-10)Online publication date: 5-Aug-2019
    • (2019)Online Adaptive Approximate Stream Processing with Customized Error ControlIEEE Access10.1109/ACCESS.2019.2899825(1-1)Online publication date: 2019

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media