[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3308558.3313547acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

A Human-in-the-loop Attribute Design Framework for Classification

Published: 13 May 2019 Publication History

Abstract

In this paper, we present a semi-automated, “human-in-the-loop” framework for attribute design that assists human analysts to transform raw attributes into effective derived attributes for classification problems. Our proposed framework is optimization guided and fully agnostic to the underlying classification model. We present an algebra with various operators (arithmetic, relational, and logical) to transform raw attributes into derived attributes and solve two technical problems: (a) the top-k buckets design problem aims at presenting human analysts with k buckets, each bucket containing promising choices of raw attributes that she can focus on only without having to look at all raw attributes; and (b) the top-l snippets generation problem, which iteratively aids human analysts with top-l derived attributes involving an attribute. For the former problem, we present an effective exact bottom-up algorithm that is empowered by pruning capability, as well as random walk based heuristic algorithms that are intuitive and work well in practice. For the latter, we present a greedy heuristic algorithm that is scalable and effective. Rigorous evaluations are conducted involving 6 different real world datasets to showcase that our framework generates effective derived attributes compared to fully manual or fully automated methods.

References

[1]
Rakesh Agarwal, Ramakrishnan Srikant, 1994. Fast algorithms for mining association rules. In Proc. of the 20th VLDB Conference. 487-499.
[2]
Michael R Anderson, Dolan Antenucci, Victor Bittorf, Matthew Burgess, Michael J Cafarella, Arun Kumar, Feng Niu, Yongjoo Park, Christopher Re´, and Ce Zhang. 2013. Brainwash: A Data System for Feature Engineering. In CIDR.
[3]
Michael R Anderson and Michael Cafarella. 2016. Input selection for fast feature engineering. In Data Engineering (ICDE), 2016 IEEE 32nd International Conference on. IEEE, 577-588.
[4]
Senjuti Basu Roy, Ankur Teredesai, Kiyana Zolfaghar, Rui Liu, David Hazel, Stacey Newman, and Albert Marinez. 2015. Dynamic hierarchical classification for patient risk-of-readmission. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1691-1700.
[5]
Jonathan Bragg, Daniel S Weld, 2013. Crowdsourcing multi-label classification for taxonomy creation. In AAAI Conference on Human Computation and Crowdsourcing.
[6]
Sergey Brin, Rajeev Motwani, and Craig Silverstein. 1997. Beyond market baskets: Generalizing association rules to correlations. In Acm Sigmod Record, Vol. 26. ACM, 265-276.
[7]
Xi Chen, Paul N Bennett, Kevyn Collins-Thompson, and Eric Horvitz. 2013. Pairwise ranking aggregation in a crowdsourced setting. In ACM International Conference on Web Search and Data Mining. ACM, 193-202.
[8]
Justin Cheng and Michael S Bernstein. 2015. Flock: Hybrid crowd-machine learning classifiers. In ACM Conference on Computer Supported Cooperative Work & Social Computing. ACM, 600-611.
[9]
Brian Eriksson. 2013. Learning to top-k search using pairwise comparisons. In Artificial Intelligence and Statistics. 265-273.
[10]
Meng Fang, Jie Yin, and Dacheng Tao. 2014. Active Learning for Crowdsourcing Using Knowledge Transfer. In AAAI. 1809-1815.
[11]
Amber Feng, Michael J. Franklin, Donald Kossmann, Tim Kraska, Samuel Madden, Sukriti Ramesh, Andrew Wang, and Reynold Xin. 2011. CrowdDB: Query Processing with the Crowd. PVLDB 4, 12 (2011), 1387-1390.
[12]
Beno&icir;t Fre´nay, Gauthier Doquire, and Michel Verleysen. 2013. Is mutual information adequate for feature selection in regression?Neural Networks 48(2013), 1-7.
[13]
Stephen Guo, Aditya Parameswaran, and Hector Garcia-Molina. 2012. So who won?: dynamic max discovery with the crowd. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, 385-396.
[14]
Jeff Heaton. 2016. An empirical analysis of feature engineering for predictive modeling. In SoutheastCon, 2016. IEEE, 1-6.
[15]
Chien-Ju Ho, Shahin Jabbari, and Jennifer W Vaughan. 2013. Adaptive task assignment for crowdsourced classification. In International Conference on Machine Learning. 534-542.
[16]
Muhammad Imran, Carlos Castillo, Ji Lucas, Patrick Meier, and Sarah Vieweg. 2014. AIDR: Artificial intelligence for disaster response. In International Conference on World Wide Web. ACM, 159-162.
[17]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In ACM International Conference on Multimedia. ACM, 675-678.
[18]
James Max Kanter and Kalyan Veeramachaneni. 2015. Deep feature synthesis: Towards automating data science endeavors. In IEEE International Conference on Data Science and Advanced Analytics. IEEE, 1-10.
[19]
Haim Kaplan, Ilia Lotosh, Tova Milo, and Slava Novgorodov. 2013. Answering Planning Queries with the Crowd. In PVDLB.
[20]
Gilad Katz, Eui Chul Richard Shin, and Dawn Song. 2016. Explorekit: Automatic feature generation and selection. In Data Mining (ICDM), 2016 IEEE 16th International Conference on. IEEE, 979-984.
[21]
Asif R Khan and Hector Garcia-Molina. 2014. Hybrid strategies for finding the max with the crowd: technical report. Technical Report. Stanford InfoLab.
[22]
Udayan Khurana, Deepak Turaga, Horst Samulowitz, and Srinivasan Parthasrathy. 2016. Cognito: Automated feature engineering for supervised learning. In IEEE International Conference on Data Mining Workshops. IEEE, 1304-1307.
[23]
Hoang Thanh Lam, Johann-Michael Thiebaut, Mathieu Sinn, Bei Chen, Tiep Mai, and Oznur Alkan. 2017. One button machine for automating feature engineering in relational databases. arXiv preprint arXiv:1706.00327(2017).
[24]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436.
[25]
Wentian Li. 1990. Mutual information functions versus correlation functions. Journal of statistical physics 60, 5-6 (1990), 823-837.
[26]
Mokshay Madiman. 2008. On the entropy of sums. In Information Theory Workshop, 2008. ITW'08. IEEE. IEEE, 303-307.
[27]
Adam Marcus, David Karger, Samuel Madden, Robert Miller, and Sewoong Oh. 2012. Counting with the crowd. Proceedings of the VLDB Endowment 6, 2, 109-120.
[28]
Adam Marcus, Eugene Wu, David Karger, Samuel Madden, and Robert Miller. 2011. Human-powered sorts and joins. Proceedings of the VLDB Endowment 5, 1 (2011), 13-24.
[29]
Barzan Mozafari, Purna Sarkar, Michael Franklin, Michael Jordan, and Samuel Madden. 2014. Scaling up crowd-sourcing to very large datasets: a case for active learning. Proceedings of the VLDB Endowment 8, 2 (2014), 125-136.
[30]
Aditya Parameswaran, Stephen Boyd, Hector Garcia-Molina, Ashish Gupta, Neoklis Polyzotis, and Jennifer Widom. 2014. Optimal crowd-powered rating and filtering algorithms. Proceedings of the VLDB Endowment 7, 9 (2014), 685-696.
[31]
Aditya G Parameswaran, Hector Garcia-Molina, Hyunjung Park, Neoklis Polyzotis, Aditya Ramesh, and Jennifer Widom. 2012. Crowdscreen: Algorithms for filtering data with humans. In ACM SIGMOD International Conference on Management of Data. ACM, 361-372.
[32]
Hyunjung Park and Jennifer Widom. 2013. Query optimization over crowdsourced data. Proceedings of the VLDB Endowment 6, 10, 781-792.
[33]
Thomas Pfeiffer, Xi Alice Gao, Yiling Chen, Andrew Mao, and David G Rand. 2012. Adaptive Polling for Information Aggregation. In AAAI.
[34]
Anish Das Sarma, Aditya Parameswaran, Hector Garcia-Molina, and Alon Halevy. 2014. Crowd-powered find algorithms. In IEEE International Conference on Data Engineering. IEEE, 964-975.
[35]
Frank Seide, Gang Li, Xie Chen, and Dong Yu. 2011. Feature engineering in context-dependent deep neural networks for conversational speech transcription. In IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE, 24-29.
[36]
Micah J Smith, Roy Wedge, and Kalyan Veeramachaneni. 2017. FeatureHub: Towards collaborative data science. In IEEE International Conference on Data Science and Advanced Analytics. IEEE, 590-600.
[37]
Chong Sun, Narasimhan Rampalli, Frank Yang, and AnHai Doan. 2014. Chimera: Large-scale classification using machine learning, rules, and crowdsourcing. Proceedings of the VLDB Endowment 7, 13 (2014), 1529-1540.
[38]
Jorge R Vergara and Pablo A Este´vez. 2014. A review of feature selection methods based on mutual information. Neural computing and applications 24, 1 (2014), 175-186.
[39]
Norases Vesdapunt, Kedar Bellare, and Nilesh Dalvi. 2014. Crowdsourcing algorithms for entity resolution. Proceedings of the VLDB Endowment 7, 12 (2014), 1071-1082.
[40]
Jiannan Wang, Tim Kraska, Michael J Franklin, and Jianhua Feng. 2012. Crowder: Crowdsourcing entity resolution. Proceedings of the VLDB Endowment 5, 11 (2012), 1483-1494.
[41]
Jiannan Wang, Guoliang Li, Tim Kraska, Michael J Franklin, and Jianhua Feng. 2013. Leveraging transitive relations for crowdsourced joins. In ACM SIGMOD International Conference on Management of Data. ACM, 229-240.
[42]
Sibo Wang, Xiaokui Xiao, and Chun-Hee Lee. 2015. Crowd-based deduplication: An adaptive approach. In ACM SIGMOD International Conference on Management of Data. ACM, 1263-1277.
[43]
Steven Euijong Whang, Peter Lofgren, and Hector Garcia-Molina. 2013. Question selection for crowd entity resolution. Proceedings of the VLDB Endowment 6, 6 (2013), 349-360.
[44]
Tingxin Yan, Vikas Kumar, and Deepak Ganesan. 2010. Crowdsearch: exploiting crowds for accurate real-time image search on mobile phones. In International Conference on Mobile Systems, Applications and Services. ACM, 77-90.
[45]
Yan Yan, Romer Rosales, Glenn Fung, and Jennifer G Dy. 2011. Active learning from crowds. In International Conference on Machine Learning, Vol. 11. 1161-1168.
[46]
Guizhen Yang. 2004. The complexity of mining maximal frequent itemsets and maximal frequent patterns. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 344-353.
[47]
Peng Ye and David Doermann. 2013. Combining preference and absolute judgements in a crowd-sourced setting. In International Conference on Machine Learning. 1-7.
[48]
Ce Zhang, Arun Kumar, and Christopher Re´. 2016. Materialization optimizations for feature selection workloads. ACM Transactions on Database Systems 41, 1 (2016), 2.
[49]
Jinhong Zhong, Ke Tang, and Zhi-Hua Zhou. 2015. Active Learning from Crowds with Unsure Option. In IJCAI. 1061-1068.
[50]
James Y Zou, Kamalika Chaudhuri, and Adam Tauman Kalai. 2015. Crowdsourcing feature discovery via adaptively chosen comparisons. arXiv preprint arXiv:1504.00064(2015).

Cited By

View all
  • (2024)Human‐in‐the‐loop machine learning for healthcare: Current progress and future opportunities in electronic health recordsMedicine Advances10.1002/med4.702:3(318-322)Online publication date: 23-Aug-2024
  • (2023)HybridEval: A Human-AI Collaborative Approach for Evaluating Design Ideas at ScaleProceedings of the ACM Web Conference 202310.1145/3543507.3583496(3837-3848)Online publication date: 30-Apr-2023
  • (2022)Efficient approximate top-k mutual information based feature selectionJournal of Intelligent Information Systems10.1007/s10844-022-00750-461:1(191-223)Online publication date: 18-Oct-2022

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
WWW '19: The World Wide Web Conference
May 2019
3620 pages
ISBN:9781450366748
DOI:10.1145/3308558
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • IW3C2: International World Wide Web Conference Committee

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. attribute design
  2. crowdsourcing
  3. feature engineering
  4. human computation

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

WWW '19
WWW '19: The Web Conference
May 13 - 17, 2019
CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)23
  • Downloads (Last 6 weeks)5
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Human‐in‐the‐loop machine learning for healthcare: Current progress and future opportunities in electronic health recordsMedicine Advances10.1002/med4.702:3(318-322)Online publication date: 23-Aug-2024
  • (2023)HybridEval: A Human-AI Collaborative Approach for Evaluating Design Ideas at ScaleProceedings of the ACM Web Conference 202310.1145/3543507.3583496(3837-3848)Online publication date: 30-Apr-2023
  • (2022)Efficient approximate top-k mutual information based feature selectionJournal of Intelligent Information Systems10.1007/s10844-022-00750-461:1(191-223)Online publication date: 18-Oct-2022
  • (2021)Security and privacy in the Internet of Things: threats and challengesService Oriented Computing and Applications10.1007/s11761-021-00327-z15:4(257-271)Online publication date: 1-Dec-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media