Abstract
One of the key challenges in large attributed graph clustering is how to select representative attributes. Previous studies introduce user-guided clustering methods by letting a user select samples based on his/her knowledge. However, due to knowledge limitation, a single user may only pick out the samples that s/he is familiar with while ignore the others, such that the selected samples are often biased. We propose a framework to address this issue which allows multiple individuals to select samples for a specific clustering. With wider knowledge coming from multiple users, the selected samples can be more relevant to the target cluster. The challenges of this study are two-folds. Firstly, as user selected samples are usually sparse and the graph can be large, it is non-trivial to effectively combine the different annotations given by the multiple users. Secondly, it is also difficult to design a scalable approach to cluster large graphs with millions of nodes. We propose the approach CGMA (Clustering Graphs with Multiple Annotations) to address these challenges. CGMA is able to combine the crowd’s consensus opinions in an unbiased way, and conducts an effective clustering with low time complexity. We show the effectiveness and efficiency of the proposed approach on real-world graphs, by comparing with existing attributed graph clustering approaches.
J. Cao—This work is supported by NSFC grants: 71331008, 61105124 and 61303017.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Tang, J., Liu, H.: Unsupervised feature selection for linked social media data. In: SIGKDD (2012)
Akoglu, L., Tong, H., Meeder, B., Faloutsos, C.: PICS: parameter-free identification of cohesive subgroups in large attributed graphs. In: SDM (2012)
Xing, E.P., Jordan, M.I., Russell, S., et al.: Distance metric learning with application to clustering with side-information. Adv. Neural Inf. Process. Syst. 15, 505–512 (2002)
Yin, X., Han, J., Yu, P.S.: Cross-relational clustering with user’s guidance. In: SIGKDD (2005)
Yin, X., Han, J., Yu, P.S.: CrossClus: user-guided multi-relational clustering. In: SIGKDD (2007)
Sun, Y., Norick, B., Han, J., et al.: Integrating meta-path selection with user-guided object clustering in heterogeneous information networks. In: SIGKDD (2012)
Perozzi, B., Akoglu, L., Iglesias Snchez, P., et al.: Focused clustering and outlier detection in large attributed graphs. In: SIGKDD (2014)
Basu, S., Banerjee, A., Mooney, R.: Semi-supervised clustering by seeding. In: ICML (2002)
Sánchez, P.I., Muller, E., Laforet, F., et al.: Statistical selection of congruent subspaces for mining attributed graphs. In: ICDM (2013)
Chapelle, O., Schölkopf, B., Zien, A., et al.: Semi-Supervised Learning. MIT Press, Cambridge (2006)
Zhou, Y., Cheng, H., Yu, J.X., Zhou, Y., Cheng, H., Yu, J.X.: Graph clustering based on structural, attribute similarities. J. VLDB 2(1), 718–729 (2009)
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques. Elsevier, Amsterdam (2011)
Dhillon, I.S., Mallela, S., Modha, D.S.: Information-theoretic co-clustering. In: SIGKDD (2003)
Ng, A.Y., Jordan, M.I., et al.: On spectral clustering: analysis and an algorithm. In: NIPS (2002)
Rodriguez, A., Laio, A.: Clustering by fast search and find of density peaks. Science 344, 1492–1496 (2014)
Flake, G.W., Lawrence, S., Giles, C.L.: Efficient identification of web communities. In: SIGKDD (2000)
Wang, S., Li, Z., Chao, W.-H., Cao, Q.: Applying adaptive over-sampling technique based on data density and cost-sensitive SVM to imbalanced learning. In: IJCNN (2012)
Zhou, D., Liu, Q., Platt, J.C., Meek, C.: Aggregating ordinal labels from crowds by minimax conditional entropy. In: ICML (2014)
Andersen, R., Chung, F., Lang, K.: Local graph partitioning using pagerank vectors. In: IEEE SFCS (2006)
Yang, J., Leskovec, J.: Overlapping community detection at scale: a nonnegative matrix factorization approach. In: WSDM (2013)
Tong, H., Lin, C.-Y.: Non-negative residual matrix factorization with application to graph anomaly detection. In: SDM (2011)
Gleich, D.F., Seshadhri, C.: Vertex neighborhoods, low conductance cuts, and good seeds for local community methods. In: SIGKDD (2012)
Wang, S., Xie, S., Zhang, X., Li, Z., Philip, S.Y., Xinyu, S.: Future influence ranking of scientific literature. In: SDM (2014)
Ruvolo, P., Whitehill, J., Movellan, J.R.: Exploiting commonality and interaction effects in crowdsourcing tasks using latent factor models. In: NIPS (2013)
Zhou, D., Basu, S., Mao, Y., Platt, J.C.: Learning from the wisdom of crowds by minimax entropy. In: NIPS (2012)
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
Karypis, G., Kumar, V.: Multilevel algorithms for multi-constraint graph partitioning. In: ACM/IEEE Conference on Supercomputing (1998)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Cao, J., Wang, S., Qiao, F., Wang, H., Wang, F., Yu, P.S. (2016). User-Guided Large Attributed Graph Clustering with Multiple Sparse Annotations. In: Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J., Wang, R. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2016. Lecture Notes in Computer Science(), vol 9651. Springer, Cham. https://doi.org/10.1007/978-3-319-31753-3_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-31753-3_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31752-6
Online ISBN: 978-3-319-31753-3
eBook Packages: Computer ScienceComputer Science (R0)