Abstract
The Web has been rapidly “deepened” with the prevalence of databases online. On this “deep Web,” numerous sources are structured, providing schema-rich data. Their schemas define the object domain and its query capabilities. This paper proposes clustering sources by their query schemas, which is critical for enabling both source selection and query mediation, by organizing sources of with similar query capabilities. In abstraction, this problem is essentially clustering categorical data (by viewing each query schema as a transaction). Our approach hypothesizes that “homogeneous sources” are characterized by the same hidden generative models for their schemas. To find clusters governed by such statistical distributions, we propose a novel objective function, model-differentiation, which employs principled hypothesis testing to maximize statistical heterogeneity among clusters. Our evaluation shows that, on clustering the Web query schemas, the model-differentiation function outperforms existing ones with the hierarchical agglomerative clustering algorithm.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Chang, K.C.C., He, B., Li, C., Zhang, Z.: Structured databases on the web: Observations and implications. Technical Report UIUCDCS-R-2003-2321, Dept. of Computer Science, UIUC (2003)
Guha, S., Rastogi, R., Shim, K.: ROCK: A robust clustering algorithm for categorical attributes. Information Systems 25, 345–366 (2000)
Barbara, D., Li, Y., Couto, J.: Coolcat: An entropy-based algorithm for categorical clustering. In: Proceedings of CIKM Conference (2002)
Brunk, H.D.: An introduction to mathematical statistics. Blaisdell Pub. Co. (1965)
Banfield, J.D., Raftery, A.E.: Model-based gaussian and non-gaussian clustering. Biometrics 49, 803–821 (1993)
Fraley, C.: Algorithms for model-based Gaussian hierarchical clustering. SIAM Journal on Scientific Computing 20, 270–281 (1999)
Meila, M., Heckerman, D.: An experimental comparison of several clustering and initialization methods. Technical report, Microsoft Research, MSR-TR-98-06 (1998)
Levy, A.Y., Rajaraman, A., Ordille, J.J.: Querying heterogeneous information sources using source descriptions. In: Proceedings of the VLDB Conference (1996)
Papakonstantinou, Y., García-Molina, H., Ullman, J.: Medmaker: A mediation system based on declarative specifications. In: Proceedings of the ICDE Conference (1996)
Callan, J.P., Connell, M., Du., A.: Automatic discovery of language models for text databases. In: Proceedings of the SIGMOD Conference (1999)
Ipeirotis, P.G., Luis Gravano, M.S.: Probe, count, and classify: Categorizing hidden web databases. In: Proceedings of the SIGMOD Conference (2001)
Meng, W., Liu, K.L., Yu, C.T., Wang, X., Chang, Y., Rishe, N.: Determining text databases to search in the internet. In: Proceedings of the VLDB Conference (1998)
Gibson, D., Kleinberg, J.M., Raghavan, P.: Clustering categorical data: An approach based on dynamical systems. VLDB Journal 8, 222–236 (1998)
Ganti, V., Gehrke, J., Ramakrishnan, R.: CACTUS – Clustering categorical data using summaries. In: Proceedings of Knowledge Discovery and Data Mining, pp. 73–83 (1999)
He, B., Tao, T., Chang, K.C.C.: Clustering structured web sources: A schema-based, model-differentiation approach. Technical Report UIUCDCS-R-2003-2322, Dept. of Computer Science, UIUC (2003)
Ponte, J., Croft, W.: A language modelling approach to information retrieval. In: Proceedings of the 21st ACM SIGIR Conference on Research and Development in Information Retrieval (1998)
He, B., Chang, K.C.C.: Statistical schema matching across web query interfaces. In: Proceedings of the 2003 ACM SIGMOD Conference (2003)
Agresti, A.: Categorical Data Analysis. John Wiley & Sons, Inc., New Jersey (2002)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: A review. ACM Computing Surveys 31, 264–323 (1999)
Berkhin, P.: Survey of clustering data mining techniques. Technical report, Accrue Software (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
He, B., Tao, T., Chang, K.CC. (2004). Clustering Structured Web Sources: A Schema-Based, Model-Differentiation Approach. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds) Current Trends in Database Technology - EDBT 2004 Workshops. EDBT 2004. Lecture Notes in Computer Science, vol 3268. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30192-9_53
Download citation
DOI: https://doi.org/10.1007/978-3-540-30192-9_53
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23305-3
Online ISBN: 978-3-540-30192-9
eBook Packages: Computer ScienceComputer Science (R0)