Abstract
Web forums have become important information sources on the Web due to their rich content contributed by millions of Internet users every day. Data extraction from Web pages is a key but cumbersome step for data analysis because of significant human intervention. Web forums have fairly regular structures which allow us to generate extraction rules automatically according to their paths. In this paper, we introduce formal expressions for XPath patterns and pattern mapping rules, and advise machine learning methods to generate extraction rules for automatic data extraction from Web forums. The experimental results on real-life Web forums show good feasibility and accuracy for forum data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems 4(1-2), 93–114 (2001)
Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 119–128. Morgan Kaufmann, San Francisco (2001)
Crescenzi, V., Mecca, G., Merialdo, P.: ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites. In: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 109–118. Morgan Kaufmann, San Francisco (2001)
Zaki, M.J., Aggarwal, C.C.: XRules: An Effective Structural Classifier for XML Data. In: SIGKDD 2003, pp. 316–325. ACM, New York (2003)
Sahuguet, A., Azavant, F.: Building light-weight wrappers for legacy Web data-sources using W4F. In: Proceedings of the 25th International Conference on Very Large Data Bases, pp. 738–741. Morgan Kaufmann, San Francisco (1999)
Flesca, S., Manco, G., Masciari, E., Rende, E., Tagarelli, A.: Web wrapper induction: a brief survey. AI Commun. 17(2), 57–61 (2004)
Shen, W., Doan, A.H., Naughton, J.F., Ramakrishnan, R.: Declarative Information Extraction Using Datalog with Embedded Extraction Predicates. In: Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB 2007), pp. 1033–1044. VLDB Endowment (2007)
Huang, Y., Liu, Z.Y., Chen, Y.: eXtract: A Snippet Generation System for XML Search. Proc. VLDB Endow. 1(2), 1392–1395 (2008)
Cohen, S.: Generating XML Structure Using Examples and Constraints. Proc. VLDB Endow. 1(1), 490–501 (2008)
Cai, R., Yang, J.M., Lai, W., Wang, Y.D., Zhang, L.: iRobot: An Intelligent Crawler for Web Forums. In: Proceeding of the 17th International Conference on World Wide Web (WWW 2008), pp. 447–456. ACM, New York (2008)
Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT : A System for Extracting Document Type Descriptors from XML Documents. SIGMOD Rec. 29(2), 165–176 (2000)
Mengel, S., Jing, Y.: Extracting structured data from web pages with maximum entropy segmental markov model. In: Vossen, G., Long, D.D.E., Yu, J.X. (eds.) WISE 2009. LNCS, vol. 5802, pp. 219–226. Springer, Heidelberg (2009)
Anton, T.: XPath-Wrapper Induction by Generalizing Tree Traversal Patterns. LWA, 126–133 (2005)
Myllymaki, J.: Effective Web Data Extraction with Standard XML Technologies. In: Proceedings of the 10th International Conference on World Wide Web (WWW 2001), pp. 689–696. ACM, New York (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhang, J., Zhang, C., Qian, W., Zhou, A. (2011). Automatic Extraction Rules Generation Based on XPath Pattern Learning. In: Chiu, D.K.W., et al. Web Information Systems Engineering – WISE 2010 Workshops. WISE 2010. Lecture Notes in Computer Science, vol 6724. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24396-7_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-24396-7_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24395-0
Online ISBN: 978-3-642-24396-7
eBook Packages: Computer ScienceComputer Science (R0)