Automatic Extraction Rules Generation Based on XPath Pattern Learning

Jingwei Zhang²³,
Can Zhang²³,
Weining Qian²³ &
…
Aoying Zhou²³

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6724))

Included in the following conference series:

International Conference on Web Information Systems Engineering

1053 Accesses
3 Citations

Abstract

Web forums have become important information sources on the Web due to their rich content contributed by millions of Internet users every day. Data extraction from Web pages is a key but cumbersome step for data analysis because of significant human intervention. Web forums have fairly regular structures which allow us to generate extraction rules automatically according to their paths. In this paper, we introduce formal expressions for XPath patterns and pattern mapping rules, and advise machine learning methods to generate extraction rules for automatic data extraction from Web forums. The experimental results on real-life Web forums show good feasibility and accuracy for forum data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 35.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 44.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Self-supervised Automated Wrapper Generation for Weblog Data Extraction

Enhancing Concept Extraction from Polish Texts with Rule Management

Data Driven XPath Generation

References

Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems 4(1-2), 93–114 (2001)
Article Google Scholar
Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 119–128. Morgan Kaufmann, San Francisco (2001)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites. In: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 109–118. Morgan Kaufmann, San Francisco (2001)
Google Scholar
Zaki, M.J., Aggarwal, C.C.: XRules: An Effective Structural Classifier for XML Data. In: SIGKDD 2003, pp. 316–325. ACM, New York (2003)
Google Scholar
Sahuguet, A., Azavant, F.: Building light-weight wrappers for legacy Web data-sources using W4F. In: Proceedings of the 25th International Conference on Very Large Data Bases, pp. 738–741. Morgan Kaufmann, San Francisco (1999)
Google Scholar
Flesca, S., Manco, G., Masciari, E., Rende, E., Tagarelli, A.: Web wrapper induction: a brief survey. AI Commun. 17(2), 57–61 (2004)
Google Scholar
Shen, W., Doan, A.H., Naughton, J.F., Ramakrishnan, R.: Declarative Information Extraction Using Datalog with Embedded Extraction Predicates. In: Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB 2007), pp. 1033–1044. VLDB Endowment (2007)
Google Scholar
Huang, Y., Liu, Z.Y., Chen, Y.: eXtract: A Snippet Generation System for XML Search. Proc. VLDB Endow. 1(2), 1392–1395 (2008)
Article Google Scholar
Cohen, S.: Generating XML Structure Using Examples and Constraints. Proc. VLDB Endow. 1(1), 490–501 (2008)
Article Google Scholar
Cai, R., Yang, J.M., Lai, W., Wang, Y.D., Zhang, L.: iRobot: An Intelligent Crawler for Web Forums. In: Proceeding of the 17th International Conference on World Wide Web (WWW 2008), pp. 447–456. ACM, New York (2008)
Chapter Google Scholar
Garofalakis, M., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT : A System for Extracting Document Type Descriptors from XML Documents. SIGMOD Rec. 29(2), 165–176 (2000)
Article Google Scholar
Mengel, S., Jing, Y.: Extracting structured data from web pages with maximum entropy segmental markov model. In: Vossen, G., Long, D.D.E., Yu, J.X. (eds.) WISE 2009. LNCS, vol. 5802, pp. 219–226. Springer, Heidelberg (2009)
Chapter Google Scholar
Anton, T.: XPath-Wrapper Induction by Generalizing Tree Traversal Patterns. LWA, 126–133 (2005)
Google Scholar
Myllymaki, J.: Effective Web Data Extraction with Standard XML Technologies. In: Proceedings of the 10th International Conference on World Wide Web (WWW 2001), pp. 689–696. ACM, New York (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Massive Computing, East China Normal University, Shanghai, 200062, China
Jingwei Zhang, Can Zhang, Weining Qian & Aoying Zhou

Authors

Jingwei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Can Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Weining Qian
View author publications
You can also search for this author in PubMed Google Scholar
Aoying Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dickson Computer Systems, 7A Victory Avenue 4/F Homantin, Kowloon, Hong Kong, China
Dickson K. W. Chiu
Ecole Nationale Supérieure de Mécanique et d’Aréotechnique, Laboratoire d’Informatique Scientifique et Industrielle, Téléport 2 - avenue Clément Ader, 86961, Futuroscope Chasseneuil Cedex, France
Ladjel Bellatreche
Dept. of Computer Science and Engineering, Ritsumeikan University, Wakakusa 6-4-10, 525-0045, Kusatu, Shiga, Japan
Hideyasu Sasaki
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Sha Tin, Hong Kong, China
Ho-fung Leung
Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China
Shing-Chi Cheung
School of Computer Science, Hangshou Dianzi University, Xiasha Higher Education Zone, 310018, Hanshou City, Zhejiang, China
Haiyang Hu
Department of Computer Science and Software Engineering, The University of Melbourne, 3010, Parkville, Victoria, Australia
Jie Shao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, J., Zhang, C., Qian, W., Zhou, A. (2011). Automatic Extraction Rules Generation Based on XPath Pattern Learning. In: Chiu, D.K.W., et al. Web Information Systems Engineering – WISE 2010 Workshops. WISE 2010. Lecture Notes in Computer Science, vol 6724. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24396-7_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-24396-7_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24395-0
Online ISBN: 978-3-642-24396-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Automatic Extraction Rules Generation Based on XPath Pattern Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Self-supervised Automated Wrapper Generation for Weblog Data Extraction

Enhancing Concept Extraction from Polish Texts with Rule Management

Data Driven XPath Generation

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Automatic Extraction Rules Generation Based on XPath Pattern Learning

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Self-supervised Automated Wrapper Generation for Weblog Data Extraction

Enhancing Concept Extraction from Polish Texts with Rule Management

Data Driven XPath Generation

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation