[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2063576.2063824acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

The quality of the XML web

Published: 24 October 2011 Publication History

Abstract

We collect evidence to answer the following question: Is the quality of the XML documents found on the web sufficient to apply XML technology like XQuery, XPath and XSLT? XML collections from the web have been previously studied statistically, but no detailed information about the quality of the XML documents on the web is available to date. We address this shortcoming in this study. We gathered 180K XML documents from the web. Their quality is surprisingly good; 85.4% is well-formed and 99.5% of all specified encodings is correct. Validity needs serious attention. Only 25% of all files contain a reference to a DTD or XSD, of which just one third is actually valid. Errors are studied in detail. Automatic error repair seems promising. Our study is well documented and easily repeatable. This paves the way for a periodic quality assessment of the XML web.

References

[1]
Abiteboul, S., & Vianu, V. (1997). Queries and Computation on the Web. ICDT '97: Proceedings ICDT (pp. 262--275).
[2]
Azze-Eddine, M., Samia, K.-B., & Douniazed, A. H. (2004). XML-DFG : A Dynamic Forms Generator for XML Valid DTD Document. RIST, 14 (2), 15--26.
[3]
Barbosa, D., Mignet, L., & Veltri, P. (2005). Studying the XML Web: Gathering Statistics from an XML Sample. World Wide Web, 8 (4), pp. 413--438.
[4]
Beatty, P., Dick, S., & Miller, J. (2008, Mar/Apr). Is HTML in a Race to the Bottom? A Large-Scale Survey and Analysis of Conformance to W3C Standards. IEEE Internet Computing, 12 (2), pp. 76--80.
[5]
Beckett, D. (1997). 30% accessible - a survey of the UK Wide Web. Computer Networks and ISDN Systems, 29 (Nos 8--13), pp. 1367--75.
[6]
Bex, G.J., Gelade, W., Neven, F. & Vansummeren, S. Learning deterministic regular expressions for the inference of schemas from XML data. WWW 2008: 825--834
[7]
Bex, G. J., Martens, W., Neven, F., & Schwentick, T. (2005). Expressiveness of XSDs: from practice to theory, there and back again. : Proceedings WWW (pp. 712--721
[8]
Bex, G. J., Neven, F., & Bussche, J. V. (2004). DTDs versus XML Schema: A Practical Study. Proceedings WebDB '04 (pp. 79--84).
[9]
Bex, G. J., Neven, F., Schwentick, T., & Tuyls, K. (2006). Inference of concise DTDs from XML data. : Proc. VLDB '06 (pp. 115--126).
[10]
Chen, B., & Shen, V. Y. (2006). Transforming Web Pages to Become Standard-Compliant through Reverse Engineering. : Proceedings W4A '06. pp. 14--22.
[11]
Chen, S., Hong, D., & Shen, V. (2005). An experimental study on validation problems with existing HTML web pages. Proceedings ICOMP'05. (pp. 373--379).
[12]
Choi, B. (2002). What are real DTDs like? Proceedings WebDB '02, (pp. 43--48).
[13]
Doan, A., & Halevey, A. (2005). AI Magazine, Vol. 26, pp. 83--94.
[14]
Elmagarmid, A., Ipeirotis, P., & Verykios, V. (2007). Knowledge and Data Engineering, IEEE Transactions on, 19 (1), pp. 1--16.
[15]
Furche T., Gottlob G., Grasso G., Schallhart Ch., Sellers A. 2011. OXPath: A Language for Scalable, Memory-efficient Data Extraction from Web Applications. In Proceedings VLDB 2011.
[16]
Gelade, W., Idziaszek, T., Martens, W. & and Neven, F. (2010)( Simplifying XML Schema: Single-Type Approximations of Regular Tree Languages. Proc. PODS 2010.
[17]
Gottlob, G., Koch, Ch., Baumgartner, R., Herzog, M. & Flesca, S. 2004. The Lixto data extraction project: back and forth between theory and practice. In Proceedings PODS '04 (pp. 1--12).
[18]
Guerrini, G., Mesiti, M., & Rossi, D. (2005). Impact of XML schema evolution on valid documents. Proc. WIDM '05 (pp. 39--44).
[19]
Hackett, S., Parmanto, B., & Zeng, X. (2004). Accessibility of Internet websites through time. SIGACCESS Access. Comput., 32--39.
[20]
Harold, E. R. (2001). XML Bible. New York, NY, USA: John Wiley & Sons, Inc.
[21]
Klettke, M., Schneider, L., & Heuer, A. (2002). Metrics for XML Document Collections. XMLDM Workshop, (pp. 162--176). Prague.
[22]
Kosek, J., Kratky, M., & Snasel, V. (2003). Struktura realnych XML dokumentu a metody indexovani. ITAT 2003: Workshop on Information Technologies Applications and Theory. High Tatras, Slovakia.
[23]
Lawrence, S., & Giles, C. L. (2000, Spring). Accessibility of information on the Web. Intelligence, 11 (1), pp. 32--39.
[24]
Lee, D., & Chu, W. W. (2000). Comparative analysis of six XML schema languages. SIGMOD Rec., 29 (3), 76--87.
[25]
Liu, B. (2007). Web Data Mining. Springer.
[26]
Madnick, S., Wang, R., Lee, Y. & ZHU, H. (2009) Overview and framework for data and information quality research. Journal of Data and Information Quality.1 (1), pp. 2.1--2.22.
[27]
Martens, W., Neven, F., & Schwentick, T. (2005). Which XML schemas admit 1-pass preorder typing? Proc.s ICDT (pp. 68--82).
[28]
Martens, W., Neven, F., Schwentick, T., & Bex, G. J. (2006). Expressiveness and complexity of XML Schema. ACM Trans. Database Syst., 31 (3), 770--813.
[29]
McDowell, A., Schmidt, C., & Yue, K.-B. (2004). Analysis and Metrics of XML Schema. Proc. SERP '04 (pp. 538--544).
[30]
Mignet, L., Barbosa, D., & Veltri, P. (2003). The XML Web: a First Study. Proceedings WWW '03 pp. 500--510.
[31]
Mlynkova, I., Toman, K., & Pokorny, J. (2006). Statistical Analysis of Real XML Data Collections (Technical Report). Charles University, Faculty of Mathematics and Physics, Department of Software Engineering, Prague, Czech Republic.
[32]
Ofuonye, E., Beatty, P., Dick, S., & Miller, J. (2010). Prevalence and classification of web page defects. Online Information Review, 34 (1), 160--174.
[33]
Pollach, I., Pinterits, A., & Treiblmaier, H. (2006). Environmental Web Sites: An Empirical Investigation of Functionality and Accessibility. Proceedings of the 39th Hawaii International Conference on System Sciences. IEEE.
[34]
Raghavan, S., & Garcia-Molina, H. (2001). Crawling the Hidden Web. Proceedings VLDB '01 (pp. 129--138
[35]
Rahm, E., & Do, H.H. (2000). Data Cleaning: Problems and Current Approaches, 23 (4)
[36]
Sahuguet, A. (2001). Everything You Ever Wanted to Know About DTDs, But Were Afraid to Ask. Proc. WebDB (pp. 171--183).
[37]
Sundaresan, N., & Moussa, R. (2001). Algorithms and programming models for efficient representation of XML for Internet applications. Proceedings WWW '01 (pp. 366--375).
[38]
Toman, K., & Mlynková, I. (2006). XML Data - The Current State of Affairs. Proceedings of XML Prague '06 conference, (pp. 87--102).
[39]
Wang, R. & Strong, D. Beyond accuracy: What data quality means to data consumers. J. Manag. Inf. Syst. 12(4) pp. 5--34.
[40]
W3C. (2004, 02 10). World Wide Web Consortium (W3C). Retrieved 04 05, 2010, from RDF - Semantic Web Standards: http://www.w3.org/RDF

Cited By

View all
  • (2022)Querying XML documents using Prolog enginesInformation Processing and Management: an International Journal10.1016/j.ipm.2019.05.01156:5(1753-1770)Online publication date: 20-Apr-2022
  • (2021)Inferring Deterministic Regular Expression with Unorder and CountingDatabase Systems for Advanced Applications10.1007/978-3-030-73197-7_15(235-252)Online publication date: 6-Apr-2021
  • (2019)A Large-Scale Repository of Deterministic Regular Expression Patterns and Its ApplicationsAdvances in Knowledge Discovery and Data Mining10.1007/978-3-030-16142-2_20(249-261)Online publication date: 20-Mar-2019
  • Show More Cited By

Index Terms

  1. The quality of the XML web

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management
    October 2011
    2712 pages
    ISBN:9781450307178
    DOI:10.1145/2063576
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 October 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. XML
    2. XML web
    3. data quality
    4. schemas

    Qualifiers

    • Research-article

    Conference

    CIKM '11
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 12 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Querying XML documents using Prolog enginesInformation Processing and Management: an International Journal10.1016/j.ipm.2019.05.01156:5(1753-1770)Online publication date: 20-Apr-2022
    • (2021)Inferring Deterministic Regular Expression with Unorder and CountingDatabase Systems for Advanced Applications10.1007/978-3-030-73197-7_15(235-252)Online publication date: 6-Apr-2021
    • (2019)A Large-Scale Repository of Deterministic Regular Expression Patterns and Its ApplicationsAdvances in Knowledge Discovery and Data Mining10.1007/978-3-030-16142-2_20(249-261)Online publication date: 20-Mar-2019
    • (2018)Practical Study of Deterministic Regular Expressions from Large-scale XML and Schema DataProceedings of the 22nd International Database Engineering & Applications Symposium10.1145/3216122.3216126(45-53)Online publication date: 18-Jun-2018
    • (2015)Monitoring of Client-Cloud InteractionCorrect Software in Web Applications and Web Services10.1007/978-3-319-17112-8_6(177-228)Online publication date: 2015
    • (2014)Data quality: The other face of Big Data2014 IEEE 30th International Conference on Data Engineering10.1109/ICDE.2014.6816764(1294-1297)Online publication date: Mar-2014
    • (2013)Learning queries for relational, semi-structured, and graph databasesProceedings of the 2013 SIGMOD/PODS Ph.D. symposium10.1145/2483574.2483576(19-24)Online publication date: 23-Jun-2013
    • (2013)A Grammatical Inference Approach to Language-Based Anomaly Detection in XMLProceedings of the 2013 International Conference on Availability, Reliability and Security10.1109/ARES.2013.90(685-693)Online publication date: 2-Sep-2013
    • (2013)The quality of the XML WebWeb Semantics: Science, Services and Agents on the World Wide Web10.1016/j.websem.2012.12.00119(59-68)Online publication date: Mar-2013
    • (undefined)The Quality of the XML WebSSRN Electronic Journal10.2139/ssrn.3199002

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media