Abstract
Deep Web crawling refers to the problem of traversing the collection of pages in a deep Web site, which are dynamically generated in response to a particular query that is submitted using a search form. To achieve this, crawlers need to be endowed with some features that go beyond merely following links, such as the ability to automatically discover search forms that are entry points to the deep Web, fill in such forms, and follow certain paths to reach the deep Web pages with relevant information. Current surveys that analyse the state of the art in deep Web crawling do not provide a framework that allows comparing the most up-to-date proposals regarding all the different aspects involved in the deep Web crawling process. In this article, we propose a framework that analyses the main features of existing deep Web crawling-related techniques, including the most recent proposals, and provides an overall picture regarding deep Web crawling, including novel features that to the present day had not been analysed by previous surveys. Our main conclusion is that crawler evaluation is an immature research area due to the lack of a standard set of performance measures, or a benchmark or publicly available dataset to evaluate the crawlers. In addition, we conclude that the future work in this area should be focused on devising crawlers to deal with ever-evolving Web technologies and improving the crawling efficiency and scalability, in order to create effective crawlers that can operate in real-world contexts.
Similar content being viewed by others
References
Álvarez, M, Raposo, J, Pan, A, Cacheda, F, Bellas, F, Carneiro, V: Crawling the content hidden behind Web forms. In: ICCSA, pp. 322–333 (2007). https://doi.org/10.1007/978-3-540-74477-1_31
Anupam, V., Freire, J., Kumar, B., Lieuwen, D.F.: Automating Web navigation with the WebVCR. Comput. Netw. 33(1-6), 503–517 (2000). https://doi.org/10.1016/S1389-1286(00)00073-6
Asudeh, A., Thirumuruganathan, S., Zhang, N., Das, G.: Discovering the skyline of Web databases. PVLDB 9(7), 600–611 (2016). https://doi.org/10.14778/2904483.2904491
Barbosa, L, Freire, J: Siphoning hidden-Web data through keyword-based interfaces. In: SBBD, pp. 309–321. (2004).
Barbosa, L, Freire, J: Searching for hidden-Web databases. In: WebDB, pp. 1–6 (2005)
Barbosa, L, Freire, J: An adaptive crawler for locating hidden-Web entry points. In: WWW, pp. 441–450 (2007). https://doi.org/10.1145/1242572.1242632
Baumgartner, R, Ceresna, M, Ledermuller, G: Deep Web navigation in Web data extraction. In: CIMCA/IAWTIC, pp. 698–703 (2005). https://doi.org/10.1109/CIMCA.2005.1631550
Bergholz, A, Chidlovskii, B: Crawling for domain-specific hidden Web resources. In: WISE, pp. 125–133 (2003). https://doi.org/10.1109/WISE.2003.1254476
Bergman, M.K.: The deep Web: Surfacing hidden value. J. Electron. Publ. 7, 1 (2001).
Blanco, L, Dalvi, N, Machanavajjhala, A: Highly efficient algorithms for structural clustering of large Webs ites. In: WWW, pp. 437–446 (2011). https://doi.org/10.1145/1963405.1963468
Blythe, J., Kapoor, D., Knoblock, C.A., Lerman, K., Minton, S.: Information integration for the masses. J UCS 14(11), 1811–1837 (2008). https://doi.org/10.3217/jucs-014-11-1811
Bollacker, K, Evans, C, Paritosh, P, Sturge, T, Taylor, J: Freebase: A collaboratively created graph database for structuring human knowledge. In: SIGMOD, pp. 1247–1250 (2008). https://doi.org/10.1145/1376616.1376746
Calì, A, Martinenghi, D: Querying the deep Web. In: EDBT, pp. 724–727 (2010). https://doi.org/10.1145/1739041.1739138
Caverlee, J, Liu, L, Buttler, D: Probe, cluster, and discover: Focused extraction of qa-pagelets from the deep Web. In: ICDE, pp. 103–114 (2004). https://doi.org/10.1109/ICDE.2004.1319988
Chakrabarti, S., Dom, B., Raghavan, P., Rajagopalan, S., Gibson, D., Kleinberg, J.M.: Automatic resource compilation by analyzing hyperlink structure and associated text. Comput. Netw. 30(1-7), 65–74 (1998). https://doi.org/10.1016/S0169-7552(98)00087-7
Chang, K.C.C., He, B., Li, C., Patel, M., Zhang, Z.: Structured databases on the Web: Observations and implications. SIGMOD Record 33(3), 61–70 (2004). https://doi.org/10.1145/1031570.1031584
Chang, KCC, He, B, Zhang, Z: Toward large scale integration: Building a metaquerier over databases on the Web. In: CIDR, pp. 44–55. (2005).
Chen, H.: Dark Web: Exploring and data mining the dark side of the Web. Online Inf. Rev. 36(6), 932–933 (2012). https://doi.org/10.1108/14684521211287981
Cho, J., Garcia-Molina, H.: Effective page refresh policies for Web crawlers. ACM Trans. Database Syst 28(4), 390–426 (2003). https://doi.org/10.1145/958942.958945
chromeless: https://github.com/graphcool/chromeless (2018)
Cope, J., Craswell, N., Hawking, D.: Automated discovery of search interfaces on the Web. In: ADC, CRPIT, vol. 17, pp. 181–189 (2003)
Davulcu, H, Freire, J, Kifer, M, Ramakrishnan, IV: A layered architecture for querying dynamic Web content. In: SIGMOD, pp. 491–502 (1999). https://doi.org/10.1145/304182.304225
Devine, J., Egger-Sider, F.: Beyond google: The invisible Web in the academic library. J. Acad. Librarianship 30(4), 265–269 (2004). https://doi.org/10.1016/j.acalib.2004.04.010
Dragut, E.C., Kabisch, T., Yu, C., Leser, U.: A hierarchical approach to model Web query interfaces for Web source integration. PVLDB 2(1), 325–336 (2009). https://doi.org/10.14778/1687627.1687665
Dragut, E.C., Meng, W., Yu, C.T.: Deep Web Query Interface Understanding and Integration. Synthesis Lectures on Data Management. Morgan & Claypool (2012). https://doi.org/10.2200/S00419ED1V01Y201205DTM026
Fetto, J.: Mobile search: Topics and themes. report, Hitwise (2017)
Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C.: The ontological key: Automatically understanding and integrating forms to access the deep Web. VLDBJ 22(5), 615–640 (2013). https://doi.org/10.1007/s00778-013-0323-0
Furche, T., Gottlob, G., Grasso, G., Schallhart, C., Sellers, A.J.: OXPath: A language for scalable data extraction, automation, and crawling on the Deep Web. VLDB J 22(1), 47–72 (2013). https://doi.org/10.1007/s00778-012-0286-6
Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C., Wang, C.: DIADEM: Thousands of Websites to a single database. PVLDB 7 (14), 1845–1856 (2014). https://doi.org/10.14778/2733085.2733091
Green, D.: The evolution of Web searching. Online Inf. Rev. 24(2), 124–137 (2000). https://doi.org/10.1108/14684520010330283
He, B., Patel, M., Zhang, Z., Chang, K.C.C.: Accessing the deep Web: A survey. Commun ACM 50(5), 94–101 (2007). https://doi.org/10.1145/1230819.1241670
He, H, Meng, W, Lu, Y, Yu, CT, Wu, Z: Towards deeper understanding of the search interfaces of the Deep Web. In: WWW, pp. 133–155 (2007). https://doi.org/10.1007/s11280-006-0010-9
He, Y, Xin, D, Ganti, V, Rajaraman, S, Shah, N: Crawling deep Web entity pages. In: WSDM, pp. 355–364 (2013). https://doi.org/10.1145/2433396.2433442
Hernández, I, Rivero, CR, Ruiz, D, Corchuelo, R: Towards discovering conceptual models behind Web sites. In: ER, pp. 166–175 (2012). https://doi.org/10.1007/978-3-642-34002-4_13
Hernández, I, Rivero, C.R., Ruiz, D., Corchuelo, R.: CALA: An unsupervised URL-based Web page classification system. Knowl.-Based Syst. 57(0), 168–180 (2014). https://doi.org/10.1016/j.knosys.2013.12.019
Hicks, C, Scheffer, M, Ngu, AHH, Sheng, QZ: Discovery and cataloging of deep Web sources. In: IRI, pp. 224–230 (2012). https://doi.org/10.1109/IRI.2012.6303014
Holmes, A, Kellogg, M: Automating functional tests using selenium. In: AGILE, pp. 270–275 (2006). https://doi.org/10.1109/AGILE.2006.19
HTTPUnit: http://httpunit.sourceforge.net/ (2016)
iMacros: http://imacros.net/ (2016)
Jamil, HM, Jagadish, HV: A structured query model for the deep relational Web. In: CIKM, pp. 1679–1682 (2015). https://doi.org/10.1145/2806416.2806589
Jiang, L, Wu, Z, Feng, Q, Liu, J, Zheng, Q: Efficient deep Web crawling using reinforcement learning. In: PAKDD, pp. 428–439 (2010). https://doi.org/10.1007/978-3-642-13657-3_46
Jiménez, P, Corchuelo, R.: Roller: A novel approach to Web information extraction. Knowl. Inf. Syst., 1–45 (2016). https://doi.org/10.1007/s10115-016-0921-4
Jin, X, Mone, A, Zhang, N, Das, G: Mobies: Mobile-interface enhancement service for hidden Web database. In: SIGMOD, pp. 1263–1266 (2011). https://doi.org/10.1145/1989323.1989471
Jin, X, Zhang, N, Das, G: Attribute domain discovery for hidden Web databases. In: SIGMOD, pp. 553–564 (2011). https://doi.org/10.1145/1989323.1989381
Kabisch, T., Dragut, E.C., Yu, C.T., Leser, U.: Deep Web integration with visQI. PVLDB 3(2), 1613–1616 (2010). https://doi.org/10.14778/1920841.1921053
Kantorski, GZ, Moraes, TG, Moreira, VP, Heuser, CA: Advances in Databases and Information Systems, pp 125–136. Springer, Berlin (2013). Chap Choosing Values for Text Fields in Web Forms
Kantorski, G.Z., Moreira, V.P., Heuser, C.A.: Automatic filling of hidden Web forms: A survey. SIGMOD Rec 44(1), 24–35 (2015). https://doi.org/10.1145/2783888.2783898
Kautz, H.A., Selman, B., Shah, M.A.: The hidden Web. AI Mag 18(2), 27–36 (1997). https://doi.org/10.1609/aimag.v18i2.1291
Khare, R, An, Y, Song, IY: Understanding deep Web search interfaces: A survey. SIGMOD Rec. 39(1), 33–40 (2010). https://doi.acm.org/10.1145/1860702.1860708
Kumar, M, Bhatia, R: Design of a mobile Web crawler for hidden Web. In: RAIT, pp. 186–190 (2016)
Kushmerick, N: Learning to invoke Web forms. In: CoopIS, pp. 997–1013 (2003). https://doi.org/10.1007/978-3-540-39964-3_63
Kushmerick, N, Thomas, B: Adaptive information extraction: Core technologies for information agents. In: Intelligent Information Agents - The AgentLink Perspective, pp. 79–103 (2003). https://doi.org/10.1007/3-540-36561-3_4
Lage, J.P., da Silva, A.S., Golgher, P.B., Laender, A.H.F.: Automatic generation of agents for collecting hidden Web pages for data extraction. Data Knowl Eng 49(2), 177–196 (2004). https://doi.org/10.1016/j.datak.2003.10.003
Li, Y., Wang, Y., Du, J.: E-FFC: An enhanced form-focused crawler for domain-specific deep Web databases. J Intell Inf Syst 40(1), 159–184 (2013). https://doi.org/10.1007/s10844-012-0221-8
Liakos, P, Ntoulas, A: Topic-sensitive hidden-Web crawling. In: WISE, pp. 538–551 (2012). https://doi.org/10.1007/978-3-642-35063-4_39
Liddle, SW, Embley, DW, Scott, DT, Yau, SH: Extracting data behind Web forms. In: Workshop on Conceptual Modeling Approaches for e-Business, pp. 402–413 (2002). https://doi.org/10.1007/b12013
Losada, J., Raposo, J., Pan, A., Montoto, P.: Efficient execution of Web navigation sequences. WWWJ 17(5), 921–947 (2014). https://doi.org/10.1007/s11280-013-0259-8
Madhavan, J, Jeffery, SR, Cohen, S, Dong, XL, Ko, D, Yu, C, Halevy, A: Web-scale data integration: You can only afford to pay as you go. In: CIDR, pp. 342–350 (2007)
Madhavan, J., Ko, D., Kot, L., Ganapathy, V., Rasmussen, A., Halevy, A.Y.: Google’s deep Web crawl. PVLDB 1(2), 1241–1252 (2008). https://doi.org/10.14778/1454159.1454163
Madhavan, J., Afanasiev, L., Antova, L., Halevy, A.Y.: Harnessing the deep Web: present and future. Syst. Res. 2(2), 50–54 (2009).
Manvi, Dixit, A, Bhatia, KK: Design of an ontology based adaptive crawler for hidden Web. In: CSNT, pp. 659–663 (2013). https://doi.org/10.1109/CSNT.2013.140
Mccoy, D, Bauer, K, Grunwald, D, Kohno, T, Sicker, D: Shining light in dark places: Understanding the tor network. In: PETS, pp. 63–76 (2008). https://doi.org/10.1007/978-3-540-70630-4_5
Meng, X, Hu, D, Li, C: Schema-guided wrapper maintenance for Web-data extraction. In: WIDM, pp. 1–8 (2003). https://doi.org/10.1145/956699.956701
Modica, GA, Gal, A, Jamil, HM: The use of machine-generated ontologies in dynamic information seeking. In: CoopIS, pp. 433–448 (2001). https://doi.org/10.1007/3-540-44751-2_32
Montoto, P, Pan, A, Raposo, J, Bellas, F, Lopez, J: Web navigation sequences automation in modern Websites. In: DEXA, pp. 302–316 (2009). https://doi.org/10.1007/978-3-642-03573-9_25
Nazi, A, Asudeh, A, Das, G, Zhang, N, Jaoua, A: Mobiface: A mobile application for faceted search over hidden Web databases. In: ICCA, pp. 13–17 (2017). https://doi.org/10.1109/COMAPP.2017.8079749
Nguyen, H., Nguyen, T., Freire, J.: Learning to extract form labels. PVLDB 1(1), 684–694 (2008). https://doi.org/10.14778/1453856.1453931
nightwatch: http://nightwatchjs.org/ (2018)
Ntoulas, A, Zerfos, P, Cho, J: Downloading textual hidden Web content through keyword queries. In: JCDL, pp. 100–109 (2005). https://doi.org/10.1145/1065385.1065407
Olston, C., Najork, M.: Web crawling. Found. Trends Inf. Retriev. 4(3), 175–246 (2010). https://doi.org/10.1561/1500000017
Olston, C, Pandey, S: Recrawl scheduling based on information longevity. In: WWW, pp. 437–446 (2008). https://doi.org/10.1145/1367497.1367557
Pan, A, Raposo, J, Álvarez, M, Hidalgo, J, Viña, Á: Semi-automatic wrapper generation for commercial Web sources. In: EISIC, pp. 265–283 (2002). https://doi.org/10.1007/978-0-387-35614-3_16
Pandey, S, Olston, C: User-centric Web crawling. In: WWW, pp. 401–411. https://doi.org/10.1145/1060745.1060805 (2005)
phantomjs.org: http://phantomjs.org/ (2018)
Raghavan, S, Garcia-Molina, H: Crawling the hidden Web. In: VLDB, pp. 129–138 (2001)
Ru, Y., Horowitz, E.: Indexing the invisible Web: a survey. Online Inf. Rev. 29(3), 249–265 (2005). https://doi.org/10.1108/14684520510607579
Schulz, A, Lässig, J, Gaedke, M: Practical Web data extraction: are we there yet? - a short survey. In: WI, pp. 562–567 (2016). https://doi.org/10.1109/WI.2016.0096
Scrapy: http://scrapy.org/ (2016)
Settles, B.: Active learning. Synthesis Lect. Artif. Intell. Mach. Learn. 6(1), 1–114 (2012). https://doi.org/10.2200/S00429ED1V01Y201207AIM018
Sheng, C., Zhang, N., Tao, Y., Jin, X.: Optimal algorithms for crawling a hidden database in the Web. PVLDB 5(11), 1112–1123 (2012). https://doi.org/10.14778/2350229.2350232
Shu, L, Meng, W, He, H, Yu, CT: Querying capability modeling and construction of deep Web sources. In: WISE, pp. 13–25 (2007). https://doi.org/10.1007/978-3-540-76993-4_2
Sleiman, H.A., Corchuelo, R.: A survey on region extractors from Web documents. TKDE 25(9), 1960–1981 (2013). https://doi.org/10.1109/TKDE.2012.135
Sleiman, H.A., Corchuelo, R.: Trinity: On using trinary trees for unsupervised Web data extraction. IEEE Trans Knowl Data Eng 26(6), 1544–1556 (2014). https://doi.org/10.1109/TKDE.2013.161
Srinivasan, P., Menczer, F., Pant, G.: A general evaluation framework for topical crawlers. Inf. Retr. 8(3), 417–447 (2005). https://doi.org/10.1007/s10791-005-6993-5
Statista: Mobile internet usage worldwide. Report (2018)
Su, W., Wu, H., Li, Y., Zhao, J., Lochovsky, F.H., Cai, H., Huang, T.: Understanding query interfaces by statistical parsing. ACM Trans Web 7(2), 8,1–8,22 (2013). https://doi.org/10.1145/2460383.2460387
Su, W, Li, Y, Lochovsky, FH: Query interfaces understanding by statistical parsing. In: WWW, pp. 1291–1294 (2014). https://doi.org/10.1145/2567948.2579702
Toda, G.A., Cortez, E., da Silva, A.S., de Moura, E.: A probabilistic approach for automatically filling form-based Web interfaces. PVLDB 4(3), 151–160 (2010). https://doi.org/10.14778/1929861.1929862
Vidal, M.L.A., da Silva, A.S., de Moura, E.S., Cavalcanti, J.M.B.: Structure-based crawling in the Hidden Web. J UCS 14(11), 1857–1876 (2008)
Vieira, K., Barbosa, L., Silva, A.S., Freire, J., Moura, E.: Finding seeds to bootstrap focused crawlers. World Wide Web, 1–26 (2015). https://doi.org/10.1007/s11280-015-0331-7
Wang, Y, Lu, J, Chen, J: Crawling deep Web using a new set covering algorithm. In: ADMA, pp. 326–337 (2009). https://doi.org/10.1007/978-3-642-03348-3_32
Watij.com: http://watij.com/ (2016)
Watin.org: http://watin.org/ (2016)
Watir.com: http://watir.com/ (2016)
Weninger, T., Palȧcios, R, Crescenzi, V., Gottron, T., Merialdo, P.: Web content extraction: A metaanalysis of its past and thoughts on its future. SIGKDD Explorations 17(2), 17–23 (2015). https://doi.org/10.1145/2897350.2897353
Wu, Z, Raghavan, V, Qian, H, Rama, KV, Meng, W, He, H, Yu, C: Towards automatic incorporation of search engines into a large-scale metasearch engine. In: WI, pp. 658–661 (2003). https://doi.org/10.1109/WI.2003.1241290
Wu, P, Wen, JR, Liu, H, Ma, WY: Query selection techniques for efficient crawling of structured Web sources. In: ICDE, pp. 47–56 (2006). https://doi.org/10.1109/ICDE.2006.124
Wu, W, Doan, A, Yu, C, Meng, W: Modeling and extracting deep-Web query interfaces, pp. 65–90 (2009). https://doi.org/10.1007/978-3-642-04141-9_4
Wu, W, Zhong, T: Searching the deep Web using proactive phrase queries. In: WWW Companion, pp. 137–138 (2013). https://doi.org/10.1145/2487788.2487854
Wu, W., Meng, W., Su, W., Zhou, G., Chiang, Y.Y.: Q2p: discovering query templates via autocompletion. ACM Trans Web 10(2), 10,1–10,29 (2016). https://doi.org/10.1145/2873061
Xu, S., Yoon, H.J., Tourassi, G.: A user-oriented Web crawler for selectively acquiring online content in e-health research. Bioinformatics 30(1), 104–114 (2014). https://doi.org/10.1093/bioinformatics/btt571
Yan, H., Gong, Z., Zhang, N., Huang, T., Zhong, H., Wei, J.: Aggregate estimation in hidden databases with checkbox interfaces. TKDE 27(5), 1192–1204 (2015). https://doi.org/10.1109/TKDE.2014.2365800
Zhang, Z, He, B, Chang, KCC: Understanding Web query interfaces: Best-effort parsing with hidden syntax. In: SIGMOD, pp. 107–118 (2004). https://doi.org/10.1145/1007568.1007583
Zhao, J, Wang, P: Nautilus: a generic framework for crawling Deep Web. In: ICDKE, pp. 141–151 (2012). https://doi.org/10.1007/978-3-642-34679-8_14
Zhao, F., Zhou, J., Nie, C., Huang, H., Jin, H.: Smartcrawler: a two-stage crawler for efficiently harvesting deep-Web interfaces. IEEE Trans Serv. Comput. 9 (4), 608–620 (2016). https://doi.org/10.1109/TSC.2015.2414931
Zheng, Q., Wu, Z., Cheng, X., Jiang, L., Liu, J.: Learning to crawl deep Web. Inf. Syst. 38(6), 801–819 (2013). https://doi.org/10.1016/j.is.2013.02.001
Zhou, X, Belkin, M: Chapter 22 - semi-supervised learning. In: Academic Press Library in Signal Processing: Volume 1, Academic Press Library in Signal Processing, vol 1, pp. 1239–1269. Elsevier (2014). https://doi.org/10.1016/B978-0-12-396502-8.00022-X
zombiejs.org: http://zombie.js.org/ (2018)
Acknowledgements
The authors would like to thank Dr. Rafael Corchuelo for his support and assistance throughout the entire research process that led to this article, and for his helpful and constructive comments that greatly contributed to improving the article. They would also like to thank the anonymous reviewers of this and past submissions, since their comments have contributed to give shape to this current version. Supported by the European Commission (FEDER), the Spanish and the Andalusian R &D & I programmes (grants TIN2016-75394-R, and TIN2013-40848-R).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hernández, I., Rivero, C.R. & Ruiz, D. Deep Web crawling: a survey. World Wide Web 22, 1577–1610 (2019). https://doi.org/10.1007/s11280-018-0602-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-018-0602-1