Abstract
Few studies have addressed the problem of extracting data from a limited deep web database. We apply formal concept analysis to this problem and propose a novel algorithm called EdaliwdbFCA. Before a query Y is sent, the algorithm analyzes the local formal context K L , which consists of the latest extracted data, and predicts the size of the query results according to the cardinality of the extent X of the formal concept (X,Y) derived from K L . Thus, it can be determined in advance if Y is a query or not. Candidate query concepts are dynamically generated from the lower cover of the current concept (X,Y). Therefore, this method avoids building of concrete concept lattices during extraction. Moreover, two pruning rules are adopted to reduce redundant queries. Experiments on controlled data sets and real applications were performed. The results confirm that the algorithm theories are correct and it can be effectively applied in the real world.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The DBLP Computer Science Bibliography. http://www.informatik.uni-trier.de/~ley/db/index.html, November, 2011.
References
Barbosa, L., & Freire, J. (2004). Siphoning hidden-web data through keyword-based interfaces. In SBBD.
Carpineto, C., & Romano, G. (2004). Exploiting the potential of concept lattices for information retrieval with CREDO. Journal of Universal Computer Science, 10(8), 985–1013.
Chang, K., He, B., Zhang, Z. (2005). Toward large scale integration: Building a metaquerier over databases on the web. In Proceedings of CIDR 2005 (pp. 44–55).
Chen, K., Zuo, W., Zhang, F., He, F., Chen, Y. (2011). Robust and efficient annotation based on ontology evolution for deep web data. Journal of Computers, 6(10), 2029–2036.
Dasgupta, A., Zhang, N., Das, G. (2009). Leveraging count information in sampling hidden databases. In Proceedings of the 25th International Conference on Data Engineering. ICDE’09. IEEE (pp. 329–340).
Dau, F., Ducrou, J., Eklund, P. (2008). Concept similarity and related categories in SearchSleuth. Lecture Notes in Computer Science, 5113, 255–268.
Du, Y., & Hai, Y. (2012). Semantic ranking of web pages based on formal concept analysis. Journal of Systems and Software, 86(1), 187–197. doi:10.1016/j.jss.2012.07.040.
Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C. (2012). Opal: automated form understanding for the deep web. In Proceedings of the 21st international conference on World Wide Web (pp. 829–838).
Hong, J.L. (2011). Data extraction for deep web using wordnet. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 41(6), 854–868.
Huang, Q., Li, Q., Li, H., Yan, Z. (2012). An approach to incremental deep web crawling based on incremental harvest model. Procedia Engineering, 29, 1081–1087.
Jiang, L., Wu, Z., Feng, Q., Liu, J., Zheng, Q. (2010). Efficient deep web crawling using reinforcement learning. Lecture Notes in Computer Science, 6118, 428–439.
Koester, B. (2006). Conceptual knowledge retrieval with FooCA: improving web search engine results with contexts and concept hierarchies. Lecture Notes in Computer Science, 4065, 176–190.
Li, Y., Wang, Y., Du, J. (2012). E-ffc: an enhanced form-focused crawler for domain-specific deep web databases. Journal of Intelligent Information Systems, 40(1), 159–184.
Lindig, C. (2000). Fast concept analysis. Working with conceptual structures—contributions to ICCS 2000 (pp. 235–248).
Liu, W., Meng, X., Meng, W. (2010). Vide: a vision-based approach for deep web data extraction. IEEE Transactions on Knowledge and Data Engineering, 22(3), 447–460.
Madhavan, J., Ko, D., Kot, Ł., Ganapathy, V., Rasmussen, A., Halevy, A. (2008). Google’s deep web crawl. Proceedings of the VLDB Endowment, 1(2), 1241–1252.
Palekar, V.R., Ali, M.S., Meghe, R. (2012). Deep web data extraction using web-programminglanguage-independent approach. Journal of Data Mining and Knowledge Discovery, 3(2), 69–73. http://www.bioinfo.in/journalcontent.php?vol_id=905&id=42&month=4&year=2012.
Polaillon, G., Aufaure, M., Le Grand, B., Soto, M. (2007). FCA for contextual semantic navigation and information retrieval in heterogeneous information systems. In DEXA’07. 18th international workshop on database and expert systems applications (pp. 534–539). IEEE.
Wang, Y., Lu, J., Chen, J. (2009). Crawling deep web using a new set covering algorithm. In Proceedings of the 5th International Conference on Advanced Data Mining and Applications. ADMA 2009, Chengdu, China (pp. 326–337). Springer.
Wang, Y., Lu, J., Liang, J., Chen, J., Liu, J. (2012). Selecting queries from sample to crawl deep web data sources. Web Intelligence and Agent Systems, 10(1), 75–88.
Wille, R. (1999). Formal concept analysis: Mathematical foundations. Springer.
Wu, P., Wen, J., Liu, H., Ma, W. (2006). Query selection techniques for efficient crawling of structured web sources. In ICDE’06. Proceedings of the 22nd international conference on data engineering (pp. 47–47). IEEE.
Yang, Y., Du, Y., Sun, J., Hai, Y. (2008). A topic-specific web crawler with concept similarity context graph based on fca. In D.-S. Huang, D. Wunsch, D. Levine, K.-H. Jo (Eds.), Advanced intelligent computing theories and applications. With aspects of artificial intelligence (Vol. 5227, p. 840). Berlin/Heidelberg: Springer. doi:10.1007/978-3-540-85984-0-101.
Acknowledgements
I thank Dr. Lin Gan at the School of Computing in Wuhan University for providing helpful suggestions. I also thank the anonymous referees and editor for their constructive comments on earlier versions of the paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhang, Z., Du, J. & Wang, L. Formal concept analysis approach for data extraction from a limited deep web database. J Intell Inf Syst 41, 211–234 (2013). https://doi.org/10.1007/s10844-013-0242-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-013-0242-y