[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Discovering the skyline of web databases

Published: 01 March 2016 Publication History

Abstract

Many web databases are "hidden" behind proprietary search interfaces that enforce the top-k output constraint, i.e., each query returns at most k of all matching tuples, preferentially selected and returned according to a proprietary ranking function. In this paper, we initiate research into the novel problem of skyline discovery over top-k hidden web databases. Since skyline tuples provide critical insights into the database and include the top-ranked tuple for every possible ranking function following the monotonic order of attribute values, skyline discovery from a hidden web database can enable a wide variety of innovative third-party applications over one or multiple web databases. Our research in the paper shows that the critical factor affecting the cost of skyline discovery is the type of search interface controls provided by the website. As such, we develop efficient algorithms for three most popular types, i.e., one-ended range, free range and point predicates, and then combine them to support web databases that feature a mixture of these types. Rigorous theoretical analysis and extensive real-world online and offline experiments demonstrate the effectiveness of our proposed techniques and their superiority over baseline solutions.

References

[1]
B. Arai, G. Das, D. Gunopulos, and N. Koudas. Anytime measures for top-k algorithms. In VLDB, 2007.
[2]
A. Asudeh, S. Thirumuruganathan, N. Zhang, and G. Das. Discovering the skyline of web databases. CoRR, abs/1512.02138, 2015.
[3]
A. Asudeh, G. Zhang, N. Hassan, C. Li, and G. Zaruba. Crowdsourcing pareto-optimal object finding by pairwise comparisons. CIKM, 2015.
[4]
W.-T. Balke, U. Güntzer, and J. X. Zheng. Efficient distributed skylining for web information systems. In EDBT, 2004.
[5]
S. Borzsony, D. Kossmann, and K. Stocker. The skyline operator. In ICDE, 2001.
[6]
C. Buchta. On the average number of maxima in a set of vectors. Information Processing Letters, 33(2), 1989.
[7]
J. Chomicki, P. Godfrey, J. Gryz, and D. Liang. Skyline with presorting. In ICDE, 2003.
[8]
A. Dasgupta, G. Das, and H. Mannila. A random walk approach to sampling hidden databases. In SIGMOD, 2007.
[9]
A. Dasgupta, N. Zhang, and G. Das. Leveraging count information in sampling hidden databases. In ICDE, 2009.
[10]
A. Dasgupta, N. Zhang, and G. Das. Turbo-charging hidden database samplers with overflowing queries and skew reduction. In EDBT, 2010.
[11]
E. Dellis and B. Seeger. Efficient computation of reverse skyline queries. In VLDB, 2007.
[12]
Z. Gong, G.-Z. Sun, J. Yuan, and Y. Zhong. Efficient top-k query algorithms using k-skyband partition. In Scalable Information Systems. Springer, 2009.
[13]
I. F. Ilyas, G. Beskales, and M. A. Soliman. A survey of top-k query processing techniques in relational database systems. ACM Computing Surveys (CSUR), 40(4), 2008.
[14]
D. Kossmann, F. Ramsak, and S. Rost. Shooting stars in the sky: An online algorithm for skyline queries. In VLDB, 2002.
[15]
X. Lin, Y. Yuan, W. Wang, and H. Lu. Stabbing the sky: Efficient skyline computation over sliding windows. In ICDE, 2005.
[16]
X. Lin, Y. Yuan, Q. Zhang, and Y. Zhang. Selecting stars: The k most representative skyline operator. In ICDE, 2007.
[17]
T. Liu, F. Wang, and G. Agrawal. Stratified sampling for data mining on the deep web. Frontiers of Computer Science, 6(2):179--196, 2012.
[18]
E. Lo, K. Y. Yip, K.-I. Lin, and D. W. Cheung. Progressive skylining over web-accessible databases. Data & Knowledge Engineering, 2006.
[19]
J. Madhavan, D. Ko, Ł. Kot, V. Ganapathy, A. Rasmussen, and A. Halevy. Google's deep web crawl. VLDB, 2008.
[20]
D. Papadias, Y. Tao, G. Fu, and B. Seeger. An optimal and progressive algorithm for skyline queries. In SIGMOD, 2003.
[21]
J. Pei, B. Jiang, X. Lin, and Y. Yuan. Probabilistic skylines on uncertain data. In VLDB, 2007.
[22]
S. Raghavan and H. Garcia-Molina. Crawling the hidden web. VLDB, 2000.
[23]
C. Sheng, N. Zhang, Y. Tao, and X. Jin. Optimal algorithms for crawling a hidden database in the web. VLDB, 2012.
[24]
K.-L. Tan, P.-K. Eng, B. C. Ooi, et al. Efficient progressive skyline computation. In VLDB, 2001.
[25]
F. Wang and G. Agrawal. Effective and efficient sampling methods for deep web aggregation queries. In EDBT, 2011.
[26]
M. L. Yiu and N. Mamoulis. Efficient processing of top-k dominating queries on multi-dimensional data. In VLDB, 2007.
[27]
N. Zhang, C. Li, N. Hassan, S. Rajasekaran, and G. Das. On skyline groups. TKDE, 26(4), 2014.

Cited By

View all
  • (2024)Data distribution tailoring revisited: cost-efficient integration of representative dataThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00849-w33:5(1283-1306)Online publication date: 1-Sep-2024
  • (2022)On Finding Rank Regret RepresentativesACM Transactions on Database Systems10.1145/353105447:3(1-37)Online publication date: 18-Aug-2022
  • (2021)Tailoring data source distributions for fairness-aware data integrationProceedings of the VLDB Endowment10.14778/3476249.347629914:11(2519-2532)Online publication date: 27-Oct-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 9, Issue 7
March 2016
96 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 March 2016
Published in PVLDB Volume 9, Issue 7

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Data distribution tailoring revisited: cost-efficient integration of representative dataThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00849-w33:5(1283-1306)Online publication date: 1-Sep-2024
  • (2022)On Finding Rank Regret RepresentativesACM Transactions on Database Systems10.1145/353105447:3(1-37)Online publication date: 18-Aug-2022
  • (2021)Tailoring data source distributions for fairness-aware data integrationProceedings of the VLDB Endowment10.14778/3476249.347629914:11(2519-2532)Online publication date: 27-Oct-2021
  • (2021)RETRACTED ARTICLE: Efficiently harvesting deep web interfaces based on adaptive learning using two-phase data crawler frameworkSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-021-05816-z27:1(505-515)Online publication date: 6-May-2021
  • (2020)Finding skyline communities in multi-valued networksThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-020-00618-529:6(1407-1432)Online publication date: 8-Jun-2020
  • (2019)A unified optimization algorithm for solving "regret-minimizing representative" problemsProceedings of the VLDB Endowment10.14778/3368289.336829113:3(239-251)Online publication date: 1-Nov-2019
  • (2019)RRRProceedings of the 2019 International Conference on Management of Data10.1145/3299869.3300080(263-280)Online publication date: 25-Jun-2019
  • (2019)Designing Fair Ranking SchemesProceedings of the 2019 International Conference on Management of Data10.1145/3299869.3300079(1259-1276)Online publication date: 25-Jun-2019
  • (2019)Deep Web crawlingWorld Wide Web10.1007/s11280-018-0602-122:4(1577-1610)Online publication date: 1-Jul-2019
  • (2019)Ranking the big skyKnowledge and Information Systems10.1007/s10115-018-1256-060:1(415-446)Online publication date: 1-Jul-2019
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media