[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2463676.2465316acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

ODYS: an approach to building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS for higher-level functionality

Published: 22 June 2013 Publication History

Abstract

Recently, parallel search engines have been implemented based on scalable distributed file systems such as Google File System. However, we claim that building a massively-parallel search engine using a parallel DBMS can be an attractive alternative since it supports a higher-level (i.e., SQL-level) interface than that of a distributed file system for easy and less error-prone application development while providing scalability. Regarding higher-level functionality, we can draw a parallel with the traditional O/S file system vs. DBMS. In this paper, we propose a new approach of building a massively-parallel search engine using a DB-IR tightly-integrated parallel DBMS. To estimate the performance, we propose a hybrid (i.e., analytic and experimental) performance model for the parallel search engine. We argue that the model can accurately estimate the performance of a massively-parallel (e.g., 300-node) search engine using the experimental results obtained from a small-scale (e.g., 5-node) one. We show that the estimation error between the model and the actual experiment is less than 2.13% by observing that the bulk of the query processing time is spent at the slave (vs. at the master and network) and by estimating the time spent at the slave based on actual measurement. Using our model, we demonstrate a commercial-level scalability and performance of our architecture. Our proposed system ODYS is capable of handling 1 billion queries per day (81 queries/sec) for 30 billion Web pages by using only 43,472 nodes with an average query response time of 194 ms. By using twice as many (86,944) nodes, ODYS can provide an average query response time of 148 ms. These results show that building a massively-parallel search engine using a parallel DBMS is a viable approach with advantages of supporting the high-level (i.e., DBMS-level), SQL-like programming interface.

References

[1]
Abouzeid, A. et al., "HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads," In Proc. Int'l Conf. on Very Large Data Bases (VLDB), pp. 922--933, Aug. 2009.
[2]
Azure, http://www.microsoft.com/windowsazure.
[3]
Barroso, L., Dean, J. and Holzle, U., "Web Search for a Planet: the Google Cluster Architecture," IEEE Micro, Vol. 23, No. 2, pp. 22--28, Mar. 2003.
[4]
Budiu, M., 2009, available at http://budiu.info/work/dryad-talk-berkeley09.pptx.
[5]
Cassandra, http://cassandra.apache.org.
[6]
Chang, F. et al., "Bigtable: A Distributed Storage System for Structured Data," In Proc. Symposium on Operating Systems Design and Implementation (OSDI), pp. 205--218, Nov. 2006.
[7]
Cooper, R., Introduction to Queuing Theory, North Holland, 2nd ed., 1981.
[8]
Dean, J., and Ghemawat, S., "MapReduce: Simplified Data Processing on Large Clusters," In Proc. Symposium on Operating Systems Design and Implementation (OSDI), pp. 137--150, Dec. 2004.
[9]
Dean, J., "Challenges in Building Large-Scale Information Retrieval Systems," In Proc. ACM Int'l Conf. on Web Search and Data Mining (WSDM) (an invited talk), p. 1, Feb. 2009 (presentation slides available at http://research.google.com/people/jeff/WSDM09-keynote.pdf).
[10]
Dean, J., "Designs, Lessons and Advice from Building Large Distributed Systems," a keynote at ACM SIGOPS Int'l Workshop on Large Scale Distributed Systems and Middleware (LADIS), Oct. 2009 (presentation slides available at http://www.odbms.org/download/deankeynote-ladis2009.pdf).
[11]
DeCandia, G. et al., "Dynamo: Amazon's Highly Available Key-Value Store," In Proc. ACM Symposium on Operating Systems Principles (SOSP), Oct. 2007.
[12]
Ghemawat, S., Gobioff, H., and Leung, S., "The Google File System," In Proc. ACM Symposium on Operating Systems Principles (SOSP), Oct. 2003.
[13]
Google, http://www.google.com/about/corporate/company/tech.html.
[14]
Hadoop, http://hadoop.apache.org.
[15]
HBase, http://hbase.apache.org.
[16]
HDFS, http://hadoop.apache.org/hdfs.
[17]
Javadi, B., Khorsandi, S., and Akbari, M., "Queuing Network Modeling of a Cluster-Based Parallel System," In Proc. Int'l Conf. on High Performance Computing and Grid in Asia Pacific Region, pp. 304--307, July 2004.
[18]
Kemper, B. and Mandjes, M., Approximations for the Mean Sojourn Time in a Parallel Queue, Technical Report PNA-E0901, Centrum Wiskunde & Informatica, Mar. 2009.
[19]
Kunder, M., http://www.worldwidewebsize.com.
[20]
Lentz, A., "MySQL Storage Engine Architecture," In MySQL Developer Articles, MySQL AB, May 2004 (available at http://ftp.nchu.edu.tw/MySQL/techresources/articles/storage-engine).
[21]
Lucene, http://lucene.apache.org.
[22]
Moreira, J. et al., "Scalability of the Nutch search engine," In Proc. Int'l Conf. on Supercomputing (ICS), pp. 3--12, June 2007.
[23]
Nielsenwire, "Nielsen Reports February 2010 U.S. Search Rankings," Nielsen Report, Mar. 15, 2010 (available at http://blog.nielsen.com/nielsenwire/ online mobile/nielsenreports-february-2010-u-s-search-rankings).
[24]
Ozsu, M. and Valduriez, P., "Distributed Reliability Protocols," In Book Principles of Distributed Database Systems, Prentice Hall, 2nd ed., pp. 379--400, 1999.
[25]
Richardson, M., Prakash, A., and Brill, E., "Beyond PageRank: machine learning for static ranking," In Proc. Int'l Conf. on World Wide Web (WWW), pp. 707--715, May 2006.
[26]
Shahhoseini, H. and Naderi, M., "Design Trade off on Shared Memory Clustered Massively Parallel Processing Systems," In Proc. Int'l Conf. on Computing and Information, Nov. 2000.
[27]
Stonebraker, M. et al., "MapReduce and Parallel DBMSs: Friends or Foes?," Communications of the ACM (CACM), pp. 64--71, Jan. 2010.
[28]
Whang, K. et al., An Inverted Index Storage Structure Using Subindexes and Large Objects for Tight Coupling of Information Retrieval with Database Management Systems, U.S. Patent No. 6,349,308, Feb. 19, 2002, Application No. 09/250,487, Feb. 15, 1999.
[29]
Whang, K. et al., "Odysseus: A High-Performance ORDBMS Tightly-Coupled with IR Features," In Proc. Int'l Conf. on Data Engineering (ICDE), pp. 1104--1105, Apr. 2005. This paper recieved the Best Demonstration Award.
[30]
Whang, K. et al., "DB-IR Integration Using Tight-Coupling in the Odysseus DBMS," submitted for publication, 2013.
[31]
Yang, C. et al., "Osprey: Implementing MapReduce-Style Fault Tolerance in a Shared-Nothing Distributed Database," In Proc. IEEE Int'l Conf. on Data Engineering (ICDE), pp. 657--668, Mar. 2010.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
June 2013
1322 pages
ISBN:9781450320375
DOI:10.1145/2463676
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. db-ir tight integration
  2. massively-parallel search engine
  3. parallel dbms

Qualifiers

  • Research-article

Conference

SIGMOD/PODS'13
Sponsor:

Acceptance Rates

SIGMOD '13 Paper Acceptance Rate 76 of 372 submissions, 20%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)3
Reflects downloads up to 31 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2018)Two-dimensional indexing to provide one-integrated-memory view of distributed memory for a massively-parallel search engineWorld Wide Web10.1007/s11280-018-0647-1Online publication date: 13-Nov-2018
  • (2018)PARADISEWorld Wide Web10.1007/s11280-014-0312-219:3(299-322)Online publication date: 25-Dec-2018
  • (2015)DB-IR integration using tight-coupling in the Odysseus DBMSWorld Wide Web10.1007/s11280-013-0264-y18:3(491-520)Online publication date: 1-May-2015
  • (2013)Can we analyze big data inside a DBMS?Proceedings of the sixteenth international workshop on Data warehousing and OLAP10.1145/2513190.2513198(85-92)Online publication date: 28-Oct-2013

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media