[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2452376.2452416acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

Pollux: towards scalable distributed real-time search on microblogs

Published: 18 March 2013 Publication History

Abstract

The last few years have witnessed a meteoric rise of microblogging platforms, such as Twitter and Tumblr. The sheer volume of the microblog data and its highly dynamic nature present unique technical challenges for the platforms that provide search services. In particular, the search service must provide real-time response to queries, and continuously update the results as new microblogs are posted. Conventional approaches either cannot keep up with the high update rate, or cannot scale well to handle the large volume of data.
We propose Pollux, a system that provides distributed real-time indexing and search service on microblogs. It adopts the distributed stream processing paradigm advocated by the recently developed platforms that are designed for real-time processing of large volume of data, such as Apache S4 and Twitter Storm. Although those open-source platforms have found successful applications in production environments, they lack some critical features required for real-time search. In particular: (1) they only implement partial fault tolerance, and do not provide lossless recovery in the event of a node failure, and (2) they do not have a facility for storing global data, which is necessary in efficiently ranking search results.
Addressing those problems, Pollux extends current platforms in two important ways. First, we propose a failover strategy that can ensure high system availability and no data/state loss in the event of a node failure. Second, Pollux adds a global storage facility that supports convenient, efficient, and reliable data storage for shared data. We describe how to apply Pollux to the task of real-time search. We implement Pollux based on Apache S4, and show through extensive experiments on a Twitter dataset that the proposed solutions are effective, and Pollux can achieve excellent scalability.

References

[1]
Apache bookkeeper. http://zookeeper.apache.org/bookkeeper/.
[2]
Apache zookeeper. http://zookeeper.apache.org/.
[3]
The engineering behind twitter's new search experience. http://engineering.twitter.com/2011/05/engineering-behind-twitters-new-search.html.
[4]
Memcached. http://memcached.org/.
[5]
New tweets per second record -- 25,088 tps -- set by screening of japanese movie "castle in the sky". http://techcrunch.com.
[6]
Streambase. http://streambase.com.
[7]
Tweets2011. http://trec.nist.gov/data/tweets/.
[8]
Twitter storm. https://github.com/nathanmarz/storm.
[9]
Twopcharts.com. http://www.twopcharts.com.
[10]
D. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik. Aurora: a new model and architecture for data stream management. The VLDB Journal, 12(2):120--139, 2003.
[11]
D. J. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, M. Cherniack, J.-H. Hwang, W. Lindner, A. S. Maskey, A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, and S. Zdonik. The design of the borealis stream processing engine. In CIDR, Asilomar, CA, January 2005.
[12]
L. Amini, H. Andrade, R. Bhagwan, F. Eskesen, R. King, P. Selo, Y. Park, and C. Venkatramani. Spc: A distributed, scalable platform for data mining. In Proceedings of the 4th international workshop on Data mining standards, services and platforms, pages 27--37. ACM, 2006.
[13]
D. Borthakur, J. Gray, J. S. Sarma, K. Muthukkaruppan, N. Spiegelberg, H. Kuang, K. Ranganathan, D. Molkov, A. Menon, S. Rash, R. Schmidt, and A. Aiyer. Apache Hadoop goes realtime at Facebook. In SIGMOD, pages 1071--1080, 2011.
[14]
M. Busch, K. Gade, B. Larson, P. Lok, S. Luckenbill, and J. Lin. Earlybird: Real-time search at twitter. In ICDE. IEEE, 2012.
[15]
C. Chen, F. Li, B. Ooi, and S. Wu. Ti: an efficient indexing mechanism for real-time search on tweets. In SIGMOD, pages 649--660, 2011.
[16]
B. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H. Jacobsen, N. Puz, D. Weaver, and R. Yerneni. Pnuts: Yahoo!'s hosted data serving platform. PVLDB, 1(2):1277--1288, 2008.
[17]
G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: amazon's highly available key-value store. ACM SIGOPS Operating Systems Review, 41(6):205--220, 2007.
[18]
A. Dong, R. Zhang, P. Kolari, J. Bai, F. Diaz, Y. Chang, Z. Zheng, and H. Zha. Time is of the essence: improving recency ranking using twitter data. In WWW, pages 331--340, 2010.
[19]
Y. Duan, L. Jiang, T. Qin, M. Zhou, and H. Shum. An empirical study on learning to rank of tweets. In COLING, pages 295--303, 2010.
[20]
J. Hwang, M. Balazinska, A. Rasin, U. Cetintemel, M. Stonebraker, and S. Zdonik. High-availability algorithms for distributed stream processing. In ICDE, pages 779--790, 2005.
[21]
J. Hwang, U. Cetintemel, and S. Zdonik. Fast and highly-available stream processing over wide area networks. In ICDE, pages 804--813, 2008.
[22]
J. Hwang, Y. Xing, U. Cetintemel, and S. Zdonik. A cooperative, self-configuring high-availability solution for stream processing. In ICDE, pages 176--185. IEEE, 2007.
[23]
B. J. Jansen, G. Campbell, and M. Gregg. Real time search user behavior. In ACM CHI, pages 3961--3966, 2010.
[24]
D. Karger, E. Lehman, T. Leighton, R. Panigrahy, M. Levine, and D. Lewin. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In STOC, pages 654--663. ACM, 1997.
[25]
H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In WWW, pages 591--600. ACM, 2010.
[26]
Y. Kwon, M. Balazinska, and A. Greenberg. Fault-tolerant stream processing using a distributed, replicated file system. PVLDB, 1(1):574--585, 2008.
[27]
A. Lakshman and P. Malik. Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review, 44(2):35--40, 2010.
[28]
L. Lin, X. Yu, and N. Koudas. "Pollux: Towards scalable distributed real-time search on microblogs". Technical Report. http://www.yorku.ca/xhyu/ITEC-TR-201203.pdf.
[29]
D. Luckham. The power of events: an introduction to complex event processing in distributed enterprise systems. Addison-Wesley Longman Publishing Co., Inc., 2001.
[30]
L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: distributed stream computing platform. In ICDM Workshops, pages 170--177. IEEE, 2010.
[31]
D. Ongaro, S. Rumble, R. Stutsman, J. Ousterhout, and M. Rosenblum. Fast crash recovery in ramcloud. In SOSP, pages 29--41. ACM, 2011.
[32]
J. Rao, E. Shekita, and S. Tata. Using paxos to build a scalable, consistent, and highly available datastore. PVLDB, 4(4):243--254, 2011.
[33]
Z. Sebepou and K. Magoutis. Scalable storage support for data stream processing. In Proceedings of IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pages 1--6. IEEE, 2010.
[34]
M. Shah, J. Hellerstein, and E. Brewer. Highly available, fault-tolerant, parallel dataflows. In SIGMOD, pages 827--838. ACM, 2004.

Cited By

View all
  • (2021)Exploiting Intel optane persistent memory for full text searchProceedings of the 2021 ACM SIGPLAN International Symposium on Memory Management10.1145/3459898.3463906(80-93)Online publication date: 22-Jun-2021
  • (2020)DSPBench: A Suite of Benchmark Applications for Distributed Data Stream Processing SystemsIEEE Access10.1109/ACCESS.2020.30439488(222900-222917)Online publication date: 2020
  • (2017)What people study when they study TumblrJournal of Documentation10.1108/JD-08-2016-010173:3(528-554)Online publication date: 8-May-2017
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
EDBT '13: Proceedings of the 16th International Conference on Extending Database Technology
March 2013
793 pages
ISBN:9781450315975
DOI:10.1145/2452376
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 March 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data stream
  2. distributed processing
  3. fault tolerance
  4. microblog
  5. search

Qualifiers

  • Research-article

Funding Sources

Conference

EDBT/ICDT '13

Acceptance Rates

Overall Acceptance Rate 7 of 10 submissions, 70%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Exploiting Intel optane persistent memory for full text searchProceedings of the 2021 ACM SIGPLAN International Symposium on Memory Management10.1145/3459898.3463906(80-93)Online publication date: 22-Jun-2021
  • (2020)DSPBench: A Suite of Benchmark Applications for Distributed Data Stream Processing SystemsIEEE Access10.1109/ACCESS.2020.30439488(222900-222917)Online publication date: 2020
  • (2017)What people study when they study TumblrJournal of Documentation10.1108/JD-08-2016-010173:3(528-554)Online publication date: 8-May-2017
  • (2016)Systematic mapping for big data stream processing frameworks2016 Eleventh International Conference on Digital Information Management (ICDIM)10.1109/ICDIM.2016.7829760(31-36)Online publication date: Sep-2016
  • (2015)Scalable Distributed Processing of K Nearest Neighbor Queries over Moving ObjectsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2014.236404627:5(1383-1396)Online publication date: 1-May-2015
  • (2015)Finding top-k local users in geo-tagged social media data2015 IEEE 31st International Conference on Data Engineering10.1109/ICDE.2015.7113290(267-278)Online publication date: Apr-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media