More Web Proxy on the site http://driver.im/

research-article

Pollux: towards scalable distributed real-time search on microblogs

Authors:

Nick KoudasAuthors Info & Claims

EDBT '13: Proceedings of the 16th International Conference on Extending Database Technology

Pages 335 - 346

https://doi.org/10.1145/2452376.2452416

Published: 18 March 2013 Publication History

Abstract

The last few years have witnessed a meteoric rise of microblogging platforms, such as Twitter and Tumblr. The sheer volume of the microblog data and its highly dynamic nature present unique technical challenges for the platforms that provide search services. In particular, the search service must provide real-time response to queries, and continuously update the results as new microblogs are posted. Conventional approaches either cannot keep up with the high update rate, or cannot scale well to handle the large volume of data.

We propose Pollux, a system that provides distributed real-time indexing and search service on microblogs. It adopts the distributed stream processing paradigm advocated by the recently developed platforms that are designed for real-time processing of large volume of data, such as Apache S4 and Twitter Storm. Although those open-source platforms have found successful applications in production environments, they lack some critical features required for real-time search. In particular: (1) they only implement partial fault tolerance, and do not provide lossless recovery in the event of a node failure, and (2) they do not have a facility for storing global data, which is necessary in efficiently ranking search results.

Addressing those problems, Pollux extends current platforms in two important ways. First, we propose a failover strategy that can ensure high system availability and no data/state loss in the event of a node failure. Second, Pollux adds a global storage facility that supports convenient, efficient, and reliable data storage for shared data. We describe how to apply Pollux to the task of real-time search. We implement Pollux based on Apache S4, and show through extensive experiments on a Twitter dataset that the proposed solutions are effective, and Pollux can achieve excellent scalability.

References

[1]

Apache bookkeeper. http://zookeeper.apache.org/bookkeeper/.

[2]

Apache zookeeper. http://zookeeper.apache.org/.

[3]

The engineering behind twitter's new search experience. http://engineering.twitter.com/2011/05/engineering-behind-twitters-new-search.html.

[4]

Memcached. http://memcached.org/.

[5]

New tweets per second record -- 25,088 tps -- set by screening of japanese movie "castle in the sky". http://techcrunch.com.

[6]

Streambase. http://streambase.com.

[7]

Tweets2011. http://trec.nist.gov/data/tweets/.

[8]

Twitter storm. https://github.com/nathanmarz/storm.

[9]

Twopcharts.com. http://www.twopcharts.com.

[10]

D. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik. Aurora: a new model and architecture for data stream management. The VLDB Journal, 12(2):120--139, 2003.

Digital Library

[11]

D. J. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, M. Cherniack, J.-H. Hwang, W. Lindner, A. S. Maskey, A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, and S. Zdonik. The design of the borealis stream processing engine. In CIDR, Asilomar, CA, January 2005.

[12]

L. Amini, H. Andrade, R. Bhagwan, F. Eskesen, R. King, P. Selo, Y. Park, and C. Venkatramani. Spc: A distributed, scalable platform for data mining. In Proceedings of the 4th international workshop on Data mining standards, services and platforms, pages 27--37. ACM, 2006.

Digital Library

[13]

D. Borthakur, J. Gray, J. S. Sarma, K. Muthukkaruppan, N. Spiegelberg, H. Kuang, K. Ranganathan, D. Molkov, A. Menon, S. Rash, R. Schmidt, and A. Aiyer. Apache Hadoop goes realtime at Facebook. In SIGMOD, pages 1071--1080, 2011.

Digital Library

[14]

M. Busch, K. Gade, B. Larson, P. Lok, S. Luckenbill, and J. Lin. Earlybird: Real-time search at twitter. In ICDE. IEEE, 2012.

Digital Library

[15]

C. Chen, F. Li, B. Ooi, and S. Wu. Ti: an efficient indexing mechanism for real-time search on tweets. In SIGMOD, pages 649--660, 2011.

Digital Library

[16]

B. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H. Jacobsen, N. Puz, D. Weaver, and R. Yerneni. Pnuts: Yahoo!'s hosted data serving platform. PVLDB, 1(2):1277--1288, 2008.

Digital Library

[17]

G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: amazon's highly available key-value store. ACM SIGOPS Operating Systems Review, 41(6):205--220, 2007.

Digital Library

[18]

A. Dong, R. Zhang, P. Kolari, J. Bai, F. Diaz, Y. Chang, Z. Zheng, and H. Zha. Time is of the essence: improving recency ranking using twitter data. In WWW, pages 331--340, 2010.

Digital Library

[19]

Y. Duan, L. Jiang, T. Qin, M. Zhou, and H. Shum. An empirical study on learning to rank of tweets. In COLING, pages 295--303, 2010.

Digital Library

[20]

J. Hwang, M. Balazinska, A. Rasin, U. Cetintemel, M. Stonebraker, and S. Zdonik. High-availability algorithms for distributed stream processing. In ICDE, pages 779--790, 2005.

Digital Library

[21]

J. Hwang, U. Cetintemel, and S. Zdonik. Fast and highly-available stream processing over wide area networks. In ICDE, pages 804--813, 2008.

Digital Library

[22]

J. Hwang, Y. Xing, U. Cetintemel, and S. Zdonik. A cooperative, self-configuring high-availability solution for stream processing. In ICDE, pages 176--185. IEEE, 2007.

[23]

B. J. Jansen, G. Campbell, and M. Gregg. Real time search user behavior. In ACM CHI, pages 3961--3966, 2010.

Digital Library

[24]

D. Karger, E. Lehman, T. Leighton, R. Panigrahy, M. Levine, and D. Lewin. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In STOC, pages 654--663. ACM, 1997.

Digital Library

[25]

H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In WWW, pages 591--600. ACM, 2010.

Digital Library

[26]

Y. Kwon, M. Balazinska, and A. Greenberg. Fault-tolerant stream processing using a distributed, replicated file system. PVLDB, 1(1):574--585, 2008.

Digital Library

[27]

A. Lakshman and P. Malik. Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review, 44(2):35--40, 2010.

Digital Library

[28]

L. Lin, X. Yu, and N. Koudas. "Pollux: Towards scalable distributed real-time search on microblogs". Technical Report. http://www.yorku.ca/xhyu/ITEC-TR-201203.pdf.

[29]

D. Luckham. The power of events: an introduction to complex event processing in distributed enterprise systems. Addison-Wesley Longman Publishing Co., Inc., 2001.

Digital Library

[30]

L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: distributed stream computing platform. In ICDM Workshops, pages 170--177. IEEE, 2010.

Digital Library

[31]

D. Ongaro, S. Rumble, R. Stutsman, J. Ousterhout, and M. Rosenblum. Fast crash recovery in ramcloud. In SOSP, pages 29--41. ACM, 2011.

Digital Library

[32]

J. Rao, E. Shekita, and S. Tata. Using paxos to build a scalable, consistent, and highly available datastore. PVLDB, 4(4):243--254, 2011.

Digital Library

[33]

Z. Sebepou and K. Magoutis. Scalable storage support for data stream processing. In Proceedings of IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pages 1--6. IEEE, 2010.

Digital Library

[34]

M. Shah, J. Hellerstein, and E. Brewer. Highly available, fault-tolerant, parallel dataflows. In SIGMOD, pages 827--838. ACM, 2004.

Digital Library

Cited By

Akram SWang ZWrigstad T(2021)Exploiting Intel optane persistent memory for full text searchProceedings of the 2021 ACM SIGPLAN International Symposium on Memory Management10.1145/3459898.3463906(80-93)Online publication date: 22-Jun-2021
https://dl.acm.org/doi/10.1145/3459898.3463906
Bordin MGriebler DMencagli GGeyer CFernandes L(2020)DSPBench: A Suite of Benchmark Applications for Distributed Data Stream Processing SystemsIEEE Access10.1109/ACCESS.2020.30439488(222900-222917)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3043948
Attu RTerras M(2017)What people study when they study TumblrJournal of Documentation10.1108/JD-08-2016-010173:3(528-554)Online publication date: 8-May-2017
https://doi.org/10.1108/JD-08-2016-0101
Show More Cited By

Index Terms

Pollux: towards scalable distributed real-time search on microblogs

Recommendations

Analyzing and predicting viral tweets
WWW '13 Companion: Proceedings of the 22nd International Conference on World Wide Web

Twitter and other microblogging services have become indispensable sources of information in today's web. Understanding the main factors that make certain pieces of information spread quickly in these platforms can be decisive for the analysis of ...
Analysis of Microblog Rumors and Correction Texts for Disaster Situations
IIWAS '13: Proceedings of International Conference on Information Integration and Web-based Applications & Services

Microblogging systems such as Twitter have become popular. They are especially useful and helpful for users in disaster situations. Microblogs have facilitated the spread of information of all kinds, even rumors. Rumors block adequate information ...
Hashtag retrieval in a microblogging environment
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

Microblog services let users broadcast brief textual messages to people who "follow" their activity. Often these posts contain terms called hashtags, markers of a post's meaning, audience, etc. This poster treats the following problem: given a user's ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

EDBT '13: Proceedings of the 16th International Conference on Extending Database Technology

March 2013

793 pages

ISBN:9781450315975

DOI:10.1145/2452376

General Chair:
Giovanna Guerrini
Università di Genova, Italy
,
Program Chair:
Norman W. Paton
University of Manchester, UK

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 March 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Program for New Century Excellent Talents in University
Natural Science Foundation of Shandong Province of China Grant
Natural Sciences and Engineering Research Council of Canada
National Natural Science Foundation of China
Independent Innovation Foundation of Shandong University
SAICT Experts Program

Conference

EDBT/ICDT '13

EDBT/ICDT '13: Joint 2013 EDBT/ICDT Conferences

March 18 - 22, 2013

Genoa, Italy

Acceptance Rates

Overall Acceptance Rate 7 of 10 submissions, 70%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
360
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Akram SWang ZWrigstad T(2021)Exploiting Intel optane persistent memory for full text searchProceedings of the 2021 ACM SIGPLAN International Symposium on Memory Management10.1145/3459898.3463906(80-93)Online publication date: 22-Jun-2021
https://dl.acm.org/doi/10.1145/3459898.3463906
Bordin MGriebler DMencagli GGeyer CFernandes L(2020)DSPBench: A Suite of Benchmark Applications for Distributed Data Stream Processing SystemsIEEE Access10.1109/ACCESS.2020.30439488(222900-222917)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3043948
Attu RTerras M(2017)What people study when they study TumblrJournal of Documentation10.1108/JD-08-2016-010173:3(528-554)Online publication date: 8-May-2017
https://doi.org/10.1108/JD-08-2016-0101
Alayyoub MYazici AKarakaya Z(2016)Systematic mapping for big data stream processing frameworks2016 Eleventh International Conference on Digital Information Management (ICDIM)10.1109/ICDIM.2016.7829760(31-36)Online publication date: Sep-2016
https://doi.org/10.1109/ICDIM.2016.7829760
Yu ZLiu YYu XPu K(2015)Scalable Distributed Processing of K Nearest Neighbor Queries over Moving ObjectsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2014.236404627:5(1383-1396)Online publication date: 1-May-2015
https://doi.org/10.1109/TKDE.2014.2364046
Jiang JLu HYang BCui B(2015)Finding top-k local users in geo-tagged social media data2015 IEEE 31st International Conference on Data Engineering10.1109/ICDE.2015.7113290(267-278)Online publication date: Apr-2015
https://doi.org/10.1109/ICDE.2015.7113290

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents