Mining Uncertain Sequence Data on Hadoop Platform

Zi-Yun Sun¹¹,
Ming-Che Tsai¹¹ &
Hsiao-Ping Tsai¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8643))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

2323 Accesses
2 Citations

Abstract

Sequence pattern mining is the mining of special and representative features hidden in sequence data. Recently, it has been attracting a lot of attention, especially in the fields of bioinformatics and spatio-temporal trajectory mining. Observing that many sequence data are born with uncertainties and huge sequence data are increasingly generated and accumulated, this paper aims to discover the hidden features from a large amount of uncertain sequence data. Specifically, Probabilistic Suffix Tree (PST) is an implementation of Variable-length Markov Chain (VMM) that has been widely applied in sequence data mining. However, the conventional PST construction algorithm is not for the mining of uncertain data and cannot bear the computing of huge data. Thus, to mine a large amount of sequence data with uncertainties, this paper proposes the uPST\(_{MR}^+\) algorithm on the Hadoop platform to fully utilize the computing power and storage capacity of cloud computing. The proposed uPST\(_{MR}^+\) algorithm constructs a PST in a progressive, multi-layered, and iterative manner so as to avoid excessive learning patterns and balance the overhead of distributed computing. In addition, to prevent the drag on overall performance owing to multiple scanning of the entire sequence data, we trade space for time by using a NodeArray data structure to store the intermediate statistical results to reduce disk I/O. To verify the performance of uPST\(_{MR}^{+}\), we conduct several experiments. The experimental results show that uPST\(_{MR}^{+}\) outperforms the naive approach significantly and show good scalability and stability. Also, although using NodeArray costs a little extra memory, the execution time is significantly lowered.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 71.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 89.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Minits-AllOcc: An Efficient Algorithm for Mining Timed Sequential Patterns

Distributed Sequential Pattern Mining in Large Scale Uncertain Databases

Mining Uncertain Sequential Patterns in Iterative MapReduce

References

Roux, C., Bernard, R.T.F.: Home range size, spatial distribution and habitat use of elephants in two enclosed game reserves in the eastern cape province, south Africa. Afr. J. Ecol. 47(2), 146–153 (2007)
Article Google Scholar
Jin, J.-F., Liu, B.-F., Yu, X., Lu, C.-H.: Wintering and migration of black-faced spoonbill in Xinghua Bay, Fujian Province. Chin. J. Zool. 44(1), 47–53 (2009)
Google Scholar
Wang, Y., Lim, E.-P., Hwang, S.-Y.: Efficient mining of group patterns from user movement data. DKE 57(3), 240–282 (2006)
Article Google Scholar
Li, Y., Han, J., Yang, J.: Clustering moving objects. In: ACM SIGKDD, pp. 617–622 (2004)
Google Scholar
Aggarwal, C.C., Yu, P.S.: A survey of uncertain data algorithms and applications. IEEE TKDE 21(5), 609–623 (2008)
Google Scholar
Gezici, S.: A survey on wireless position estimation. Wireless Pers. Commun. 44(3), 263–282 (2007)
Article Google Scholar
Gu, Y., Lo, A., Niemegeers, I.: A survey of indoor positioning systems for wireless personal networks. IEEE Commun. Surv. Tutorials 11(1), 13–32 (2009)
Article Google Scholar
Wang, J., Luo, Y., Zhao, Y., Le, J.: A survey on privacy preserving data mining. In: 1st International Workshop on Database Technology and Applications (2009)
Google Scholar
Chen, R., Fung, B.C.M., Mohammed, N., Desai, B.C., Wang, K.: Privacy-preserving trajectory data publishing by local suppression. Inf. Sci. Spec. Issue Data Min. Inf. Secur. 231, 83–91 (2013)
MATH Google Scholar
Kuijpers, B., Othman, W.: Trajectory databases: data models, uncertainty and complete query languages. J. Comput. Syst. Sci. 76(7), 538–560 (2009)
Article MathSciNet Google Scholar
Yang, J., Hu, M.: TrajPattern: mining sequential patterns from imprecise trajectories of mobile objects. In: Ioannidis, Y., Scholl, M.H., Schmidt, J.W., Matthes, F., Hatzopoulos, M., Böhm, K., Kemper, A., Grust, T., Böhm, C. (eds.) EDBT 2006. LNCS, vol. 3896, pp. 664–681. Springer, Heidelberg (2006)
Chapter Google Scholar
Leung, C.K.-S., Brajczuk, D.A.: Efficient algorithms for mining constrained frequent patterns from uncertain data. In: ACM SIGKDD Workshop on knowledge, Discovery from Uncertain Data (2009)
Google Scholar
Ron, D., Singer, Y., Tishby, N.: Learning probabilistic automata with variable memory length. In: 7th Annual Conference on Computational Learning Theory, pp. 35–46 (1994)
Google Scholar
Bejerano, G., Yona, G.: Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics 17(1), 23–43 (2001)
Article Google Scholar
Sun, P., Chawla, S., Arunasalam, B.: Mining for outliers in sequential databases. In: SDM (2006)
Google Scholar
Pampapathi, R.M.: Annotated suffix trees for text modelling and classification. Dissertation, University of London (2008)
Google Scholar
Dubnov, S., Assayag, G., Lartillot, O., Bejerano, G.: Using machine-learning methods for musical style modeling. J. Comput. 36(10), 73–80 (2003)
Article Google Scholar
Ghoting, A., Makarychev, K.: I/O efficient algorithms for serial and parallel suffix tree construction. ACM TODS 35(4), 1–37 (2010)
Article Google Scholar
Gao, F., Zaki, M.J.: Psist: a scalable approach to indexing protein structures using suffix trees. J. Parallel Distrib. Comput. 68(1), 54–63 (2008)
Article MATH Google Scholar
Pellicer, S., Chen, G., Chan, K.C.C., Pan, Y.: Distributed sequence alignment applications for the public computing architecture. IEEE T-NB 7(1), 35–43 (2008)
Google Scholar
Tsai, H.-P., Yang, D.-N., Chen, M.-S.: Mining group movement patterns for tracking moving objects efficiently. IEEE TKDE 23(2), 266–281 (2011)
Google Scholar

Download references

Acknowledgment

The work was supported in part by the National Science Council of Taiwan, R.O.C., under Contracts NSC102-2220-E-005-007.

Author information

Authors and Affiliations

Department of Electrical Engineering, National Chung Hsing University, Taichung, Taiwan, Republic of China
Zi-Yun Sun, Ming-Che Tsai & Hsiao-Ping Tsai

Authors

Zi-Yun Sun
View author publications
You can also search for this author in PubMed Google Scholar
Ming-Che Tsai
View author publications
You can also search for this author in PubMed Google Scholar
Hsiao-Ping Tsai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hsiao-Ping Tsai .

Editor information

Editors and Affiliations

National Chiao Tung University, Hsinchu, Taiwan
Wen-Chih Peng
Google Research, Mountain View, California, USA
Haixun Wang
University of Melbourne, Melbourne, Victoria, Australia
James Bailey
National Cheng Kung University, Tainan, Taiwan
Vincent S. Tseng
Japan Advanced Institute of Science and Technology, Nomi City, Japan
Tu Bao Ho
Nanjing University, Nanjing, China
Zhi-Hua Zhou
National Chengchi University, Taipei, Taiwan
Arbee L.P. Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sun, ZY., Tsai, MC., Tsai, HP. (2014). Mining Uncertain Sequence Data on Hadoop Platform. In: Peng, WC., et al. Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2014. Lecture Notes in Computer Science(), vol 8643. Springer, Cham. https://doi.org/10.1007/978-3-319-13186-3_20

Download citation

DOI: https://doi.org/10.1007/978-3-319-13186-3_20
Published: 26 November 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13185-6
Online ISBN: 978-3-319-13186-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics