Abstract
Sequence pattern mining is the mining of special and representative features hidden in sequence data. Recently, it has been attracting a lot of attention, especially in the fields of bioinformatics and spatio-temporal trajectory mining. Observing that many sequence data are born with uncertainties and huge sequence data are increasingly generated and accumulated, this paper aims to discover the hidden features from a large amount of uncertain sequence data. Specifically, Probabilistic Suffix Tree (PST) is an implementation of Variable-length Markov Chain (VMM) that has been widely applied in sequence data mining. However, the conventional PST construction algorithm is not for the mining of uncertain data and cannot bear the computing of huge data. Thus, to mine a large amount of sequence data with uncertainties, this paper proposes the uPST\(_{MR}^+\) algorithm on the Hadoop platform to fully utilize the computing power and storage capacity of cloud computing. The proposed uPST\(_{MR}^+\) algorithm constructs a PST in a progressive, multi-layered, and iterative manner so as to avoid excessive learning patterns and balance the overhead of distributed computing. In addition, to prevent the drag on overall performance owing to multiple scanning of the entire sequence data, we trade space for time by using a NodeArray data structure to store the intermediate statistical results to reduce disk I/O. To verify the performance of uPST\(_{MR}^{+}\), we conduct several experiments. The experimental results show that uPST\(_{MR}^{+}\) outperforms the naive approach significantly and show good scalability and stability. Also, although using NodeArray costs a little extra memory, the execution time is significantly lowered.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Roux, C., Bernard, R.T.F.: Home range size, spatial distribution and habitat use of elephants in two enclosed game reserves in the eastern cape province, south Africa. Afr. J. Ecol. 47(2), 146–153 (2007)
Jin, J.-F., Liu, B.-F., Yu, X., Lu, C.-H.: Wintering and migration of black-faced spoonbill in Xinghua Bay, Fujian Province. Chin. J. Zool. 44(1), 47–53 (2009)
Wang, Y., Lim, E.-P., Hwang, S.-Y.: Efficient mining of group patterns from user movement data. DKE 57(3), 240–282 (2006)
Li, Y., Han, J., Yang, J.: Clustering moving objects. In: ACM SIGKDD, pp. 617–622 (2004)
Aggarwal, C.C., Yu, P.S.: A survey of uncertain data algorithms and applications. IEEE TKDE 21(5), 609–623 (2008)
Gezici, S.: A survey on wireless position estimation. Wireless Pers. Commun. 44(3), 263–282 (2007)
Gu, Y., Lo, A., Niemegeers, I.: A survey of indoor positioning systems for wireless personal networks. IEEE Commun. Surv. Tutorials 11(1), 13–32 (2009)
Wang, J., Luo, Y., Zhao, Y., Le, J.: A survey on privacy preserving data mining. In: 1st International Workshop on Database Technology and Applications (2009)
Chen, R., Fung, B.C.M., Mohammed, N., Desai, B.C., Wang, K.: Privacy-preserving trajectory data publishing by local suppression. Inf. Sci. Spec. Issue Data Min. Inf. Secur. 231, 83–91 (2013)
Kuijpers, B., Othman, W.: Trajectory databases: data models, uncertainty and complete query languages. J. Comput. Syst. Sci. 76(7), 538–560 (2009)
Yang, J., Hu, M.: TrajPattern: mining sequential patterns from imprecise trajectories of mobile objects. In: Ioannidis, Y., Scholl, M.H., Schmidt, J.W., Matthes, F., Hatzopoulos, M., Böhm, K., Kemper, A., Grust, T., Böhm, C. (eds.) EDBT 2006. LNCS, vol. 3896, pp. 664–681. Springer, Heidelberg (2006)
Leung, C.K.-S., Brajczuk, D.A.: Efficient algorithms for mining constrained frequent patterns from uncertain data. In: ACM SIGKDD Workshop on knowledge, Discovery from Uncertain Data (2009)
Ron, D., Singer, Y., Tishby, N.: Learning probabilistic automata with variable memory length. In: 7th Annual Conference on Computational Learning Theory, pp. 35–46 (1994)
Bejerano, G., Yona, G.: Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics 17(1), 23–43 (2001)
Sun, P., Chawla, S., Arunasalam, B.: Mining for outliers in sequential databases. In: SDM (2006)
Pampapathi, R.M.: Annotated suffix trees for text modelling and classification. Dissertation, University of London (2008)
Dubnov, S., Assayag, G., Lartillot, O., Bejerano, G.: Using machine-learning methods for musical style modeling. J. Comput. 36(10), 73–80 (2003)
Ghoting, A., Makarychev, K.: I/O efficient algorithms for serial and parallel suffix tree construction. ACM TODS 35(4), 1–37 (2010)
Gao, F., Zaki, M.J.: Psist: a scalable approach to indexing protein structures using suffix trees. J. Parallel Distrib. Comput. 68(1), 54–63 (2008)
Pellicer, S., Chen, G., Chan, K.C.C., Pan, Y.: Distributed sequence alignment applications for the public computing architecture. IEEE T-NB 7(1), 35–43 (2008)
Tsai, H.-P., Yang, D.-N., Chen, M.-S.: Mining group movement patterns for tracking moving objects efficiently. IEEE TKDE 23(2), 266–281 (2011)
Acknowledgment
The work was supported in part by the National Science Council of Taiwan, R.O.C., under Contracts NSC102-2220-E-005-007.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Sun, ZY., Tsai, MC., Tsai, HP. (2014). Mining Uncertain Sequence Data on Hadoop Platform. In: Peng, WC., et al. Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2014. Lecture Notes in Computer Science(), vol 8643. Springer, Cham. https://doi.org/10.1007/978-3-319-13186-3_20
Download citation
DOI: https://doi.org/10.1007/978-3-319-13186-3_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13185-6
Online ISBN: 978-3-319-13186-3
eBook Packages: Computer ScienceComputer Science (R0)