[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/775047.775070acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Pattern discovery in sequences under a Markov assumption

Published: 23 July 2002 Publication History

Abstract

In this paper we investigate the general problem of discovering recurrent patterns that are embedded in categorical sequences. An important real-world problem of this nature is motif discovery in DNA sequences. We investigate the fundamental aspects of this data mining problem that can make discovery "easy" or "hard." We present a general framework for characterizing learning in this context by deriving the Bayes error rate for this problem under a Markov assumption. The Bayes error framework demonstrates why certain patterns are much harder to discover than others. It also explains the role of different parameters such as pattern length and pattern frequency in sequential discovery. We demonstrate how the Bayes error can be used to calibrate existing discovery algorithms, providing a lower bound on achievable performance. We discuss a number of fundamental issues that characterize sequential pattern discovery in this context, present a variety of empirical results to complement and verify the theoretical analysis, and apply our methodology to real-world motif-discovery problems in computational biology.

References

[1]
Bailey, T. and Elkan C. (1995) Unsupervised learning of Multiple Motifs in Biopolymers Using Expectation Maximization. Machine Learning Journal, 21, pp. 51--83.
[2]
Baldi, P., Chauvin, Y., McClure, M. and Hunkapiller, T. (1994) Hidden Markov Models of Biological Primary Sequence Information. Proceedings of the National Academy of Science, 91, pp. 1059--1063.
[3]
Buhler, J., and Tompa, M. (2001), Proceedings of the Fifth Annual International Conference on Computational Molecular Biology (RECOMB01), Montreal, Canada, 2001, pp. 69--76
[4]
Chow, C. K. (1962) A recognition method using neighbor dependence, IRE Trans. Elect. Comput., vol. EC-11, pp. 683--690.
[5]
Chu, J. T. (1974) Error Bounds for a Contextual Recognition Procedure. IEEE Trans. Elect. Comput., Vol. C-20, No. 10.
[6]
Chudova, D. and Smyth P. (2002) Pattern discovery in sequences under a Markov assumption, Technical Report ICS-TR-02-08, University of California, Irvine.
[7]
Duda, R. O., and Hart, P. E. (1973) Pattern Recognition and Scene Analysis, New York, NY: John Wiley.
[8]
Eddy, S.R. (1995) Multiple Alignment Using Hidden Markov Models. Intelligent Systems for Molecular Biology, 3, pp. 114--120.
[9]
Van Helden, J.V., Abdre, B, and Collado-Vides, J. (1998) Extracting regulatory sites from the upstream region of Yeast genes by computational analysis of oligonucleotide frequencies. Journal of Molecular Biology, 281, pp. 827--842
[10]
Lee, E.T. (1974) Bounds and approximations for error probabilities in character recognition. Proceedings of the International Conference on Cybernetics and Society, pp. 324--329.
[11]
Lawrence, C.E., Altshul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F. and Wootton, J.C. (1993). Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262, 208--214
[12]
Liu X, Brutlag D L, Liu J S. (2001) BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac Symp Biocomput. 127--138.
[13]
Liu, J. S., Neuwald, A.F., and Lawrence, C.E. (1995) Bayesian models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies. Journal of the American Statistical Association, 90, pp. 1156--1170.
[14]
McLachlan, G. J. (1992) Discriminant Analysis and Statistical Pattern Recognition, New York, NY: John Wiley and Sons.
[15]
Pevzner, P. A. (2000) Computational Molecular Biology: an Algorithmic Approach. Cambridge, MA: The MIT Press.
[16]
Pevzner, P. A., and Sze, S.-H. (2000) Combinatorial approaches to finding subtle signals in DNA sequences. Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (ISMB'2000), pp. 269--278.
[17]
Regnier, M. and Szpankowski, W. (1998) On the approximate pattern occurrences in a text. In Compression and Complexity of Sequences 1997, pp. 253--264.
[18]
Ripley, B. D. (1996) Pattern Recognition and Neural Networks, Cambridge, UK: Cambridge University Press.
[19]
Robison, K., McGuire, A.M., Church, G.M. (1998) A comprehensive library of DNA-binding Site Matrices for 55 Proteins Applied to the Complete Escherichia coli K-12 Genome. Journal of Molecular Biology, 284:241--254.
[20]
Stormo, G.D., Hartzell, G.W. Proc. Natl. Acad. Sci., 86:1183--1187, 1989
[21]
Sze, S.-H., Gelfand, M.S., Pevzner, P. A., (2002) Finding weak motifs in DNA sequences. Pacific Symposium on Biocomputing 2002, pp. 235--246
[22]
Vapnik, V. (1998) Statistical Learning Theory, New York, NY: John Wiley.

Cited By

View all
  • (2024)The influence of observation sequence features on the performance of the Bayesian hidden Markov model: A Monte Carlo simulation studyPLOS ONE10.1371/journal.pone.031444419:12(e0314444)Online publication date: 11-Dec-2024
  • (2022)Predicting Mood from Digital Footprints Using Frequent Sequential Context Patterns FeaturesInternational Journal of Human–Computer Interaction10.1080/10447318.2022.207332139:10(2061-2075)Online publication date: 14-Jun-2022
  • (2018)Sequential fraud detection for prepaid cards using hidden Markov model divergenceExpert Systems with Applications10.1016/j.eswa.2017.08.04391(235-251)Online publication date: Jan-2018
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
July 2002
719 pages
ISBN:158113567X
DOI:10.1145/775047
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 July 2002

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

KDD02
Sponsor:

Acceptance Rates

KDD '02 Paper Acceptance Rate 44 of 307 submissions, 14%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)2
Reflects downloads up to 19 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)The influence of observation sequence features on the performance of the Bayesian hidden Markov model: A Monte Carlo simulation studyPLOS ONE10.1371/journal.pone.031444419:12(e0314444)Online publication date: 11-Dec-2024
  • (2022)Predicting Mood from Digital Footprints Using Frequent Sequential Context Patterns FeaturesInternational Journal of Human–Computer Interaction10.1080/10447318.2022.207332139:10(2061-2075)Online publication date: 14-Jun-2022
  • (2018)Sequential fraud detection for prepaid cards using hidden Markov model divergenceExpert Systems with Applications10.1016/j.eswa.2017.08.04391(235-251)Online publication date: Jan-2018
  • (2013)Linguistic summaries for periodicity detection based on mathematical morphology2013 IEEE Symposium on Foundations of Computational Intelligence (FOCI)10.1109/FOCI.2013.6602462(106-113)Online publication date: Apr-2013
  • (2012)Mining Frequent Trajectory Patterns for Activity Monitoring Using Radio Frequency Tag ArraysIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2011.30723:11(2138-2149)Online publication date: 1-Nov-2012
  • (2012)Pattern Discovery of User Interface Sequencing by Rehabilitation Clients with Cognitive ImpairmentsProceedings of the 2012 45th Hawaii International Conference on System Sciences10.1109/HICSS.2012.467(3001-3010)Online publication date: 4-Jan-2012
  • (2011)Natural event summarizationProceedings of the 20th ACM international conference on Information and knowledge management10.1145/2063576.2063688(765-774)Online publication date: 24-Oct-2011
  • (2011)Alternative Approach to Tree-Structured Web Log Representation and MiningProceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 0110.1109/WI-IAT.2011.156(235-242)Online publication date: 22-Aug-2011
  • (2011)Rule generation for categorical time series with Markov assumptionsStatistics and Computing10.1007/s11222-009-9141-z21:1(1-16)Online publication date: 1-Jan-2011
  • (2009)Constructing comprehensive summaries of large event sequencesACM Transactions on Knowledge Discovery from Data10.1145/1631162.16311693:4(1-31)Online publication date: 4-Dec-2009
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media