[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Optimal string clustering based on a Laplace-like mixture and EM algorithm on a set of strings

Published: 01 December 2019 Publication History

Abstract

In this study, we address the problem of clustering string data in an unsupervised manner by developing a theory of a mixture model and an EM algorithm for strings based on probability theory on a topological monoid of strings developed in our previous studies. We begin with introducing a parametric probability distribution on a set of strings, which has location and dispersion parameters of a string and positive real number. We develop an iteration algorithm for estimating the parameters of the mixture model of the distributions introduced and demonstrate that our algorithm converges to the EM algorithm, which cannot be explicitly written for this mixture model, with probability one and strongly consistently estimates its parameters as the numbers of observed strings and iterations increase. We finally derive a procedure for unsupervised string clustering that is asymptotically optimal in the sense that the posterior probability of making correct classifications is maximized.

References

[1]
M.A. Aizerman, E.M. Braverman, L.I. Rozoner, Theoretical foundations of the potential function method in pattern recognition learning, Autom. Remote Control 25 (1964) 821–837.
[2]
H. Akaike, Information theory and an extension of the maximum likelihood principle, in: B.N. Petrov, F. Csaki (Eds.), 2nd International Symposium on Information Theory, Akademiai Kiado, Budapest, 1973, pp. 267–281.
[3]
T. Akutsu, H. Bannai, S. Miyano, S. Ott, On the complexity of deriving position specific score matrices from positive and negative sequences, Discrete Appl. Math. 155 (2007) 676–685.
[4]
L. Bergroth, H. Hakonen, T. Raita, A survey of longest common subsequence algorithms, in: String Processing and Information Retrieval (SPIRe 2000): 7th International Symposium, IEEE, 2000, pp. 39–48.
[5]
B.E. Boser, I.M. Guyon, V.N. Vapnik, A training algorithm for optimal margin classifiers, in: D. Houssler (Ed.), Proc. 5th Annu. Workshop Comput. Learn. Theory, 1992, pp. 144–152.
[6]
B.M. Brown, Statistical uses of the spatial median, J. R. Stat. Soc., Ser. B, Stat. Methodol. 45 (1983) 25–30.
[7]
G. Claeskens, N.L. Hjort, Model Selection and Model Averaging, Cambridge University Press, Cambridge, UK, 2008.
[8]
C. Cortes, V.N. Vapnik, Support-vector networks, Mach. Learn. 20 (3) (1995) 273–297.
[9]
F.J. Damerau, A technique for computer detection and correction of spelling errors, Commun. ACM 7 (3) (1964) 171–176.
[10]
C. de la Higuera, F. Casacuberta, Topology of strings: median string is NP-complete, Theor. Comput. Sci. 230 (1) (2000) 39–48.
[11]
A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc., Ser. B, Stat. Methodol. 39 (1) (1977) 1–38.
[12]
D.L. Donoho, M. Gasko, Breakdown properties of location estimates based on halfspace depth and projected outlyingness, Ann. Stat. 20 (4) (1992) 1803–1827.
[13]
H. Drucker, C.J.C. Burges, L. Kaufman, A. Smola, V. Vapnik, Support vector regression machines, in: M.C. Mozer, M.I. Jordan, T. Petsche (Eds.), Adv. Neural Inf. Process. Syst. 9, MIT Press, Cambridge, MA, 1997, pp. 155–161.
[14]
E. Fix, J.L. Hodges Jr., Discriminatory Analysis, Nonparametric Discrimination, Consistency Properties, Technical Report 4 U.S. Air Force School of Aviation Medicine, Randolph Field, Texas, 1951.
[15]
E.W. Forgy, Cluster analysis of multivariate data: efficiency versus interpretability of classifications, Biometrics 21 (1965) 768–769.
[16]
R.R. Gutell, J.J. Cannone, D. Konings, D. Gautheret, Predicting U-turns in ribosomal RNA with comparative sequence analysis, J. Mol. Biol. 300 (4) (2000) 791–803.
[17]
R.W. Hamming, Error detecting and error correcting codes, Bell Syst. Tech. J. 29 (2) (1950) 147–160.
[18]
A. Hardy, On the number of clusters, Comput. Stat. Data Anal. 23 (1) (1996) 83–96.
[19]
D. Haussler, Convolution Kernels on Discrete Structures, Technical Report UCSC-CRL-99-10 Department of Computer Science, University of California, Santa Cruz, Santa Cruz, CA, 1999.
[20]
I.L. Hofacker, P.F. Stadler, RNA secondary structures, in: T. Lengauer (Ed.), Bioinformatics - From Genomes to Therapies, vol. 1, Wiley-VCH, Weinheim, Germany, 2007, pp. 439–489.
[21]
A.K. Jain, Data clustering: 50 years beyond K-means, Pattern Recognit. Lett. 31 (8) (2010) 651–666.
[22]
M.A. Jaro, Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida, J. Am. Stat. Assoc. 84 (406) (1989) 414–420.
[23]
X. Jiang, K. Abegglen, H. Bunke, J. Csirik, Dynamic computation of generalised median strings, Pattern Anal. Appl. 6 (3) (2003) 185–193.
[24]
X. Jiang, J. Wentker, M. Ferrer, Generalized median string computation by means of string embedding in vector spaces, Pattern Recognit. Lett. 33 (7) (2012) 842–852.
[25]
N.L. Johnson, A.W. Kemp, S. Kotz, Univariate Discrete Distributions, third edition, Wiley, Hoboken, NJ, 2005.
[26]
N.L. Johnson, S. Kotz, N. Balakrishnan, Continuous Univariate Distributions, vol. 1, second edition, Wiley, Hoboken, NJ, 1994.
[27]
N.L. Johnson, S. Kotz, N. Balakrishnan, Continuous Univariate Distributions, vol. 2, second edition, Wiley, Hoboken, NJ, 1995.
[28]
A.M. Kagan, C.R. Rao, Y.V. Linnik, Characterization Problems in Mathematical Statistics, Wiley, Hoboken, NJ, 1973.
[29]
I. Kalvari, J. Argasinska, N. Quinones-Olvera, E.P. Nawrocki, E. Rivas, S.R. Eddy, A. Bateman, R.D. Finn, A.I. Petrov, Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families, Nucleic Acids Res. 46 (D1) (2017) D335–D342.
[30]
T. Kohonen, Median strings, Pattern Recognit. Lett. 3 (5) (1985) 309–313.
[31]
S. Kotz, T. Kozubowski, K. Podgorski, The Laplace Distribution and Generalizations: A Revisit with Applications to Communications, Exonomics, Engineering, and Finance, Birkhäuser, Boston, 2001.
[32]
H. Koyano, M. Hayashida, T. Akutsu, Maximum margin classifier working in a set of strings, Proc. R. Soc. A 472 (2187) (2016).
[33]
H. Koyano, H. Kishino, Quantifying biodiversity and asymptotics for a sequence of random strings, Phys. Rev. E 81 (6) (2010).
[34]
H. Koyano, T. Tsubouchi, H. Kishino, T. Akutsu, Archaeal β diversity patterns under the seafloor along geochemical gradients, J. Geophys. Res., Biogeosci. 119 (9) (2014) 1770–1788.
[35]
Koyano, H.; Yano, K. : Evolutionary model of a population of DNA sequences through the interaction with an environment and its application to speciation analysis. arXiv:1706.01182 [q-bio.PE].
[36]
P.-S. Laplace, Mémoire sur la probabilité des causes par les événements, Mémoires de l'Academie Royale des Sciences Presentés par Divers Savants 6 (1774) 621–656.
[37]
C. Leslie, E. Eskin, J. Weston, W.S. Noble, Mismatch string kernels for SVM protein classification, in: S. Becker, S. Thrun, K. Obermayer (Eds.), Adv. Neural Inf. Process. Syst. 15, MIT Press, Cambridge, MA, 2003, pp. 1417–1424.
[38]
C. Leslie, R. Kuang, Fast string kernels using inexact matching for protein sequences, J. Mach. Learn. Res. 5 (2004) 1435–1455.
[39]
C.S. Leslie, E. Eskin, W.S. Noble, The spectrum kernel: a string kernel for SVM protein classification, in: R.B. Altman, A.K. Dunker, L. Hunter, T.E. Klein, K. Lauderdale (Eds.), Proc. Pacific Symp. Biocomput., vol. 7, 2002, pp. 566–575.
[40]
V.I. Levenshtein, Binary codes capable of correcting deletions, insertions and reversals, Dokl. Akad. Nauk SSSR 163 (4) (1965) 845–848.
[41]
H. Li, T. Jiang, A class of edit kernels for SVMs to predict translation initiation sites in eukaryotic mRNAs, J. Comput. Biol. 12 (6) (2005) 702–718.
[42]
S. Lloyd, Least square quantization in PCM, IEEE Trans. Inf. Theory 28 (2) (1982) 129–137. Bell Telephone Laboratories Paper in 1957.
[43]
H. Lodhi, J. Shawe-Taylor, N. Cristianini, C. Watkins, Text classication using string kernel, in: T.K. Leen, T.G. Dietterich, V. Tresp (Eds.), Adv. Neural Inf. Process. Syst. 13, MIT Press, Cambridge, MA, 2001.
[44]
J.B. MacQueen, Some methods of classification and analysis of multivariate observations, in: L.M. Le Cam, J. Neyman (Eds.), Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, University of California Press, Berkeley, CA, 1967, pp. 281–297.
[45]
F.H.C. Marriott, Practical problems in a method of cluster analysis, Biometrics 21 (1971) 501–514.
[46]
C.D. Martínez-Hinarejos, A. Juan, F. Casacuberta, Use of median string for classification, in: Proceedings of the 15th International Conference on Pattern Recognition, 2000, vol. 2, IEEE, 2000, pp. 903–906.
[47]
C.D. Martínez-Hinarejos, A. Juan, F. Casacuberta, Median strings for k-nearest neighbour classification, Pattern Recognit. Lett. 24 (1) (2003) 173–181.
[48]
S. Maurer-Stroh, M. Debulpaep, N. Kuemmerer, M.L. de la Paz, I.C. Martins, J. Reumers, K.L. Morris, A. Copland, L. Serpell, L. Serrano, J.W.H. Schymkowitz, F. Rousseau, Exploring the sequence determinants of amyloid structure using position-specific scoring matrices, Nat. Methods 7 (2010) 237–242.
[49]
G. McLachlan, T. Krishnan, The EM Algorithm and Extensions, Wiley, Hoboken, NJ, 1997.
[50]
G. McLachlan, D. Peel, Finite Mixture Models, Wiley, Hoboken, NJ, 2004.
[51]
G. Navarro, A guided tour to approximate string matching, ACM Comput. Surv. 33 (1) (2001) 31–88.
[52]
F. Nicolas, E. Rivals, Complexities of the centre and median string problems, in: R. Baeza-Yates, E. Chávez, M. Crochemore (Eds.), Combinatorial Pattern Matching, Springer, Berlin, 2003, pp. 315–327.
[53]
F. Nicolas, E. Rivals, Hardness results for the center and median string problems under the weighted and unweighted edit distances, J. Discret. Algorithms 3 (2) (2005) 390–415.
[54]
H. Oja, Descriptive statistics for multivariate distributions, Stat. Probab. Lett. 1 (6) (1983) 327–332.
[55]
H. Oja, A. Niinimaa, Asymptotic properties of the generalized median in the case of multivariate normality, J. R. Stat. Soc., Ser. B, Stat. Methodol. 47 (1985) 372–377.
[56]
C. Olivares-Rodríguez, J. Oncina, A stochastic approach to median string computation, in: N. da, Vitoria Lobo, T. Kasparis, J.T. Roli, F. Kwok, M. Georgiopoulos, G.C. Anagnostopoulos, M. Loog (Eds.), Structural, Syntactic, and Statistical Pattern Recognition, Springer, Berlin, 2008, pp. 431–440.
[57]
G. Paaß, E. Leopold, M. Larson, J. Kindermann, S. Eickeler, SVM classification using sequences of phonemes and syllables, in: T. Elomaa, H. Mannila, H. Toivonen (Eds.), Proc. 6th Eur. Conf. Principles Data Min. Knowl. Discov., Springer, 2002, pp. 373–384.
[58]
K. Pearson, Contributions to the mathematical theory of evolution, Philos. Trans. R. Soc. Lond. A 185 (1894) 71–110.
[59]
M.D. Perlman, On the strong consistency of approximate maximum likelihood estimators, in: L.M. Le Cam, J. Neyman, E.L. Scott (Eds.), Proc. 6th Berkeley Symp. Math. Stat. Prob., vol. 1, University of California Press, Berkeley, CA, 1972, pp. 263–281.
[60]
C.E. Rasmussen, The infinite Gaussian mixture model, in: S.A. Solla, T.K. Leen, K.-R. Müller (Eds.), Adv. Neural Inf. Process. Syst., vol. 12, MIT Press, Cambridge, MA, 2000, pp. 554–560.
[61]
J. Rissanen, Modeling by shortest data description, Automatica 14 (5) (1978) 465–658.
[62]
H. Saigo, J.-P. Vert, N. Ueda, T. Akutsu, Protein homology detection using string alignment kernels, Bioinformatics 20 (11) (2004) 1682–1689.
[63]
A.A. Schäffer, Y.I. Wolf, C.P. Ponting, E.V. Koonin, L. Aravind, S.F. Altschul, IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices, Bioinformatics 15 (12) (1999) 1000–1011.
[64]
G. Schwarz, Estimating the dimension of a model, Ann. Stat. 6 (2) (1978) 461–464.
[65]
H. Steinhaus, Sur la division des corps matériels en parties, Bull. Acad. Pol. Sci. 4 (12) (1956) 801–804.
[66]
R. Tibshirani, G. Walther, T. Hastie, Estimating the number of clusters in a data set via the gap statistic, J. Roy. Statist. Soc. Ser. B 63 (2) (2001) 411–423.
[67]
W.S. Torgerson, Multidimensional scaling: I. Theory and method, Psychometrika 17 (4) (1952) 401–419.
[68]
J.W. Tukey, Mathematics and the picturing of data, in: Proceedings of the International Congress of Mathematicians, Vancouver, 1974, vol. 2, 1975, pp. 523–531.
[69]
V.N. Vapnik, Statistical Learning Theory, Wiley, Hoboken, NJ, 1998.
[70]
J.-P. Vert, Support vector machine prediction of signal peptide cleavage site using a new class of kernels for strings, in: R.B. Altman, A.K. Dunker, L. Hunter, T.E. Klein, K. Lauderdale (Eds.), Proc. Pacific Symp. Biocomput., vol. 7, 2002, pp. 649–660.
[71]
S.V.N. Vishwanathan, A.J. Smola, Fast kernels for string and tree matching, in: K. Tsuda, B. Schölkopf, J.-P. Vert (Eds.), Kernel Methods in Computational Biology, MIT Press, Cambridge, MA, 2004, pp. 113–130.
[72]
A. Wald, Note on the consistency of the maximum likelihood estimate, Ann. Math. Stat. 29 (1949) 595–601.
[73]
C.S. Wallace, D.M. Boulton, An information measure for classification, Comput. J. 11 (2) (1968) 185–194.
[74]
S. Washietl, I.L. Hofacker, M. Lukasser, A. Hüttenhofer, P.F. Stadler, Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome, Nat. Biotechnol. 23 (11) (2005) 1383–1390.
[75]
M.S. Waterman, Introduction to Computational Biology: Maps, Sequences and Genomes, Chapman and Hall, London, 1995.
[76]
C. Watkins, Dynamic Alignment Kernels, Technical Report CSD-TR-98-11 Computer Science Department, University of London, Royal Holloway, 1999.
[77]
W.E. Winkler, String comparator metrics and enhanced decision rules in the Fellegi–Sunter model of record linkage, in: Proceedings of the Section on Survey Research Methods, American Statistical Association, 1990, pp. 354–359.
[78]
J.H. Wolfe, Pattern clustering by multivariate mixture analysis, Multivar. Behav. Res. 5 (3) (1970) 329–350.
[79]
A. Zien, G. Rätsch, S. Mika, B. Schölkopf, T. Lengauer, K.-R. Müller, Engineering support vector machine kernels that recognize translation initiation sites, Bioinformatics 16 (9) (2000) 799–807.

Index Terms

  1. Optimal string clustering based on a Laplace-like mixture and EM algorithm on a set of strings
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image Journal of Computer and System Sciences
          Journal of Computer and System Sciences  Volume 106, Issue C
          Dec 2019
          145 pages

          Publisher

          Academic Press, Inc.

          United States

          Publication History

          Published: 01 December 2019

          Author Tags

          1. Unsupervised string clustering
          2. Topological monoid of strings
          3. Probability theory
          4. Statistical asymptotics
          5. Convergence of a sequence of algorithms

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • 0
            Total Citations
          • 0
            Total Downloads
          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 06 Jan 2025

          Other Metrics

          Citations

          View Options

          View options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media