More Web Proxy on the site http://driver.im/

research-article

Optimal string clustering based on a Laplace-like mixture and EM algorithm on a set of strings

Authors:

Hitoshi Koyano,

Morihiro Hayashida,

Tatsuya AkutsuAuthors Info & Claims

Volume 106, Issue C

Pages 94 - 128

https://doi.org/10.1016/j.jcss.2019.07.003

Published: 01 December 2019 Publication History

Abstract

In this study, we address the problem of clustering string data in an unsupervised manner by developing a theory of a mixture model and an EM algorithm for strings based on probability theory on a topological monoid of strings developed in our previous studies. We begin with introducing a parametric probability distribution on a set of strings, which has location and dispersion parameters of a string and positive real number. We develop an iteration algorithm for estimating the parameters of the mixture model of the distributions introduced and demonstrate that our algorithm converges to the EM algorithm, which cannot be explicitly written for this mixture model, with probability one and strongly consistently estimates its parameters as the numbers of observed strings and iterations increase. We finally derive a procedure for unsupervised string clustering that is asymptotically optimal in the sense that the posterior probability of making correct classifications is maximized.

References

[1]

M.A. Aizerman, E.M. Braverman, L.I. Rozoner, Theoretical foundations of the potential function method in pattern recognition learning, Autom. Remote Control 25 (1964) 821–837.

[2]

H. Akaike, Information theory and an extension of the maximum likelihood principle, in: B.N. Petrov, F. Csaki (Eds.), 2nd International Symposium on Information Theory, Akademiai Kiado, Budapest, 1973, pp. 267–281.

[3]

T. Akutsu, H. Bannai, S. Miyano, S. Ott, On the complexity of deriving position specific score matrices from positive and negative sequences, Discrete Appl. Math. 155 (2007) 676–685.

[4]

L. Bergroth, H. Hakonen, T. Raita, A survey of longest common subsequence algorithms, in: String Processing and Information Retrieval (SPIRe 2000): 7th International Symposium, IEEE, 2000, pp. 39–48.

[5]

B.E. Boser, I.M. Guyon, V.N. Vapnik, A training algorithm for optimal margin classifiers, in: D. Houssler (Ed.), Proc. 5th Annu. Workshop Comput. Learn. Theory, 1992, pp. 144–152.

[6]

B.M. Brown, Statistical uses of the spatial median, J. R. Stat. Soc., Ser. B, Stat. Methodol. 45 (1983) 25–30.

[7]

G. Claeskens, N.L. Hjort, Model Selection and Model Averaging, Cambridge University Press, Cambridge, UK, 2008.

[8]

C. Cortes, V.N. Vapnik, Support-vector networks, Mach. Learn. 20 (3) (1995) 273–297.

[9]

F.J. Damerau, A technique for computer detection and correction of spelling errors, Commun. ACM 7 (3) (1964) 171–176.

Digital Library

[10]

C. de la Higuera, F. Casacuberta, Topology of strings: median string is NP-complete, Theor. Comput. Sci. 230 (1) (2000) 39–48.

[11]

A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc., Ser. B, Stat. Methodol. 39 (1) (1977) 1–38.

[12]

D.L. Donoho, M. Gasko, Breakdown properties of location estimates based on halfspace depth and projected outlyingness, Ann. Stat. 20 (4) (1992) 1803–1827.

[13]

H. Drucker, C.J.C. Burges, L. Kaufman, A. Smola, V. Vapnik, Support vector regression machines, in: M.C. Mozer, M.I. Jordan, T. Petsche (Eds.), Adv. Neural Inf. Process. Syst. 9, MIT Press, Cambridge, MA, 1997, pp. 155–161.

[14]

E. Fix, J.L. Hodges Jr., Discriminatory Analysis, Nonparametric Discrimination, Consistency Properties, Technical Report 4 U.S. Air Force School of Aviation Medicine, Randolph Field, Texas, 1951.

[15]

E.W. Forgy, Cluster analysis of multivariate data: efficiency versus interpretability of classifications, Biometrics 21 (1965) 768–769.

[16]

R.R. Gutell, J.J. Cannone, D. Konings, D. Gautheret, Predicting U-turns in ribosomal RNA with comparative sequence analysis, J. Mol. Biol. 300 (4) (2000) 791–803.

[17]

R.W. Hamming, Error detecting and error correcting codes, Bell Syst. Tech. J. 29 (2) (1950) 147–160.

[18]

A. Hardy, On the number of clusters, Comput. Stat. Data Anal. 23 (1) (1996) 83–96.

[19]

D. Haussler, Convolution Kernels on Discrete Structures, Technical Report UCSC-CRL-99-10 Department of Computer Science, University of California, Santa Cruz, Santa Cruz, CA, 1999.

[20]

I.L. Hofacker, P.F. Stadler, RNA secondary structures, in: T. Lengauer (Ed.), Bioinformatics - From Genomes to Therapies, vol. 1, Wiley-VCH, Weinheim, Germany, 2007, pp. 439–489.

[21]

A.K. Jain, Data clustering: 50 years beyond K-means, Pattern Recognit. Lett. 31 (8) (2010) 651–666.

Digital Library

[22]

M.A. Jaro, Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida, J. Am. Stat. Assoc. 84 (406) (1989) 414–420.

[23]

X. Jiang, K. Abegglen, H. Bunke, J. Csirik, Dynamic computation of generalised median strings, Pattern Anal. Appl. 6 (3) (2003) 185–193.

[24]

X. Jiang, J. Wentker, M. Ferrer, Generalized median string computation by means of string embedding in vector spaces, Pattern Recognit. Lett. 33 (7) (2012) 842–852.

[25]

N.L. Johnson, A.W. Kemp, S. Kotz, Univariate Discrete Distributions, third edition, Wiley, Hoboken, NJ, 2005.

[26]

N.L. Johnson, S. Kotz, N. Balakrishnan, Continuous Univariate Distributions, vol. 1, second edition, Wiley, Hoboken, NJ, 1994.

[27]

N.L. Johnson, S. Kotz, N. Balakrishnan, Continuous Univariate Distributions, vol. 2, second edition, Wiley, Hoboken, NJ, 1995.

[28]

A.M. Kagan, C.R. Rao, Y.V. Linnik, Characterization Problems in Mathematical Statistics, Wiley, Hoboken, NJ, 1973.

[29]

I. Kalvari, J. Argasinska, N. Quinones-Olvera, E.P. Nawrocki, E. Rivas, S.R. Eddy, A. Bateman, R.D. Finn, A.I. Petrov, Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families, Nucleic Acids Res. 46 (D1) (2017) D335–D342.

[30]

T. Kohonen, Median strings, Pattern Recognit. Lett. 3 (5) (1985) 309–313.

[31]

S. Kotz, T. Kozubowski, K. Podgorski, The Laplace Distribution and Generalizations: A Revisit with Applications to Communications, Exonomics, Engineering, and Finance, Birkhäuser, Boston, 2001.

[32]

H. Koyano, M. Hayashida, T. Akutsu, Maximum margin classifier working in a set of strings, Proc. R. Soc. A 472 (2187) (2016).

[33]

H. Koyano, H. Kishino, Quantifying biodiversity and asymptotics for a sequence of random strings, Phys. Rev. E 81 (6) (2010).

[34]

H. Koyano, T. Tsubouchi, H. Kishino, T. Akutsu, Archaeal β diversity patterns under the seafloor along geochemical gradients, J. Geophys. Res., Biogeosci. 119 (9) (2014) 1770–1788.

[35]

Koyano, H.; Yano, K. : Evolutionary model of a population of DNA sequences through the interaction with an environment and its application to speciation analysis. arXiv:1706.01182 [q-bio.PE].

[36]

P.-S. Laplace, Mémoire sur la probabilité des causes par les événements, Mémoires de l'Academie Royale des Sciences Presentés par Divers Savants 6 (1774) 621–656.

[37]

C. Leslie, E. Eskin, J. Weston, W.S. Noble, Mismatch string kernels for SVM protein classification, in: S. Becker, S. Thrun, K. Obermayer (Eds.), Adv. Neural Inf. Process. Syst. 15, MIT Press, Cambridge, MA, 2003, pp. 1417–1424.

[38]

C. Leslie, R. Kuang, Fast string kernels using inexact matching for protein sequences, J. Mach. Learn. Res. 5 (2004) 1435–1455.

[39]

C.S. Leslie, E. Eskin, W.S. Noble, The spectrum kernel: a string kernel for SVM protein classification, in: R.B. Altman, A.K. Dunker, L. Hunter, T.E. Klein, K. Lauderdale (Eds.), Proc. Pacific Symp. Biocomput., vol. 7, 2002, pp. 566–575.

[40]

V.I. Levenshtein, Binary codes capable of correcting deletions, insertions and reversals, Dokl. Akad. Nauk SSSR 163 (4) (1965) 845–848.

[41]

H. Li, T. Jiang, A class of edit kernels for SVMs to predict translation initiation sites in eukaryotic mRNAs, J. Comput. Biol. 12 (6) (2005) 702–718.

[42]

S. Lloyd, Least square quantization in PCM, IEEE Trans. Inf. Theory 28 (2) (1982) 129–137. Bell Telephone Laboratories Paper in 1957.

[43]

H. Lodhi, J. Shawe-Taylor, N. Cristianini, C. Watkins, Text classication using string kernel, in: T.K. Leen, T.G. Dietterich, V. Tresp (Eds.), Adv. Neural Inf. Process. Syst. 13, MIT Press, Cambridge, MA, 2001.

[44]

J.B. MacQueen, Some methods of classification and analysis of multivariate observations, in: L.M. Le Cam, J. Neyman (Eds.), Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, University of California Press, Berkeley, CA, 1967, pp. 281–297.

[45]

F.H.C. Marriott, Practical problems in a method of cluster analysis, Biometrics 21 (1971) 501–514.

[46]

C.D. Martínez-Hinarejos, A. Juan, F. Casacuberta, Use of median string for classification, in: Proceedings of the 15th International Conference on Pattern Recognition, 2000, vol. 2, IEEE, 2000, pp. 903–906.

[47]

C.D. Martínez-Hinarejos, A. Juan, F. Casacuberta, Median strings for k-nearest neighbour classification, Pattern Recognit. Lett. 24 (1) (2003) 173–181.

[48]

S. Maurer-Stroh, M. Debulpaep, N. Kuemmerer, M.L. de la Paz, I.C. Martins, J. Reumers, K.L. Morris, A. Copland, L. Serpell, L. Serrano, J.W.H. Schymkowitz, F. Rousseau, Exploring the sequence determinants of amyloid structure using position-specific scoring matrices, Nat. Methods 7 (2010) 237–242.

[49]

G. McLachlan, T. Krishnan, The EM Algorithm and Extensions, Wiley, Hoboken, NJ, 1997.

[50]

G. McLachlan, D. Peel, Finite Mixture Models, Wiley, Hoboken, NJ, 2004.

[51]

G. Navarro, A guided tour to approximate string matching, ACM Comput. Surv. 33 (1) (2001) 31–88.

Digital Library

[52]

F. Nicolas, E. Rivals, Complexities of the centre and median string problems, in: R. Baeza-Yates, E. Chávez, M. Crochemore (Eds.), Combinatorial Pattern Matching, Springer, Berlin, 2003, pp. 315–327.

[53]

F. Nicolas, E. Rivals, Hardness results for the center and median string problems under the weighted and unweighted edit distances, J. Discret. Algorithms 3 (2) (2005) 390–415.

[54]

H. Oja, Descriptive statistics for multivariate distributions, Stat. Probab. Lett. 1 (6) (1983) 327–332.

[55]

H. Oja, A. Niinimaa, Asymptotic properties of the generalized median in the case of multivariate normality, J. R. Stat. Soc., Ser. B, Stat. Methodol. 47 (1985) 372–377.

[56]

C. Olivares-Rodríguez, J. Oncina, A stochastic approach to median string computation, in: N. da, Vitoria Lobo, T. Kasparis, J.T. Roli, F. Kwok, M. Georgiopoulos, G.C. Anagnostopoulos, M. Loog (Eds.), Structural, Syntactic, and Statistical Pattern Recognition, Springer, Berlin, 2008, pp. 431–440.

[57]

G. Paaß, E. Leopold, M. Larson, J. Kindermann, S. Eickeler, SVM classification using sequences of phonemes and syllables, in: T. Elomaa, H. Mannila, H. Toivonen (Eds.), Proc. 6th Eur. Conf. Principles Data Min. Knowl. Discov., Springer, 2002, pp. 373–384.

[58]

K. Pearson, Contributions to the mathematical theory of evolution, Philos. Trans. R. Soc. Lond. A 185 (1894) 71–110.

[59]

M.D. Perlman, On the strong consistency of approximate maximum likelihood estimators, in: L.M. Le Cam, J. Neyman, E.L. Scott (Eds.), Proc. 6th Berkeley Symp. Math. Stat. Prob., vol. 1, University of California Press, Berkeley, CA, 1972, pp. 263–281.

[60]

C.E. Rasmussen, The infinite Gaussian mixture model, in: S.A. Solla, T.K. Leen, K.-R. Müller (Eds.), Adv. Neural Inf. Process. Syst., vol. 12, MIT Press, Cambridge, MA, 2000, pp. 554–560.

[61]

J. Rissanen, Modeling by shortest data description, Automatica 14 (5) (1978) 465–658.

Digital Library

[62]

H. Saigo, J.-P. Vert, N. Ueda, T. Akutsu, Protein homology detection using string alignment kernels, Bioinformatics 20 (11) (2004) 1682–1689.

[63]

A.A. Schäffer, Y.I. Wolf, C.P. Ponting, E.V. Koonin, L. Aravind, S.F. Altschul, IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices, Bioinformatics 15 (12) (1999) 1000–1011.

[64]

G. Schwarz, Estimating the dimension of a model, Ann. Stat. 6 (2) (1978) 461–464.

[65]

H. Steinhaus, Sur la division des corps matériels en parties, Bull. Acad. Pol. Sci. 4 (12) (1956) 801–804.

[66]

R. Tibshirani, G. Walther, T. Hastie, Estimating the number of clusters in a data set via the gap statistic, J. Roy. Statist. Soc. Ser. B 63 (2) (2001) 411–423.

[67]

W.S. Torgerson, Multidimensional scaling: I. Theory and method, Psychometrika 17 (4) (1952) 401–419.

[68]

J.W. Tukey, Mathematics and the picturing of data, in: Proceedings of the International Congress of Mathematicians, Vancouver, 1974, vol. 2, 1975, pp. 523–531.

[69]

V.N. Vapnik, Statistical Learning Theory, Wiley, Hoboken, NJ, 1998.

[70]

J.-P. Vert, Support vector machine prediction of signal peptide cleavage site using a new class of kernels for strings, in: R.B. Altman, A.K. Dunker, L. Hunter, T.E. Klein, K. Lauderdale (Eds.), Proc. Pacific Symp. Biocomput., vol. 7, 2002, pp. 649–660.

[71]

S.V.N. Vishwanathan, A.J. Smola, Fast kernels for string and tree matching, in: K. Tsuda, B. Schölkopf, J.-P. Vert (Eds.), Kernel Methods in Computational Biology, MIT Press, Cambridge, MA, 2004, pp. 113–130.

[72]

A. Wald, Note on the consistency of the maximum likelihood estimate, Ann. Math. Stat. 29 (1949) 595–601.

[73]

C.S. Wallace, D.M. Boulton, An information measure for classification, Comput. J. 11 (2) (1968) 185–194.

[74]

S. Washietl, I.L. Hofacker, M. Lukasser, A. Hüttenhofer, P.F. Stadler, Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome, Nat. Biotechnol. 23 (11) (2005) 1383–1390.

[75]

M.S. Waterman, Introduction to Computational Biology: Maps, Sequences and Genomes, Chapman and Hall, London, 1995.

[76]

C. Watkins, Dynamic Alignment Kernels, Technical Report CSD-TR-98-11 Computer Science Department, University of London, Royal Holloway, 1999.

[77]

W.E. Winkler, String comparator metrics and enhanced decision rules in the Fellegi–Sunter model of record linkage, in: Proceedings of the Section on Survey Research Methods, American Statistical Association, 1990, pp. 354–359.

[78]

J.H. Wolfe, Pattern clustering by multivariate mixture analysis, Multivar. Behav. Res. 5 (3) (1970) 329–350.

[79]

A. Zien, G. Rätsch, S. Mika, B. Schölkopf, T. Lengauer, K.-R. Müller, Engineering support vector machine kernels that recognize translation initiation sites, Bioinformatics 16 (9) (2000) 799–807.

Index Terms

Optimal string clustering based on a Laplace-like mixture and EM algorithm on a set of strings

Index terms have been assigned to the content through auto-classification.

Recommendations

A robust EM clustering algorithm for Gaussian mixture models

Clustering is a useful tool for finding structure in a data set. The mixture likelihood approach to clustering is a popular clustering method, in which the EM algorithm is the most used method. However, the EM algorithm for Gaussian mixture models is ...
A Flexible EM-Like Clustering Algorithm for Noisy Data
Though very popular, it is well known that the Expectation-Maximisation (EM) algorithm for the Gaussian mixture model performs poorly for non-Gaussian distributions or in the presence of outliers or noise. In this paper, we propose a Flexible EM-like ...
Resequencing a Set of Strings Based on a Target String

Given a set S={S1,S2, ,Sl} of l strings, a text T, and a natural number k, find a string M, which is a concatenation of k strings (not necessarily distinct, i.e., a string in S may occur more than once in M) from S, whose longest common subsequence with ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Computer and System Sciences

Journal of Computer and System Sciences Volume 106, Issue C

Dec 2019

145 pages

ISSN:0022-0000

Issue’s Table of Contents

Elsevier Inc.

Publisher

Academic Press, Inc.

United States

Publication History

Published: 01 December 2019

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents