US20050273274A1 - Method for identifying sub-sequences of interest in a sequence - Google Patents
Method for identifying sub-sequences of interest in a sequence Download PDFInfo
- Publication number
- US20050273274A1 US20050273274A1 US10/858,744 US85874404A US2005273274A1 US 20050273274 A1 US20050273274 A1 US 20050273274A1 US 85874404 A US85874404 A US 85874404A US 2005273274 A1 US2005273274 A1 US 2005273274A1
- Authority
- US
- United States
- Prior art keywords
- sequence
- heuristic
- data series
- recited
- grammar
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000000034 method Methods 0.000 title claims abstract description 74
- 238000007906 compression Methods 0.000 claims abstract description 32
- 230000006835 compression Effects 0.000 claims abstract description 32
- 238000004422 calculation algorithm Methods 0.000 claims description 27
- 229920000642 polymer Polymers 0.000 claims description 14
- 108020004414 DNA Proteins 0.000 claims description 8
- 102000053602 DNA Human genes 0.000 claims description 8
- 108091081062 Repeated sequence (DNA) Proteins 0.000 claims description 7
- 229920002477 rna polymer Polymers 0.000 claims description 7
- 125000003275 alpha amino acid group Chemical group 0.000 claims description 4
- 238000012545 processing Methods 0.000 claims description 4
- 108090000765 processed proteins & peptides Proteins 0.000 claims description 2
- 238000004364 calculation method Methods 0.000 abstract description 2
- 238000005192 partition Methods 0.000 description 11
- 108091028043 Nucleic acid sequence Proteins 0.000 description 6
- 230000003252 repetitive effect Effects 0.000 description 6
- 206010022000 influenza Diseases 0.000 description 4
- 239000002773 nucleotide Substances 0.000 description 4
- 125000003729 nucleotide group Chemical group 0.000 description 4
- 230000006870 function Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 150000001413 amino acids Chemical class 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000013144 data compression Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Definitions
- the invention relates generally to algorithmic information theory, and more specifically, to the identification of sequences of interest in a given data series.
- sequences of interest may be identified within a data series. It may be advantageous to identify such sequences of interest in order to extract meaningful information from the identified sequences or to allow easier manipulation or analysis of the data series. For example, identification of repetitive sequences in a data series may allow easier or more effective compression of the data.
- a method for identifying a sequence of interest includes analyzing a data series based on a grammar comprising at least an initial grammar. A statistical heuristic is calculated for each sub-sequence of the analyzed data series. A selected statistical heuristic is compared with one or more reference conditions. The grammar and the data series are updated with a symbol representing a sequence corresponding to the selected statistical heuristic based upon a non-termination result of the comparison. Alternatively, the sequence is identified as a sequence of interest based upon a termination result of the comparison.
- Code stored on tangible, machine-readable media may afford functionality of the type defined by these methods and is provided for by the present technique.
- a method for processing a data series.
- the method comprises the step of specifying a data series for analysis.
- One or more routines configured to analyze the data series based on minimum description length principles are executed.
- the analyzed data series comprising at least one sequence of interest is obtained.
- a method for identifying a biological sequence of interest.
- the method comprises analyzing a biological polymer sequence based on a grammar comprising at least an initial grammar.
- a minimum description length heuristic for each sub-sequence of the analyzed biological polymer sequence may be calculated.
- a selected minimum description length heuristic may be compared with one or more reference conditions.
- the grammar and the biological polymer sequence may be updated with a symbol representing a sub-sequence corresponding to the selected minimum description length heuristic based upon a non-termination result of the comparison.
- the sub-sequence may be identified as a biological sequence of interest based upon a termination result of the comparison.
- Code stored on tangible, machine-readable media may afford functionality of the type defined by these methods and is provided for by the present technique.
- FIG. 1 is a flowchart depicting steps for identifying a sequence of interest from a data series, in accordance with one aspect of the present technique
- FIG. 2 illustrates the identification of a sequence of interest from a data series, in accordance with one aspect of the present technique
- FIG. 3 depicts a table representing symbols and corresponding frequencies of occurrence, in accordance with one aspect of the present technique
- FIG. 4 depicts a table representing a grammar model indices for encoding a grammar, in accordance with one aspect of the present technique.
- FIG. 5 depicts identified repetitive sequences in a genome sequence, in accordance with one aspect of the present technique.
- genomic sequencing and analysis it may be desirable to identify repetitive sequences, either to assist in compression and manipulation or to facilitate analysis. In particular, it may be desirable to identify such sequences in a computationally efficient manner.
- the techniques discussed herein address some or all of these issues.
- a flow chart 10 depicts steps for identifying a sequence of interest, according to one aspect of the present technique.
- a given data series 12 may be provided, within which may be one or more sequences of interest to be identified.
- the data series 12 may be constructed from a grammar 14 .
- a grammar 14 may comprise terminals, i.e., uncombined symbols, and variables, i.e., combinations of terminals or terminal and other variables.
- the grammar may include the numerals 0-9 as terminals and combinations of the terminals as variables.
- an alphanumeric data series 12 may include numerals 1-9, the alphabetic letters, punctuation marks, and so forth as terminals and combinations of the terminals as variables.
- the grammar generally defines the characters or symbols, either alone or in combination, which may be found within the data series.
- the grammar 14 may include symbols representing nucleotides and amino acids, respectively.
- the grammar 14 and the data series 12 may be updated in the process of trying to identify sequences of interest.
- the grammar 14 includes at least an initial grammar 16 encompassing the characters or symbols of which the data series 12 is initially comprised, i.e., the initial “alphabet.”
- the grammar may also be known as a codebook or model in the art.
- the method for identifying a sequence of interest begins at step 18 , where the analysis of the data series 12 based on the grammar 14 is performed.
- analysis of the data series 12 involves partitioning the data series 12 into symbols or phrases that contribute most to the compression of the data series 12 .
- the analysis of the data series 12 may be based upon algorithmic minimum sufficient statistics, such as, but not limited to, Kolmogorov Complexity.
- Kolmogorov Complexity For simplicity, the example of Kolmogorov Complexity will be discussed herein, though one of ordinary skill in the art will appreciate that other algorithmic minimum sufficient statistics may be equally applicable to the present discussion and techniques. With regard to Kolmogorov Complexity the descriptive complexity contained in a data series 12 is measured. However, one of the drawbacks associated with Kolmogorov Complexity is that it is not computable.
- the first part of the code may be a description of a smallest set S containing the data series 12
- the second part an enumeration of the data series 12 within the finite set S.
- the data series 12 is a typical, that is, random, member of the finite set S then its index or enumeration within the set may be incompressible.
- a formal representation of these notions may generally, be represented via a Kolmogorov Structure Function, which defines a smallest set S that may be described in at most k bits containing the data sequence x of length n.
- K represents the Kolmogorov Complexity of the data series 12 .
- Equation (1) may be referred to as a Kolmogorov Minimum Sufficient Statistic.
- the Kolmogorov Complexity is not computable.
- the ability to apply this parameter to the analysis of a data series 12 is based upon the use of estimators, such as, for example, the Minimum Description Length (MDL) coding technique.
- the estimation of the Kolmogorov Complexity using the MDL coding technique typically comprises encoding the data series 12 as a hypothesis or model that identifies a presumptive distribution from which the data series 12 originated, appended with the data series 12 that is coded in an optimal way.
- MDL coding approaches the Kolmogorov Complexity or actual bound on the minimum length required for representing a data series 12 .
- the two-part description consists of a model that describes regularity associated with the data series 12 and a data portion describing the random elements of the data series 12 .
- the sum of the lengths of these two parts is equal to the Kolmogorov Complexity of the data series 12 .
- many two-part code descriptions of a data series 12 may exist, with the shortest being termed the Algorithmic Minimum Sufficient Statistic.
- the combination that minimizes a two-part descriptive cost that is, provides the shortest or most compressed description, may be termed as an MDL description.
- K ⁇ (x) represents the Kolmogorov Complexity
- S represents a finite set of which x is a typical, that is, equally likely, element.
- the symbol + is employed to indicate that both ⁇ + and > + are true, where the symbol ⁇ + is utilized to denote an inequality with an additive constant.
- the minimum possible sum of a descriptive cost for set S and the logarithm of the cardinality of the set corresponds to an MDL two-part description for data series (string) x.
- SCR Symbol Compression Ratio
- OSCR Optimal Symbol Compression Ratio
- the OSCR algorithm recursively forms a minimum sufficient algorithmic statistic consisting of a grammar defining a set to which a sequence of interest is a typical element.
- the OSCR algorithm allows the identification of one or more repeated sequences within the data series 12 which may be of interest due to their degree of repetition.
- H s The entropy of the distribution of symbols, H s , defines the average per symbol compression bound in bits per symbol for a prefix free code.
- H s - ⁇ i ⁇ p i ⁇ log 2 ⁇ ( p i ) .
- the entropy may be used in the optimization of the partition (the number of symbols, their length, and distribution) of the data series 12 such that the compression bound plus the grammar size is minimized according to the MDL criteria.
- the compression bound is defined by the product, R*H s , where R represents the total number of repetitions.
- ⁇ circumflex over (R) ⁇ may be computed from a known partition p and the length and number of repetitions of the candidate symbol i.
- d i r i (log 2 ( ⁇ circumflex over (R) ⁇ ) ⁇ log 2 ( r i ))+ l i (9)
- d i is the descriptive cost of symbol i
- l i is the length of symbol i
- r i is the number of repetitions of symbol i in data series 12 .
- Equation (9) represents a metric that may be employed to estimate the descriptive cost of any possible symbol in the data series 12 .
- a measure of an MDL based heuristic for a particular symbol may be represented by the descriptive length of the data series 12 divided by the length of the data series 12 covered by this symbol.
- ⁇ i represents the SCR
- d i is the descriptive cost of symbol i
- L i is the length of data series 12 consumed by symbol i
- l i is the length of symbol i
- r i is the number of repetitions of symbol i in the data series 12
- ⁇ circumflex over (R) ⁇ is a constant for a given partition.
- an algorithm referred to herein as the Optimal Symbol Compression Ratio (OSCR) algorithm, may be based on the heuristic presented in equation (10) and may be utilized in processing the data series 12 .
- This algorithm recursively forms a partition of the data series 12 (string x) into symbols that have the best SCR among possible symbols contained in x.
- the concept is to form a grammar dictionary that provides near optimal compression by adding one symbol at a time based upon the SCR of the symbol.
- one method for identifying a sequence of interest is summarized in the flow chart of FIG. 1 .
- the depicted method may employ the OSCR algorithm summarized above or other similar recursive techniques.
- the present technique is generally directed to finding a finite set of patterns from which a data series 12 is typical in the Kolmogorov Minimum Sufficient Statistic or MDL sense.
- the following paragraphs summarize exemplary steps that may be performed to analyze a data sequence in accordance with the present technique.
- a data series 12 representing the input string x may be analyzed at step 18 based on the current grammar.
- the grammar 14 typically includes at least an initial grammar 16 , where the initial grammar 16 may include an alphabet of terminals of size N symbols .
- the N symbols may, for example, represent unique bytes encountered in the input data series 12 .
- a lexicographic ordering of this alphabet may be constructed, if desired, at step 18 .
- an array A index representing the input data series 12 as an array of index values may also be constructed as part of the analysis of step 18 .
- a tree of non-overlapping sub-strings contained in input string x, that is, the data series 12 , that occur more often than (or equal to) a threshold value is formed as a product of the analysis step 18 , where each sub-string or a potential codeword for the grammar 14 may be represented as a node of the tree.
- the threshold value may represent twice the frequency of occurrence of the sub-string in the data series 12 , though other thresholds are also possible.
- the frequency of occurrence of each of the sub-strings may be noted during step 18 . This may be accomplished by recursively searching the data series 12 for repetitions of the sub-strings and noting or storing the frequency of occurrence of each sub-string.
- a MDL based statistical heuristic such as a SCR
- a statistical heuristic such as the sub-string having the lowest SCR may be selected and stored, along with the corresponding sub-string length and the number of repetitions, as parameters, ⁇ select , l select and r select respectively.
- the lowest SCR may be tested in view of the desired termination conditions.
- the termination condition may be represented as follows: ( ⁇ best ⁇ 1) ⁇ ( r select ⁇ l select ⁇ select >G min ) (11) where G min is a configurable minimum value to ensure compression at each step, if desired.
- G min is a configurable minimum value to ensure compression at each step, if desired.
- the value of G min may be varied from zero to higher threshold values that may reduce the number of algorithm iterations.
- steps 26 and 28 may be performed, allowing additional analysis and testing to occur until the termination criteria are met. For example, at step 26 , all occurrences of the selected sub-string, that is, the sub-string having the lowest SCR, may be replaced within the data series 12 with a corresponding symbol. In this manner, the data series 12 may be simplified or compressed in view of the identified sub-string.
- the termination criteria such as the threshold of equation (11)
- the grammar 14 may be updated to include the symbol, which will be present in the data series 12 in the next iteration of the process upon completion of step 26 .
- the recursive operation of the OSCR algorithm progressively adds symbols to the grammar based on the contribution of the symbols to the compression of the available candidates.
- the above steps of analyzing, calculating a statistical heuristic, comparing the lowest SCR, and updating the grammar 14 and the data series 12 may be recursively iterated through until a termination result is indicated at decision block 24 .
- the current sub-string having the lowest SCR may be identified as a sequence of interest at step 30 .
- the sequence identified in this manner may be of interest because the repetition of the sequence suggests importance, such as may occur in biological polymers such as DNA, RNA, or amino acid sequences.
- the data series 12 may be encoded using techniques, such as, Huffman and arithmetic coding.
- new indices relating to grammar sub-strings or grammar rules may be added and an array of symbols representing the data series 12 recomputed as variables are added to the grammar 14 through the recursions of the OSCR algorithm.
- the SCR heuristic defined by equation (10), may be modified to accommodate the encoding of the model.
- the number of indices in the model, representative of the number of grammar terminals and grammar variables, relate directly to the model cost of a new codeword.
- FIG. 2 An example employing the OSCR algorithm described above to identify a sequence of interest is provided by FIG. 2 .
- the OSCR algorithm may be applied to identify a sequence of interest that may facilitate compression and estimation of the sophistication, where sophistication, as known to one of ordinary skill in the art, is a measure of meaningful information in a data series, of the given sample data series 34 :
- a listing of an initial symbol alphabet and their corresponding frequencies of occurrence in the sample data series 34 are listed in table 42 .
- a first column 44 of table 42 lists the initial symbol alphabet, or grammar, which includes seven terminals ⁇ a, _[space], r, o, s, e, i ⁇ .
- a second column 46 of table 42 represents the frequency of occurrence of each symbol.
- An integer sequence representation of the data series 34 may be depicted as: a r o s e i s a r o S e i s a r o s e 1 2 3 4 5 6 2 7 5 2 1 2 3 4 5 6 2 7 5 2 1 2 3 4 5 6 2 7 5 2 1 2 3 4 5 6 Expanding the tree with sub-strings beginning with the terminal a illustrates that there are three occurrences of each of the sub-strings:
- the initial tree statistics and SCR calculation may be computed by following the steps of the OSCR algorithm outlined above. For example, in the instance where the symbol i represents the sub-string “a_”, the length, l, of symbol i, is 2. Moreover, the number of repetitions, r, of symbol i in the data series 34 would be 3. Additionally, the total number of repetitions, R, is computed to be 26. In a similar fashion, in the instance where “a_rose” is the symbol, the length, l, would be 6, the number of repetitions, r, would be 3, and the total number of repetitions, R, would be 11. Similarly, for the sub-string “a_rose_”, the values of l, r and R would be 7, 2 and 14 respectively.
- the value of the SCR, ⁇ is computed for each symbol.
- “a_” ⁇ would be 1.023, as computed using equation (10).
- ⁇ would be 0.500
- ⁇ would be 0.7143.
- the model may be updated by the addition of this codeword.
- the instances of the sub-string “a_rose” in the sample data series 34 are replaced with the symbol S 1 .
- an integer sequence representation of the data series 34 may be depicted as: S 1 — i s — S 1 — i s — S 1 8 2 7 5 2 8 2 7 5 2 8 Iterating through the algorithm a second time provides a second rule:
- the values in a first column of the table 48 represent the respective indices 50 corresponding to a symbol (terminal or variable).
- a second column of the table 48 represents a symbol array 52 .
- the symbol array is the ASCII code for the corresponding ternimal.
- the symbol array 52 represents the grammar rule, constructing the phrase from indices corresponding to other terminals and variables. The separator symbol is added artificially and given an index of zero.
- a third column of the table 48 represents a decoded phrase 54 corresponding to each symbol listed in the second column of the table 48 .
- ADCII symbols may be sent literally in seven bytes.
- the two codewords may be encoded as a sequence of eleven numbers with two additional separators.
- Huffman code lengths may be sent for each variable and terminal in lexicographic order, with a single bit assigned for phrase 8 and its complement for phrase 9.
- the data series 34 may be illustrated as follows: Terminals Codeword 1 Codeword 2 HuffCodelengths 97 13 114 111 112 101 105 1 2 3 4 5 6 0 2 7 5 2 8 0 0 0 0 0 0 0 1 1 and may be encoded using arithmetic coding and appended with the data portion of the code.
- the resulting model size, including a configuration byte, is 22.25 bytes.
- the entire encoded cost for the originally 26-byte sample data series 34 is 22.6 bytes.
- an OSCR algorithm may be employed to identify sequences of interest in a given biological (or other) data series such as DNA sequences, RNA sequences, amino acid sequences, and so forth.
- a DNA sequence is the data series of interest
- the analysis of the DNA sequence amounts to an analysis of the sequence of nucleotides forming the DNA.
- the nucleotide sequence is formed from a four-symbol alphabet, that is, four nucleotides, represented by the symbols ⁇ A, T, C, G ⁇ that form the genetic code.
- DNA sequences are very random in nature, and discerning structure may lead to attractive discoveries or important sequences within the genome.
- the biological polymer sequence is represented by the H Influenza genomic sequence 58 .
- the H Influenza genomic sequence 58 represents the input biological polymer sequence x.
- the grammar may include an initial grammar, where the initial grammar may include an alphabet of terminals of size N symbols .
- the N symbols may, for example, represent unique bytes encountered in the biological sequence 58 , such as ⁇ A, T, C, G ⁇ . Additionally, a lexicographic ordering of this alphabet may be constructed.
- an array A index representing the input biological sequence 56 as an array of index values may also be constructed.
- repetitive sequences of DNA in the H influenza genome may be identified, such as repeated sequence 60 of FIG. 5 .
- the OSCR algorithm identifies the repeated sequences 60 that are present within the DNA sequences HI0221, HI0221.1 and H10222, which form part of the H influenza genome. While the repeated sequences 60 are known to be of interest, the ease with which they are identified by application of the OSCR algorithm, suggests that application of such an algorithm may greatly facilitate identification of new and important repeated sequences in a computationally efficient manner.
- demonstrations, and process steps may be implemented by suitable code on a processor-based system, such as a general-purpose or special-purpose computer. It should also be noted that different implementations of the present technique may perform some or all of the steps described herein in different orders or substantially concurrently, that is, in parallel. Furthermore, the functions may be implemented in a variety of programming languages, such as C++ or JAVA.
- Such code may be stored or adapted for storage on one or more tangible, machine readable media, such as on memory chips, local or remote hard disks, optical disks (that is, CD's or DVD's), or other media, which may be accessed by a processor-based system to execute the stored code.
- the tangible media may comprise paper or another suitable medium upon which the instructions are printed.
- the instructions can be electronically captured via optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
- the method described herein of identifying sequences of interest, given an input data sequence based on MDL principles enables the implementation of a universal compression algorithm capable of universal complexity as well as sophistication estimation.
- the present techniques may be used to identify repetitive sequences, which may be of interest in a variety of fields, such as cryptography or biological research. In particular, any field in which pattern recognition is a component may benefit from the techniques described herein.
Landscapes
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Biophysics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The present technique provides for the analysis of a data series to identify sequences of interest within the series. The analysis may be used to iteratively update a grammar used to analyze the data series or updated versions of the data series. Furthermore, the technique provides for the calculation of a minimum description length heuristic, such as a symbol compression ratio, for each sub-sequence of the analyzed data sequence. The technique may then compare a selected heuristic value against one or more reference conditions to determine if additional iteration is to be performed. The grammar and the data sequence may be updated between iterations to include a symbol representing a string corresponding to the selected heuristic value based upon a non-termination result of the comparison. Alternatively, the string corresponding to the selected heuristic value may be identified as a sequence of interest based upon a termination result of the comparison.
Description
- The invention relates generally to algorithmic information theory, and more specifically, to the identification of sequences of interest in a given data series.
- In various applications, such as information theory, data compression, and intrusion detection, it may be desirable to identify sequences of interest within a data series. It may be advantageous to identify such sequences of interest in order to extract meaningful information from the identified sequences or to allow easier manipulation or analysis of the data series. For example, identification of repetitive sequences in a data series may allow easier or more effective compression of the data.
- Similarly, in the field of genetics, biologically interesting phrases or sequences in a genome, such as the human genome, may have higher redundancy than non-meaningful phrases, as nature tends to repeat or emphasize important sequences more frequently than unimportant sequences. However, for the genomes, which are known or are being sequenced, the purposes of different parts of the genomes are currently unknown. Hence, the identification of meaningful or interesting sequences within a genome may pose a challenge.
- Furthermore, it is increasingly difficult to identify meaningful sequences of interest employing traditional techniques. In particular, the vast amount of data, such as genome data is difficult to analyze using traditional techniques in a computationally efficient manner. In addition, existing computational techniques to determine meaningful information may be inadequate for the identification of sequences of interest. For example, existing techniques may fail to identify DNA sequences in a genome that are known to be of interest, such as sequences experimentally demonstrated to be of interest. Hence, it may be desirable to develop techniques that efficiently and accurately recognize sequences of interest within a data series.
- Briefly, in accordance with one embodiment of the present technique a method for identifying a sequence of interest is presented. The method includes analyzing a data series based on a grammar comprising at least an initial grammar. A statistical heuristic is calculated for each sub-sequence of the analyzed data series. A selected statistical heuristic is compared with one or more reference conditions. The grammar and the data series are updated with a symbol representing a sequence corresponding to the selected statistical heuristic based upon a non-termination result of the comparison. Alternatively, the sequence is identified as a sequence of interest based upon a termination result of the comparison. Code stored on tangible, machine-readable media may afford functionality of the type defined by these methods and is provided for by the present technique.
- In accordance with another embodiment of the present technique, a method is provided for processing a data series. The method comprises the step of specifying a data series for analysis. One or more routines configured to analyze the data series based on minimum description length principles are executed. The analyzed data series comprising at least one sequence of interest is obtained.
- In accordance with a further embodiment of the present technique, a method is provided for identifying a biological sequence of interest. The method comprises analyzing a biological polymer sequence based on a grammar comprising at least an initial grammar. A minimum description length heuristic for each sub-sequence of the analyzed biological polymer sequence may be calculated. A selected minimum description length heuristic may be compared with one or more reference conditions. The grammar and the biological polymer sequence may be updated with a symbol representing a sub-sequence corresponding to the selected minimum description length heuristic based upon a non-termination result of the comparison. Alternatively, the sub-sequence may be identified as a biological sequence of interest based upon a termination result of the comparison. Code stored on tangible, machine-readable media may afford functionality of the type defined by these methods and is provided for by the present technique.
- These and other features, aspects, and advantages of the present invention will become better understood when the following detailed description is read with reference to the accompanying drawings in which like characters represent like parts throughout the drawings, wherein:
-
FIG. 1 is a flowchart depicting steps for identifying a sequence of interest from a data series, in accordance with one aspect of the present technique; -
FIG. 2 illustrates the identification of a sequence of interest from a data series, in accordance with one aspect of the present technique; -
FIG. 3 depicts a table representing symbols and corresponding frequencies of occurrence, in accordance with one aspect of the present technique; -
FIG. 4 depicts a table representing a grammar model indices for encoding a grammar, in accordance with one aspect of the present technique; and -
FIG. 5 depicts identified repetitive sequences in a genome sequence, in accordance with one aspect of the present technique. - In many fields, such as genomic sequencing and analysis, it may be desirable to identify repetitive sequences, either to assist in compression and manipulation or to facilitate analysis. In particular, it may be desirable to identify such sequences in a computationally efficient manner. The techniques discussed herein address some or all of these issues.
- Turning now to the drawings, and referring to
FIG. 1 , aflow chart 10 depicts steps for identifying a sequence of interest, according to one aspect of the present technique. As suggested by theflow chart 10, a givendata series 12 may be provided, within which may be one or more sequences of interest to be identified. Thedata series 12 may be constructed from agrammar 14. As will be appreciated by those of ordinary skill in the art, agrammar 14 may comprise terminals, i.e., uncombined symbols, and variables, i.e., combinations of terminals or terminal and other variables. For example, for anumeric data series 12, the grammar may include the numerals 0-9 as terminals and combinations of the terminals as variables. Similarly, analphanumeric data series 12 may include numerals 1-9, the alphabetic letters, punctuation marks, and so forth as terminals and combinations of the terminals as variables. In other words, the grammar generally defines the characters or symbols, either alone or in combination, which may be found within the data series. In the context of biological sequences, such as deoxyribonucleic acid (DNA) sequences and ribonucleic acid (RNA) sequences or peptide sequences, thegrammar 14 may include symbols representing nucleotides and amino acids, respectively. As will be discussed herein, thegrammar 14 and thedata series 12 may be updated in the process of trying to identify sequences of interest. Initially, however, thegrammar 14 includes at least aninitial grammar 16 encompassing the characters or symbols of which thedata series 12 is initially comprised, i.e., the initial “alphabet.” Furthermore, as will be appreciated by one of ordinary skill in the art, the grammar may also be known as a codebook or model in the art. - Once the
data series 12 andgrammar 14 are established, the method for identifying a sequence of interest begins atstep 18, where the analysis of thedata series 12 based on thegrammar 14 is performed. In accordance with one embodiment of the present technique, analysis of thedata series 12 involves partitioning thedata series 12 into symbols or phrases that contribute most to the compression of thedata series 12. Furthermore, the analysis of thedata series 12 may be based upon algorithmic minimum sufficient statistics, such as, but not limited to, Kolmogorov Complexity. - For simplicity, the example of Kolmogorov Complexity will be discussed herein, though one of ordinary skill in the art will appreciate that other algorithmic minimum sufficient statistics may be equally applicable to the present discussion and techniques. With regard to Kolmogorov Complexity the descriptive complexity contained in a
data series 12 is measured. However, one of the drawbacks associated with Kolmogorov Complexity is that it is not computable. - This drawback may be addressed by dividing the
data series 12 into a two-part code or description in which the first part represents regularity in thedata series 12 and the second part represents the random part of thedata series 12. For example, the first part of the code may be a description of a smallest set S containing thedata series 12, and the second part an enumeration of thedata series 12 within the finite set S. In this example, if thedata series 12 is a typical, that is, random, member of the finite set S then its index or enumeration within the set may be incompressible. However, if thedata series 12 is not a typical member of set S, that is, is not random, then this regularity may be employed to form a smaller set of which thedata series 12 is an element. A formal representation of these notions may generally, be represented via a Kolmogorov Structure Function, which defines a smallest set S that may be described in at most k bits containing the data sequence x of length n. The Kolmogorov Structure Function may be defined as follows:
where K represents the Kolmogorov Complexity of thedata series 12. Equation (1) may be referred to as a Kolmogorov Minimum Sufficient Statistic. - As noted above, in general, the Kolmogorov Complexity is not computable. Hence, the ability to apply this parameter to the analysis of a
data series 12 is based upon the use of estimators, such as, for example, the Minimum Description Length (MDL) coding technique. The estimation of the Kolmogorov Complexity using the MDL coding technique, typically comprises encoding thedata series 12 as a hypothesis or model that identifies a presumptive distribution from which thedata series 12 originated, appended with thedata series 12 that is coded in an optimal way. In other words, the length of an MDL message is determined as follows:
L(M)=L(H)+L(D) (2)
where L(M) is a message length, L(H) is a length of a specification of a hypothesis regarding thedata series 12, and L(D) is a length of thedata series 12, encoded in an optimal manner given hypothesis H. Generally, MDL coding approaches the Kolmogorov Complexity or actual bound on the minimum length required for representing adata series 12. - As discussed above, the two-part description consists of a model that describes regularity associated with the
data series 12 and a data portion describing the random elements of thedata series 12. The sum of the lengths of these two parts is equal to the Kolmogorov Complexity of thedata series 12. In general, many two-part code descriptions of adata series 12 may exist, with the shortest being termed the Algorithmic Minimum Sufficient Statistic. Among the possible two-part descriptions of thedata series 12, the combination that minimizes a two-part descriptive cost, that is, provides the shortest or most compressed description, may be termed as an MDL description. - By means of example, an MDL decomposition of a data series x, where the data series may be a binary string, may be represented by:
K φ(x)=+K(S)+log2(|S|) (3)
where Kφ(x) represents the Kolmogorov Complexity, S represents a finite set of which x is a typical, that is, equally likely, element. The symbol =+ is employed to indicate that both <+ and >+ are true, where the symbol <+ is utilized to denote an inequality with an additive constant. The minimum possible sum of a descriptive cost for set S and the logarithm of the cardinality of the set corresponds to an MDL two-part description for data series (string) x. - Keeping in mind the preceding discussion of complexity estimation, one of ordinary skill in the art will appreciate that is may be desirable to partition the
data series 12 ofFIG. 1 into symbols that provide a near optimal compression among possible partitions. In order to achieve this a Symbol Compression Ratio (SCR) heuristic may be employed to evaluate the contribution of an individual symbol to overall compression ratio between different partitions. Furthermore, the SCR heuristic may be recursively implemented as an Optimal Symbol Compression Ratio (OSCR) algorithm to estimate the complexity of thedata series 12. This compression technique estimates the Kolmogorov Complexity via the MDL technique described above by recursively modeling thedata series 12 as a concatenation of a finite set of symbols. In this manner, the OSCR algorithm recursively forms a minimum sufficient algorithmic statistic consisting of a grammar defining a set to which a sequence of interest is a typical element. In other words, the OSCR algorithm allows the identification of one or more repeated sequences within thedata series 12 which may be of interest due to their degree of repetition. - The effect of a partition based on MDL, as described above, may be studied by examining the entropy of the distribution of symbols. The entropy of the distribution of symbols, Hs, defines the average per symbol compression bound in bits per symbol for a prefix free code. For a distribution p of I symbols, the entropy may be defined as:
- The entropy may be used in the optimization of the partition (the number of symbols, their length, and distribution) of the
data series 12 such that the compression bound plus the grammar size is minimized according to the MDL criteria. In particular, the compression bound is defined by the product, R*Hs, where R represents the total number of repetitions. The size of the grammar, that is the model descriptive cost, M, (also known as descriptive length), may be estimated as the sum of the lengths of unique symbols:
where li is the length of symbol i. Furthermore, an estimate of the total descriptive length Dp may be computed as:
D p =M+R.H s (6)
where R is the total number of repetitions, and Hs is the entropy. - Typically, in seeking to partition the
data series 12 to minimize the total string descriptive cost Dp, the factors considered are the length that the presence of each symbol adds to the total descriptive length and the amount of coverage of the total string length L that it provides. Thus, the descriptive length of thedata series 12 under partition p may be defined as:
where R is the number of repetitions, li is the length of symbol i, and ri is the number of repetitions of symbol i indata series 12. A per symbol descriptive cost may be estimated by employing the following equations:
where {circumflex over (R)} is a constant for a given partition. - In a recursive implementation, such as may be employed in one implementation of the present technique, {circumflex over (R)} may be computed from a known partition p and the length and number of repetitions of the candidate symbol i. Thus,
d i =r i(log2({circumflex over (R)})−log2(r i))+l i (9)
where di is the descriptive cost of symbol i, li is the length of symbol i, and ri is the number of repetitions of symbol i indata series 12. Equation (9) represents a metric that may be employed to estimate the descriptive cost of any possible symbol in thedata series 12. - A measure of an MDL based heuristic for a particular symbol, may be represented by the descriptive length of the
data series 12 divided by the length of thedata series 12 covered by this symbol. For example, an MDL based heuristic, such as the Symbol Compression Ratio (SCR) may be defined by the following equation:
where λi represents the SCR, di is the descriptive cost of symbol i, Li is the length ofdata series 12 consumed by symbol i, li is the length of symbol i, ri is the number of repetitions of symbol i in thedata series 12, and {circumflex over (R)} is a constant for a given partition. - As discussed above, an algorithm, referred to herein as the Optimal Symbol Compression Ratio (OSCR) algorithm, may be based on the heuristic presented in equation (10) and may be utilized in processing the
data series 12. This algorithm recursively forms a partition of the data series 12 (string x) into symbols that have the best SCR among possible symbols contained in x. The concept is to form a grammar dictionary that provides near optimal compression by adding one symbol at a time based upon the SCR of the symbol. - Referring again to
FIG. 1 , one method for identifying a sequence of interest is summarized in the flow chart ofFIG. 1 . The depicted method may employ the OSCR algorithm summarized above or other similar recursive techniques. As will be appreciated by those of ordinary skill in the art, the present technique is generally directed to finding a finite set of patterns from which adata series 12 is typical in the Kolmogorov Minimum Sufficient Statistic or MDL sense. The following paragraphs summarize exemplary steps that may be performed to analyze a data sequence in accordance with the present technique. - As depicted in
FIG. 1 , adata series 12 representing the input string x may be analyzed atstep 18 based on the current grammar. Thegrammar 14 typically includes at least aninitial grammar 16, where theinitial grammar 16 may include an alphabet of terminals of size Nsymbols. The Nsymbols may, for example, represent unique bytes encountered in theinput data series 12. Additionally, a lexicographic ordering of this alphabet may be constructed, if desired, atstep 18. Furthermore, an array Aindex representing theinput data series 12 as an array of index values may also be constructed as part of the analysis ofstep 18. - A tree of non-overlapping sub-strings contained in input string x, that is, the
data series 12, that occur more often than (or equal to) a threshold value is formed as a product of theanalysis step 18, where each sub-string or a potential codeword for thegrammar 14 may be represented as a node of the tree. In one implementation, the threshold value may represent twice the frequency of occurrence of the sub-string in thedata series 12, though other thresholds are also possible. In addition, the frequency of occurrence of each of the sub-strings, may be noted duringstep 18. This may be accomplished by recursively searching thedata series 12 for repetitions of the sub-strings and noting or storing the frequency of occurrence of each sub-string. - At
step 20, a MDL based statistical heuristic, such as a SCR, for all the sub-sequences may be computed using equation (10), as described above. Furthermore, in the depicted example, a statistical heuristic such as the sub-string having the lowest SCR may be selected and stored, along with the corresponding sub-string length and the number of repetitions, as parameters, λselect, lselect and rselect respectively. - At
step 22, the lowest SCR, as determined above, may be tested in view of the desired termination conditions. For instance, the termination condition may be represented as follows:
(λbest<1)∩(r select ·l select·λselect >G min) (11)
where Gmin is a configurable minimum value to ensure compression at each step, if desired. For example, the value of Gmin may be varied from zero to higher threshold values that may reduce the number of algorithm iterations. - Subsequently, at
decision block 24, checks are performed to determine whether the termination criteria, such as the threshold of equation (11), have been satisfied. If the termination criteria have not been satisfied, steps 26 and 28 may be performed, allowing additional analysis and testing to occur until the termination criteria are met. For example, atstep 26, all occurrences of the selected sub-string, that is, the sub-string having the lowest SCR, may be replaced within thedata series 12 with a corresponding symbol. In this manner, thedata series 12 may be simplified or compressed in view of the identified sub-string. - Similarly, at
step 28, thegrammar 14 may be updated to include the symbol, which will be present in thedata series 12 in the next iteration of the process upon completion ofstep 26. In this manner, the recursive operation of the OSCR algorithm progressively adds symbols to the grammar based on the contribution of the symbols to the compression of the available candidates. The above steps of analyzing, calculating a statistical heuristic, comparing the lowest SCR, and updating thegrammar 14 and thedata series 12 may be recursively iterated through until a termination result is indicated atdecision block 24. When a termination result is obtained atdecision block 24, the current sub-string having the lowest SCR may be identified as a sequence of interest atstep 30. As noted above, the sequence identified in this manner may be of interest because the repetition of the sequence suggests importance, such as may occur in biological polymers such as DNA, RNA, or amino acid sequences. - As will be appreciated by those of ordinary skill in the art, the
data series 12 may be encoded using techniques, such as, Huffman and arithmetic coding. In such implementations, new indices relating to grammar sub-strings or grammar rules may be added and an array of symbols representing thedata series 12 recomputed as variables are added to thegrammar 14 through the recursions of the OSCR algorithm. In addition, the SCR heuristic, defined by equation (10), may be modified to accommodate the encoding of the model. The number of indices in the model, representative of the number of grammar terminals and grammar variables, relate directly to the model cost of a new codeword. Furthermore, as described herein, at each iteration of the algorithm symbols may be combined to form new codewords, which in turn may be represented by new symbols, which may in turn be incorporated into subsequent processing and analysis. If every index is considered to be equally likely, a model cost of Hm=log2 (max Index) bits for each symbol may be assigned. Furthermore, the cost of sending the Huffman code length for each symbol may be accounted for by sending one additional symbol. The SCR heuristic, accounting for the model cost Hm in bytes may be redefined as follows: - An example employing the OSCR algorithm described above to identify a sequence of interest is provided by
FIG. 2 . As depicted inFIG. 2 , the OSCR algorithm may be applied to identify a sequence of interest that may facilitate compression and estimation of the sophistication, where sophistication, as known to one of ordinary skill in the art, is a measure of meaningful information in a data series, of the given sample data series 34: -
- a_rose_is_a_rose_is_a_rose.
In this example,reference numerals
- a_rose_is_a_rose_is_a_rose.
- Referring now to
FIG. 3 , a listing of an initial symbol alphabet and their corresponding frequencies of occurrence in thesample data series 34 are listed in table 42. Afirst column 44 of table 42 lists the initial symbol alphabet, or grammar, which includes seven terminals {a, _[space], r, o, s, e, i}. Furthermore, asecond column 46 of table 42 represents the frequency of occurrence of each symbol. - An integer sequence representation of the
data series 34, with terminal symbols replaced with a corresponding index value, may be depicted as:a r o s e i s a r o S e i s a r o s e 1 2 3 4 5 6 2 7 5 2 1 2 3 4 5 6 2 7 5 2 1 2 3 4 5 6
Expanding the tree with sub-strings beginning with the terminal a illustrates that there are three occurrences of each of the sub-strings: -
- a, a_, a_r, a_ro, a_ros, a_rose.
However, only two occurrences of longer sub-strings, such as “a_rose_” 40, are present. For simplicity, this example is limited to the consideration of the sub-strings “a_” 36, “a_rose” 38 and “a_rose_” 40.
- a, a_, a_r, a_ro, a_ros, a_rose.
- The initial tree statistics and SCR calculation may be computed by following the steps of the OSCR algorithm outlined above. For example, in the instance where the symbol i represents the sub-string “a_”, the length, l, of symbol i, is 2. Moreover, the number of repetitions, r, of symbol i in the
data series 34 would be 3. Additionally, the total number of repetitions, R, is computed to be 26. In a similar fashion, in the instance where “a_rose” is the symbol, the length, l, would be 6, the number of repetitions, r, would be 3, and the total number of repetitions, R, would be 11. Similarly, for the sub-string “a_rose_”, the values of l, r and R would be 7, 2 and 14 respectively. - In addition to the computation of the parameters outlined above, the value of the SCR, λ, is computed for each symbol. For the symbol, “a_” λ would be 1.023, as computed using equation (10). Similarly, for the sub-string “a_rose”, λ would be 0.500 and for the sub-string “a_rose_”, λ would be 0.7143.
- As may be observed, the value of A decreases as the length of the sub-string increases until the number of repetitions drops from 3 to 2. Since there exist only two repetitions of the phrase “a_rose_”, the best, that is, the lowest, A along this branch is λbest=½, and the search may be terminated along this branch. Because the selection is based on the SCR, λ, the selection process does not necessarily select the most frequently repeated or the longest symbol, that is, sub-string. Instead the length of the sub-string and the amount of repetition are both factors, but not determinative factors, in the selection process. For instance, in this example, the selected codeword is:
-
- S1→a_rose
- After this selection, the model may be updated by the addition of this codeword. In addition, the instances of the sub-string “a_rose” in the
sample data series 34 are replaced with the symbol S1. Furthermore, replacing the sub-string “a_rose” with a corresponding symbol index of 8, an integer sequence representation of thedata series 34, may be depicted as:S1 — i s — S1 — i s — S1 8 2 7 5 2 8 2 7 5 2 8
Iterating through the algorithm a second time provides a second rule: -
- S2→is_S1
- which may, by updating the grammar, the
sample data series 34, and the respective array and index, be represented as the array sequence:S1 S2 S2 8 9 9 - The resulting grammar may be summarized as follows:
-
- S1→a_rose, S2→is_S1, S→S1 S2 S2.
Additionally, the model may be summarized as: - S1→a_rose f(S1)=1
- S2→is_S1 f(S2)=2
where f(S) represents the frequency of occurrence of each phrase.
- S1→a_rose, S2→is_S1, S→S1 S2 S2.
- Referring to
FIG. 4 , the grammar model indices for encoding of this grammar are summarized. The values in a first column of the table 48 represent therespective indices 50 corresponding to a symbol (terminal or variable). A second column of the table 48 represents asymbol array 52. For grammar terminals (index values 1 through 7) the symbol array is the ASCII code for the corresponding ternimal. Furthermore, for grammar variables (index values 8 and 9), thesymbol array 52 represents the grammar rule, constructing the phrase from indices corresponding to other terminals and variables. The separator symbol is added artificially and given an index of zero. Moreover, a third column of the table 48 represents a decodedphrase 54 corresponding to each symbol listed in the second column of the table 48. - Since there are only seven unique bytes in this grammar (alphabet), these ADCII symbols may be sent literally in seven bytes. The two codewords may be encoded as a sequence of eleven numbers with two additional separators. Huffman code lengths may be sent for each variable and terminal in lexicographic order, with a single bit assigned for phrase 8 and its complement for phrase 9.
- The
data series 34 may be illustrated as follows:Terminals Codeword 1 Codeword 2HuffCodelengths 97 13 114 111 112 101 105 1 2 3 4 5 6 0 2 7 5 2 8 0 0 0 0 0 0 0 0 1 1
and may be encoded using arithmetic coding and appended with the data portion of the code. The resulting model size, including a configuration byte, is 22.25 bytes. Thus, the entire encoded cost for the originally 26-bytesample data series 34 is 22.6 bytes. Hence, it may be inferred that, due to the high model cost compared to the data cost, there is significant pattern content in thesample data series 34. - As may be appreciated, the present techniques may be used to identify repetitive sequences within lengthy or large amounts of data. For example, an OSCR algorithm may be employed to identify sequences of interest in a given biological (or other) data series such as DNA sequences, RNA sequences, amino acid sequences, and so forth. For instance, in a case where a DNA sequence is the data series of interest, the analysis of the DNA sequence amounts to an analysis of the sequence of nucleotides forming the DNA. The nucleotide sequence is formed from a four-symbol alphabet, that is, four nucleotides, represented by the symbols {A, T, C, G} that form the genetic code. Generally, DNA sequences are very random in nature, and discerning structure may lead to attractive discoveries or important sequences within the genome.
- As discussed above, a recursive process, such as may be implemented via an OSCR algorithm described herein, may be performed on the biological data series. Turning now to
FIG. 5 , a biological polymer sequence ofinterest 56 identified by the application of the present techniques is illustrated. InFIG. 5 , the biological polymer sequence is represented by the H Influenzagenomic sequence 58. During the analysis step, the H Influenzagenomic sequence 58 represents the input biological polymer sequence x. The grammar may include an initial grammar, where the initial grammar may include an alphabet of terminals of size Nsymbols. The Nsymbols may, for example, represent unique bytes encountered in thebiological sequence 58, such as {A, T, C, G}. Additionally, a lexicographic ordering of this alphabet may be constructed. Furthermore, an array Aindex representing the inputbiological sequence 56 as an array of index values may also be constructed. - Based on a recursive analysis, as set forth above, repetitive sequences of DNA in the H influenza genome may be identified, such as repeated
sequence 60 ofFIG. 5 . In particular, the OSCR algorithm identifies the repeatedsequences 60 that are present within the DNA sequences HI0221, HI0221.1 and H10222, which form part of the H influenza genome. While the repeatedsequences 60 are known to be of interest, the ease with which they are identified by application of the OSCR algorithm, suggests that application of such an algorithm may greatly facilitate identification of new and important repeated sequences in a computationally efficient manner. - As will be appreciated by those of ordinary skill in the art, the foregoing example, demonstrations, and process steps may be implemented by suitable code on a processor-based system, such as a general-purpose or special-purpose computer. It should also be noted that different implementations of the present technique may perform some or all of the steps described herein in different orders or substantially concurrently, that is, in parallel. Furthermore, the functions may be implemented in a variety of programming languages, such as C++ or JAVA. Such code, as will be appreciated by those of ordinary skill in the art, may be stored or adapted for storage on one or more tangible, machine readable media, such as on memory chips, local or remote hard disks, optical disks (that is, CD's or DVD's), or other media, which may be accessed by a processor-based system to execute the stored code. Note that the tangible media may comprise paper or another suitable medium upon which the instructions are printed. For instance, the instructions can be electronically captured via optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
- The method described herein of identifying sequences of interest, given an input data sequence based on MDL principles enables the implementation of a universal compression algorithm capable of universal complexity as well as sophistication estimation. As described herein, the present techniques may be used to identify repetitive sequences, which may be of interest in a variety of fields, such as cryptography or biological research. In particular, any field in which pattern recognition is a component may benefit from the techniques described herein.
- While only certain features of the invention have been illustrated and described herein, many modifications and changes will occur to those skilled in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the invention.
Claims (26)
1. A method for identifying a sequence of interest, comprising the steps of:
analyzing a data series based on a grammar comprising at least an initial grammar;
calculating a statistical heuristic for each sub-sequence of the analyzed data series;
comparing a selected statistical heuristic with one or more reference conditions;
updating the grammar and the data series with a symbol representing a sequence corresponding to the selected statistical heuristic based upon a non-termination result of the comparison; and
identifying the sequence as a sequence of interest based upon a termination result of the comparison.
2. The method, as recited in claim 1 , further comprising the step of iterating the steps of analyzing, calculating, comparing, and updating until a termination result of the comparison occurs.
3. The method, as recited in claim 1 , wherein the data series comprises one of a deoxyribonucleic acid sequence, a ribonucleic acid sequence, and an amino acid sequence.
4. The method, as recited in claim 1 , wherein calculating the statistical heuristic for each sub-sequence comprises computing the statistical heuristic using a minimum sufficient statistics algorithm.
5. The method, as recited in claim 1 , wherein calculating the statistical heuristic for each sub-sequence comprises computing a symbol compression ratio using an optimal symbol compression ratio algorithm.
6. The method, as recited in claim 1 , wherein the sequence of interest comprises a repeated sequence within the data series.
7. The method, as recited in claim 1 , wherein the statistical heuristic comprises a minimum description length heuristic and wherein the selected statistical heuristic comprises the minimum description length heuristic having the lowest value.
8. The method, as recited in claim 1 , wherein the statistical heuristic comprises a symbol compression ratio and wherein the selected statistical heuristic comprises the symbol compression ratio having the lowest value.
9. A tangible, machine-readable media, comprising:
code adapted to analyze a data series based on a grammar comprising at least an initial grammar;
code adapted to calculate a statistical heuristic for each sub-sequence of the analyzed data series;
code adapted to compare a selected statistical heuristic with one or more reference conditions;
code adapted to update the grammar and the data series with a symbol representing a sequence corresponding to the selected statistical heuristic based upon a non-termination result of the comparison; and
code adapted to identify the sequence as a sequence of interest based upon a termination result of the comparison.
10. The tangible medium, as recited in claim 9 , comprising code adapted to iterate the steps of analyzing, calculating, comparing, and updating until a termination result of the comparison occurs.
11. The tangible medium, as recited in claim 9 , wherein the code adapted to calculate the statistical heuristic for each sub-sequence computes the statistical heuristic using a minimum sufficient statistics algorithm.
12. The tangible medium, as recited in claim 9 , wherein the code adapted to calculate the statistical heuristic for each sub-sequence computes a symbol compression ratio using an optimal symbol compression ratio algorithm.
13. The tangible medium, as recited in claim 9 , wherein the statistical heuristic comprises a minimum description length heuristic and wherein the selected statistical heuristic comprises the minimum description length heuristic having the lowest value.
14. The tangible medium, as recited in claim 9 , wherein the statistical heuristic comprises a symbol compression ratio and wherein the selected statistical heuristic comprises the symbol compression ratio having the lowest value.
15. A method for processing a data series, the method comprising:
specifying a data series for analysis;
executing one or more routines configured to analyze the data series based on minimum description length principles; and
obtaining the analyzed data series comprising at least one sequence of interest.
16. The method as recited in claim 15 , wherein specifying a data series comprises specifying one of a deoxyribonucleic acid sequence, a ribonucleic acid sequence, and an amino acid sequence.
17. The method as recited in claim 15 , wherein the one or more executed routines implement one or more sufficient statistics algorithms.
18. The method as recited in claim 15 , wherein the one or more executed routines implement one or more optimal symbol compression algorithms.
19. The method as recited in claim 15 , wherein the at least one sequence of interest comprises a repeated sequence within the data series.
20. A method for identifying a biological sequence of interest, the method comprising:
analyzing a biological polymer sequence based on a grammar comprising at least an initial grammar;
calculating a minimum description length heuristic for each sub-sequence of the analyzed biological polymer sequence;
comparing a selected minimum description length heuristic with one or more reference conditions;
updating the grammar and the biological polymer sequence with a symbol representing a sub-sequence corresponding to the selected minimum description length heuristic based upon a non-termination result of the comparison; and
identifying the sub-sequence as a biological sequence of interest based upon a termination result of the comparison.
21. The method, as recited in claim 20 , wherein the biological polymer comprises one of a deoxyribonucleic acid, a ribonucleic acid, and a peptide.
22. The method, as recited in claim 20 , further comprising iterating the steps of analyzing, calculating, comparing, and updating until a termination result of the comparison occurs.
23. The method, as recited in claim 20 , wherein the minimum description length heuristic comprises a symbol compression ratio and the selected minimum description length heuristic comprises the symbol compression ratio having the lowest value.
24. A tangible, machine-readable media, comprising:
code adapted to analyze a biological polymer sequence based on a grammar comprising at least an initial grammar;
code adapted to calculate a minimum description length heuristic for each sub-sequence of the analyzed biological polymer sequence;
code adapted to compare a selected minimum description length heuristic with one or more reference conditions;
code adapted to update the grammar and the biological polymer sequence with a symbol representing a sub-sequence corresponding to the selected minimum description length heuristic based upon a non-termination result of the comparison; and
code adapted to identify the sub-sequence as a biological sequence of interest based upon a termination result of the comparison.
25. The tangible medium, as recited in claim 24 , comprising:
code adapted to iterate the steps of analyzing, calculating, comparing, and updating until a termination result of the comparison occurs.
26. The tangible medium, as recited in claim 24 , wherein the minimum description length heuristic comprises a symbol compression ratio and the selected minimum description length heuristic comprises the symbol compression ratio having the lowest value.
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/858,744 US20050273274A1 (en) | 2004-06-02 | 2004-06-02 | Method for identifying sub-sequences of interest in a sequence |
PCT/US2005/017552 WO2005122022A2 (en) | 2004-06-02 | 2005-05-19 | Method for identifying sub-sequences of interest in a sequence |
US12/043,609 US8046173B2 (en) | 2004-06-02 | 2008-03-06 | Method for identifying sub-sequences of interest in a sequence |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/858,744 US20050273274A1 (en) | 2004-06-02 | 2004-06-02 | Method for identifying sub-sequences of interest in a sequence |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/043,609 Division US8046173B2 (en) | 2004-06-02 | 2008-03-06 | Method for identifying sub-sequences of interest in a sequence |
Publications (1)
Publication Number | Publication Date |
---|---|
US20050273274A1 true US20050273274A1 (en) | 2005-12-08 |
Family
ID=35385462
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/858,744 Abandoned US20050273274A1 (en) | 2004-06-02 | 2004-06-02 | Method for identifying sub-sequences of interest in a sequence |
US12/043,609 Active 2026-07-15 US8046173B2 (en) | 2004-06-02 | 2008-03-06 | Method for identifying sub-sequences of interest in a sequence |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/043,609 Active 2026-07-15 US8046173B2 (en) | 2004-06-02 | 2008-03-06 | Method for identifying sub-sequences of interest in a sequence |
Country Status (2)
Country | Link |
---|---|
US (2) | US20050273274A1 (en) |
WO (1) | WO2005122022A2 (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090007267A1 (en) * | 2007-06-29 | 2009-01-01 | Walter Hoffmann | Method and system for tracking authorship of content in data |
US20090055425A1 (en) * | 2007-08-24 | 2009-02-26 | General Electric Company | Sequence identification and analysis |
US20100107253A1 (en) * | 2008-10-29 | 2010-04-29 | Eiland Edward E | Mdl compress system and method for signature inference and masquerade intrusion detection |
US20100107254A1 (en) * | 2008-10-29 | 2010-04-29 | Eiland Edward E | Network intrusion detection using mdl compress for deep packet inspection |
US20100312755A1 (en) * | 2006-10-07 | 2010-12-09 | Eric Hildebrandt | Method and apparatus for compressing and decompressing digital data by electronic means using a context grammar |
US20110067106A1 (en) * | 2009-09-15 | 2011-03-17 | Scott Charles Evans | Network intrusion detection visualization |
US20110066409A1 (en) * | 2009-09-15 | 2011-03-17 | Lockheed Martin Corporation | Network attack visualization and response through intelligent icons |
US9106689B2 (en) | 2011-05-06 | 2015-08-11 | Lockheed Martin Corporation | Intrusion detection using MDL clustering |
US20170101676A1 (en) * | 2015-10-07 | 2017-04-13 | Illumina, Inc. | Off-target capture reduction in sequencing techniques |
US9805312B1 (en) * | 2013-12-13 | 2017-10-31 | Google Inc. | Using an integerized representation for large-scale machine learning data |
US10917109B1 (en) * | 2020-03-06 | 2021-02-09 | Centre National De La Recherche Scientifique | Methods for storing digital data as, and for transforming digital data into, synthetic DNA |
US11882201B2 (en) * | 2022-03-30 | 2024-01-23 | Itron, Inc. | Data compression techniques for efficient network management |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9996543B2 (en) | 2016-01-06 | 2018-06-12 | International Business Machines Corporation | Compression and optimization of a specified schema that performs analytics on data within data systems |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5136686A (en) * | 1990-03-28 | 1992-08-04 | Koza John R | Non-linear genetic algorithms for solving problems by finding a fit composition of functions |
US6373971B1 (en) * | 1997-06-12 | 2002-04-16 | International Business Machines Corporation | Method and apparatus for pattern discovery in protein sequences |
US6473757B1 (en) * | 2000-03-28 | 2002-10-29 | Lucent Technologies Inc. | System and method for constraint based sequential pattern mining |
US20040249574A1 (en) * | 2001-03-30 | 2004-12-09 | Naftali Tishby | Markovian domain fingerprinting in statistical segmentation of protein sequences |
US7043371B2 (en) * | 2002-08-16 | 2006-05-09 | American Museum Of Natural History | Method for search based character optimization |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5455577A (en) * | 1993-03-12 | 1995-10-03 | Microsoft Corporation | Method and system for data compression |
AU1514795A (en) * | 1993-12-30 | 1995-08-01 | Connectix Corporation | Lossless data compression system and method |
US5703581A (en) * | 1996-06-14 | 1997-12-30 | Lucent Technologies Inc. | Method and apparatus for data compression and decompression |
US5977890A (en) * | 1997-06-12 | 1999-11-02 | International Business Machines Corporation | Method and apparatus for data compression utilizing efficient pattern discovery |
-
2004
- 2004-06-02 US US10/858,744 patent/US20050273274A1/en not_active Abandoned
-
2005
- 2005-05-19 WO PCT/US2005/017552 patent/WO2005122022A2/en active Application Filing
-
2008
- 2008-03-06 US US12/043,609 patent/US8046173B2/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5136686A (en) * | 1990-03-28 | 1992-08-04 | Koza John R | Non-linear genetic algorithms for solving problems by finding a fit composition of functions |
US6373971B1 (en) * | 1997-06-12 | 2002-04-16 | International Business Machines Corporation | Method and apparatus for pattern discovery in protein sequences |
US6473757B1 (en) * | 2000-03-28 | 2002-10-29 | Lucent Technologies Inc. | System and method for constraint based sequential pattern mining |
US20040249574A1 (en) * | 2001-03-30 | 2004-12-09 | Naftali Tishby | Markovian domain fingerprinting in statistical segmentation of protein sequences |
US7043371B2 (en) * | 2002-08-16 | 2006-05-09 | American Museum Of Natural History | Method for search based character optimization |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100312755A1 (en) * | 2006-10-07 | 2010-12-09 | Eric Hildebrandt | Method and apparatus for compressing and decompressing digital data by electronic means using a context grammar |
US7849399B2 (en) * | 2007-06-29 | 2010-12-07 | Walter Hoffmann | Method and system for tracking authorship of content in data |
US20090007267A1 (en) * | 2007-06-29 | 2009-01-01 | Walter Hoffmann | Method and system for tracking authorship of content in data |
US20090055425A1 (en) * | 2007-08-24 | 2009-02-26 | General Electric Company | Sequence identification and analysis |
US7809765B2 (en) | 2007-08-24 | 2010-10-05 | General Electric Company | Sequence identification and analysis |
US8312542B2 (en) | 2008-10-29 | 2012-11-13 | Lockheed Martin Corporation | Network intrusion detection using MDL compress for deep packet inspection |
US20100107253A1 (en) * | 2008-10-29 | 2010-04-29 | Eiland Edward E | Mdl compress system and method for signature inference and masquerade intrusion detection |
US20100107254A1 (en) * | 2008-10-29 | 2010-04-29 | Eiland Edward E | Network intrusion detection using mdl compress for deep packet inspection |
US20100107255A1 (en) * | 2008-10-29 | 2010-04-29 | Eiland Edward E | Intrusion Detection Using MDL Compression |
US8375446B2 (en) * | 2008-10-29 | 2013-02-12 | Lockheed Martin Corporation | Intrusion detection using MDL compression |
US8327443B2 (en) * | 2008-10-29 | 2012-12-04 | Lockheed Martin Corporation | MDL compress system and method for signature inference and masquerade intrusion detection |
US8245301B2 (en) | 2009-09-15 | 2012-08-14 | Lockheed Martin Corporation | Network intrusion detection visualization |
US8245302B2 (en) | 2009-09-15 | 2012-08-14 | Lockheed Martin Corporation | Network attack visualization and response through intelligent icons |
US20110066409A1 (en) * | 2009-09-15 | 2011-03-17 | Lockheed Martin Corporation | Network attack visualization and response through intelligent icons |
US20110067106A1 (en) * | 2009-09-15 | 2011-03-17 | Scott Charles Evans | Network intrusion detection visualization |
US9106689B2 (en) | 2011-05-06 | 2015-08-11 | Lockheed Martin Corporation | Intrusion detection using MDL clustering |
US9805312B1 (en) * | 2013-12-13 | 2017-10-31 | Google Inc. | Using an integerized representation for large-scale machine learning data |
US20170101676A1 (en) * | 2015-10-07 | 2017-04-13 | Illumina, Inc. | Off-target capture reduction in sequencing techniques |
US10577643B2 (en) * | 2015-10-07 | 2020-03-03 | Illumina, Inc. | Off-target capture reduction in sequencing techniques |
US11624084B2 (en) | 2015-10-07 | 2023-04-11 | Illumina, Inc. | Off-target capture reduction in sequencing techniques |
US10917109B1 (en) * | 2020-03-06 | 2021-02-09 | Centre National De La Recherche Scientifique | Methods for storing digital data as, and for transforming digital data into, synthetic DNA |
US11882201B2 (en) * | 2022-03-30 | 2024-01-23 | Itron, Inc. | Data compression techniques for efficient network management |
Also Published As
Publication number | Publication date |
---|---|
WO2005122022A3 (en) | 2006-03-30 |
US20080270439A1 (en) | 2008-10-30 |
US8046173B2 (en) | 2011-10-25 |
WO2005122022A2 (en) | 2005-12-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8046173B2 (en) | Method for identifying sub-sequences of interest in a sequence | |
US7809765B2 (en) | Sequence identification and analysis | |
Li et al. | The similarity metric | |
Siepel et al. | Combining phylogenetic and hidden Markov models in biosequence analysis | |
Delgrange et al. | STAR: an algorithm to search for tandem approximate repeats | |
Ebeling et al. | Partition-based entropies of deterministic and stochastic maps | |
US20240312567A1 (en) | Efficient payload extraction from polynucleotide sequence reads | |
Comin et al. | Fast entropic profiler: An information theoretic approach for the discovery of patterns in genomes | |
Lifshits et al. | Speeding up HMM decoding and training by exploiting sequence repetitions | |
US20040199931A1 (en) | Method and system for pattern matching | |
US20230053844A1 (en) | Improved Quality Value Compression Framework in Aligned Sequencing Data Based on Novel Contexts | |
US11507476B2 (en) | Noise estimation method, non-transitory computer-readable storage medium, and noise estimation apparatus | |
Ryabko et al. | Information-theoretic method for classification of texts | |
Yin et al. | PREMIER—PRobabilistic error-correction using Markov inference in errored reads | |
Mozes et al. | Speeding up HMM decoding and training by exploiting sequence repetitions | |
Jagota et al. | Comparing a hidden Markov model and a stochastic context-free grammar | |
Mignosi et al. | Forbidden factors and fragment assembly | |
Elishco et al. | The capacity of some Pólya string models | |
CN113704465B (en) | Text clustering method and device, electronic equipment and storage medium | |
Shmilovici et al. | Using a VOM model for reconstructing potential coding regions in EST sequences | |
Böer | Multiple alignment using hidden Markov models | |
Khetan et al. | An efficient approach towards compressed parameterized word matching using wavelet tree | |
Psomopoulos et al. | A finite state automata based technique for protein classification rules induction | |
García et al. | Copula based model correction for bivariate Bernoulli financial series | |
Leung et al. | An efficient motif discovery algorithm with unknown motif length and number of binding sites |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: GENERAL ELECTRIC COMPANY, NEW YORK Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:EVANS, SCOTT C.;BUSH, STEPHEN F.;TORRES, ANDREW S.;REEL/FRAME:015432/0972 Effective date: 20040602 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |