CN110569974B

CN110569974B - DNA storage layered representation and interweaving coding method capable of containing artificial base

Info

Publication number: CN110569974B
Application number: CN201810573636.3A
Authority: CN
Inventors: 陈为刚; 韩昌彩
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2018-06-06
Filing date: 2018-06-06
Publication date: 2021-08-24
Anticipated expiration: 2038-06-06
Also published as: CN110569974A

Abstract

The invention discloses a DNA storage layered representation and interweaving coding method which can contain artificial bases, belonging to the field of DNA storage reliable coding, wherein binary data is scrambled and divided into a plurality of subblocks after error correction, oligonucleotide molecules are composed from different coding subblocks in a layered mode, and when molecules are lost or bases have errors, the errors are dispersed to a plurality of error correction codes for error correction; the correspondence between layering and block codes is realized by adopting a bipartite graph with a large girth, so that a plurality of oligonucleotide molecules can form a whole, and the error correction capability and the DNA storage reliability are improved. Further, the artificial base pairs are included, the storage capacity of a single base is increased, each molecule can be decomposed into more layers, and the error correction performance and the storage efficiency are improved. The invention has the advantages that the artificial base can be contained, and the single-base storage density is improved; the storage reliability is ensured through the interleaving of layering and bipartite graph; a highly reliable and high-density DNA storage encoding method is provided.

Description

DNA storage layered representation and interweaving coding method capable of containing artificial base

Technical Field

The invention relates to the field of reliable data storage by using oligonucleotide molecules, in particular to a DNA storage hierarchical representation and interweaving coding method which can contain artificial bases.

Background

With the development of various information acquisition means, the existing information storage methods have been unable to meet the development speed of massive digital information generated by human beings. Currently, the main information storage media include: magnetic disks, optical disks, nonvolatile semiconductor memory, and the like. The storage modes mainly include independent storage, network storage, distributed storage, cloud storage and the like based on a computer. However, storage methods and technologies based on these conventional storage media have many challenges. First, it is predicted that the current primary storage approaches are difficult to meet the rapidly growing data storage needs. Second, the mainstream data storage methods currently include: cloud storage and data centers also face huge costs in terms of electricity, equipment, site, manpower, and the like. Third, these conventional storage media can be used for a lifetime generally less than several decades, and cannot satisfy the long-term storage and data backup requirements of important data, such as: important historical files, literature, etc. In addition, in the conventional storage methods based on magnetism, light, electricity, etc., there is also a great risk to the physical security and disaster recovery of the storage device, for example: if water inflow accidents happen to data stored in a common hard disk or a hard disk array, the data can be lost. On the other hand, the importance of data is increasing, and especially, data-based applications are developing, and data becomes an important asset. According to the relevant reports, the loss of data caused by 911 events has continued to have many companies shut down due to the loss of data assets.

Deoxyribonucleic acid (DNA) storage is a novel storage mode of storing digital information in a base sequence of DNA and taking artificially synthesized DNA molecules as storage media, has the advantages of high density, small volume and long storage time, and becomes a potential mass data storage method in the future. DNA storage can be divided into in vivo and in vitro storage, using four different bases of oligonucleotides, namely: adenine, thymine, cytosine and guanine, abbreviated as A, T, C and G respectively, represent data, and mass data storage is realized by DNA or oligonucleotide molecules containing different bases, which is a new data storage mode. With the continuous increase of information data generated by human activities and the continuous significance of the data, DNA storage becomes a potential mass data long-term storage method in the future.

The DNA storage adopts different bases { A, T, G, C } of a large number of oligonucleotide molecules to store data, and the current mainstream DNA synthesis method and sequencing method are based on simultaneous processing of a large number of oligonucleotide molecules and have the characteristic of great flux. However, in DNA storage, various types of errors are caused by processes such as oligonucleotide sequence synthesis, Polymerase Chain Reaction (PCR), and sequencing. Therefore, how to design a proper error correction coding technology to ensure the reliability of data storage in oligonucleotide molecules becomes an important issue.

The application of the error correction coding technology in DNA storage is expected to solve the problems of storage cost and reliability, and is a very potential technology. On one hand, the error rate of the current oligonucleotide molecule synthesis is high, and one synthesis error can occur in hundreds of bases generally; if the reliability of the synthesized base is further improved by a biochemical method, the synthesis yield is reduced, and the cost is greatly increased. Therefore, it is also an important method to reduce the synthesis cost of DNA storage systems by using efficient error correction coding techniques.

However, the design of error correction coding to ensure the reliability of DNA storage has its special features. First, another feature of oligonucleotide synthesis in DNA storage is that the currently mature technology facilitates the synthesis of oligonucleotide molecules of less than several hundred lengths, especially using high throughput chip synthesis techniques. The current assembly technology has higher cost for assembling long sequences, and short oligonucleotide molecules are mostly adopted for DNA storage. Oligonucleotide molecules are limited in length and use of separate channel coding creates greater redundancy for reliability. Secondly, there is a bias of different molecules during DNA synthesis, amplification and sequencing, which results in the loss of some molecules and affects storage reliability. At present, DNA synthesis and sequencing technologies may cause more base errors, and error correction coding technology must be adopted to realize the reliability of data storage. The present researchers have proposed various methods for storing and encoding DNA based on error correction coding. Proposed storage coding schemes include the use of repetition codes, reed-solomon (RS) codes, fountain codes, and the like. Researchers at the university of columbia in 2017, for example, proposed a fountain coding method to correct the loss of oligonucleotide molecules; the scheme proposed by microsoft and washington university adopts very long RS codes to correct various errors and deletions caused by complete deletion of oligonucleotide molecules; goldman et al, the institute of bioinformatics, Europe, propose coding methods that use repetition coding to ensure DNA storage reliability. Other error correction methods using RS codes have been proposed by some researchers. However, none of these error correction coding methods utilizes the correlation between the error characteristics of a plurality of bits corresponding to each base symbol, and thus the matching between the error correction coding method and the error characteristics in DNA storage cannot be sufficiently achieved.

Further, ensuring high efficiency of DNA storage is also an important aspect of DNA storage considerations. The traditional DNA storage technology utilizes A, T, C and G four bases, and the number of binary bits which can be represented by each base is 2 bits; in order to improve the storage density of DNA, additional artificial bases can be added, and the data storage density of each base can be improved by utilizing { A, T, G, C } and artificial bases for data storage, and the biological safety problem of DNA storage can be solved. Researchers like F Romesberg, the Scripps research institute of California 2017 broken the constraint of { A, T, G, C } four bases, used for the first time in X-Y base pairs and corresponding amino acids which do not exist in Nature synthesized in laboratories, created in laboratories a completely new life entity containing 6 bases of { A, T, G, C, X, Y }, and the result was published in Nature. Additional artificial base pairs have also been proposed by some researchers, such as P-Z base pairs, V-J base pairs, X-K base pairs, isoG-isoC base pairs, and the like. Therefore, with the continuous development of artificial base theory and technology, more base pairs are used for data storage of DNA, and the method has the advantages of higher storage density, capability of being distinguished from the existing DNA of a living body, and wide application prospect.

In order to solve the above problems, the inventors of the present invention have made studies to solve the problem that a plurality of bits expressed by bases of the same molecule have a correlation when an error occurs, and have proposed a DNA storage coding method based on oligonucleotide hierarchical expression and interleave coding. In the DNA storage scheme, synthesis, amplification and sequencing are carried out, and each base symbol is a whole; if an error occurs, a substitution, insertion or deletion (deletion) of the entire base will occur. Due to the preference in synthesis, sequencing and amplification, imbalances in the various types of oligonucleotide molecules in the DNA pool can result, or even complete deletions of some of the oligonucleotide molecules.

Disclosure of Invention

The invention provides a method for jointly carrying out error correction coding on data corresponding to a plurality of oligonucleotide molecules in DNA storage and mapping coding segments to different layers of the oligonucleotide molecules, which overcomes the problems of strong correlation of the same base or the same oligonucleotide when the oligonucleotide molecules are lost and some bases have point errors, can fully exert the potential of error correction codes and improve the reliability of DNA storage, and is described in detail as follows:

a DNA storage hierarchical representation and an interleaved encoding method that may comprise artificial bases, the encoding method comprising the steps of:

(1) scrambling the binary bit data of the user, and then carrying out error correction coding to obtain a series of grouped error correction codes with fixed length;

(2) dividing N bits contained in each packet of error correcting code into N sub-packets on average, wherein the number of bits contained in each sub-packet is N/N, wherein N is a positive integer and is a factor of N, and N/N is also an integer;

(3) determining the available oligonucleotide base set according to the biochemical technical method for synthesizing, sequencing and amplifying the tube nucleotide, wherein the size of the set is represented by a positive integer b;

(4) selecting N/N bases to form a group, and mapping each base to k bits { b }₁,b₂,…,b_kSequentially taking the ith bit of N/N basic groups to form a sequence to obtain the ith bit sequence with the length of N/N, wherein i is in the position of [1, k ]]Taking values;

(5) b block codes with the length of N are taken as a whole and comprise Bn sub-blocks, wherein B is a positive integer; the code words of the block code correspond to the molecules of M oligonucleotides; the connection between block code and oligonucleotide molecule adopts a regular bipartite graph G (V)₁,V₂And E) according to the connection relation E of the bipartite graph, combining k bit sequences with the length of N/N into a group of oligonucleotide molecular sequence segments with the length of N/N;

(6) adding sequences representing different molecular labels to each combined molecular sequence fragment for identifying different sequences, wherein the sequence length is L basic groups, adding primers at two ends, and the primer lengths are P basic groups respectively to form a complete oligonucleotide molecule with the length of N/N + L + 2P;

(7) synthesizing oligonucleotide molecules by a biochemical method, adopting single strands or double strands, and taking the single strands or the double strands of the oligonucleotide molecules as data storage media;

(8) when data is read, a certain number of oligonucleotide molecules are taken for amplification, and then sequencing data is preprocessed;

(9) and remapping the data portion corresponding to the oligonucleotide molecule into k bit sequences according to the preprocessed sequencing data.

Further, a type of node set V of the bipartite graph₁Representing B block codes, and the degree of each node is n; another type of node set V of bipartite graph₂Represents M oligonucleotide molecules, and the degree of the node is k according to the difference of the adopted base sets; any bipartite graph with the above node degrees is used, with Bn ═ Mk.

Further, the molecular label is represented by binary, a single block error correcting code is adopted, the length of the block code is Lk bits, and then the block code is mapped into L basic groups; l and P are positive integers, and L is determined according to the number of molecules in the same DNA pool and the adopted error correcting code; p is determined by the need for molecular amplification.

Remapping the data part corresponding to the oligonucleotide molecule into k bit sequences according to the preprocessed sequencing data;

if a certain oligonucleotide molecule is lost due to a small number or label fragment errors, marking the subblocks of the k block codes corresponding to the block as deleted according to the bipartite graph connection relation, and adopting a deletion error correction method when the block codes correct errors so as to improve the error correction performance;

if some segment is not lost, it is mapped to the determined position directly according to the connection relation of bipartite graph, so as to correct deletion and error decoding of the block code, and obtain the decoding result, and carry out the operation of removing the scrambling code, and obtain the original binary data.

The technical scheme provided by the invention has the beneficial effects that:

1. the invention mainly aims at the in vitro storage of DNA, namely a group of synthesized oligonucleotide molecules is utilized, and the high-flux writing and reading of data on oligonucleotide bases are realized by means of a high-flux synthesis technology, an amplification technology and a sequencing technology in the field of biotechnology, so that the invention has the advantages of high density, small volume and long storage duration;

2. the invention adopts the layered representation and the interweaving coding method of the oligonucleotide molecules, and codes a plurality of molecules together, thereby greatly improving the coding efficiency and the reliability; meanwhile, the hierarchical representation method can also eliminate the correlation among a plurality of bit errors caused by base errors, and further ensure the reliability of DNA storage.

3. The invention converts each base symbol into a plurality of bits, for example, when the number of the base in the used base system is 4, two bits can be decomposed; if the number of bases is 8, it can be decomposed into 3 bits; then, each base in a molecule is decomposed into a plurality of bits, and the entire molecular sequence can be decomposed into a plurality of bit sequences. Further, the plurality of bit sequences are respectively associated with a certain sub-packet of a long channel code. By this method, many oligonucleotide molecules are subjected to error correction processing as a whole, the capability of packet error correction coding can be fully exerted, and the reliability of DNA data storage is improved.

In summary, the present invention is characterized in that:

(1) the invention can comprise artificial base pairing, thereby improving the storage density of oligonucleotide molecules, for example, compared with a four-base system, an eight-base system can improve the storage density to 1.5 times of the original storage density, namely 2 bits are stored in each base and are improved to 3 bits;

(2) the long error correcting code is adopted to realize the joint coding of a plurality of molecules, so that the condition that a single molecule has serious errors or a molecule has no sequencing data at all can be effectively resisted;

(3) the interleaving coding method of molecular layering representation and participation of a plurality of block codes is adopted, so that the correlation of errors of each base symbol is effectively solved, and the error correction capability of the error correction code can be fully exerted;

(4) the mode of storing the DNA containing the artificial base has better biological safety.

Drawings

FIG. 1 is a block diagram of a DNA storage hierarchy representation and interleaving codes that may contain artificial bases;

FIG. 2 is a mapping method of bit sequence groupings and oligonucleotide molecules;

FIG. 3 is a bipartite graph of the connection relationship between B block codes and M molecules;

fig. 4 is a read-side erasure marking method supporting erasure decoding of block codes.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

Example 1

The basic idea of the DNA storage hierarchical representation and interleaving coding method which can be designed by the embodiment of the invention and comprises the artificial bases is as follows: in order to ensure the reliability of the data stored by each oligonucleotide molecule for storing data, the data of the molecule is divided into a plurality of layers, and each layer participates in different block code coding, so that various errors in DNA storage are dispersed into different block codes for correction, thereby eliminating the correlation of the errors and fully exerting the error correction capability of the block codes; furthermore, by associating a plurality of molecules together in this way to form a single process, the reliability of DNA storage can be improved.

Embodiments of the present invention are described in detail below with reference to the accompanying drawings:

as shown in FIG. 1, the method for storing layered representation and interleaving coding by using DNA containing artificial bases comprises the following steps:

wherein, the block code selects one of algebraic block code, Low Density Parity Check (LDPC) code, Turbo code, cascade code, product code and nonlinear block code; the code length of the block code as a whole is N bits.

(3) determining the available oligonucleotide base set according to the biochemical technological method for oligonucleotide synthesis, sequencing and amplification, wherein the size of the set is represented by a positive integer b;

where an oligonucleotide consisting of 4 bases { a, T, G, C } is used, b is 4, where a represents adenine, T represents thymine, G represents guanine, and C represents cytosine.

It is also possible to use artificial bases comprising currently existing or newly developed in the future, or to use a set of artificial bases entirely, so that the number of available bases is 4, 8 or 16, or even more; b is the k power of 2, and k is a positive integer greater than or equal to 2.

(5) b block codes with the length of N are taken as a whole and comprise Bn sub-blocks, wherein B is a positive integer; the code words of the block code correspond to the molecules of M oligonucleotides, M also being a positive integer;

the connection between block code and oligonucleotide molecule adopts a regular bipartite graph G (V)₁,V₂And E) represents a set of nodes V of a class of bipartite graph₁Representing B block codes, and the degree of each node is n; another type of node set V of bipartite graph₂Represents M oligonucleotide molecules, and the degree of the node is k according to the difference of the adopted base sets; any bipartite graph conforming to the node degree above is used, with Bn ═ Mk; according to the connection relation E of the bipartite graph, bit sequences with the length of N/N from the k block codes form a group of oligonucleotide molecular sequence segments, and the length of the molecular sequence segments is N/N.

(6) Adding sequences representing different molecular labels to each combined molecular sequence fragment for identifying different sequences, wherein the sequence length is L basic groups, and adding primers at two ends, and the primer lengths are P basic groups respectively, so that a complete oligonucleotide molecule is formed, and the length is N/N + L + 2P;

wherein, the molecular label adopts binary representation, can adopt the single grouping error correcting code, the length of the grouping code is Lk bit, then map it to L basic groups; l and P are positive integers, and the L is determined according to the number of molecules in the same DNA pool and by adopting an error correcting code; p is determined according to the requirement of molecular amplification;

(7) synthesizing oligonucleotide molecules by a biochemical method, wherein single strands or double strands can be adopted, and the single strands or the double strands of the oligonucleotide molecules are used as data storage media;

(8) when data is read, a certain amount of oligonucleotide molecules are taken for amplification, the amplification can adopt Polymerase Chain Reaction (PCR) of the oligonucleotide, and other amplification methods can also be adopted for amplification, and then sequencing is carried out; preprocessing sequencing data;

(9) remapping the data portion corresponding to the oligonucleotide molecule into k bit sequences according to the preprocessed sequencing data; if a certain oligonucleotide molecule is lost due to a small number or label fragment errors, marking the subblocks of the k block codes corresponding to the block as deleted according to the defined bipartite graph connection relation, and adopting a deletion error correction method when the block codes are corrected, thereby improving the error correction performance; if a certain segment is not lost, no matter whether an error exists or not, the segment is directly mapped to a determined position according to the connection relation of the bipartite graph; further, the block code is subjected to erasure correction and error decoding to obtain a decoding result; further, the decoding result is subjected to descrambling operation to obtain original binary data.

Wherein, in the step (1), the user binary bit data is scrambled and then error correction coded to obtain a series of packet error correction codes with fixed length; the block code selects one of algebraic block code, low-density parity check code, Turbo code, cascade code, product code and nonlinear block code; the code length of the block code as a whole is N bits, and the specific method is as follows:

(1.1) scrambling the binary data of the user, and if the data length of the user is K bits, performing bit-by-bit exclusive OR by adopting a pseudo-random sequence with the same length;

(1.2) coding by adopting a block code to obtain coded N bits;

in the present invention, the block code may be any block code, including linear codes or non-linear codes; multi-level block code coding is also possible, such as Turbo code, serial concatenated code, product code, etc.; the method also includes a cascade coding method which adopts a code with a larger length and then a short-length block code, such as a coding method of a Turbo-Hadamard code and an LDPC-Hadamard code type; other error correction codes are also included, such as codes of the Davey-Mackay architecture, which are overall non-linear block codes capable of correcting various types of errors; the above-described encoding methods can be applied to the present invention.

As shown in FIG. 2, the N bits contained in each error correction code packet are divided into N sub-packets on average in steps (2) to (5) of the present invention, and further assembled into oligonucleotide molecules by:

(1) dividing each block code into 3 blocks, wherein n is 3; a bit sequence of length N/3 from two block codes organized in one molecule;

(2) in FIG. 2, bit sequence a and bit sequence b from block code 1 and block code 2 are further combined, a 4-base system consisting of { A, T, G, C } bases is selected, and two bit blocks a and b constitute an oligonucleotide molecule fragment according to the mapping relationship between bases and bit pairs shown in FIG. 2;

as shown in fig. 3, step 5 described in the present invention is to take B block codes with length N as a whole, which include Bn sub-blocks, where B is a positive integer; the code words of the block code correspond to M oligonucleotide molecules, and M is also a positive integer; the connection between block code and oligonucleotide molecule adopts a regular bipartite graph G (V)₁,V₂And E) represents a set of nodes V of a class of bipartite graph₁Representing B block codes, and the degree of each node is n; another type of node set V of bipartite graph₂Represents M oligonucleotide molecules, and the degree of the node is k according to the difference of the adopted base sets; any bipartite graph conforming to the node degree above is used, with Bn ═ Mk; according to the connection relation E of the bipartite graph, the bit sequences with the length of N/N from k block codes are formedA group of oligonucleotide molecule sequence fragments, the length of the molecule sequence fragment is N/N, the specific method is:

(5.1) determining the first type node V of the bipartite graph according to the number B of the participated block codes and the number n of the sub-blocks decomposed by the block codes₁The number of the corresponding check nodes is B, and the degree of each node is n;

(5.2) according to the number k of layers which can be formed by using the base system, the invention uses the base system with the base number of 2 raised to the power of k, so that k is a positive integer;

(5.3) to ensure that a suitable bipartite graph G (V) can be constructed₁,V₂E), requiring k to be a factor of Bn;

(5.4) construction of bipartite graph G (V) by mathematical means₁,V₂E), class V of nodes of the bipartite graph₁The number of the check nodes is B, and the degree of the nodes is n; another set of nodes, called variable nodes, using V₂Representing that the number is M, and the node degree is k; in order to ensure the superior performance, the bipartite graph in the invention can be constructed by various methods for constructing the bipartite graph in mathematics, and the constructed bipartite graph is required to have larger girth or better expansibility;

(5.5) each side of the bipartite graph, one end of which is associated with a sub-block of the block code and the other end of which is associated with a layer of bit sequence of a molecule, completing the mapping of block code data to oligonucleotide molecule fragments; edges connected to the same check node are associated to different subblocks of the same block code, and edges connected to the same variable node are associated to different layers of the same molecule;

(5.6) obtaining multi-layered bit data of the assembled molecules according to the above steps, further composing oligonucleotide molecule fragments.

Adding sequences which represent different molecular labels to each combined molecular sequence fragment as described in the step (6) for identifying different sequences, wherein the sequence length is L basic groups, and adding primers at two ends, wherein the length of the primers is P basic groups respectively, so as to form a complete oligonucleotide molecule, and the length is N/N + L + 2P; the molecular label is represented by binary, and can adopt a single grouping error correcting code, the length of the grouping code is Lk bits, and then the grouping code is mapped into L basic groups; l and P are positive integers, and the L is determined according to the number of molecules and by adopting an error correcting code; p is determined according to the requirement of molecular amplification; the specific method comprises the following steps:

(6.1) determining the range of the label according to the number of different oligonucleotide molecular species in the DNA synthesis pool, and representing the label by binary bit groups with proper length; e.g., the number of molecules is Q, Q bits can be used for representation,

here, the

Represents rounding up; then, coding the binary label by adopting error correction coding with a short code length, or not adopting error correction coding, and obtaining the length of a bit sequence as kL;

(6.2) mapping by the same method as the data part to obtain a base sequence, wherein the length of the base sequence is L;

(6.3) embedding the base sequence representing the molecular marker in a molecular sequence fragment representing the data, which can be ligated to the head, tail or middle of the fragment representing the molecular sequence, or can be dispersed and embedded, and the placement position can be optimized according to the error characteristics of synthesis, amplification and sequencing;

(6.4) adding primers for amplification to form a complete N/N + L +2P oligonucleotide molecule.

Example 2

The embodiment of the invention designs a DNA storage layered representation and interweaving coding method which can contain artificial bases, and combines the method

The examples further illustrate the superior performance achieved by the process of the present invention.

In this embodiment, it is assumed that each length-N molecule code is divided into 4 groups, and each group corresponds to one of two layers of four different molecules; this example uses a conventional four base system of { A, T, G, C } molecules, each divided into two layers. A bipartite graph is constructed, wherein the degree of one type of nodes is 4, and the degree of the other type of nodes is 2, and the bipartite graph can be obtained by using a high-dimensional long 4-rule graph.

As shown in FIG. 4, this example focuses on the distribution of errors caused by synthesis, amplification, and sequencing within different block codes. In DNA storage, some molecules are lost due to the molecular preference of each processing step. And a certain molecule loss can cause more burst errors, often exceeds the error correction capability of the error correction code, and causes a corresponding certain error correction code to fail.

If a molecule is lost, it is based on the bipartite graph G (V)₁,V₂And E), the connection relation can be easily marked as deletion, so that an erasure decoding method can be adopted to ensure the whole error correction capability of the error correction code. Meanwhile, a small number of base errors in the molecule can also be dispersed to different block codes for error correction, so that the error correction capability of the block codes is improved in probability, and the principle is shown in fig. 4.

In summary, the embodiment of the present invention has the following features through the above design:

(1) the invention can contain artificial base pairs, thereby breaking the limitation of a four-base system, and along with the development of synthetic biology technology, the method can improve the storage density of oligonucleotide molecules; for example, compared with a four-base system, the storage density can be improved by 1.5 times by adopting an eight-base system, namely 2 bits are stored in each base and converted into 3 bits;

(2) the invention can adopt relatively longer error correcting codes to realize the joint coding of a plurality of molecules, and can effectively resist the condition that a single molecule is seriously wrong or is lost without sequencing data; as the number of partitions increases, the impact of single molecule errors on overall performance further decreases;

(3) by adopting the method of layered transmission of each molecule, the relevance of errors of each base symbol is effectively solved, so that the error correction capability of the error correction code can be fully exerted;

(4) the DNA storage mode containing the artificial base is adopted to distinguish the DNA of the existing living system, and the biological safety is better.

Thus, the present invention can be applied to a DNA storage system comprising a plurality of sets of artificial base pairs. For ease of design, the present invention is directed to the use of a base number of powers of 2, e.g., 4 bases, 8 bases or more.

In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.

Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method for storing a hierarchical representation of DNA that may contain artificial bases and interleaving the coding, the method comprising the steps of:

(4) selecting N/N bases to form a group, and mapping each base to k bits { b }₁,b₂,…,b_kSequentially taking the ith bit of N/N basic groups to form a sequence to obtain the ith bit sequence with the length of N/N, wherein i is in the position of [1, k ]]Get betweenA value;

(9) remapping the data portion corresponding to the oligonucleotide molecule into k bit sequences according to the preprocessed sequencing data;

wherein the bipartite graph G (V)₁,V₂And E) is:

class-one node set V of bipartite graph₁Representing B block codes, and the degree of each node is n; another type of node set V of bipartite graph₂Represents M oligonucleotide molecules, and the degree of the node is k according to the difference of the adopted base sets; any bipartite graph using degrees consistent with nodes, with Bn ═ Mk;

one end of each edge of the bipartite graph is associated with a subblock of a block code, and the other end of each edge is associated with a layer of bit sequence of a molecule, so that mapping of block code data to oligonucleotide molecule fragments is completed; edges connected to the same check node are associated to different subblocks of the same block code, and edges connected to the same variable node are associated to different layers of the same molecule.

2. The DNA storage hierarchical representation and interleaving coding method which can contain artificial bases according to claim 1,

the molecular label is represented by binary, a single grouping error correcting code is adopted, the length of the grouping code is Lk bits, and then the grouping code is mapped into L basic groups; l and P are positive integers, and L is determined according to the number of molecules in the same DNA pool and the adopted error correcting code; p is determined by the need for molecular amplification.

3. The method for hierarchical representation and interleaved encoding of DNA storage that can contain artificial bases according to claim 1, wherein said remapping of the data portions corresponding to oligonucleotide molecules into k bit sequences based on preprocessed sequencing data is specifically: