[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN110569974B - DNA storage layered representation and interweaving coding method capable of containing artificial base - Google Patents

DNA storage layered representation and interweaving coding method capable of containing artificial base Download PDF

Info

Publication number
CN110569974B
CN110569974B CN201810573636.3A CN201810573636A CN110569974B CN 110569974 B CN110569974 B CN 110569974B CN 201810573636 A CN201810573636 A CN 201810573636A CN 110569974 B CN110569974 B CN 110569974B
Authority
CN
China
Prior art keywords
code
oligonucleotide
storage
data
error correction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810573636.3A
Other languages
Chinese (zh)
Other versions
CN110569974A (en
Inventor
陈为刚
韩昌彩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201810573636.3A priority Critical patent/CN110569974B/en
Publication of CN110569974A publication Critical patent/CN110569974A/en
Application granted granted Critical
Publication of CN110569974B publication Critical patent/CN110569974B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/123DNA computing
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/27Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes using interleaving techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a DNA storage layered representation and interweaving coding method which can contain artificial bases, belonging to the field of DNA storage reliable coding, wherein binary data is scrambled and divided into a plurality of subblocks after error correction, oligonucleotide molecules are composed from different coding subblocks in a layered mode, and when molecules are lost or bases have errors, the errors are dispersed to a plurality of error correction codes for error correction; the correspondence between layering and block codes is realized by adopting a bipartite graph with a large girth, so that a plurality of oligonucleotide molecules can form a whole, and the error correction capability and the DNA storage reliability are improved. Further, the artificial base pairs are included, the storage capacity of a single base is increased, each molecule can be decomposed into more layers, and the error correction performance and the storage efficiency are improved. The invention has the advantages that the artificial base can be contained, and the single-base storage density is improved; the storage reliability is ensured through the interleaving of layering and bipartite graph; a highly reliable and high-density DNA storage encoding method is provided.

Description

DNA storage layered representation and interweaving coding method capable of containing artificial base
Technical Field
The invention relates to the field of reliable data storage by using oligonucleotide molecules, in particular to a DNA storage hierarchical representation and interweaving coding method which can contain artificial bases.
Background
With the development of various information acquisition means, the existing information storage methods have been unable to meet the development speed of massive digital information generated by human beings. Currently, the main information storage media include: magnetic disks, optical disks, nonvolatile semiconductor memory, and the like. The storage modes mainly include independent storage, network storage, distributed storage, cloud storage and the like based on a computer. However, storage methods and technologies based on these conventional storage media have many challenges. First, it is predicted that the current primary storage approaches are difficult to meet the rapidly growing data storage needs. Second, the mainstream data storage methods currently include: cloud storage and data centers also face huge costs in terms of electricity, equipment, site, manpower, and the like. Third, these conventional storage media can be used for a lifetime generally less than several decades, and cannot satisfy the long-term storage and data backup requirements of important data, such as: important historical files, literature, etc. In addition, in the conventional storage methods based on magnetism, light, electricity, etc., there is also a great risk to the physical security and disaster recovery of the storage device, for example: if water inflow accidents happen to data stored in a common hard disk or a hard disk array, the data can be lost. On the other hand, the importance of data is increasing, and especially, data-based applications are developing, and data becomes an important asset. According to the relevant reports, the loss of data caused by 911 events has continued to have many companies shut down due to the loss of data assets.
Deoxyribonucleic acid (DNA) storage is a novel storage mode of storing digital information in a base sequence of DNA and taking artificially synthesized DNA molecules as storage media, has the advantages of high density, small volume and long storage time, and becomes a potential mass data storage method in the future. DNA storage can be divided into in vivo and in vitro storage, using four different bases of oligonucleotides, namely: adenine, thymine, cytosine and guanine, abbreviated as A, T, C and G respectively, represent data, and mass data storage is realized by DNA or oligonucleotide molecules containing different bases, which is a new data storage mode. With the continuous increase of information data generated by human activities and the continuous significance of the data, DNA storage becomes a potential mass data long-term storage method in the future.
The DNA storage adopts different bases { A, T, G, C } of a large number of oligonucleotide molecules to store data, and the current mainstream DNA synthesis method and sequencing method are based on simultaneous processing of a large number of oligonucleotide molecules and have the characteristic of great flux. However, in DNA storage, various types of errors are caused by processes such as oligonucleotide sequence synthesis, Polymerase Chain Reaction (PCR), and sequencing. Therefore, how to design a proper error correction coding technology to ensure the reliability of data storage in oligonucleotide molecules becomes an important issue.
The application of the error correction coding technology in DNA storage is expected to solve the problems of storage cost and reliability, and is a very potential technology. On one hand, the error rate of the current oligonucleotide molecule synthesis is high, and one synthesis error can occur in hundreds of bases generally; if the reliability of the synthesized base is further improved by a biochemical method, the synthesis yield is reduced, and the cost is greatly increased. Therefore, it is also an important method to reduce the synthesis cost of DNA storage systems by using efficient error correction coding techniques.
However, the design of error correction coding to ensure the reliability of DNA storage has its special features. First, another feature of oligonucleotide synthesis in DNA storage is that the currently mature technology facilitates the synthesis of oligonucleotide molecules of less than several hundred lengths, especially using high throughput chip synthesis techniques. The current assembly technology has higher cost for assembling long sequences, and short oligonucleotide molecules are mostly adopted for DNA storage. Oligonucleotide molecules are limited in length and use of separate channel coding creates greater redundancy for reliability. Secondly, there is a bias of different molecules during DNA synthesis, amplification and sequencing, which results in the loss of some molecules and affects storage reliability. At present, DNA synthesis and sequencing technologies may cause more base errors, and error correction coding technology must be adopted to realize the reliability of data storage. The present researchers have proposed various methods for storing and encoding DNA based on error correction coding. Proposed storage coding schemes include the use of repetition codes, reed-solomon (RS) codes, fountain codes, and the like. Researchers at the university of columbia in 2017, for example, proposed a fountain coding method to correct the loss of oligonucleotide molecules; the scheme proposed by microsoft and washington university adopts very long RS codes to correct various errors and deletions caused by complete deletion of oligonucleotide molecules; goldman et al, the institute of bioinformatics, Europe, propose coding methods that use repetition coding to ensure DNA storage reliability. Other error correction methods using RS codes have been proposed by some researchers. However, none of these error correction coding methods utilizes the correlation between the error characteristics of a plurality of bits corresponding to each base symbol, and thus the matching between the error correction coding method and the error characteristics in DNA storage cannot be sufficiently achieved.
Further, ensuring high efficiency of DNA storage is also an important aspect of DNA storage considerations. The traditional DNA storage technology utilizes A, T, C and G four bases, and the number of binary bits which can be represented by each base is 2 bits; in order to improve the storage density of DNA, additional artificial bases can be added, and the data storage density of each base can be improved by utilizing { A, T, G, C } and artificial bases for data storage, and the biological safety problem of DNA storage can be solved. Researchers like F Romesberg, the Scripps research institute of California 2017 broken the constraint of { A, T, G, C } four bases, used for the first time in X-Y base pairs and corresponding amino acids which do not exist in Nature synthesized in laboratories, created in laboratories a completely new life entity containing 6 bases of { A, T, G, C, X, Y }, and the result was published in Nature. Additional artificial base pairs have also been proposed by some researchers, such as P-Z base pairs, V-J base pairs, X-K base pairs, isoG-isoC base pairs, and the like. Therefore, with the continuous development of artificial base theory and technology, more base pairs are used for data storage of DNA, and the method has the advantages of higher storage density, capability of being distinguished from the existing DNA of a living body, and wide application prospect.
In order to solve the above problems, the inventors of the present invention have made studies to solve the problem that a plurality of bits expressed by bases of the same molecule have a correlation when an error occurs, and have proposed a DNA storage coding method based on oligonucleotide hierarchical expression and interleave coding. In the DNA storage scheme, synthesis, amplification and sequencing are carried out, and each base symbol is a whole; if an error occurs, a substitution, insertion or deletion (deletion) of the entire base will occur. Due to the preference in synthesis, sequencing and amplification, imbalances in the various types of oligonucleotide molecules in the DNA pool can result, or even complete deletions of some of the oligonucleotide molecules.
Disclosure of Invention
The invention provides a method for jointly carrying out error correction coding on data corresponding to a plurality of oligonucleotide molecules in DNA storage and mapping coding segments to different layers of the oligonucleotide molecules, which overcomes the problems of strong correlation of the same base or the same oligonucleotide when the oligonucleotide molecules are lost and some bases have point errors, can fully exert the potential of error correction codes and improve the reliability of DNA storage, and is described in detail as follows:
a DNA storage hierarchical representation and an interleaved encoding method that may comprise artificial bases, the encoding method comprising the steps of:
(1) scrambling the binary bit data of the user, and then carrying out error correction coding to obtain a series of grouped error correction codes with fixed length;
(2) dividing N bits contained in each packet of error correcting code into N sub-packets on average, wherein the number of bits contained in each sub-packet is N/N, wherein N is a positive integer and is a factor of N, and N/N is also an integer;
(3) determining the available oligonucleotide base set according to the biochemical technical method for synthesizing, sequencing and amplifying the tube nucleotide, wherein the size of the set is represented by a positive integer b;
(4) selecting N/N bases to form a group, and mapping each base to k bits { b }1,b2,…,bkSequentially taking the ith bit of N/N basic groups to form a sequence to obtain the ith bit sequence with the length of N/N, wherein i is in the position of [1, k ]]Taking values;
(5) b block codes with the length of N are taken as a whole and comprise Bn sub-blocks, wherein B is a positive integer; the code words of the block code correspond to the molecules of M oligonucleotides; the connection between block code and oligonucleotide molecule adopts a regular bipartite graph G (V)1,V2And E) according to the connection relation E of the bipartite graph, combining k bit sequences with the length of N/N into a group of oligonucleotide molecular sequence segments with the length of N/N;
(6) adding sequences representing different molecular labels to each combined molecular sequence fragment for identifying different sequences, wherein the sequence length is L basic groups, adding primers at two ends, and the primer lengths are P basic groups respectively to form a complete oligonucleotide molecule with the length of N/N + L + 2P;
(7) synthesizing oligonucleotide molecules by a biochemical method, adopting single strands or double strands, and taking the single strands or the double strands of the oligonucleotide molecules as data storage media;
(8) when data is read, a certain number of oligonucleotide molecules are taken for amplification, and then sequencing data is preprocessed;
(9) and remapping the data portion corresponding to the oligonucleotide molecule into k bit sequences according to the preprocessed sequencing data.
Further, a type of node set V of the bipartite graph1Representing B block codes, and the degree of each node is n; another type of node set V of bipartite graph2Represents M oligonucleotide molecules, and the degree of the node is k according to the difference of the adopted base sets; any bipartite graph with the above node degrees is used, with Bn ═ Mk.
Further, the molecular label is represented by binary, a single block error correcting code is adopted, the length of the block code is Lk bits, and then the block code is mapped into L basic groups; l and P are positive integers, and L is determined according to the number of molecules in the same DNA pool and the adopted error correcting code; p is determined by the need for molecular amplification.
Remapping the data part corresponding to the oligonucleotide molecule into k bit sequences according to the preprocessed sequencing data;
if a certain oligonucleotide molecule is lost due to a small number or label fragment errors, marking the subblocks of the k block codes corresponding to the block as deleted according to the bipartite graph connection relation, and adopting a deletion error correction method when the block codes correct errors so as to improve the error correction performance;
if some segment is not lost, it is mapped to the determined position directly according to the connection relation of bipartite graph, so as to correct deletion and error decoding of the block code, and obtain the decoding result, and carry out the operation of removing the scrambling code, and obtain the original binary data.
The technical scheme provided by the invention has the beneficial effects that:
1. the invention mainly aims at the in vitro storage of DNA, namely a group of synthesized oligonucleotide molecules is utilized, and the high-flux writing and reading of data on oligonucleotide bases are realized by means of a high-flux synthesis technology, an amplification technology and a sequencing technology in the field of biotechnology, so that the invention has the advantages of high density, small volume and long storage duration;
2. the invention adopts the layered representation and the interweaving coding method of the oligonucleotide molecules, and codes a plurality of molecules together, thereby greatly improving the coding efficiency and the reliability; meanwhile, the hierarchical representation method can also eliminate the correlation among a plurality of bit errors caused by base errors, and further ensure the reliability of DNA storage.
3. The invention converts each base symbol into a plurality of bits, for example, when the number of the base in the used base system is 4, two bits can be decomposed; if the number of bases is 8, it can be decomposed into 3 bits; then, each base in a molecule is decomposed into a plurality of bits, and the entire molecular sequence can be decomposed into a plurality of bit sequences. Further, the plurality of bit sequences are respectively associated with a certain sub-packet of a long channel code. By this method, many oligonucleotide molecules are subjected to error correction processing as a whole, the capability of packet error correction coding can be fully exerted, and the reliability of DNA data storage is improved.
In summary, the present invention is characterized in that:
(1) the invention can comprise artificial base pairing, thereby improving the storage density of oligonucleotide molecules, for example, compared with a four-base system, an eight-base system can improve the storage density to 1.5 times of the original storage density, namely 2 bits are stored in each base and are improved to 3 bits;
(2) the long error correcting code is adopted to realize the joint coding of a plurality of molecules, so that the condition that a single molecule has serious errors or a molecule has no sequencing data at all can be effectively resisted;
(3) the interleaving coding method of molecular layering representation and participation of a plurality of block codes is adopted, so that the correlation of errors of each base symbol is effectively solved, and the error correction capability of the error correction code can be fully exerted;
(4) the mode of storing the DNA containing the artificial base has better biological safety.
Drawings
FIG. 1 is a block diagram of a DNA storage hierarchy representation and interleaving codes that may contain artificial bases;
FIG. 2 is a mapping method of bit sequence groupings and oligonucleotide molecules;
FIG. 3 is a bipartite graph of the connection relationship between B block codes and M molecules;
fig. 4 is a read-side erasure marking method supporting erasure decoding of block codes.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.
Example 1
The basic idea of the DNA storage hierarchical representation and interleaving coding method which can be designed by the embodiment of the invention and comprises the artificial bases is as follows: in order to ensure the reliability of the data stored by each oligonucleotide molecule for storing data, the data of the molecule is divided into a plurality of layers, and each layer participates in different block code coding, so that various errors in DNA storage are dispersed into different block codes for correction, thereby eliminating the correlation of the errors and fully exerting the error correction capability of the block codes; furthermore, by associating a plurality of molecules together in this way to form a single process, the reliability of DNA storage can be improved.
Embodiments of the present invention are described in detail below with reference to the accompanying drawings:
as shown in FIG. 1, the method for storing layered representation and interleaving coding by using DNA containing artificial bases comprises the following steps:
(1) scrambling the binary bit data of the user, and then carrying out error correction coding to obtain a series of grouped error correction codes with fixed length;
wherein, the block code selects one of algebraic block code, Low Density Parity Check (LDPC) code, Turbo code, cascade code, product code and nonlinear block code; the code length of the block code as a whole is N bits.
(2) Dividing N bits contained in each packet of error correcting code into N sub-packets on average, wherein the number of bits contained in each sub-packet is N/N, wherein N is a positive integer and is a factor of N, and N/N is also an integer;
(3) determining the available oligonucleotide base set according to the biochemical technological method for oligonucleotide synthesis, sequencing and amplification, wherein the size of the set is represented by a positive integer b;
where an oligonucleotide consisting of 4 bases { a, T, G, C } is used, b is 4, where a represents adenine, T represents thymine, G represents guanine, and C represents cytosine.
It is also possible to use artificial bases comprising currently existing or newly developed in the future, or to use a set of artificial bases entirely, so that the number of available bases is 4, 8 or 16, or even more; b is the k power of 2, and k is a positive integer greater than or equal to 2.
(4) Selecting N/N bases to form a group, and mapping each base to k bits { b }1,b2,…,bkSequentially taking the ith bit of N/N basic groups to form a sequence to obtain the ith bit sequence with the length of N/N, wherein i is in the position of [1, k ]]Taking values;
(5) b block codes with the length of N are taken as a whole and comprise Bn sub-blocks, wherein B is a positive integer; the code words of the block code correspond to the molecules of M oligonucleotides, M also being a positive integer;
the connection between block code and oligonucleotide molecule adopts a regular bipartite graph G (V)1,V2And E) represents a set of nodes V of a class of bipartite graph1Representing B block codes, and the degree of each node is n; another type of node set V of bipartite graph2Represents M oligonucleotide molecules, and the degree of the node is k according to the difference of the adopted base sets; any bipartite graph conforming to the node degree above is used, with Bn ═ Mk; according to the connection relation E of the bipartite graph, bit sequences with the length of N/N from the k block codes form a group of oligonucleotide molecular sequence segments, and the length of the molecular sequence segments is N/N.
(6) Adding sequences representing different molecular labels to each combined molecular sequence fragment for identifying different sequences, wherein the sequence length is L basic groups, and adding primers at two ends, and the primer lengths are P basic groups respectively, so that a complete oligonucleotide molecule is formed, and the length is N/N + L + 2P;
wherein, the molecular label adopts binary representation, can adopt the single grouping error correcting code, the length of the grouping code is Lk bit, then map it to L basic groups; l and P are positive integers, and the L is determined according to the number of molecules in the same DNA pool and by adopting an error correcting code; p is determined according to the requirement of molecular amplification;
(7) synthesizing oligonucleotide molecules by a biochemical method, wherein single strands or double strands can be adopted, and the single strands or the double strands of the oligonucleotide molecules are used as data storage media;
(8) when data is read, a certain amount of oligonucleotide molecules are taken for amplification, the amplification can adopt Polymerase Chain Reaction (PCR) of the oligonucleotide, and other amplification methods can also be adopted for amplification, and then sequencing is carried out; preprocessing sequencing data;
(9) remapping the data portion corresponding to the oligonucleotide molecule into k bit sequences according to the preprocessed sequencing data; if a certain oligonucleotide molecule is lost due to a small number or label fragment errors, marking the subblocks of the k block codes corresponding to the block as deleted according to the defined bipartite graph connection relation, and adopting a deletion error correction method when the block codes are corrected, thereby improving the error correction performance; if a certain segment is not lost, no matter whether an error exists or not, the segment is directly mapped to a determined position according to the connection relation of the bipartite graph; further, the block code is subjected to erasure correction and error decoding to obtain a decoding result; further, the decoding result is subjected to descrambling operation to obtain original binary data.
Wherein, in the step (1), the user binary bit data is scrambled and then error correction coded to obtain a series of packet error correction codes with fixed length; the block code selects one of algebraic block code, low-density parity check code, Turbo code, cascade code, product code and nonlinear block code; the code length of the block code as a whole is N bits, and the specific method is as follows:
(1.1) scrambling the binary data of the user, and if the data length of the user is K bits, performing bit-by-bit exclusive OR by adopting a pseudo-random sequence with the same length;
(1.2) coding by adopting a block code to obtain coded N bits;
in the present invention, the block code may be any block code, including linear codes or non-linear codes; multi-level block code coding is also possible, such as Turbo code, serial concatenated code, product code, etc.; the method also includes a cascade coding method which adopts a code with a larger length and then a short-length block code, such as a coding method of a Turbo-Hadamard code and an LDPC-Hadamard code type; other error correction codes are also included, such as codes of the Davey-Mackay architecture, which are overall non-linear block codes capable of correcting various types of errors; the above-described encoding methods can be applied to the present invention.
As shown in FIG. 2, the N bits contained in each error correction code packet are divided into N sub-packets on average in steps (2) to (5) of the present invention, and further assembled into oligonucleotide molecules by:
(1) dividing each block code into 3 blocks, wherein n is 3; a bit sequence of length N/3 from two block codes organized in one molecule;
(2) in FIG. 2, bit sequence a and bit sequence b from block code 1 and block code 2 are further combined, a 4-base system consisting of { A, T, G, C } bases is selected, and two bit blocks a and b constitute an oligonucleotide molecule fragment according to the mapping relationship between bases and bit pairs shown in FIG. 2;
as shown in fig. 3, step 5 described in the present invention is to take B block codes with length N as a whole, which include Bn sub-blocks, where B is a positive integer; the code words of the block code correspond to M oligonucleotide molecules, and M is also a positive integer; the connection between block code and oligonucleotide molecule adopts a regular bipartite graph G (V)1,V2And E) represents a set of nodes V of a class of bipartite graph1Representing B block codes, and the degree of each node is n; another type of node set V of bipartite graph2Represents M oligonucleotide molecules, and the degree of the node is k according to the difference of the adopted base sets; any bipartite graph conforming to the node degree above is used, with Bn ═ Mk; according to the connection relation E of the bipartite graph, the bit sequences with the length of N/N from k block codes are formedA group of oligonucleotide molecule sequence fragments, the length of the molecule sequence fragment is N/N, the specific method is:
(5.1) determining the first type node V of the bipartite graph according to the number B of the participated block codes and the number n of the sub-blocks decomposed by the block codes1The number of the corresponding check nodes is B, and the degree of each node is n;
(5.2) according to the number k of layers which can be formed by using the base system, the invention uses the base system with the base number of 2 raised to the power of k, so that k is a positive integer;
(5.3) to ensure that a suitable bipartite graph G (V) can be constructed1,V2E), requiring k to be a factor of Bn;
(5.4) construction of bipartite graph G (V) by mathematical means1,V2E), class V of nodes of the bipartite graph1The number of the check nodes is B, and the degree of the nodes is n; another set of nodes, called variable nodes, using V2Representing that the number is M, and the node degree is k; in order to ensure the superior performance, the bipartite graph in the invention can be constructed by various methods for constructing the bipartite graph in mathematics, and the constructed bipartite graph is required to have larger girth or better expansibility;
(5.5) each side of the bipartite graph, one end of which is associated with a sub-block of the block code and the other end of which is associated with a layer of bit sequence of a molecule, completing the mapping of block code data to oligonucleotide molecule fragments; edges connected to the same check node are associated to different subblocks of the same block code, and edges connected to the same variable node are associated to different layers of the same molecule;
(5.6) obtaining multi-layered bit data of the assembled molecules according to the above steps, further composing oligonucleotide molecule fragments.
Adding sequences which represent different molecular labels to each combined molecular sequence fragment as described in the step (6) for identifying different sequences, wherein the sequence length is L basic groups, and adding primers at two ends, wherein the length of the primers is P basic groups respectively, so as to form a complete oligonucleotide molecule, and the length is N/N + L + 2P; the molecular label is represented by binary, and can adopt a single grouping error correcting code, the length of the grouping code is Lk bits, and then the grouping code is mapped into L basic groups; l and P are positive integers, and the L is determined according to the number of molecules and by adopting an error correcting code; p is determined according to the requirement of molecular amplification; the specific method comprises the following steps:
(6.1) determining the range of the label according to the number of different oligonucleotide molecular species in the DNA synthesis pool, and representing the label by binary bit groups with proper length; e.g., the number of molecules is Q, Q bits can be used for representation,
Figure BDA0001686564290000081
here, the
Figure BDA0001686564290000082
Represents rounding up; then, coding the binary label by adopting error correction coding with a short code length, or not adopting error correction coding, and obtaining the length of a bit sequence as kL;
(6.2) mapping by the same method as the data part to obtain a base sequence, wherein the length of the base sequence is L;
(6.3) embedding the base sequence representing the molecular marker in a molecular sequence fragment representing the data, which can be ligated to the head, tail or middle of the fragment representing the molecular sequence, or can be dispersed and embedded, and the placement position can be optimized according to the error characteristics of synthesis, amplification and sequencing;
(6.4) adding primers for amplification to form a complete N/N + L +2P oligonucleotide molecule.
Example 2
The embodiment of the invention designs a DNA storage layered representation and interweaving coding method which can contain artificial bases, and combines the method
The examples further illustrate the superior performance achieved by the process of the present invention.
In this embodiment, it is assumed that each length-N molecule code is divided into 4 groups, and each group corresponds to one of two layers of four different molecules; this example uses a conventional four base system of { A, T, G, C } molecules, each divided into two layers. A bipartite graph is constructed, wherein the degree of one type of nodes is 4, and the degree of the other type of nodes is 2, and the bipartite graph can be obtained by using a high-dimensional long 4-rule graph.
As shown in FIG. 4, this example focuses on the distribution of errors caused by synthesis, amplification, and sequencing within different block codes. In DNA storage, some molecules are lost due to the molecular preference of each processing step. And a certain molecule loss can cause more burst errors, often exceeds the error correction capability of the error correction code, and causes a corresponding certain error correction code to fail.
If a molecule is lost, it is based on the bipartite graph G (V)1,V2And E), the connection relation can be easily marked as deletion, so that an erasure decoding method can be adopted to ensure the whole error correction capability of the error correction code. Meanwhile, a small number of base errors in the molecule can also be dispersed to different block codes for error correction, so that the error correction capability of the block codes is improved in probability, and the principle is shown in fig. 4.
In summary, the embodiment of the present invention has the following features through the above design:
(1) the invention can contain artificial base pairs, thereby breaking the limitation of a four-base system, and along with the development of synthetic biology technology, the method can improve the storage density of oligonucleotide molecules; for example, compared with a four-base system, the storage density can be improved by 1.5 times by adopting an eight-base system, namely 2 bits are stored in each base and converted into 3 bits;
(2) the invention can adopt relatively longer error correcting codes to realize the joint coding of a plurality of molecules, and can effectively resist the condition that a single molecule is seriously wrong or is lost without sequencing data; as the number of partitions increases, the impact of single molecule errors on overall performance further decreases;
(3) by adopting the method of layered transmission of each molecule, the relevance of errors of each base symbol is effectively solved, so that the error correction capability of the error correction code can be fully exerted;
(4) the DNA storage mode containing the artificial base is adopted to distinguish the DNA of the existing living system, and the biological safety is better.
Thus, the present invention can be applied to a DNA storage system comprising a plurality of sets of artificial base pairs. For ease of design, the present invention is directed to the use of a base number of powers of 2, e.g., 4 bases, 8 bases or more.
In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (3)

1. A method for storing a hierarchical representation of DNA that may contain artificial bases and interleaving the coding, the method comprising the steps of:
(1) scrambling the binary bit data of the user, and then carrying out error correction coding to obtain a series of grouped error correction codes with fixed length;
(2) dividing N bits contained in each packet of error correcting code into N sub-packets on average, wherein the number of bits contained in each sub-packet is N/N, wherein N is a positive integer and is a factor of N, and N/N is also an integer;
(3) determining the available oligonucleotide base set according to the biochemical technical method for synthesizing, sequencing and amplifying the tube nucleotide, wherein the size of the set is represented by a positive integer b;
(4) selecting N/N bases to form a group, and mapping each base to k bits { b }1,b2,…,bkSequentially taking the ith bit of N/N basic groups to form a sequence to obtain the ith bit sequence with the length of N/N, wherein i is in the position of [1, k ]]Get betweenA value;
(5) b block codes with the length of N are taken as a whole and comprise Bn sub-blocks, wherein B is a positive integer; the code words of the block code correspond to the molecules of M oligonucleotides; the connection between block code and oligonucleotide molecule adopts a regular bipartite graph G (V)1,V2And E) according to the connection relation E of the bipartite graph, combining k bit sequences with the length of N/N into a group of oligonucleotide molecular sequence segments with the length of N/N;
(6) adding sequences representing different molecular labels to each combined molecular sequence fragment for identifying different sequences, wherein the sequence length is L basic groups, adding primers at two ends, and the primer lengths are P basic groups respectively to form a complete oligonucleotide molecule with the length of N/N + L + 2P;
(7) synthesizing oligonucleotide molecules by a biochemical method, adopting single strands or double strands, and taking the single strands or the double strands of the oligonucleotide molecules as data storage media;
(8) when data is read, a certain number of oligonucleotide molecules are taken for amplification, and then sequencing data is preprocessed;
(9) remapping the data portion corresponding to the oligonucleotide molecule into k bit sequences according to the preprocessed sequencing data;
wherein the bipartite graph G (V)1,V2And E) is:
class-one node set V of bipartite graph1Representing B block codes, and the degree of each node is n; another type of node set V of bipartite graph2Represents M oligonucleotide molecules, and the degree of the node is k according to the difference of the adopted base sets; any bipartite graph using degrees consistent with nodes, with Bn ═ Mk;
one end of each edge of the bipartite graph is associated with a subblock of a block code, and the other end of each edge is associated with a layer of bit sequence of a molecule, so that mapping of block code data to oligonucleotide molecule fragments is completed; edges connected to the same check node are associated to different subblocks of the same block code, and edges connected to the same variable node are associated to different layers of the same molecule.
2. The DNA storage hierarchical representation and interleaving coding method which can contain artificial bases according to claim 1,
the molecular label is represented by binary, a single grouping error correcting code is adopted, the length of the grouping code is Lk bits, and then the grouping code is mapped into L basic groups; l and P are positive integers, and L is determined according to the number of molecules in the same DNA pool and the adopted error correcting code; p is determined by the need for molecular amplification.
3. The method for hierarchical representation and interleaved encoding of DNA storage that can contain artificial bases according to claim 1, wherein said remapping of the data portions corresponding to oligonucleotide molecules into k bit sequences based on preprocessed sequencing data is specifically:
if a certain oligonucleotide molecule is lost due to a small number or label fragment errors, marking the subblocks of the k block codes corresponding to the block as deleted according to the bipartite graph connection relation, and adopting a deletion error correction method when the block codes correct errors so as to improve the error correction performance;
if some segment is not lost, it is mapped to the determined position directly according to the connection relation of bipartite graph, so as to correct deletion and error decoding of the block code, and obtain the decoding result, and carry out the operation of removing the scrambling code, and obtain the original binary data.
CN201810573636.3A 2018-06-06 2018-06-06 DNA storage layered representation and interweaving coding method capable of containing artificial base Active CN110569974B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810573636.3A CN110569974B (en) 2018-06-06 2018-06-06 DNA storage layered representation and interweaving coding method capable of containing artificial base

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810573636.3A CN110569974B (en) 2018-06-06 2018-06-06 DNA storage layered representation and interweaving coding method capable of containing artificial base

Publications (2)

Publication Number Publication Date
CN110569974A CN110569974A (en) 2019-12-13
CN110569974B true CN110569974B (en) 2021-08-24

Family

ID=68772753

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810573636.3A Active CN110569974B (en) 2018-06-06 2018-06-06 DNA storage layered representation and interweaving coding method capable of containing artificial base

Country Status (1)

Country Link
CN (1) CN110569974B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022073225A1 (en) * 2020-10-10 2022-04-14 中国科学院深圳先进技术研究院 Dna storage-based incremental information management method and device
CN113345521A (en) * 2021-05-31 2021-09-03 天津大学 Coding and recovering method using large fragment DNA storage
CN113611364B (en) * 2021-08-27 2022-04-08 中国人民解放军军事科学院军事医学研究院 DNA sequence processing method and device for DNA storage and electronic equipment
US12131057B2 (en) 2022-01-28 2024-10-29 Western Digital Technologies, Inc. Encoding and integrity markers for molecular storage applications
CN117669703A (en) * 2022-08-17 2024-03-08 密码子(杭州)科技有限公司 Method, apparatus and system for storing information in molecules
CN116633366A (en) * 2023-05-30 2023-08-22 天津大学 DNA data storage high-efficiency coding circuit

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101496292A (en) * 2006-07-27 2009-07-29 原子能委员会 Message-passing decoding method with sequencing according to reliability of vicinity
CN104520864A (en) * 2012-06-01 2015-04-15 欧洲分子生物学实验室 High-capacity storage of digital information in DNA
WO2015062659A1 (en) * 2013-10-31 2015-05-07 Huawei Technologies Co., Ltd. Methods and nodes in a wireless communication system enabling equal error protection with adaptive hierarchical modulation
CN106464270A (en) * 2014-05-21 2017-02-22 三星电子株式会社 Transmitting apparatus and interleaving method thereof

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008090885A1 (en) * 2007-01-23 2008-07-31 Panasonic Corporation Radio communication apparatus and temporary bit insertion method
US20170019212A1 (en) * 2014-03-13 2017-01-19 Lg Electronics Inc. Method and device for decoding low-density parity check code for forward error correction in wireless communication system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101496292A (en) * 2006-07-27 2009-07-29 原子能委员会 Message-passing decoding method with sequencing according to reliability of vicinity
CN104520864A (en) * 2012-06-01 2015-04-15 欧洲分子生物学实验室 High-capacity storage of digital information in DNA
WO2015062659A1 (en) * 2013-10-31 2015-05-07 Huawei Technologies Co., Ltd. Methods and nodes in a wireless communication system enabling equal error protection with adaptive hierarchical modulation
CN106464270A (en) * 2014-05-21 2017-02-22 三星电子株式会社 Transmitting apparatus and interleaving method thereof

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Packet-Level Layer-Based Interleaving Unequal Forward Error Correction (LIU-FEC) for Robust Wireless Scalable Video Transmission;Hojin Ha 等,;《Wireless Personal Communications》;20140202;第2014年卷;第2141–2154页 *
Toward a DNA-Based Archival Storage System;James Bornholt 等,;《IEEE Micro》;20170614;第37卷(第3期);第98-104页 *
带纠错机制的DNA存储;王诗薇,;《中国优秀硕士学位论文全文数据库信息科技辑》;20170315;第2017年卷(第3期);第I137-324页 *
纠错码在某些领域的应用;白姗姗,;《中国优秀硕士学位论文全文数据库信息科技辑》;20160815;第2016年卷(第8期);第I136-124页 *

Also Published As

Publication number Publication date
CN110569974A (en) 2019-12-13

Similar Documents

Publication Publication Date Title
CN110569974B (en) DNA storage layered representation and interweaving coding method capable of containing artificial base
Wang et al. High capacity DNA data storage with variable-length Oligonucleotides using repeat accumulate code and hybrid mapping
Bornholt et al. A DNA-based archival storage system
CN109830263B (en) DNA storage method based on oligonucleotide sequence coding storage
Shomorony et al. Capacity results for the noisy shuffling channel
WO2018148260A1 (en) Apparatus, method and system for digital information storage in deoxyribonucleic acid (dna)
Welzel et al. DNA-Aeon provides flexible arithmetic coding for constraint adherence and error correction in DNA storage
CN113345521A (en) Coding and recovering method using large fragment DNA storage
TW201905691A (en) Information coding and information decoding method
CN104520864A (en) High-capacity storage of digital information in DNA
Haughton et al. BioCode: Two biologically compatible Algorithms for embedding data in non-coding and coding regions of DNA
US20070042372A1 (en) Method for designing dna codes used as information carrier
CN111858507B (en) DNA-based data storage method, decoding method, system and device
Schoeny et al. Novel combinatorial coding results for DNA sequencing and data storage
Xue et al. Construction of GC-balanced DNA with deletion/insertion/mutation error correction for DNA storage system
Shomorony et al. Torn-paper coding
Heinis et al. Survey of information encoding techniques for dna
Zan et al. A hierarchical error correction strategy for text DNA storage
CN110932736A (en) DNA information storage method based on Raptor code and quaternary RS code
Park et al. Iterative coding scheme satisfying gc balance and run-length constraints for dna storage with robustness to error propagation
Lin et al. Managing reliability skew in DNA storage
CN113300720A (en) Method for identifying insertion deletion section of long DNA sequence storage
Xue et al. Notice of violation of IEEE publication principles: Construction of GC-balanced DNA with deletion/insertion/mutation error correction for DNA storage system
Wei et al. Dna storage: A promising large scale archival storage?
Bi et al. Extended XOR Algorithm with Biotechnology Constraints for Data Security in DNA Storage

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant