CN110322927B

CN110322927B - CRISPR (clustered regularly interspaced short palindromic repeats) induced RNA (ribonucleic acid) library design method

Info

Publication number: CN110322927B
Application number: CN201910712069.XA
Authority: CN
Inventors: 王建新; 李涛; 王劭恺; 严承; 李敏
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2019-08-02
Filing date: 2019-08-02
Publication date: 2021-04-09
Anticipated expiration: 2039-08-02
Also published as: CN110322927A

Abstract

The invention discloses a CRISPR-induced RNA library design method, which comprises the following steps: step one, generating a kmer set according to a reference genome; step two, cutting the kmer into two parts of kmer1 and kmer2, and dividing corresponding kmer2 with the same kmer1 into one category; constructing the kmer2 of the same category into the same retrieval tree, wherein the key sequence of each retrieval tree is the kmer1 corresponding to the kmer 2; and step three, obtaining the induced RNA and the off-target sequence thereof in parallel, wherein in the step, when one kmer is compared with the key sequence of the retrieval tree and the kmer connected with kme2 in the key sequence, firstly, the kmer1 of the kmer is compared with the key sequence of the retrieval tree, whether a set condition is met is judged, and if the set condition is met, the kmer2 of the kmer is continuously compared with the kmer2 in the retrieval tree. The invention improves the calculation efficiency.

Description

CRISPR (clustered regularly interspaced short palindromic repeats) induced RNA (ribonucleic acid) library design method

Technical Field

The invention belongs to the field of functional genomics, and particularly relates to a CRISPR (clustered regularly interspaced short palindromic repeats) induced RNA library design method.

Background

In the current generation of genome editing technology, the technical research and development of a CRISPR system (Clustered regularly interspaced short palindromic repeats) and Cas9 nuclease (CRISPR-associated RNA-guided nuclease 9) are the fastest, and the CRISPR system can easily target almost any genome position, thereby bringing a great step towards the innovative development of genetic engineering. Cas9 protein from CRISPR system localizes complexes of inducing RNA and protein to DNA target sequences by exploiting the feature of complementary base pairing of inducing RNA (guide RNA, gRNA, also known as guide RNA) to DNA target sequences. Binding to a Protospacer Adjacent Motif (PAM) downstream of the target site helps direct Cas9 to cleave the DNA double strand, the PAM being necessary for Cas9 nuclease cleavage of the DNA double strand. The induced RNA is a key component of a CRISPR-Cas system and consists of an invariant part and a variable part, wherein the variable part is a part complementary with a DNA target sequence, and the variable part can be artificially designed to realize the binding of the induced RNA and DNA at different sites. CRISPR-induced RNA libraries are critical to genome editing systems. With the advancement of genome sequencing technology, the design of inducible RNA libraries becomes increasingly important for understanding genome function.

Many inducible RNA Design tools, such as CRISPR Design, Cas-OFFinder, CRISPR scan, and E-CRISP, have been developed for genome editing. However, these tools return a unique but overlapping set of induced RNAs, neglecting whole genome noncoding regions, and use third party alignment tools for off-target sequence searches.

Recently, the GuideScan software improved the construction of the crisp inducible RNA database. The GuideScan is an open source system that can design more synthetic or fully customized libraries of inducing RNAs from any genomic or CRISPR endonuclease. In addition, the GuideScan software can also construct single-primer RNA and double-primer RNA databases, and can obtain quite a lot of induced RNAs with multiple perfect target sites. Thus, the induced RNA obtained by GuideScan is more specific than that obtained on average by other tools. However, the calculation cost of GuideScan is relatively high, and especially when calculating off-target sequences of the inducing RNA (the number of mismatches with the inducing RNA is not less than parameter M and not more than parameter Q, Q > M), the calculation cost of GuideScan is very high. Therefore, it is impractical to apply GuideScan to large genomes. As more and more eukaryotic genomes are sequenced or re-sequenced, more efficient tools are needed to accelerate the design of CRISPR-induced RNA libraries.

Disclosure of Invention

The invention aims to provide a CRISPR-induced RNA library design method, which can effectively reduce the calculation time overhead for designing a CRISPR-induced RNA library.

A method of CRISPR-induced RNA library design comprising the steps of:

scanning a kmer taking standard PAM or non-standard PAM as a prefix or suffix in a reference genome to form a kmer set, namely a targeted genome space;

step two, for each kmer in the kmer set, cutting each kmer into two parts of kmer1 and kmer2, wherein the kmer1 is a sequence consisting of the first n bases, dividing the kmer2 which is the same with the corresponding kmer1 into a category, and taking the kmer1 which is corresponding to the kmer2 of the category as a bond sequence of the kmer2 of the category; constructing the kmer2 of the same category into the same retrieval tree (dictionary tree), so that a plurality of retrieval trees (data tasks are divided into a series of small tasks), and the key sequence of each retrieval tree is the key sequence of the kmer2 of the corresponding category;

step three, classifying all the induced RNAs according to kmer1, traversing all retrieval trees for all the classes of induced RNAs, and searching for kmers with Hamming distances smaller than Q; the method for traversing all retrieval trees for the induced RNA of one category and searching the kmer with the Hamming distance smaller than Q comprises the following steps: firstly, calculating the Hamming distance between the kmer1 of the class of induced RNA and the key sequence of each retrieval tree, and finding out the retrieval tree of which the Hamming distance between the key sequence and the kmer1 of the class of candidate induced RNA is not more than Q; then, searching for each candidate induced RNA in the category for a kmer2 with the hamming distance of the kmer2 of the candidate induced RNA being not more than Q-m in the searched retrieval tree; the method comprises the following steps of connecting kmer2 with bond sequences of a retrieval tree where the kmer2 are located to form a plurality of kmers, wherein a sequence which is in complementary pairing with the kmers in a reference genome is an off-target sequence of the induced RNA (a DNA sequence in the reference genome is a double-strand structure, bases on one strand are in complementary pairing with bases on the other strand, if the Hamming distance between the kmer sequence on one strand and the induced RNA is not less than M and not more than Q, the mismatching number between the kmer sequence on the other strand and the induced RNA is not less than M and not more than Q, and the off-target sequence is the off-target sequence of the induced RNA);

the induced RNA and its off-target sequence information thus obtained are CRISPR-induced RNA libraries.

The steps adopt a data preprocessing optimization algorithm, because the kmer1 of the induced RNA of each category is the same, and the bond sequence of the kmer2 in each retrieval tree is also the same, the induced RNA is classified firstly, and then the kmer1 of the induced RNA of one category is compared with the bond sequence of the retrieval tree, so that the calculation amount and the time cost of the comparison of the induced RNA and the kmer are greatly reduced, and the full sequence comparison of each induced RNA and each kmer is avoided. And when the Hamming distance m between the kmer1 of a certain category of induced RNA and a certain search tree key sequence is larger than Q, the kmer2 of the category of induced RNA does not need to be compared with the kmer2 in the search tree, and the calculation amount and the time overhead are reduced again.

Further, in the first step, kmers prefixed or postfixed with standard PAM or non-standard PAM are scanned in parallel in a plurality of sequences of the reference genome.

Further, scanning the kmer with the standard PAM or the non-standard PAM as the prefix or suffix in a plurality of sequences of the reference genome to serve as a total task, and dividing the total task into a plurality of subtasks, wherein each subtask is to scan the kmer with the standard PAM or the non-standard PAM as the prefix or suffix in one sequence of the reference genome; and simulating the function of a process pool by adopting a process and queue method in the python multi-process module, and executing a plurality of subtasks in parallel.

Further, in the second step, kmers 2 of multiple categories are built into multiple search trees respectively in parallel.

Further, constructing the kmers 2 of multiple categories into multiple search trees respectively as a total task, dividing the total task into multiple subtasks, wherein each subtask is to construct the kmer2 of one category into one search tree; and simulating the function of a process pool by adopting a process and queue method in the python multi-process module, and executing a plurality of subtasks in parallel.

Further, in the second step, all the search trees are traversed in parallel, and whether the number of occurrences of each kmer2 in the reference genome, which is connected with the key sequence of the search tree in which it is located, is greater than 1 is determined.

Further, traversing all the search trees, judging whether the occurrence frequency of each kmer2 connected with the key sequence of the search tree where the kmer2 is located in a reference genome is greater than 1 to serve as a total task, dividing the total task into a plurality of subtasks, wherein each subtask is to traverse one search tree, and judging whether the occurrence frequency of each kmer2 connected with the key sequence of the search tree where the kmer is located in the reference genome is greater than 1; and simulating the function of a process pool by adopting a process and queue method in the python multi-process module, and executing a plurality of subtasks in parallel.

Further, in the second step, in each kmer in the kmer set, if it uses non-standard PAM as prefix or suffix, or its occurrence number in the reference genome is greater than 1, it is non-induced RNA, otherwise it is candidate induced RNA;

classifying all candidate induced RNAs according to kmer1, traversing all search trees for all the candidate induced RNAs of all classes in parallel, and judging whether kmers with Hamming distances smaller than M exist, if so, the candidate induced RNAs are non-induced RNAs, otherwise, the candidate induced RNAs are induced RNAs.

Further, traversing all search trees for all classes of candidate induced RNAs in parallel, judging whether kmers with Hamming distances smaller than M exist as a total task, dividing the total task into a plurality of subtasks, and traversing all search trees for each class of candidate induced RNAs to judge whether kmers with Hamming distances smaller than M exist; and simulating the function of a process pool by adopting a process and queue method in the python multi-process module, and executing a plurality of subtasks in parallel.

Further, the method for traversing all search trees for the candidate induced RNA of one category and judging whether the kmer with the Hamming distance smaller than M exists specifically comprises the following steps: firstly, calculating the Hamming distance between the kmer1 of the candidate induced RNA and the key sequence of each retrieval tree, and finding out the retrieval tree of which the Hamming distance between the key sequence and the kmer1 of the candidate induced RNA is not more than M; then, for each candidate induced RNA in the category, searching whether a kmer2 with the hamming distance of the kmer2 of the candidate induced RNA being not more than M-M exists in the searched retrieval tree, and if so, indicating that the kmer with the hamming distance of the candidate induced RNA being less than M exists. The steps adopt a data preprocessing optimization algorithm, because the kmer1 of the candidate induced RNA of each category is the same, and the key sequence of the kmer2 in each retrieval tree is also the same, the candidate induced RNA is firstly classified, and then the kmer1 of the candidate induced RNA of one category is compared with the key sequence of the retrieval tree, so that the calculation amount and the time cost of the comparison of the candidate induced RNA and the kmer are greatly reduced, and the full sequence comparison of each candidate induced RNA and each kmer one by one is avoided. And when the Hamming distance M between the kmer1 of a certain category of candidate induced RNA and a certain search tree key sequence is larger than M, the kmer2 of the category of candidate induced RNA does not need to be compared with the kmer2 in the search tree, and the calculation amount and the time overhead are reduced again.

Further, in the third step, all search trees are traversed for all classes of induced RNAs in parallel, and kmers with Hamming distances smaller than Q are searched.

Furthermore, in the third step, traversing all retrieval trees for all classes of induced RNAs in parallel, searching kmers with Hamming distances smaller than Q of the retrieval trees as a total task, dividing the total task into a plurality of subtasks, wherein each subtask is used for traversing all retrieval trees for one class of induced RNAs, and searching kmers with Hamming distances smaller than Q of the retrieval trees; and simulating the function of a process pool by adopting a process and queue method in the python multi-process module, and executing a plurality of subtasks in parallel.

Further, a process pool function is simulated by adopting a process method and a queue method in the multi-process module of python, and the method for executing a plurality of subtasks in parallel specifically comprises the following steps:

creating a plurality of processes by adopting a process method in a python multi-process module; taking the common parameters of each subtask as fixed parameters of each process, and putting the characteristic parameters of each subtask into a queue; each process takes out a group of characteristic parameters from the queue each time, and executes a subtask according to the characteristic parameters and the fixed parameters; a plurality of processes execute a plurality of subtasks in parallel; after each process executes a subtask, a group of characteristic parameters are taken out from the queue again, and a new subtask is executed according to the group of characteristic parameters and the fixed parameters; until all the characteristic parameters in the queue are taken out, all the subtasks are executed; wherein the characteristic parameter refers to a parameter that distinguishes one subtask from other subtasks.

According to the invention, based on research, when the GuideScan software constructs the CRISPR-induced RNA library, the calculation data can be parallel. Therefore, a method for realizing construction of a CRISPR-induced RNA library by GuideScan software through multiprocess parallel processing is designed, and comprises the steps of determining targetable genome space in parallel, namely scanning a plurality of sequences of a reference genome in parallel to generate a kmer set; for each kmer in the kmer set, cutting the kmer into two parts of 1 and 2, and dividing corresponding kmer2 with the same kmer1 into one category; constructing retrieval trees in a classified mode, namely constructing kmers 2 of the same category into the same retrieval tree (dictionary tree), wherein the key sequence of the retrieval tree is kmer1 corresponding to kmer 2; parallelly filtering non-candidate induced RNAs, parallelly traversing all retrieval trees, marking each kmer2 as the kmer2 of the non-candidate induced RNA according to whether the corresponding kmer takes non-standard PAM as a prefix or whether the times of occurrence in a reference genome are more than 1, and filtering the remaining unmarked kmers 2, wherein the kmers combined with the key sequences of the corresponding retrieval trees are candidate induced RNAs; parallel filtering candidate induced RNAs with similar kmers in a search tree, and for each candidate induced RNA, trying to find another kmer with a Hamming distance smaller than the Hamming distance M from the candidate induced RNA in the search tree, and if the candidate induced RNA and the kmers 2 of all the kmers similar to the candidate induced RNA are marked as the kmer2 of non-candidate induced RNA; obtaining induction RNA in parallel, traversing all retrieval trees in parallel, filtering out kmer2 of kmer2 which is not marked as non-candidate induction RNA, wherein the kmer combined with the key sequence of the corresponding retrieval tree is the final induction RNA; and calculating the off-target sequences of the induced RNAs in parallel, trying to search for kmers with Hamming distance not less than M and not more than Hamming distance Q from each induced RNA in a retrieval tree, taking sequences which are in complementary pairing with the kmers in a reference genome as the off-target sequences of the induced RNAs, and finally generating BAM files by using the information of the induced RNAs and the off-target sequences thereof. The invention divides the design task of inducing the RNA library into a series of subtasks according to classified data, then simulates the function of a process pool, and executes the subtasks in parallel based on a multi-process method. And data is preprocessed before each subtask starts, so that subsequent calculation tasks are reduced, the search time can be greatly shortened, and the search efficiency is improved. The invention improves the calculation efficiency through a multi-process mechanism and a strategy of a data preprocessing optimization algorithm.

The method is mainly characterized by data parallel, so the parallel design of the method can adopt distributed parallel. Distributed parallelism can utilize more machine resources, but at the same time, can also generate more communication overhead. Since the time overhead of the first step and the second step (the first 6 steps in the embodiment) of the method is not much, distributed parallel operation is not necessary, otherwise, more communication overhead is caused. The parallel calculation of the off-target sequence information of the induced RNA is the most time-consuming step, the kmer classification conditions are very different for different data sets, the data size of each task needs to be considered, the data are reasonably distributed to all machines, and the data inclination is avoided. Further, since the data structure of the search tree cannot be serialized, it is necessary to store the information of the search tree in a temporary file, copy and distribute the information to each device, read the search tree from each device, and perform further parallel processing. And finally, combining all results. By using the distributed cluster platform for parallel processing, more computing resources can be utilized without being limited to a single machine.

Has the advantages that:

the invention (MultiGuideSacan) provides a CRISPR-induced RNA library design method. Firstly, generating a kmer set from a genome sequence, then dividing the kmer into two parts of kmer1 and the rest of kmer2, dividing corresponding kmer2 with the same kmer1 into a class, and taking the kmer1 corresponding to the kmer2 of the class as a key sequence of the kmer2 of the class; constructing the kmer2 of the same category into the same retrieval tree (dictionary tree), wherein the key sequence of each retrieval tree is the key sequence of the kmer2 of the corresponding category, so as to obtain a plurality of retrieval trees, filtering non-induced RNA in each retrieval tree to obtain induced RNA, and dividing a data task of searching off-target sequences of the induced RNA in all the kmers into a series of small tasks; the small tasks can be executed in parallel, so that the calculation time overhead for designing the CRISPR-induced RNA library is effectively reduced; in addition, the induced RNA is divided into kmer1 and kmer2, the Hamming distance between the kmer1 and the key sequence of the search tree is obtained through calculation, the search tree which does not meet the distance requirement is excluded, and the subsequent matching calculation amount of the kmer2 is reduced when the off-target sequence information of the induced RNA is obtained; moreover, the method simulates the function of a process pool to process the calculation process of the subtasks in the algorithm in parallel, so that the advantage of a multiprocessor is utilized to improve the calculation speed. The efficient parallel algorithm of the invention makes it possible to design large genomic inducible RNA libraries.

The method realizes the multi-process parallel version of the GuideScan software so as to accelerate the design process of the CRISPR-induced RNA library. The GuideSacan software is implemented in Python language in combination with C + +. In the mainstream Python interpreter CPython, the Global Interpreter Lock (GIL) is a mutual exclusion lock that protects access to Python objects and prevents multiple threads from executing Python bytecodes simultaneously. The threads in Python cannot be used for parallel computing, so for CPU-intensive tasks, using multiple threads does not bring about a speed increase. While Python multiprocessing modules (multiprocessing modules) allow the creation of programs that can run in parallel and use the entire CPU core. In the method, more tasks are intensive in calculation, so that a multi-process method is adopted to improve the execution efficiency of the algorithm. The multi-process module of Python provides each process with its own Python interpreter, and each process has its own GIL. The multi-process module uses a single memory space and a plurality of CPU kernels, the GIL limit in CPython is bypassed, and the sub-process can be operated and is easier to use. When the number of processing tasks is not large, multiple processes can be dynamically generated for processing by utilizing the processes (Process methods) in the multi-Process module of the Python. However, if the number of operation objects is particularly large, the manual management of the process is particularly troublesome, and the process pool (pool) method can be effectively used. The process pool can provide a certain number of processes according to the needs of users, and when a new task request is submitted to the process pool, if the process pool is not full, a new process can be directly created to execute the task; and if the process quantity in the process pool reaches the specified maximum value, the task request enters a queue to wait, once a task is finished, the idle process can acquire a new task from the queue to execute, so that the limited process can be efficiently utilized, and the process is prevented from being continuously created and destroyed. However, in the process of using the process pool method, the parameters of the process pool method need to be serialized and then transmitted, and the search tree structure cannot be serialized, so that the process pool method cannot be used. The invention simulates the function of a process pool through the queue and the process method in the multi-process module. The construction and search process of each retrieval tree is regarded as one task, so that a plurality of tasks can be submitted simultaneously, the process is dynamically called, and when the process is idle and waiting tasks exist, the idle process can acquire the waiting tasks from the queue for processing, so that the execution efficiency of the tasks is improved.

Drawings

FIG. 1 is a schematic flow chart of an embodiment of the present invention (hereinafter referred to as MultiGuideESCan); wherein a is the MultiGuideSecan process; b is a reference genome example; c is kmer extracted from a reference genome and coordinate information thereof; d is the result after classification from the first n bases of kmer, which constitute kmer1 as the bond sequence (key) for this class, kmer1 is converted to index by quaternary, the remainder is kmer 2. (e) The retrieval tree constructed by the kmer2 comprises the kmer2 and information such as the occurrence times and coordinates of the corresponding kmer in a genome, and takes the kmer1 as a key sequence (key) of the corresponding retrieval tree and the corresponding index number as the index number of the retrieval tree;

FIG. 2 is a schematic diagram of information on off-target sequences of parallel-computing induced RNAs.

FIG. 3 is the amount of induced RNA obtained from yeast datasets

FIG. 4 shows the amount of induced RNA obtained from C.elegans dataset

FIG. 5 is the amount of induced RNA obtained from Drosophila melanogaster data set

FIG. 6 is a comparison of total run time performance of MultiGuideScan and GuideScan for Q of 3 and n of 4

FIG. 7 is a comparison of total run time performance of MultiGuideScan and GuideScan at Q of 4 and n of 5

Detailed Description

The present invention will be further described with reference to the following examples.

The embodiment provides a method for designing a CRISPR-induced RNA library, which specifically comprises the following steps:

step 1: the targetable genomic space is determined in parallel.

To generate a user-desired library of induced RNAs, the method also allows the user to enter a reference genomic FASTA file, the length of the induced RNA, standard PAM (protospacer adjacent motif), PAM position relative to the induced RNA target sequence, non-standard PAM, and hamming distances M and Q. As shown in fig. 1(b, c), based on the parameters given by the user, the algorithm scans the kmer prefixed or postfixed by the standard PAM and the non-standard PAM in the reference genome, and the related information, such as the corresponding coordinates and directions (for recording the position and encoding direction of the kmer in a certain reference sequence), and the like. The scanning method can be parallel, because the reference genome file comprises a plurality of reference sequences, and the scanning process of each reference sequence is not influenced mutually, the scanning process of each reference sequence can be regarded as a task, the tasks are executed in parallel by a multiprocess method, the kmer, the PAM and the coordinate information thereof are stored, then all the scanning results are combined to form a kmer set, and the kmer set can be used for targeting the genome space. And counting the number of occurrences of each kmer in the reference genome.

Among them, standard PAM and non-standard PAM are also referred to as classical PAM and non-classical PAM in some literature; each PAM consists of s bases; each kmer consists of N bases; the base types comprise t types; in this example, s is 3, N is 23, t is 4, and the types of bases are classified into A, C, G, T four types.

Step 2: and segmenting and classifying the kmers in the kmer set.

As shown in fig. 1(d), a kmer is first split into two parts, kmer1 and kmer2, kmer1 being n (n ═ Q +1 in this example) in length and being the sequence of the first n bases of the kmer, and kmer2 being the remaining sequence of the kmer. Then, the corresponding kmer1 phase was put inThe same kmer2 is classified into one class and the corresponding kmer1 is used as the bond sequence of the kmer 2. The sequence consists of A, C, G, T four bases, i.e., all kmers are divided into 4ⁿAnd (4) each category. For convenience, four bases A, C, G, T in the bond sequence are replaced by four

numbers

0, 1, 2 and 3 to obtain a quandary number, then the quandary number is converted into a decimal number, the obtained decimal number is used as an index number, and the bond sequence and the index number are mutually converted, for example, the bond sequence "AAAA" corresponds to the index number 0, and the index number 1 corresponds to the bond sequence "AAAC".

And step 3: and constructing a retrieval tree by classification.

As shown in fig. 1(e), a search tree (the search tree structure used in the present embodiment is a dictionary tree) is constructed from each type of kmer2, so as to obtain a plurality of search trees, and the key sequence of each type of kmer2 is used as the key sequence of the corresponding search tree. The steps divide the search trees constructed by all kmers in the original guideScan method into a plurality of smaller search trees, each kmer2 in each search tree has the same prefix kmer1 in the corresponding kmer, and the information such as the corresponding coordinates of each sequence is the same as the original kmer. The building process of each search tree is independent, 4ⁿThe construction process of the search trees can be processed in parallel. In the process of using the process pool (pool) method of the multi-process module in the Python, parameters of the method need to be serialized and then transmitted, and the search tree structure cannot be serialized, so that the process pool (pool) method cannot be used. The embodiment simulates the function of a process pool through the queue and the process structure in the multi-process module. The construction process of each search tree is regarded as one task, so that a plurality of tasks can be submitted simultaneously, the process is dynamically called, and when the process is idle and waiting tasks exist, the idle process can acquire the waiting tasks from the queue for processing, so that the execution efficiency of the tasks is improved. In this step, all sequences prefixed or postfixed with non-standard PAM are labeled as non-candidate induced RNA, while sequences prefixed or postfixed with standard PAM are labeled as candidate induced RNA.

And 4, step 4: parallel filtration of non-candidate induced RNA.

Based on the kmer set and the occurrence number of the kmer in the reference genome, if the occurrence number of a certain kmer is more than 1, searching out the kmer2 corresponding to the kmer in the search tree, and marking the kmer as the kmer2 of the non-candidate induced RNA. The search tree is then traversed to filter out the remaining kmer2 (referring to kmer2 of kmer2 not labeled as a non-candidate inducing RNA) and to join them separately with the key sequence (kmer1) of the search tree in which it resides as candidate inducing RNAs. The filtering processes of the search trees are independent, so that a process pool can be simulated, each filtering process of the search trees is regarded as a task, and the tasks are executed in parallel. Moreover, the candidate induced RNA filtered by each task is from the same search tree and has the same bond sequence.

And 5: candidate induced RNAs with similar kmers exist in the search tree by parallel filtration.

For each candidate induced RNA, an attempt is made to find another kmer in the search tree that is similar to the candidate induced RNA, i.e., kmer with a hamming distance from the candidate induced RNA of less than M. If present, the candidate induced RNA and all similar kmers are labeled as non-candidate induced RNAs and the kmers of these kmers are labeled as kmers 2 of non-candidate induced RNAs. Therefore, all candidate induced RNAs are ensured to have Hamming distances greater than or equal to M from other kmer, and all candidate induced RNAs have no similarity with each other.

When candidate induced RNAs with similar kmers exist in the search tree based on multi-process parallel filtering, the following steps are carried out:

step 5.1: classifying all candidate induced RNAs according to their kmer1, classifying candidate induced RNAs with the same kmer1 into one class; the task (candidate induced RNA with similar kmer in the filter search tree) is divided into a small task according to the category of the candidate induced RNA. Putting each small file containing one type of candidate induced RNA and a corresponding index number (the index number converted from the kmer1 of the candidate induced RNA in the small file) into a queue for multi-process access, wherein each small file containing one type of candidate induced RNA and the corresponding index number are characteristic parameters of one subtask; the characteristic parameter refers to a parameter which is different from other subtasks in one subtask;

step 5.2: creating a plurality of processes by a multi-process module process method, wherein the number of the processes can be set according to user definition; the fixed parameters of each process are all the search trees and the Hamming distance M; the idle process takes out a group of characteristic parameters from the queue, executes corresponding tasks according to the fixed parameters and the characteristic parameters, firstly calculates the kmer1 of the candidate induced RNA in the corresponding file according to the index number, then calculates the Hamming distance m between the kmer1 and the key sequence of each retrieval tree, and stores the Hamming distance m in a list;

step 5.3: by cycling, for each candidate induced RNA and each search tree, first extracting their corresponding hamming distance M from the list, and if the hamming distance M is greater than or equal to M, directly ignoring (if the hamming distance M between kmer1 of the candidate induced RNA and the key sequence of the search tree is greater than or equal to M, then no comparison with kmer2 is performed to reduce the amount of computation); otherwise, trying to find the kmer2 with the Hamming distance of the kmer2 of the candidate induction RNA being not more than M-M in the corresponding search tree, and if the kmer2 exists, writing the kmer2 and the index number of the search tree in which the kmer is located into a temporary file;

step 5.4: when all tasks are executed, according to the kmer2 and the index numbers in all the generated temporary files, marking the corresponding kmer2 in the search tree of the corresponding index number as the kmer2 of the non-candidate induced RNA.

Wherein, given a kmer2, the search in the search tree for Hamming distances less than M-M from it is as follows:

a: firstly, recording the Hamming distance h as 0; starting from the 1 st layer of the search tree, comparing whether the base of the 1 st node is the same as the 1 st base of the kmer2, if so, keeping the Hamming distance h unchanged, otherwise, adding 1 to the Hamming distance; then continuously comparing the 1 st node in the child nodes of the node with the base of the kmer2, and so on;

b: if h is still smaller than M-M up to the leaf node, the kmer corresponding to kmer2 of the branch is a similar sequence; recording the related information; continuing to compare the brother nodes of the brother nodes, if the brother nodes are completely compared, returning to the father node, and continuing to compare the brother nodes of the father node;

c: when the j node of the ith layer is compared with the i base of the kmer2, if h is equal to M-M, the kmer2 where the branch is located is not a similar sequence to be found, and the j +1 node of the ith layer is continuously compared with the i base of the kmer 2; if h is less than M-M, continuing to compare the 1 st sub-node of the node with the i +1 st base of kmer 2;

d: and so on until all the search trees are traversed, all similar sequences of the kmer2 and relevant information are obtained.

Step 6: and obtaining the induced RNA in parallel. Performing the same operation as the step 4, traversing all the search trees in parallel by utilizing multiple processes, and filtering out all the candidate induced RNAs to obtain a final induced RNA set;

and 7: calculating off-target sequences of the induced RNA in parallel;

for all induced RNAs, attempts were made to search the search tree for off-target sequences that mismatched them, with hamming distances required to be greater than or equal to M and less than or equal to Q. This step is schematically illustrated in FIG. 2, and the task is divided into 4ⁿAnd carrying out parallel processing on the fractional data tasks.

When the off-target sequence information of the induced RNA is calculated based on multiple processes in parallel, the following processes are carried out:

step 7.1: classifying all candidate induced RNAs according to their kmer1, classifying candidate induced RNAs with the same kmer1 into one class; creating a corresponding SAM small file for writing off-target sequence information aiming at each class of candidate induced RNA by using each small file containing one class of induced RNA and a corresponding index number (the index number converted from the kmer1 of the induced RNA in the small file); putting each small file containing one type of induced RNA, the corresponding index number and the SAM small file into a queue for multi-process access; each queue comprises small files of a type of induced RNA, corresponding index numbers and SAM small files, namely characteristic parameters of a word task; the characteristic parameter refers to a parameter which is different from other subtasks in one subtask;

step 7.2: creating a plurality of processes by a multi-process module process method, wherein the number of the processes can be set according to user definition; the fixed parameters of each process are all the search trees and the Hamming distance Q; the idle process takes out a group of characteristic parameters from the queue, executes corresponding tasks according to the fixed parameters and the characteristic parameters, firstly calculates the kmer1 of the candidate induced RNA in the corresponding file according to the index number, then calculates the Hamming distance m between the kmer1 and the key sequence of each retrieval tree, and stores the Hamming distance m in a list;

step 7.3: for each idle process, it takes a small file of the inducing RNA from the queue, from which the inducing RNA sequence and its corresponding index number are read. The index number is converted into a key sequence, the hamming distance m between the key sequence and the key sequence of each search tree is calculated, and the calculation results are put into a list for subsequent access.

Step 7.4: for each induced RNA, recording the relevant information, cutting the induced RNA into kmer1 and kmer2, and for each search tree, obtaining the Hamming distance m between the induced RNA and the key sequence of the search tree from the stored list according to the index number of the file where the induced RNA is located. If m is larger than Q, neglecting (if the Hamming distance m of the kmer1 of the induced RNA and the key sequence of the search tree is larger than or equal to Q, not performing the comparison of kmer2 to reduce the calculation amount), otherwise searching the search tree for the kmer2 of which the Hamming distance from the kmer2 of the induced RNA is not larger than Q-m; and (3) connecting the kmers 2 with the key sequences of the retrieval trees where the kmers 2 are located to form a plurality of kmers, wherein the sequences which are complementarily paired with the kmers in the reference genome are the off-target sequences of the induced RNA. When all search trees are searched, all the obtained off-target sequences and related information thereof are converted into a hexadecimal form.

Step 7.5: the information related to the induced RNA and the corresponding off-target sequence information are written into the small SAM file. And then converting all the SAM small files into SAM binary format BAM files, merging all the BAM small files, and establishing indexes for the BAM small files so as to facilitate quick access through a Samtools tool.

Wherein, given an RNA-inducing kmer2, searching the search tree for a Hamming distance from it that is no greater than Q-m is as follows:

a: the hamming distance h is first recorded as 0. Starting from the 1 st layer of the search tree, comparing whether the base of the 1 st node is the same as the 1 st base of the kmer2, if so, keeping the Hamming distance h unchanged, otherwise, adding 1 to the Hamming distance. Then continue to compare the 1 st node in the node's children with the base of kmer2, and so on.

b: if h is still less than or equal to Q-m up to a leaf node, connecting the kmer2 of the branch with the key sequence of the retrieval tree where h is located to form a kmer, wherein the sequence which is in complementary pairing with the kmer in the reference genome is the off-target sequence of the induced RNA, and recording the related information; and continuing to compare the brother nodes of the brother nodes, and if the brother nodes are compared, returning to the parent node and continuing to compare the brother nodes of the parent node.

c: when the j node of the ith layer is compared with the i base of the kmer2, if h is larger than Q-m, neglecting, and continuously comparing the j +1 node of the ith layer with the i base of the kmer 2; otherwise, continue to compare the 1 st child node of the node with the i +1 st base of kmer 2.

d: and repeating the steps until all the search trees are traversed to obtain all off-target sequences of the induced RNA sequence and related information thereof.

Evaluation of Experimental results

To evaluate the experimental performance of the method of the invention, the performance of the method was compared to that of the original method GuideScan using a benchmark dataset from three species of the UCSC genome browser (http:// hgdownload. soe. UCSC. edu) to test the performance of the method, including the Yeast (Yeast), caenorhabditis elegans (c.elegans) and drosophila melanogaster (d.melanogaster) datasets. The input file is a reference genome FASTA file for three species, and the genome FASTA annotation file is generally divided into the following sections:

(1) assembling chromosomes: sequences beginning with chr1., chrX, chrY, and chrM.

(2) Sequence not located: the sequence with _ random as suffix shows that it knows on which chromosome it is but does not know its direction and order.

(3) Undisplaced sequence: the sequences prefixed with chrUn _ indicate on which chromosome it is not known.

The main features of the data set used in this example are shown in table 1:

TABLE 1 Main characteristics of the reference data set

The embodiment implements the original GuideScan algorithm in parallel by a multi-process method, and the calculation logic of the original GuideScan algorithm is the same as that of the original GuideScan algorithm in essence. The experimental results of this example are identical to the original GuideScan algorithm. FIGS. 3, 4 and 5 show the induced RNAs obtained from the data sets of three species of yeast, C.elegans and Drosophila melanogaster, respectively, in which the abscissa indicates the sequence name tag in the reference genome and the ordinate indicates the amount of induced RNA found in the first 50kb bases (less than all bases) of each sequence.

Computational time performance comparison analysis

The time performance of the method and the original GuideSacn algorithm of the embodiment is tested on a computer Linux system with 40Xeon 2.20GHz E5-2630 v4 CPU and 128G memory. Two configurations are used: 1) the parameter M is 2, the Hamming distance Q is 3, n is 4, the length of the induced RNA is 20, the standard PAM is 'NGG', and the non-standard PAM is 'NAG'; 2) the parameters M is 2, Hamming distance Q is 4, n is 5, the length of induced RNA is 20, standard PAM is 'NGG', and non-standard PAM is 'NAG'. A detailed description of the data set is shown in table 1.

Table 2 shows a description of the steps of the method (MultiGuideScan) of the present embodiment, wherein step 2 is a step more processed than the GuideScan. Except for step 2, the other steps are subjected to parallelization processing.

TABLE 2 description of the steps of MultiGuideScan

Tables 3 and 4 show the runtime performance in steps of the method of the present embodiment (MultiGuideScan) and GuideScan using configurations 1) and 2), respectively. In the step 1, as the number of sequences of the reference genome is limited, the calculation speed tends to be stable along with the increase of the number of processes; step 3, because the IO and communication costs are high, the expansion cannot be continued; step 7 is the one with the greatest time overhead, because it is the most computationally complex.

Table 3 MultiGuideScan step run time performance comparison (Q ═ 3, n ═ 4)

Table 4 MultiGuideScan step run time performance comparison (Q ═ 4, n ═ 5)

Fig. 6 and 7 show the total run-time performance comparison for the case of the method of the present embodiment (MultiGuideScan) and GuideScan using configurations 1) and 2), respectively. As can be seen, the calculation time decreases as the number of processes used increases. In the method, the induced RNAs and the retrieval trees are classified according to the kmer1, the Hamming distance between the kmer1 of each class of induced RNAs and the key sequence of the retrieval tree is firstly calculated, because the kmer1 of each class of induced RNAs is the same, and the key sequence of the kmer2 in each retrieval tree is also the same, the design greatly reduces the calculated amount of the comparison between the induced RNAs and the kmers, and avoids the full sequence comparison between each induced RNA and each kmer one by one. And when the Hamming distance m between the kmer1 of a certain category of induced RNA and a certain search tree key sequence is larger than Q, the kmer2 of the category of induced RNA does not need to be compared with the kmer2 in the search tree, and the calculation amount is reduced again. Thus, when Q is 3, n is 4, and the number of processes used is 1, the temporal performance of MultiGuideScan is still much better than GuideScan, reducing the temporal overhead by a factor of about 1. As shown in fig. 6, 3 times acceleration was obtained using 2 processes, 5 times acceleration was obtained using 4 processes, 6-8 times acceleration was obtained using 8 processes, 8-10 times acceleration was obtained using 16 processes, and 9-12 times acceleration was obtained using 32 processes.

When Q is 4 and n is 5, the induced RNA and the search tree are divided into 4⁵I.e. 1024 classes, although theoretically similar acceleration effect should be achieved as in configuration 1), in practice the number of tasks is increased too much, and at the same time the IO overhead and communication overhead are also increased too much, resulting in no improvement of the performance when the number of processes is 1. But as the number of processes used increases, the acceleration effect becomes more and more significant. As shown in fig. 7, 1.5 times acceleration was obtained with 2 processes, 2.6 times acceleration was obtained with 4 processes, 4.5 times acceleration was obtained with 8 processes, 7 times acceleration was obtained with 16 processes, and 9-10 times acceleration was obtained with 32 processes.

In addition, the algorithm also has a plurality of special processing parts which cannot be parallel, and a plurality of IO overhead, communication overhead and other system overhead. According to Amdahl's law, parallelization of a single program cannot be accelerated indefinitely. As the number of processes used increases, scalability slowly decreases.

It should be emphasized that the examples described herein are illustrative and not restrictive, and thus the invention is not to be limited to the examples described herein, but rather to other embodiments that may be devised by those skilled in the art based on the teachings herein, and that various modifications, alterations, and substitutions are possible without departing from the spirit and scope of the present invention.

Claims

1. A method of CRISPR-induced RNA library design comprising the steps of:

scanning a kmer taking standard PAM or non-standard PAM as a prefix or suffix in a reference genome to form a kmer set;

step two, for each kmer in the kmer set, cutting each kmer into two parts of kmer1 and kmer2, wherein the kmer1 is a sequence consisting of the first n bases, dividing the kmer2 which is the same with the corresponding kmer1 into a category, and taking the kmer1 which is corresponding to the kmer2 of the category as a bond sequence of the kmer2 of the category; constructing the kmer2 of the same category into the same retrieval tree, wherein the key sequence of each retrieval tree is the key sequence of the kmer2 of the corresponding category;

for each kmer in the kmer set, if the kmer takes non-standard PAM as a prefix or a suffix or the occurrence frequency of the kmer in a reference genome is more than 1, the kmer is non-induced RNA, otherwise, the kmer is candidate induced RNA; classifying all candidate induced RNAs according to kmer1, traversing all search trees for all classes of candidate induced RNAs in parallel, judging whether kmers with Hamming distances smaller than M exist, if so, determining the candidate induced RNAs as non-induced RNAs, otherwise, determining the candidate induced RNAs as induced RNAs;

step three, classifying all the induced RNAs according to kmer1, traversing all retrieval trees for all the classes of induced RNAs, and searching for kmers with Hamming distances smaller than Q; the method for traversing all retrieval trees for the induced RNA of one category and searching the kmer with the Hamming distance smaller than Q comprises the following steps: firstly, calculating the Hamming distance between the kmer1 of the class of induced RNA and the key sequence of each retrieval tree, and finding out the retrieval tree of which the Hamming distance between the key sequence and the kmer1 of the class of candidate induced RNA is not more than Q; then, aiming at each candidate induced RNA in the category, searching for a kmer2 with the hamming distance of the kmer2 of the candidate induced RNA being not more than Q-m in the found retrieval tree; connecting the kmers 2 with the key sequences of the retrieval trees where the kmers 2 are located to form a plurality of kmers, wherein the sequences which are complementarily matched with the kmers in the reference genome are off-target sequences of the induced RNA;

2. The method of claim 1, wherein in step one, kmers prefixed or postfixed with standard PAM or non-standard PAM are scanned in parallel over a plurality of sequences of a reference genome.

3. The method of claim 2, wherein the kmer prefixed or postfixed by standard PAM or non-standard PAM is scanned in a plurality of sequences of the reference genome as a total task, and the total task is divided into a plurality of subtasks, and each subtask is a kmer prefixed or postfixed by standard PAM or non-standard PAM scanned in a sequence of the reference genome; and simulating the function of a process pool by adopting a process and queue method in the python multi-process module, and executing a plurality of subtasks in parallel.

4. The method for designing a CRISPR-inducing RNA library of claim 1, wherein in step two, a plurality of classes of kmer2 are constructed in parallel into a plurality of search trees, respectively.

5. The method of claim 4, wherein the plurality of classes of kmers 2 are respectively constructed into a plurality of search trees as a total task, and the total task is divided into a plurality of subtasks, and each subtask is a search tree into which one class of kmers 2 is constructed; and simulating the function of a process pool by adopting a process and queue method in the python multi-process module, and executing a plurality of subtasks in parallel.

6. The method for designing a CRISPR-inducing RNA library of claim 1, wherein all search trees are traversed for all classes of candidate inducing RNAs in parallel, and it is determined whether kmer whose hamming distance is less than M exists as a total task, and divided into a plurality of subtasks, and each subtask is a total task for one class of candidate inducing RNA, and all search trees are traversed to determine whether kmer whose hamming distance is less than M exists; and simulating the function of a process pool by adopting a process and queue method in the python multi-process module, and executing a plurality of subtasks in parallel.

7. The method of claim 1, wherein in step three, all the search trees are traversed for all classes of inducing RNA in parallel, and kmers with Hamming distance less than Q are searched.

8. The method for designing a CRISPR-inducing RNA library of claim 7, wherein in the third step, all search trees are traversed for all classes of inducing RNAs in parallel, kmer whose hamming distance is smaller than Q is searched as a total task, which is divided into a plurality of subtasks, each subtask is a total task for one class of inducing RNA, all search trees are traversed, kmer whose hamming distance is smaller than Q is searched; and simulating the function of a process pool by adopting a process and queue method in the python multi-process module, and executing a plurality of subtasks in parallel.

9. The method for designing a CRISPR-inducing RNA library according to any of claims 3, 5, 6 or 8, wherein the process pool function is simulated by using the process method and the queue method in the multiprocess module of python, and the method for executing a plurality of subtasks in parallel is specifically as follows: