JP2003079366A

JP2003079366A - Information processing system for assisting primer walking

Info

Publication number: JP2003079366A
Application number: JP2001274420A
Authority: JP
Inventors: Tomohiro Yasuda; 知弘安田; Takahide Yokoi; 崇秀横井; Takashi Minowa; 貴司箕輪; Hiroshi Kondo; 博近藤; Takako Furuya; 崇子古谷; Emiko Yoshikawa; えみ子吉川; Taketo Komura; 雄飛小村; Tetsuo Nishikawa; 哲夫西川
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2001-09-11
Filing date: 2001-09-11
Publication date: 2003-03-18

Abstract

PROBLEM TO BE SOLVED: To provide a means for performing information processing necessary for the determination of a base sequence by primer walking method on multiple DNA specimens in high throughput at the same time. SOLUTION: The waveform data obtained from multiple DNA specimens as an output of a DNA sequencer are inputted and subjected to base calling, the accuracy of the total trace data is estimated, the processing steps of the sequence assembly and the primer design are collectively and automatically performed on the trace data judged to have high accuracy by the accuracy estimation and the reason of the failure in getting a high-accuracy data is estimated on a trace data which is not judged to have high accuracy by accuracy estimation. The data processing on multiple DNA specimens can be collectively carried out by the system in high efficiency.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、プライマーウォー
キング法におけるプライマー設計処理に関わり、ベース
コーリング、トレースデータ精度評価、配列アセンブ
ル、プライマー設計、トレースデータ解析に関わる。TECHNICAL FIELD The present invention relates to a primer designing process in a primer walking method, and relates to base calling, trace data accuracy evaluation, sequence assembly, primer design, and trace data analysis.

【０００２】[0002]

【従来の技術】国際共同プロジェクト及び米国ヴェンチ
ャー企業により、2000年6月にヒトゲノムの塩基配列決
定の完了が宣言された。4色蛍光色素やキャピラリを使
用したDNAシーケンサの普及など、配列決定技術の進歩
にともない、ヒトゲノムのみならず、微生物から哺乳類
のマウスにいたるまで、様々な生物種のゲノム配列が決
定されている。遺伝子転写産物mRNAの配列決定も広く行
なわれている。2. Description of the Related Art In June 2000, the international joint project and the American venture company declared the completion of the sequencing of the human genome. With the advance of sequencing technology such as the spread of DNA sequencers using 4-color fluorescent dyes and capillaries, not only human genome but also genome sequences of various species from microorganism to mammal mouse have been determined. Sequencing of gene transcript mRNAs is also widely practiced.

【０００３】DNA塩基配列決定の手法として、ショット
ガン法とプライマーウォーキング法が知られている。こ
のうち、プライマーウォーキング法による塩基配列決定
は、cDNAの塩基配列決定やゲノム配列のギャップ配列を
埋める作業など、数千塩基の配列を決定する際に有用な
方法であり、盛んに用いられている。プライマーウォー
キング法では、配列の両端から順次配列決定を行なう。
既に決定された配列の3'末端付近の配列から、新たにプ
ライマーを設計・合成しDNAの増幅および伸長反応を行
ない、DNAシーケンサを用いて500塩基程度の配列を得
て、決定配列を伸ばしていく。この繰り返しで、数千塩
基の塩基配列を決定していく。The shotgun method and the primer walking method are known as methods for determining the DNA base sequence. Among them, the base walking method by the primer walking method is a useful method for determining the sequence of several thousand bases such as the work of determining the base sequence of cDNA or filling the gap sequence of the genome sequence, and is widely used. . In the primer walking method, sequencing is performed sequentially from both ends of the sequence.
From the sequence near the 3'end of the already determined sequence, a new primer is designed and synthesized, DNA amplification and extension reaction are performed, a sequence of about 500 bases is obtained using a DNA sequencer, and the determined sequence is extended. Go. By repeating this, the nucleotide sequence of several thousand bases is determined.

【０００４】プライマーウォーキング法は、配列の端か
ら順に確実に塩基配列を決定していくため、配列中に存
在する繰り返し配列などに起因する配列の不明確さが少
なく、ゲノム配列の塩基配列決定に用いられるショット
ガン法に比べ、決定される配列の精度が高いことが特徴
である。また、ゲノム配列のギャップ配列を埋める作業
においては、位置特異的な配列決定が可能なプライマー
ウォーキング法が、不可欠である。[0004] In the primer walking method, since the base sequence is reliably determined from the end of the sequence, there are few uncertainties in the sequence due to repetitive sequences present in the sequence, and the base sequence determination of the genome sequence is possible. The feature is that the determined sequence is more accurate than the shotgun method used. In addition, a primer walking method capable of position-specific sequence determination is indispensable in the work of filling a gap sequence in a genome sequence.

【０００５】[0005]

【発明が解決しようとする課題】プライマーウォーキン
グ法では、DNAの増幅・伸長反応およびDNAシーケンサに
よる電気泳動と、次の実験的操作に必要な情報処理を、
交互に行なう必要がある。この情報処理には、ベースコ
ーリング、配列アセンブル、ベクター配列除去、プライ
マー設計が挙げられる。これらの情報処理作業は、実験
的操作によって得られるトレースデータをもとに逐次行
なうもので、一度にまとめて行なうことができない。一
方で、これらの情報処理を一括して行なう計算機システ
ムは普及しておらず、大人数で長時間の作業が必要にな
るという問題があった。[Problems to be Solved by the Invention] In the primer walking method, DNA amplification / elongation reaction and electrophoresis by a DNA sequencer and information processing necessary for the next experimental operation are performed.
You need to alternate. This information processing includes base calling, sequence assembly, vector sequence removal, and primer design. These information processing operations are sequentially performed based on the trace data obtained by the experimental operation, and cannot be performed all at once. On the other hand, there is a problem that a computer system that collectively carries out these information processings has not become widespread, and a large number of people need to work for a long time.

【０００６】しかも、単にプライマーを自動設計するだ
けでは、塩基配列決定には役立たない場合がある。実験
で得られたトレースデータ全体の精度が低い場合、実験
操作の誤りや機器の不具合の可能性を考慮し、再度実験
をやり直す必要がある。また、個々の試料については、
塩基配列にGCクラスターが存在すると、トレースデータ
のシグナル値が急激に下降し、以降の配列決定が不可能
になる場合があることが知られている。Aのクラスタ
（または、Ｔのクラスタ）が存在する場合、バックグラ
ウンドノイズが高くなり、精度の高い配列を得るのが困
難になる。In addition, simply designing a primer in some cases may not be useful for nucleotide sequence determination. If the accuracy of the entire trace data obtained in the experiment is low, it is necessary to redo the experiment again, taking into consideration the possibility of incorrect experimental operation and equipment failure. For individual samples,
It is known that when GC clusters are present in the nucleotide sequence, the signal value of the trace data drops sharply, making subsequent sequencing impossible. When the cluster of A (or the cluster of T) is present, the background noise becomes high and it becomes difficult to obtain a highly accurate array.

【０００７】本発明が解決しようとする課題は、プライ
マーウォーキング法による塩基配列決定で必要な情報処
理を、複数のDNA試料に対して、同時に高スループット
で実施する手段を提供するとともに、精度の低いトレー
スデータが得られた場合、その原因が実験に由来するの
か、該トレースデータの塩基配列に固有の特徴によるも
のかを自動的に判断することである。[0007] The problem to be solved by the present invention is to provide a means for simultaneously performing, with high throughput, a plurality of DNA samples for information processing required for nucleotide sequencing by the primer walking method, and with low accuracy. When the trace data is obtained, it is to automatically determine whether the cause is an experiment or a characteristic peculiar to the base sequence of the trace data.

【０００８】[0008]

【課題を解決するための手段】プライマーウォーキング
法による配列決定作業に伴う情報処理のために、以下の
ステップからなる処理を行なうシステムを導入する。 (1)複数のDNA試料に対して、DNAシーケンサが出力した
トレースデータを入力するステップ。 (2)トレースデータから高精度塩基配列を得る、ベース
コーリングステップ。なお、ベースコーリングとは、ト
レースデータに記録されている蛍光シグナルのピークを
解析して、塩基配列を得る処理のことである。また、本
明細書において、高精度塩基配列とは、本発明のベース
コーラの結果得られる配列から、Nが無い最も長い部分
配列を抽出したものである。 (3)トレースデータの高精度塩基配列長を入力とし、同
一の実験で得られたトレースデータ全体の精度評価を配
列長がユーザ指定パラメータを下回る配列の割合が別の
ユーザ指定パラメータを超えるか否かで判定し、さら
に、配列長とその偏差値がそれぞれ別のユーザ指定パラ
メータを下回るか否かによって、個々のトレースデータ
の精度評価を行なうステップ。 (4)各々のDNA試料に対する、配列アセンブルステップ。 (5)各々のDNA試料に対する、プライマー設計ステップ。 (6) (3)のステップで、個々のトレースデータに対する
精度評価の結果高精度でないとされた（即ち低精度な）
個々のトレースデータについて、その精度が高精度でな
かった原因を推測するステップ。[Means for Solving the Problems] In order to process information involved in the sequencing work by the primer walking method, a system for performing the following steps is introduced. (1) A step of inputting the trace data output by the DNA sequencer for a plurality of DNA samples. (2) Base calling step for obtaining high-precision base sequences from trace data. The base calling is a process of analyzing a peak of a fluorescent signal recorded in trace data to obtain a base sequence. In addition, in the present specification, the high-precision base sequence is the longest partial sequence without N extracted from the sequence obtained as a result of the base cola of the present invention. (3) Input the high-precision base sequence length of the trace data and use the accuracy evaluation of the entire trace data obtained in the same experiment to determine whether the proportion of sequences whose sequence length is less than the user-specified parameter exceeds another user-specified parameter. And a step of performing accuracy evaluation of each trace data depending on whether or not the array length and its deviation value are below different user-specified parameters. (4) Sequence assembly step for each DNA sample. (5) Primer design step for each DNA sample. (6) In step (3), it was determined that the accuracy was not high as a result of accuracy evaluation for each trace data (that is, low accuracy).
The step of inferring the reason why the accuracy of each trace data was not high.

【０００９】[0009]

【発明の実施の形態】図1を用いて、本発明の一実施の
形態を説明する。本発明のシステムは、複数の試料につ
いて、それぞれの試料に対してDNAシーケンサを用いて
得られた波形のデータであるトレースデータ全てを入力
としてとる。図2の２０２に、４色蛍光色素を用いたDNA
シーケンサを使用して得られたトレースデータの例を示
す。トレースデータは、DNAシーケンサで電気泳動中の
サンプリング回数に応じた数のデータポイントで、A、
T、G、Cの各塩基に対応するシグナルの値を記録したも
のである。以下、塩基b、データポイントxにおけるトレ
ースデータの値を f(b,x) で表す。BEST MODE FOR CARRYING OUT THE INVENTION An embodiment of the present invention will be described with reference to FIG. The system of the present invention takes as input, for a plurality of samples, all trace data that is waveform data obtained by using a DNA sequencer for each sample. In Figure 2 at 202, DNA using 4-color fluorescent dye
An example of trace data obtained using a sequencer is shown. Trace data is the number of data points according to the number of samplings during electrophoresis with a DNA sequencer, A,
The signal values corresponding to each base of T, G, and C are recorded. Hereinafter, the value of the trace data at base b and data point x is represented by f (b, x).

【００１０】本発明のシステムは、入力であるトレース
データ１０１を読み取ったのち、ベースコーリングステ
ップ１０２で、トレースデータから高精度な塩基配列を
抽出する。続いて、トレースデータの精度評価ステップ
１０３を実行する。精度評価ステップ１０３で、精度の
低いトレースデータが多いと判断される場合には、実験
操作に問題があると判断し、処理を中止する。それ以外
の場合、個々のトレースデータの精度評価を行ない、精
度が高いと判断されたトレースデータをもつDNA試料に
対して配列アセンブルステップ１０４にてその時点まで
に決定可能な塩基配列を作成し、プライマー設計の処理
ステップ１０５にて決定配列伸長のためのプライマーを
設計する。精度が高くないと判断されたトレースデータ
をもつDNA試料に対しては、その原因を推測するステッ
プ１０６において、高精度な配列が得られなかった原因
を予測する。In the system of the present invention, after the input trace data 101 is read, a base calling step 102 extracts a highly accurate base sequence from the trace data. Then, the trace data accuracy evaluation step 103 is executed. If it is judged in the accuracy evaluation step 103 that there are many trace data with low accuracy, it is judged that there is a problem in the experimental operation, and the processing is stopped. In other cases, the accuracy of each trace data is evaluated, and a base sequence that can be determined by that time is created in the sequence assembly step 104 for the DNA sample having the trace data that is determined to have high accuracy. In the primer designing process step 105, a primer for extending the determined sequence is designed. For a DNA sample having trace data that is determined to be not highly accurate, in step 106 of inferring the cause, the reason why the highly accurate sequence was not obtained is predicted.

【００１１】最終的な出力には、各試料について、その
時点までの決定配列および新規プライマーか、または、
トレースデータが高精度でなかった原因のいずれかが出
力されることとなる。The final output is, for each sample, the determined sequence and new primers up to that point, or
One of the reasons why the trace data was not accurate will be output.

【００１２】以下、ベースコーリングステップ１０２、
トレースデータの精度評価ステップ１０３、配列アセン
ブルステップ１０４、プライマー設計の処理ステップ１
０５の、高精度なトレースデータが得られなかった原因
を予測するステップ１０６の実施形態について述べる。Hereinafter, the base calling step 102,
Trace data accuracy evaluation step 103, sequence assembly step 104, primer design processing step 1
The embodiment of step 106 of predicting the reason why the high-accuracy trace data was not obtained will be described.

【００１３】まず、図2を用いて、ベースコーリングス
テップ１０２について述べる。始めに、トレースデータ
f(b, x) ２０２をxについて微分した１次微分f'(b, x)
２０３、２次微分f''(b, x) ２０４を計算する。そし
て、２０６のように、トレースデータの値が他の塩基に
対応するトレースデータの値を上回り、かつ２次微分
f''(b, x) ２０４が極小値を取る位置を、塩基を表わす
ピーク位置の候補として検討する。こうした位置は、ト
レースデータの３次微分２０５が0を交差する位置から
計算できる。広く普及しているベースコーラ phred (Ew
ing、 B. et al.、Genome Research、 8:175-185、 199
8) ではf(b, x)が極大となる位置をピーク位置候補とし
ているが、本発明で２次微分が極小値をとる位置をピー
ク位置候補とするのは、図３の３０１のようにf(b, x)
が極大値をとる明瞭な形のピークだけでなく、３０２の
ような他のピークに埋没し極大値をとらないピークも検
出可能とするためである。First, the base calling step 102 will be described with reference to FIG. First, trace data
f (b, x) First derivative f '(b, x) obtained by differentiating 202 with respect to x
203, the second derivative f ″ (b, x) 204 is calculated. Then, as in 206, the value of the trace data exceeds the value of the trace data corresponding to another base, and the second derivative
The position where f ″ (b, x) 204 takes a minimum value is considered as a candidate for a peak position representing a base. Such a position can be calculated from the position where the third derivative 205 of the trace data crosses zero. Widely used base cola phred (Ew
ing, B. et al., Genome Research, 8: 175-185, 199.
In 8), the position where f (b, x) is the maximum is a peak position candidate, but in the present invention, the position where the second derivative has a minimum value is a peak position candidate, as indicated by 301 in FIG. f (b, x)
This is because it is possible to detect not only a clearly shaped peak having a maximum value of, but also a peak buried in another peak such as 302 and having no maximum value.

【００１４】ピーク位置候補４０１の分析方法を、図４
を用いて説明する。max{f(b, x)} (b∈{A、T、G、C})
を与えるbをBとし、２次微分f''(B, x) ２０４がx=Xで
極小値をとるとする。この位置４０１で、以下の条件が
満足されるか否かを計算する。ここに、P1、P2、P3、P4
は、ユーザが与えるパラメータで、必要な配列精度の条
件により異なる値となる。 1. f''(B, X) ４０２の値が閾値P1以下である。 2. f(B, X) ４０３が閾値P2以上である。 3. f(B, X) / max{f(b, x)} (ここに、b∈{A, T, G,
C}−{B})が閾値P3以上である。A method of analyzing the peak position candidate 401 is shown in FIG.
Will be explained. max {f (b, x)} (b ∈ {A, T, G, C})
Let b be B and the second derivative f ″ (B, x) 204 has a minimum value at x = X. At this position 401, it is calculated whether the following conditions are satisfied. Where P1, P2, P3, P4
Is a parameter given by the user and has a different value depending on the condition of the required array precision. 1. The value of f ″ (B, X) 402 is less than or equal to the threshold P1. 2. f (B, X) 403 is greater than or equal to the threshold P2. 3.f (B, X) / max {f (b, x)} (where b ∈ {A, T, G,
C} − {B}) is greater than or equal to the threshold P3.

【００１５】以上全ての判定で、塩基に対応するピーク
としての条件が満たされた場合に、塩基が存在すると判
断する。上記の条件を満足しないピーク位置ではNを配
列に挿入する。この処理で得られた塩基を連結し、塩基
配列を構成する。ただし、直前の塩基の位置をYとする
とき、距離X−Yすなわち図４中４０4が、妥当であるか
を、先行するn塩基(nはユーザパラメータ)のピーク間距
離の平均値をdとするとき、(X−Y)/d≦P4を満足するか
否かで判定し、間隔が広すぎるときはNを挿入する。こ
こで、dを計算しX−Yとの比を計算するのは、トレース
データ中のピーク間距離が一定でないことを考慮するた
めである。Nのないもっとも長い配列を、高精度塩基配
列とする。精度の低いトレースデータでは、頻繁にNが
入るため、Nのないもっとも長い配列の長さは、短くな
る。したがって、ベースコーリングステップ１０２が出
力するNのない配列の長さは、そのトレースデータの精
度を反映しているといえる。ベースコーリング結果の一
例として f'(b, x) = f(x+2)+f(x+1))−(f(x−1)+f(x−2)) f''(b, x) = f(x+4)+f(x+3)+f(x+2)+f(x−2)+f(x−3)+f
(x−4)−(f(x−1)+f(x)+f(x+1)) P1=100、P2=50、P3=0.5、P4=1.5 とした場合のベースコーリングで、得られた配列を２０
１に示す。この配列２０１は、Nのないもっとも長い配
列を抽出する直前の、Ｎを含んだ配列である。In all of the above judgments, it is judged that a base exists when the condition as a peak corresponding to the base is satisfied. N is inserted into the sequence at peak positions that do not satisfy the above conditions. The bases obtained by this treatment are linked to form a base sequence. However, when the position of the immediately preceding base is Y, it is determined whether the distance X-Y, that is, 404 in FIG. 4 is appropriate, and the average value of the peak-to-peak distances of the preceding n bases (n is a user parameter) is d. When it does, it judges by whether or not (X−Y) / d ≦ P4 is satisfied, and if the interval is too wide, N is inserted. Here, the reason why d is calculated and the ratio to XY is calculated is to consider that the peak-to-peak distance in the trace data is not constant. The longest sequence without N is a high-precision base sequence. In the trace data with low precision, N is frequently entered, so the length of the longest array without N is short. Therefore, it can be said that the length of the array without N output by the base calling step 102 reflects the accuracy of the trace data. As an example of the base calling result, f '(b, x) = f (x + 2) + f (x + 1)) − (f (x−1) + f (x−2)) f''(b, x) = f (x + 4) + f (x + 3) + f (x + 2) + f (x−2) + f (x−3) + f
(x−4) − (f (x−1) + f (x) + f (x + 1)) P1 = 100, P2 = 50, P3 = 0.5, P4 = 1.5 20 for the array
Shown in 1. This array 201 is an array containing N immediately before extracting the longest array without N.

【００１６】次に、精度評価ステップ１０３について述
べる。該ステップ１０３は、プライマーウォーキングに
おいて、情報処理と実験を交互に行なう過程で、最新の
実験で得られたトレースデータ全体を対象として精度評
価を行なう。個々のトレースデータの精度評価は、同じ
最新の実験で得られたすべてのトレースデータを使って
評価する。ベースコーリングステップ１０２で得られた
高精度塩基配列の長さは、各トレースデータの精度を反
映する。精度評価ステップ１０３では、始めにユーザが
与えるパラメータP5、P6を用いて最新の実験全体の精度
評価を行なう。各トレースデータ由来の高精度塩基配列
の長さがP5を下回るか否かを判定する。長さがP5を下回
る配列の割合がP6を超えた場合、その実験全体に高精度
なトレースデータの割合が少ない原因があると判断し、
その旨をユーザに通知し処理を打ち切る。Next, the accuracy evaluation step 103 will be described. In step 103, the accuracy evaluation is performed on the entire trace data obtained in the latest experiment in the process of alternating information processing and experiment in primer walking. The accuracy of individual trace data is evaluated using all the trace data obtained in the same latest experiment. The length of the high precision base sequence obtained in the base calling step 102 reflects the precision of each trace data. In the accuracy evaluation step 103, the accuracy of the latest entire experiment is evaluated using the parameters P5 and P6 provided by the user. It is determined whether or not the length of the high-precision base sequence derived from each trace data is shorter than P5. If the proportion of sequences whose length is less than P5 exceeds P6, it is judged that there is a small proportion of highly accurate trace data in the entire experiment,
To that effect, the user is notified and the processing is terminated.

【００１７】そうでない場合、個々のトレースデータの
精度評価を行なう。まず最新の実験で得られた全ての高
精度塩基配列の長さの平均値mと標準偏差σを求める。
そして、あるトレースデータから得られた高精度塩基配
列の長さがLのとき、以下のいずれかの条件が満足され
たとき、このトレースデータは高精度であると判断す
る。 1. L＞P7 2. (L−m)/σ＞P8 ここに、P7、P8はユーザの与えるパラメータである。If not, the accuracy of each trace data is evaluated. First, the average value m and the standard deviation σ of the lengths of all high-precision base sequences obtained in the latest experiment are obtained.
Then, when the length of the high-precision base sequence obtained from certain trace data is L, and when any of the following conditions is satisfied, it is determined that this trace data has high precision. 1. L> P7 2. (L−m) / σ> P8 where P7 and P8 are parameters given by the user.

【００１８】それぞれのトレースデータは、精度評価ス
テップ１０３で高精度と判断された場合、該トレースデ
ータからベースコーリングユニットで得られた配列断片
は、配列アセンブルステップにて、同一DNA試料に対し
てそれまでに得られているトレースデータからベースコ
ーリングで得られた配列とアセンブルされ、コンセンサ
ス配列が生成される。これが、その時点までに配列決定
された部分配列（８０１、８０２）となる。アセンブル
方法を、図５を用いて説明する。任意の配列断片５０１
と任意の配列断片５０２は、配列断片５０１の末尾配列
５０３と配列断片５０２の先頭配列５０４が十分によく
類似している場合に結合される。同様に、任意の配列断
片５０５と任意の配列断片５０６は、配列断片５０６全
長に渡り配列断片５０５の一部に一致した場合に結合さ
れる。When each trace data is judged to be highly accurate in the accuracy evaluation step 103, the sequence fragment obtained from the trace calling data in the base calling unit is applied to the same DNA sample in the sequence assembling step. The consensus sequence is generated by assembling from the trace data obtained up to now with the sequence obtained by base calling. This is the partial sequence (801, 802) sequenced up to that point. The assembling method will be described with reference to FIG. Arbitrary sequence fragment 501
And an arbitrary sequence fragment 502 are combined when the tail sequence 503 of the sequence fragment 501 and the head sequence 504 of the sequence fragment 502 are sufficiently similar. Similarly, the arbitrary sequence fragment 505 and the arbitrary sequence fragment 506 are combined when they match a part of the sequence fragment 505 over the entire length of the sequence fragment 506.

【００１９】配列同士の比較には、ミスマッチ及びギャ
ップの存在下でも類似配列を探索可能な動的計画法のア
ルゴリズム(「バイオインフォマティクス」第２章、Dur
bin他著、阿久津他訳、医学出版)を用いる。DP行列の計
算開始位置となる左端の列および上端の行におけるスコ
アは、「バイオインフォマティクス」第２章に述べられ
ている重複一致を許したアルゴリズムの要求に従い、0
とする。For comparison between sequences, a dynamic programming algorithm capable of searching for similar sequences even in the presence of mismatches and gaps (“Bioinformatics”, Chapter 2, Dur
bin et al., translated by Akutsu et al., medical publication). The scores in the leftmost column and the uppermost row, which are the calculation start positions of the DP matrix, are 0 according to the requirement of the algorithm that allows duplicate matching described in Chapter 2 of “Bioinformatics”.
And

【００２０】動的計画法による配列比較は、計算の精度
は高いが計算時間は比較する2つの配列長の積に比例す
るため、時間を要するアルゴリズムとして知られてい
る。本システムでは、2つの配列AとBを比較する場合、
それぞれの配列の先頭部分から固定長配列A' ６０１、
B'６０２を取り出し、AとB' ６０２、BとA' ６０１を動
的計画法で比較する。B' ６０２がAのp番目の塩基で始
まる配列に類似することがわかった場合、AとB全体を、
探索範囲をDP行列中Aのp番目の塩基を含む対角線周辺の
一定範囲に限定した動的計画法により比較する。同様
に、A' ６０１がBのq番目の塩基で始まる配列に類似す
ることがわかった場合、AとB全体を、探索範囲をDP行列
中Bのq番目の塩基６０３を含む対角線周辺の一定範囲
に限定した動的計画法により比較する。図6には、後者
の場合について、探索が必要なDP行列上の領域を示し
た。本発明の方法により、計算量は各配列長に比例した
時間に抑えられる。Sequence comparison by dynamic programming is known as a time-consuming algorithm because the calculation precision is high but the calculation time is proportional to the product of the two sequence lengths to be compared. In this system, when comparing two sequences A and B,
Fixed-length array A'601 from the beginning of each array,
B ′ 602 is taken out and A and B ′ 602 and B and A ′ 601 are compared by dynamic programming. If B'602 was found to be similar to the sequence starting at the p-th base of A, then A and B as a whole,
The search range is compared by a dynamic programming method that is limited to a certain range around the diagonal line that includes the p-th base of A in the DP matrix. Similarly, if it is found that A ′ 601 is similar to the sequence starting from the qth base of B, the entire search range of A and B is set as a constant around the diagonal line including the qth base 603 of B in the DP matrix. The comparison is made by dynamic programming limited to the range. FIG. 6 shows the area on the DP matrix that needs to be searched for the latter case. With the method of the present invention, the calculation amount is suppressed to a time proportional to each sequence length.

【００２１】アセンブルの結果得られた配列は、その時
点までに配列決定が終了した部分配列であり、この配列
データを出力する。プライマーウォーキング法による配
列決定が完了したとき、ここで出力される配列が、配列
決定を行なおうとしていたDNA試料の塩基配列そのもの
となる。The array obtained as a result of the assembly is a partial array whose sequence determination has been completed up to that point, and this array data is output. When the sequencing by the primer walking method is completed, the sequence output here becomes the base sequence itself of the DNA sample for which the sequencing was attempted.

【００２２】アセンブルの結果得られた配列７０１を延
長する塩基配列７０２を得るために使用するプライマー
を、配列７０１の末尾付近のプライマーとしてよい性質
を持つ位置７０３において設計する。ただし、既存のプ
ライマーの位置情報１０１を参照し、既存プライマーの
中に、配列の末尾までの距離が１回の泳動で得られると
期待される配列長よりも短いものがあった場合は、次回
の泳動に必要なプライマーが既に存在すると判断し、新
たなプライマーの設計は行なわない。プライマー設計に
当たっては、配列末尾付近で、「PCR法最前線」(関谷剛
男・藤永薫編、共立出版)に述べられたダイマー形成の
可能性、ヘアピンなど２次構造形成の可能性、melting
temperature、 false positive priming site などを考
慮し、プライマー設計を行なう。Melting temperature
の計算には、予測精度の高い nearest neighbor 法(Bre
slauer et al.、 Proc. Natl、 Acad. Sci. USA、 vol.
38、 pp.3746-3750、 1986) を使用し、SantaLucia の
パラメータ (SantaLucia、J. Jr.、 Proc. Natl. Acad.
Sci. USA、 vol 95、 pp.1460-1465、 1998) を用いて
計算する。The primer used to obtain the nucleotide sequence 702 extending the sequence 701 obtained as a result of the assembly is designed near the end of the sequence 701 at a position 703 having a good property as a primer. However, referring to the position information 101 of the existing primer, if there is one of the existing primers in which the distance to the end of the sequence is shorter than the expected sequence length obtained by one-time electrophoresis, next time, It is judged that the primer necessary for the migration of DNA is already present, and no new primer is designed. In designing the primer, the possibility of dimer formation, the possibility of secondary structure formation such as hairpins, described in "Forefront of PCR Method" (edited by Takeo Sekiya and Kaoru Fujinaga, Kyoritsu Shuppan) near the end of the sequence, melting
Design primers considering temperature and false positive priming site. Melting temperature
The nearest neighbor method (Bre
slauer et al., Proc. Natl, Acad. Sci. USA, vol.
38, pp.3746-3750, 1986) and parameters of SantaLucia (SantaLucia, J. Jr., Proc. Natl. Acad.
Sci. USA, vol 95, pp.1460-1465, 1998).

【００２３】試料DNA中にfalse positive priming site
が存在すると、試料DNAの本来意図していない領域にプ
ライマーがアニールし、データにノイズが入る大きな要
因となる。False positive priming site が存在する可
能性を最大限除去するため、本システムでは、プライマ
ー設計時にフォワード側配列決定済み領域８０１および
リバース側配列決定済み領域８０２、ベクターサイト配
列、および、これらの相補配列全体から、プライマー候
補配列に類似した部分配列を探索する。類似部分配列が
見付かった場合、そのプライマー候補配列は棄却され
る。この探索処理は、suffix trie またはsuffix tree
を用いて高速処理する（Inenaga、 S. etal.、 Proc. 1
2^th Ann. Symp. on Combinatorial Pattern Matching、
LectureNote in Computer Science 2089、 p.169、 20
01、Hass、 S. et al.、 NucleicAcid Research、 Vol
26、 No.12、 pp.3006-3012）。False positive priming site in sample DNA
If present, the primer will anneal to the originally unintended region of the sample DNA, which is a major factor causing noise in the data. In order to eliminate the possibility of the presence of false positive priming sites to the maximum extent, this system uses the forward-side sequenced region 801 and the reverse-side sequenced region 802, the vector site sequence, and their entire complementary sequences when designing the primer. Is searched for a partial sequence similar to the candidate primer sequence. When a similar partial sequence is found, the candidate primer sequence is rejected. This search process is performed by suffix trie or suffix tree
High-speed processing using (Inenaga, S. et al., Proc. 1
2 ^th Ann. Symp. On Combinatorial Pattern Matching,
LectureNote in Computer Science 2089, p.169, 20
01, Hass, S. et al., Nucleic Acid Research, Vol
26, No. 12, pp. 3006-3012).

【００２４】設計されたプライマー配列は、次回のDNA
増幅・伸長反応および電気泳動で使用するプライマー１
０７として出力される。プライマー配列は汎用的な表計
算ソフトウェアで読み込み可能なタブ区切りテキスト形
式で出力し、プライマー情報の維持管理を容易にする。The designed primer sequence is used for the next DNA
Primer 1 used in amplification / extension reaction and electrophoresis
It is output as 07. The primer sequences are output in tab-delimited text format that can be read by general-purpose spreadsheet software, making it easy to maintain and manage primer information.

【００２５】次に、トレースデータの精度評価ステップ
１０３で高精度でないと判断されたトレースデータにつ
いて、高精度でなかった原因を推定するステップ１０６
の実施形態を述べる。まず、波形に基づく塩基配列の抽
出が困難になるような、シグナル値が極めて低い領域が
該トレースデータ中に見られるかどうかを判断する。該
トレースデータのデータポイントが1からN(Nはトレース
データに記録されているデータポイントの数)までとす
れば、A、T、G、Ｃの各塩基bについて、該トレースデー
タのシグナル値の平均値Ｍを数１で計算できる。Next, in the trace data accuracy evaluation step 103, with respect to the trace data which is judged not to be highly accurate, the reason why it is not highly accurate is estimated step 106.
Will be described. First, it is determined whether or not a region having an extremely low signal value is found in the trace data, which makes it difficult to extract a base sequence based on a waveform. If the data points of the trace data are from 1 to N (N is the number of data points recorded in the trace data), for each base b of A, T, G, C, the signal value of the trace data The average value M can be calculated by Equation 1.

【００２６】[0026]

【数１】本発明では、該トレースデータ中で、ある位置xを左端
とする幅P９(P9はユーザの与えるパラメータ)のウィン
ドウ内で、A、T、G、Cのいずれに対応するトレースデー
タもM以下である場合、その位置xでシグナル値が低くな
っていると判断し、ユーザに対し通知する。トレースデ
ータの先頭部分での波形の乱れを誤って検出するのを防
ぐため、xがP10(P10はユーザパラメータ)以上の範囲の
みを考慮する。厳密には、次式を満足するxが存在する
場合、位置xでシグナル低下があると判断する。 x≧P10、x≦∀y＜x+P9、∀b∈{A, T, G, C}、 f(b, y)
≦M 本発明ではシグナル低下が認められた場合、該トレース
データが得られたDNA試料に関する配列から、原因と考
えられる配列の特徴を探索する。探索対象となる配列に
は、該トレースデータからベースコーリングステップ１
０２で抽出された塩基配列を始め、該DNA試料から得ら
れたほかの塩基配列が含まれる。探索する特徴の一例と
しては、GCクラスターが挙げられる。本発明では、探索
対象の配列で、連続する12塩基中10塩基がGまたはCとな
る位置をGCクラスターとして検出し、ユーザにシグナル
低下の原因となっている可能性があるGCクラスターが存
在する旨を通知する。[Equation 1] In the present invention, in the trace data, the trace data corresponding to any of A, T, G, and C is less than or equal to M in the window of the width P9 (P9 is a parameter given by the user) whose left end is a position x. If there is, the signal value is judged to be low at the position x and the user is notified. To prevent false detection of waveform distortion at the beginning of trace data, consider only the range where x is P10 (P10 is a user parameter) or higher. Strictly speaking, if there exists x that satisfies the following equation, it is determined that there is a signal decrease at position x. x ≧ P10, x ≦ ∀y <x + P9, ∀b∈ {A, T, G, C}, f (b, y)
≦ M In the present invention, when a signal decrease is observed, the characteristic of the sequence considered to be the cause is searched for from the sequence relating to the DNA sample from which the trace data was obtained. For the sequence to be searched, the base calling step 1 is performed from the trace data.
Starting with the base sequence extracted in No. 02, other base sequences obtained from the DNA sample are included. An example of the feature to be searched is a GC cluster. In the present invention, in the sequence to be searched, a position where 10 out of 12 consecutive bases are G or C is detected as a GC cluster, and there is a GC cluster that may cause a signal decrease to the user. Notify to that effect.

【００２７】次に、バックグラウンドノイズが高いか否
かの判定を行なう。本発明では、該トレースデータ中、
ベースコーリングユニット１０２において得られた塩基
配列に無関係なシグナル値のうち、それぞれのデータポ
イントで最大のものを選び、それらの和をバックグラウ
ンドノイズと定義する。バックグラウンドノイズを計算
するために、本発明は任意のデータポイントxで、集合B
[x]を定義する。B[x]の要素は、データポイントxをピー
ク範囲として含むノイズでないピークに対応する塩基で
ある。図９に、B[x]の例を示す。本発明では、シグナル
値が単調に増加し続けて極大となり、その後単調に減少
し続ける領域をピークの範囲と考える。従って、ピーク
範囲の左側の部分はf'(b, x)が正である区間、右側の部
分はf'(b, x)が負である区間である。厳密には、B[x]は
数２で定義される。この数２は、トレースデータ中のデ
ータポイントxでノイズでないシグナルを持つ塩基の集
合B[x]の定義を表す式である。Next, it is determined whether the background noise is high. In the present invention, in the trace data,
Of the signal values irrelevant to the base sequence obtained in the base calling unit 102, the maximum value at each data point is selected, and the sum thereof is defined as background noise. To calculate the background noise, the present invention uses the set B at any data point x.
Define [x]. The element of B [x] is a base corresponding to a non-noise peak including the data point x as a peak range. FIG. 9 shows an example of B [x]. In the present invention, the region where the signal value continues to increase monotonically, reaches a maximum, and then continues to decrease monotonously is considered to be the peak range. Therefore, the left part of the peak range is a section where f '(b, x) is positive, and the right part is a section where f' (b, x) is negative. Strictly speaking, B [x] is defined by Equation 2. This equation 2 is an expression representing the definition of the set B [x] of bases having a signal that is not noise at the data point x in the trace data.

【００２８】[0028]

【数２】数２において、s[i]はベースコーリングステップ１０２
で得られた塩基配列のうちi番目の塩基、p[i]はi番目の
塩基s[i]が検出されたデータポイントの位置である。B
[x]を用いて、トレースデータのバックグラウンドノイ
ズeを数３で定義する。[Equation 2] In Equation 2, s [i] is the base calling step 102.
In the base sequence obtained in step i, the i-th base, p [i] is the position of the data point where the i-th base s [i] was detected. B
Using [x], the background noise e of the trace data is defined by Equation 3.

【００２９】[0029]

【数３】実験的操作で同時にられた複数試料由来のトレースデー
タについて、eの値を計算し、平均値m(e)、分散σ(e)を
求めておく。個々の試料のeについて偏差値(e−m(e))/
σ(e)を求め、その値がP11(P11はユーザの与えるパラメ
ータ)以上である場合、バックグラウンドノイズが高い
と判断し、ユーザに通知する。[Equation 3] The values of e are calculated for the trace data derived from a plurality of samples that were simultaneously processed by the experimental operation, and the average value m (e) and the variance σ (e) are obtained. Deviation value (e−m (e)) /
σ (e) is obtained, and when the value is P11 (P11 is a parameter given by the user) or more, it is determined that the background noise is high and the user is notified.

【００３０】バックグラウンドノイズが高いと判断され
た場合、該トレースデータが得られたDNA試料に関する
配列から、原因と考えられる配列の特徴を探索する。探
索対象となる配列には、該トレースデータからベースコ
ーリングステップ１０２で抽出された塩基配列を始め、
該DNA試料から得られたほかの塩基配列が含まれる。探
索する特徴の一例としては、Poly-A、Poly-Tが挙げられ
る。本発明では、Aが７塩基以上続く領域をPoly-A、 T
が７個以上続く領域をPoly-Tと定義している。Poly-A、
Poly-Tが存在する場合、ユーザに、バックグラウンドノ
イズを引き起こしている原因となっている可能性がある
Poly-A、 Poly-Tが存在する旨を通知する。When it is determined that the background noise is high, the characteristic of the sequence considered to be the cause is searched for from the sequence relating to the DNA sample for which the trace data was obtained. The sequence to be searched includes the base sequence extracted in the base calling step 102 from the trace data,
Other base sequences obtained from the DNA sample are included. Examples of features to be searched include Poly-A and Poly-T. In the present invention, the region where A continues for 7 bases or more is Poly-A, T
The area where 7 or more are continuous is defined as Poly-T. Poly-A,
The presence of Poly-T may be causing the background noise to the user.
Notify that Poly-A and Poly-T exist.

【００３１】[0031]

【発明の効果】本発明により、複数のDNA試料に対する
データ処理を、一括して実行することが可能になる。パ
ラメータを予め与えておけば、処理の途中で、人手が介
在しなければならないステップはなく、従来数日を要し
ていた複数DNA試料のデータ処理の時間を、30分程度に
大幅に効率化することができる。さらに、false positi
ve priming site を、フォワード側決定配列、リバース
側決定配列およびベクター配列およびこれら全ての相補
配列から探索し、false positive priming siteの可能
性を最大限に減少させることで、単純な効率化のみなら
ず、決定配列の精度向上が達成される。実験で得られた
トレースデータ全体の精度評価が自動で行なわれるた
め、人手でトレースデータの品質を検査する必要がなく
なる。また、問題のある試料に関して警告が出されるた
め、ユーザは難読クローンの存在に早い段階で気づき、
対策をとることができる。According to the present invention, it becomes possible to collectively execute data processing for a plurality of DNA samples. If parameters are given in advance, there is no step that requires human intervention during the process, and the time required for data processing of multiple DNA samples, which conventionally took several days, is greatly improved to about 30 minutes. can do. Furthermore, false positi
The ve priming site is searched for from the forward side determinant sequence, the reverse side determinant sequence, the vector sequence, and the complementary sequences of all of them, and the possibility of false positive priming site is reduced to the maximum, so that not only simple efficiency improvement , The accuracy of the decision sequence is improved. Since the accuracy of the entire trace data obtained in the experiment is automatically evaluated, there is no need to manually inspect the quality of the trace data. Users will also be alerted early on the presence of obfuscated clones as they will be warned about problematic samples.
Can take measures.

[Brief description of drawings]

【図１】本発明のシステム構成を表す図。FIG. 1 is a diagram showing a system configuration of the present invention.

【図２】本発明における、ベースコーリング及びその過
程で計算する微分の説明図。FIG. 2 is an explanatory diagram of base calling and differentiation calculated in the process in the present invention.

【図３】トレースデータ中シグナル値が極大値をとるピ
ークと他のピークに埋没し極大値をとらないピークの説
明図。FIG. 3 is an explanatory diagram of a peak where a signal value in trace data has a maximum value and a peak which is buried in another peak and does not have a maximum value.

【図４】トレースデータ中シグナル値のピークが塩基を
表しているかを判定する評価尺度の説明図。FIG. 4 is an explanatory diagram of an evaluation scale for determining whether a peak of a signal value in trace data represents a base.

【図５】配列アセンブルユニットで、結合される配列断
片同士の関係を表す説明図。FIG. 5 is an explanatory diagram showing a relationship between sequence fragments to be bound in a sequence assembly unit.

【図６】配列アセンブルユニットで、DP行列上の計算が
必要な領域の説明図。FIG. 6 is an explanatory diagram of a region in the array assemble unit that requires calculation on the DP matrix.

【図７】プライマー設計位置と、配列アセンブル時の類
似配列探索位置との説明図。FIG. 7 is an explanatory diagram of a primer design position and a similar sequence search position during sequence assembly.

【図８】フォワード側配列決定済み領域、リバース側配
列決定済み領域、配列未決定領域、ベクターサイトの説
明図。FIG. 8 is an explanatory diagram of a forward side sequence-determined region, a reverse side sequence-determined region, a sequence undetermined region, and a vector site.

【図９】B[x]の定義の説明図。FIG. 9 is an explanatory diagram of the definition of B [x].

[Explanation of symbols]

１０１は、複数試料のトレースデータであり、本発明の
システムの入力である。１０２において、ベースコーリングが行われる。１０３において、トレースデータの精度評価を行う。１０４において、塩基配列断片をアセンブルする。１０５において、配列未決定部分の塩基配列を決定する
ためのプライマーを設計する。１０６において、高精度でないトレースデータについ
て、その原因を分析する。１０７は、新規に設計されたプライマーであるか、また
は、トレースデータの精度が高精度でない原因の予測結
果であり、本プログラムの出力である。１０８は、高精度トレースデータに対する処理の流れで
ある。１０９は、高精度でないトレースデータに対する処理の
流れである。２０１は、ベースコーリングで得られる塩基配列の例。２０２は、DNAシーケンサで得られたトレースデータの
例。２０３は、トレースデータの１次微分の例。２０４は、トレースデータの２次微分の例。２０５は、トレースデータの３次微分の例。２０６は、２次微分が極小値をとり、ピークの候補があ
る位置。３０１は、トレースデータが極大値をとるピーク位置。３０２は、トレースデータのピーク位置だが、他のピー
クに埋没し、極大値とならないピーク位置。４０１は、トレースデータの、ピーク位置候補を表す。４０２は、トレースデータの、２次微分の値を表す。４０３は、トレースデータの強度を表す。４０４は、隣り合った塩基に由来するピーク間の距離を
表す。５０１は、ある配列断片。５０２は、５０１と異なる配列断片。５０３は、５０２の先頭の配列に類似した配列断片５０
１の末尾の部分配列。５０４は、５０１の末尾の配列に類似した配列断片５０
２の先頭の部分配列。５０５は、ある配列断片。５０６は、５０５の部分配列に類似した、ある配列断
片。６０１は、配列Aの先頭の部分配列である配列A'。６０２は、配列Bの先頭の部分配列である配列B'。６０３は、配列Bの、q番目の塩基の位置。７０１は、ある配列断片。７０２は、７０１を鋳型配列としたプライマーを用いて
得られた配列断片。７０３は、７０２を得るために使ったプライマーの鋳型
となった位置。７０４は、７０２の先頭部分に類似している６０１の末
尾の部分配列。８０１は、フォワード側配列決定済み部位。８０２は、リバース側配列決定済み部位。８０３は、ベクターサイト。Reference numeral 101 is trace data of a plurality of samples, which is an input of the system of the present invention. At 102, base calling is performed. At 103, the accuracy of the trace data is evaluated. At 104, the base sequence fragments are assembled. At 105, a primer for determining the base sequence of the undetermined sequence portion is designed. At 106, the cause of the inaccurate trace data is analyzed. 107 is a newly designed primer or a prediction result of the reason why the accuracy of trace data is not high, and is the output of this program. Reference numeral 108 denotes a processing flow for the high precision trace data. Reference numeral 109 is a processing flow for trace data that is not highly accurate. 201 is an example of a base sequence obtained by base calling. 202 is an example of trace data obtained by a DNA sequencer. 203 is an example of the first derivative of trace data. Reference numeral 204 is an example of the second derivative of the trace data. 205 is an example of the third derivative of the trace data. 206 is the position where the second derivative has a minimum value and there is a peak candidate. 301 is a peak position where the trace data has a maximum value. 302 is a peak position of the trace data, but it is a peak position that is buried in other peaks and does not have a maximum value. Reference numeral 401 represents a peak position candidate of the trace data. Reference numeral 402 represents the value of the second derivative of the trace data. Reference numeral 403 represents the intensity of the trace data. 404 represents the distance between peaks derived from adjacent bases. 501 is a sequence fragment. 502 is a sequence fragment different from 501. 503 is a sequence fragment 50 similar to the first sequence of 502.
The partial array at the end of 1. 504 is a sequence fragment 50 similar to the sequence at the end of 501
The first partial array of 2. 505 is a sequence fragment. 506 is a sequence fragment similar to the partial sequence of 505. An array A ′ 601 is a partial array at the head of the array A. An array B ′ 602 is a partial array at the head of the array B. 603 is the position of the qth base of sequence B. 701 is a sequence fragment. 702 is a sequence fragment obtained by using a primer having 701 as a template sequence. 703 is a position serving as a template for the primer used to obtain 702. 704 is a partial array at the end of 601 which is similar to the beginning part of 702. 801 is a sequenced site on the forward side. 802 is the reverse side sequenced site. 803 is a vector site.

フロントページの続き (72)発明者箕輪貴司東京都千代田区神田駿河台四丁目６番地株式会社日立製作所ライフサイエンス推進事業部内 (72)発明者近藤博東京都千代田区神田駿河台四丁目６番地株式会社日立製作所ライフサイエンス推進事業部内 (72)発明者古谷崇子東京都千代田区神田駿河台四丁目６番地株式会社日立製作所ライフサイエンス推進事業部内 (72)発明者吉川えみ子東京都千代田区神田駿河台四丁目６番地株式会社日立製作所ライフサイエンス推進事業部内 (72)発明者小村雄飛東京都千代田区神田駿河台四丁目６番地株式会社日立製作所ライフサイエンス推進事業部内 (72)発明者西川哲夫東京都国分寺市東恋ケ窪一丁目280番地株式会社日立製作所中央研究所内Ｆターム(参考） 4B024 AA11 AA20 HA08 HA19 4B063 QA13 QQ42 QQ52 QS39 5B075 ND20 UU18 Continued front page (72) Inventor Takashi Minowa 4-6 Kanda Surugadai, Chiyoda-ku, Tokyo Hitachi, Ltd. Life Science Promotion Within the business unit (72) Inventor Hiroshi Kondo 4-6 Kanda Surugadai, Chiyoda-ku, Tokyo Hitachi, Ltd. Life Science Promotion Within the business unit (72) Inventor Takako Furuya 4-6 Kanda Surugadai, Chiyoda-ku, Tokyo Hitachi, Ltd. Life Science Promotion Within the business unit (72) Inventor Emiko Yoshikawa 4-6 Kanda Surugadai, Chiyoda-ku, Tokyo Hitachi, Ltd. Life Science Promotion Within the business unit (72) Inventor Yuhi Komura 4-6 Kanda Surugadai, Chiyoda-ku, Tokyo Hitachi, Ltd. Life Science Promotion Within the business unit (72) Inventor Tetsuo Nishikawa 1-280, Higashi Koikekubo, Kokubunji, Tokyo Central Research Laboratory, Hitachi, Ltd. F-term (reference) 4B024 AA11 AA20 HA08 HA19 4B063 QA13 QQ42 QQ52 QS39 5B075 ND20 UU18

Claims

[Claims]

1. With respect to a plurality of DNA samples, trace data, which is waveform data output by a DNA sequencer, is input,
Base calling, which is the process of extracting the base sequence from the trace data, is performed, and the accuracy of the entire trace data obtained in the same experiment is evaluated in the process of alternating the information processing and the experiment in primer walking. For the sample for which the accuracy evaluation was performed and the trace data that was determined to be highly accurate in the accuracy evaluation for each trace data was obtained, the sequence obtained by performing sequence assembly and the primer design were performed. An information processing system for primer walking, which is characterized by predicting the reason why highly accurate trace data was not obtained for a sample for which trace data that was determined to be not highly accurate was obtained by the accuracy evaluation of the above trace data.

2. The function of performing accuracy evaluation of the entire trace data and individual trace data, and the function of predicting the cause of the trace data which is not highly accurate in the accuracy evaluation of the individual trace data are deleted.
Information processing system for walking primer.

3. When performing base calling, a position at which the second derivative of the trace data has a minimum, not the maximum value of the trace data itself, is used as a candidate for a position where a peak corresponding to a base exists. The information processing system for primer walking according to claim 1.

4. A high-precision base sequence obtained by base calling, when performing accuracy evaluation of the entire trace data obtained in the same experiment in the process of alternating information processing and experiment in primer walking, The information processing system for primer walking according to claim 1, wherein the accuracy of the whole is evaluated by calculating the ratio of those having a short sequence length.

5. When performing accuracy evaluation of individual trace data, individual trace data are obtained by using a value calculated from an average value and standard deviation of sequence lengths obtained from trace data derived from a plurality of samples. 2. The primer walking information processing system according to claim 1, wherein the accuracy of the primer walking is evaluated.

6. When a sequence to be compared is set to A and B in an overlap search between two base sequences in a sequence assembly process, a fixed length head portion of a base sequence fragment of A and B are dynamically moved in advance. If a similar sequence is not found by performing the sequence comparison by the design method, the sequence comparison by the dynamic programming method is performed for the whole A and the fixed-length head portion of the base sequence fragment of B,
When a sequence similarity region is found in one of the two sequence comparisons, the overlap search is performed as the sum of the sequence lengths of both sequences by applying the sequence comparison algorithm by dynamic programming in which the number of gaps is limited based on the position. The information processing system for primer walking according to claim 1, wherein an assembling method that is performed in a time proportional to is implemented.

7. A method for predicting the cause of trace data that is not highly accurate as a result of an accuracy evaluation among individual trace data, for a plurality of samples,
The value of background noise obtained from the trace data is obtained, and the background noise of each trace data is evaluated by using the value calculated from the average value and standard deviation of those values. When it is determined that the value is high, the characteristic sequence that causes it is searched from the sequence obtained in the sequence determination process of the DNA sample from which the trace data was obtained. Information processing system for primer walking.

8. A decrease in signal when predicting the cause of trace data that has been determined to be not highly accurate as a result of accuracy evaluation among individual trace data, and a decrease in signal is indicated by an average value of signal intensities in the trace data. When a signal decrease is observed, the characteristic sequence that causes the decrease is searched for from the sequence obtained in the sequencing process of the DNA sample from which the trace data was obtained. The information processing system for primer walking according to claim 1.