WO2024214217A1

WO2024214217A1 - Genetic analysis device and genetic analysis method

Info

Publication number: WO2024214217A1
Application number: PCT/JP2023/014893
Authority: WO
Inventors: 徹横山; 功原浦; 尚哉室岡; 基博山崎; 周志隅田
Original assignee: 株式会社日立ハイテク
Priority date: 2023-04-12
Filing date: 2023-04-12
Publication date: 2024-10-17

Abstract

This genetic analysis device comprises an acquisition unit that acquires time series data indicating the result of electrophoresis of a sample; and an analysis unit that analyzes a base sequence of the sample from the time series data. The time series data includes multiple sets of fluorescence intensity data corresponding to multiple bases. The analysis unit divides the time series data into multiple sections, generates, for each of the multiple sets of fluorescence intensity data, a feature amount indicating an emergence frequency of at least one of a local maximum portion, a local minimum portion, and a flat portion of the fluorescence intensity data in each of the sections, determines, from among the multiple feature amounts generated for the multiple sets of fluorescence intensity data, a section feature amount on the basis of the magnitude relationship of the feature amounts, and detects, by using the section feature amount, a signal region which is an analysis target region of the base sequence in the time series data.

Description

Genetic analysis device and genetic analysis method

The present invention relates to a genetic analysis device and a genetic analysis method.

　There is a conventional technique for determining base sequences, described in International Publication WO 2008/050426 (Patent Document 1). This publication states that "it is possible to accurately analyze base sequences even in electrophoretic data that includes degraded portions." and "The base sequence of a nucleic acid is determined by including the following steps (A) to (C) in that order: (A) a base peak extraction step of extracting base peaks from electrophoretic data including peaks of four types of base types obtained by electrophoretic separation of a sample nucleic acid; (B) a condition setting step of setting a search start base peak and a peak interval reference value for starting a search in time series data composed of the extracted base peaks; (C) starting from the search start base peak in the time series data, sequentially scanning between adjacent base peaks in the forward and backward directions of the time series, comparing the interval between base peaks with the peak interval reference value and adding an interpolated peak to a peak missing section, thereby determining the base sequence."

International Publication No. 2008/050426

　Conventional technology had issues with the accuracy of distinguishing between signal sections (signal regions) and non-signal sections (non-signal regions). In particular, the accuracy of detecting signal regions containing low-intensity signals was insufficient. For example, in a method of determining a threshold for distinguishing between signals and non-signals from the overall distribution of signal strength, the threshold becomes large due to the influence of high-intensity signals, and sections containing low-intensity signals are sometimes erroneously determined to be non-signal sections. In a method of distinguishing between signals and non-signals by finding local changes in signal strength (rising and falling edges), it is difficult to distinguish because low-intensity signals have relatively small changes in strength. In a method of extracting periodic components of a signal to determine a signal section, erroneous determinations occur when periodicity close to that of a signal is observed in a non-signal region.

The present invention aims to detect signal sections (signal regions) with high accuracy from time-series data showing the results of electrophoresis.

In order to achieve the above-mentioned object, a representative genetic analysis device of the present invention comprises an acquisition unit that acquires time series data indicating the results of electrophoresis of a sample, and an analysis unit that analyzes the base sequence of the sample from the time series data, wherein the time series data includes a plurality of fluorescence intensity data corresponding to a plurality of bases, and the analysis unit divides the time series data into a plurality of intervals, generates for each of the plurality of fluorescence intensity data a feature amount indicating the frequency of occurrence of at least one of a maximum portion, a minimum portion, and a flat portion of the fluorescence intensity data in each interval, determines an interval feature amount from the plurality of feature amounts generated for the plurality of fluorescence intensity data based on a magnitude relationship between the feature amounts, and uses the interval feature amount to detect a signal region in the time series data that is a region to be analyzed for the base sequence.
Moreover, one representative genetic analysis method of the present invention is characterized by comprising the steps of: acquiring time series data indicating the result of electrophoresis of the sample, the time series data including a plurality of fluorescence intensity data corresponding to a plurality of bases; dividing the time series data into a plurality of intervals; generating non-signal features of the fluorescence intensity data in each interval for each of the plurality of fluorescence intensity data based on the frequency of appearance of at least one of a maximum portion, a minimum portion, and a flat portion of the fluorescence intensity data in each interval; determining an interval feature from the plurality of feature features generated for the plurality of fluorescence intensity data based on a magnitude relationship between the feature features; and detecting a signal region, which is a region to be analyzed of the base sequence in the time series data, using the interval feature.

According to the present invention, it is possible to detect signal sections (signal regions) with high accuracy from time-series data showing the results of electrophoresis. Problems, configurations, and advantages other than those described above will become clear from the description of the embodiments below.

Example of the configuration of the gene analysis device according to the first embodiment Configuration example of electrophoresis apparatus according to the first embodiment 1 is a flowchart outlining a process executed by a gene analysis device according to a first embodiment of the present invention. Flow of electrophoresis processing of real samples Base calling flow Signal section detection flow A diagram explaining the characteristics of non-signal sections A diagram explaining the characteristics of the signal section Illustration of generation of non-signal features from shape patterns Diagram of determining non-signal features of a section Diagram of signal boundary determination (part 1) Diagram of signal boundary determination (part 2) A diagram illustrating a case where a threshold is determined based on the distribution of non-signal features in a section. Diagram of the case where non-signal features are used in combination with other features (part 1) Diagram of the case where non-signal features are used in combination with other features (part 2) Diagram of the case where non-signal features are used in combination with other features (part 3) Diagram of the combination of non-signal features and other features (part 4) Diagram of the case where non-signal features are used in combination with other features (part 5) An explanatory diagram for correcting the analysis interval by editing the fluorescence intensity data

The following describes the embodiment with reference to the drawings.

FIG. 1 is a diagram showing an example of the configuration of a gene analysis device 101 according to a first embodiment.
The genetic analysis device 101 includes an electrophoresis device 105 and a data analysis device 112. The electrophoresis device 105 and the data analysis device 112 are communicatively connected using a communication cable.

The data analysis device 112 includes a central control unit 102 , a storage unit 104 , and a user interface unit 103 .
The central control unit 102 executes control and data processing of the electrophoretic device 105. The central control unit 102 is, for example, a central processing unit (CPU) and a graphics processing unit (GPU).
The storage unit 104 stores programs executed by the central control unit 102, setting information for the electrophoretic device 105, information used for various processes, etc. The storage unit 104 is, for example, a memory.
The user interface unit 103 is an interface for connecting to an input device and an output device, or an interface for connecting to an external device via a network. The data analysis device 112 presents information to a user via the user interface unit 103, and also accepts information input by the user.

The central control unit 102 operates as a sample information setting unit 106, an electrophoresis device control unit 108, a fluorescence intensity calculation unit 110, and a base calling unit 107 by executing the programs stored in the memory unit 104. In the following explanation, when the processing is explained with the functional units as the subject, it means that the central control unit 102 is executing the programs.

The sample information setting section 106 is a setting section for setting information related to a sample.
The electrophoresis device control unit 108 is a control unit that controls the electrophoresis of the sample performed by the electrophoresis device 105.
The fluorescence intensity calculation unit 110 is an acquisition unit that acquires time series data indicating the results of electrophoresis from the electrophoresis device 105. The time series data includes a plurality of fluorescence intensity data corresponding to a plurality of bases.
The base calling unit 107 is an analysis unit that analyzes the base sequence of a sample from time-series data. The base calling unit 107 includes an analysis interval detection unit 109.

The analysis interval detection unit 109 divides the time series data into a plurality of intervals, and generates a non-signal feature indicating a non-signal for each fluorescence intensity data based on the frequency of occurrence of maximum, minimum, and flat points in the fluorescence intensity data in each interval. The feature is set to a value that is larger as the non-signal is more likely. Alternatively, a value that is larger as the occurrence frequency is smaller, such as the inverse of the occurrence frequency or a value obtained by subtracting the occurrence frequency from a fixed value, may be generated as a signal feature. In this case, the signal feature is set to a value that is larger as the signal is more likely. Below, an embodiment using the non-signal feature is described, but this embodiment is also applicable to the case where a signal feature is used. The analysis interval detection unit 109 determines the minimum value of the multiple non-signal features generated for multiple fluorescence intensity data as the non-signal feature for that interval. If a signal feature is used, the maximum signal feature is determined as the signal feature for that interval. Then, the signal interval of the time series data is detected using the signal feature for the interval. A signal interval (signal region) is an interval in the time series data that includes a change in fluorescence intensity due to the presence of bases. A non-signal section (non-signal region) is a section of the time series data that does not contain any changes in fluorescence intensity due to the presence of bases.

The electrophoresis device 105 electrophoreses the sample (DNA fragments) and obtains electrophoresis data. The electrophoresis data is time-series data of the brightness values of DNA fragments labeled with fluorescent dyes.

The configuration of the electrophoresis device 105 will now be described. Figure 2 is a diagram showing an example of the configuration of the electrophoresis device 105 of Example 1.

The electrophoresis device 105 has a detection unit 216, a thermostatic chamber 218, a transport machine 225, a high-voltage power supply 204, a first ammeter 205, an anode electrode 211, a second ammeter 212, a capillary array 217, and a pump mechanism 203.

The capillary array 217 is a replacement component that includes multiple (e.g., eight) capillaries 202, and includes a load header 229, a detection unit 216, and a capillary head 233. In addition, if a capillary 202 is damaged or its quality deteriorates, it can be replaced with a new capillary array 217.

The capillary 202 is made of a glass tube with an inner diameter of several tens to several hundred microns and an outer diameter of several hundred microns, and its surface is coated with polyimide to improve its strength. However, the light irradiation section where the laser light is irradiated has a structure where the polyimide coating has been removed so that the internal light emission can easily leak to the outside. The inside of the capillary 202 is filled with a separation medium that creates a difference in migration speed during electrophoresis. Separation media come in both fluid and non-fluid types, but in Example 1, a fluid polymer is used.

The high-voltage power supply 204 applies a high voltage to the capillary 202. The first ammeter 205 detects the current emitted from the high-voltage power supply 204. The second ammeter 212 detects the current flowing through the anode electrode 211.

The optical detection unit that detects the information light obtained from the sample is composed of a light source 214 that irradiates the detection unit 216 with excitation light, an optical detector 215 for detecting the light emitted within the detection unit 216, and a diffraction grating 232. The detection unit 216 is a component that acquires information that depends on the sample.

When detecting a sample in the capillary 202 that has been separated by electrophoresis, the detection unit 216 is irradiated with excitation light from the light source 214, generating fluorescence having a wavelength that depends on the sample as information light. Furthermore, the diffraction grating 232 separates the information light in the wavelength direction, and the optical detector 215 detects the separated information light to analyze the sample.

The capillary cathode ends 227 are each fixed through a metallic hollow electrode 226, with the tip of the capillary 202 protruding from the hollow electrode 226 by approximately 0.5 mm. The hollow electrodes 226 provided on each capillary 202 are all attached together to the load header 229. Furthermore, all hollow electrodes 226 are electrically connected to the high-voltage power supply 204 mounted on the main body of the device, and function as cathode electrodes when voltage application is required for electrophoresis, sample introduction, etc.

The capillary end opposite the capillary cathode end 227 (the other end) is bound together by the capillary head 233. The capillary head 233 can be connected to the block 207 in a pressure-tight manner. A high voltage is applied between the load header 229 and the capillary head 233 from the high-voltage power supply 204. Then, new polymer is filled into the capillary 202 from the other end by the syringe 206. The polymer in the capillary 202 is refilled for each measurement to improve the measurement performance.

The pump mechanism 203 is composed of a syringe 206 and a mechanism for pressurizing the syringe 206, and injects the polymer into the capillary 202.

Block 207 is a connection part for connecting the syringe 206, the capillary array 217, the anode buffer container 210, and the polymer container 209.

The thermostatic chamber 218 is covered with a heat insulating material to keep the capillaries 202 in the thermostatic chamber 218 at a constant temperature, and the temperature is controlled by a heating and cooling mechanism 220. In addition, a fan 219 circulates and stirs the air in the thermostatic chamber 218, keeping the temperature of the capillary array 217 uniform and constant in position.

The transporter 225 transports various containers to the capillary cathode end 227. The transporter 225 is equipped with three electric motors and linear actuators, and can move in three axial directions: up and down, left and right, and depth. At least one container can be placed on the moving stage 230 of the transporter 225. Furthermore, the moving stage 230 is equipped with an electric grip 231, which can grasp and release each container. Therefore, the buffer container 221, the washing container 222, the waste liquid container 223, and the sample plate 224 can be transported to the capillary cathode end 227 as necessary. Unnecessary containers are stored in a designated storage location within the electrophoresis device 105.

The user can use the data analysis device 112 to control various functions of the electrophoresis device 105 and obtain the electrophoresis data detected by the optical detection unit.

The electrophoresis device 105 may have sensors for acquiring information about the observation environment that affects electrophoresis (observation environment information). The electrophoresis device 105 in FIG. 2 has an in-device sensor 240, a polymer sensor 241, and a buffer solution sensor 242.

The internal sensor 240 is a sensor for acquiring information about the internal environment of the electrophoresis device 105, and measures, for example, a temperature sensor, a humidity sensor, and an air pressure sensor within the electrophoresis device 105.

The polymer sensor 241 is a sensor for acquiring information about the quality of the polymer, such as a pH sensor and an electrical conductivity sensor. In FIG. 2, the polymer sensor 241 is installed inside the polymer container 209, but the installation location is not limited to this.

The buffer solution sensor 242 is a sensor for obtaining information regarding the quality of the buffer solution, and may be, for example, a temperature sensor. In FIG. 2, the buffer solution sensor 242 is installed in the anode buffer container 210, but the installation location is not limited to this. For example, the buffer solution sensor 242 may be installed in the buffer container 221.

FIG. 3 is a flowchart outlining the processing executed by the genetic analysis device 101 of the first embodiment.

The electrophoresis device 105 of the genetic analysis device 101 performs electrophoresis processing on the sample to be analyzed (step S301). Details of the electrophoresis processing will be explained using FIG. 4.

Then, the data analysis device 112 of the genetic analysis device 101 performs spectrum correction to correct the wavelength characteristics of the device (step S302), and executes a fluorescence intensity calculation process using the electrophoresis data (step S303). Specifically, the fluorescence intensity calculation unit 110 calculates time series data of the fluorescence intensity of the fluorescent dye from the electrophoresis data, and detects the center position, height, width, etc. of the peak from the time series data of the fluorescence intensity.

Next, the data analyzer 112 of the genetic analyzer 101 executes a mobility correction process on the time series data of the fluorescence intensity (step S304).
Next, the data analyzer 112 of the genetic analyzer 101 executes base calling using the time series data of the fluorescence intensity corrected based on the result of the mobility correction process (step S305). Specifically, the base calling unit 107 identifies the base sequence of the sample using the time series data of the corrected fluorescence intensity.

Figure 4 shows the flow of electrophoresis processing of an actual sample in S301. The basic steps of electrophoresis can be broadly divided into sample preparation (S401), analysis start event (S402), loading of migration medium (S403), preliminary migration (S404), sample introduction (S405), migration analysis (S406), and end of migration analysis (S407).

The operator of this device sets the samples and reagents in this device as sample preparation (S401) before starting the analysis. More specifically, first, the buffer container 221 and the anode buffer container 210 are filled with a buffer solution that forms part of the current path. The buffer solution is, for example, an electrolyte solution commercially available from various companies for electrophoresis. The sample to be analyzed is dispensed into the wells of the sample plate 224. The sample is, for example, a PCR product of DNA. A cleaning solution for cleaning the capillary cathode end 227 is dispensed into the cleaning container 222. The cleaning solution is, for example, pure water. A migration medium for electrophoresis of the sample is injected into the syringe 206. The migration medium is, for example, a polyacrylamide separation gel or polymer commercially available from various companies for electrophoresis. The capillary array 217 is replaced if degradation of the capillary 202 is expected or if the length of the capillary 202 is to be changed.

The samples set on the sample plate 224 at this time include the actual DNA sample to be analyzed, as well as a positive control, a negative control, and an allelic ladder, each of which is electrophoresed in a different capillary. The positive control is, for example, a PCR product containing known DNA, and is a sample used in a control experiment to confirm that DNA has been correctly amplified by PCR. The negative control is a PCR product that does not contain DNA, and is a sample used in a control experiment to confirm that the PCR amplified product has not been contaminated by the operator's DNA, dust, etc.

An allelic ladder is an artificial sample that contains many alleles that may commonly be contained in a DNA marker, and is usually provided by reagent manufacturers as part of a reagent kit for DNA identification. Allelic ladders are used to fine-tune the correspondence between the DNA fragment length of each DNA marker and the allele.

　In addition, all of the above samples, including the actual sample, positive control, negative control, and allelic ladder, are mixed with known DNA fragments labeled with specific fluorescent dyes, called size standards. The type of fluorescent dye assigned to the size standard varies depending on the reagent kit used.

The operator specifies the type of allelic ladder, the type of size standard, the type of fluorescent reagent, and the type of sample set in the wells on the sample plate 224 corresponding to each capillary. In this embodiment, the type of sample specified is any one of real sample, positive control, negative control, and allelic ladder. This information is set in the sample information setting section 106 on the data analysis device 112 via the user interface section 103.

After completing the above sample preparation (S401), the operator operates the user interface unit 103 on the data analysis device 112 to instruct the start of analysis. This instruction to start analysis is passed to the electrophoresis device control unit 108. The electrophoresis device control unit 108 sends an analysis start signal to the electrophoresis device 105, thereby starting the analysis (S402).

Next, the electrophoresis device 105 starts filling the migration medium (S403). This step may be performed automatically after the start of the analysis, or may be performed sequentially by sending a control signal from the electrophoresis device control unit 108. Filling the migration medium is a procedure in which new migration medium is filled into the capillary 202 to form a migration path.

In filling the migration medium in this embodiment (S403), first, the waste liquid container 223 is transported directly below the load header 229 by the transport machine 225, and the solenoid valve 213 is closed so that the used migration medium discharged from the capillary cathode end 227 can be received. Then, the syringe 206 is driven to fill the capillary 202 with new migration medium, and the used migration medium is discarded. Finally, the capillary cathode end 227 is immersed in a cleaning solution in the cleaning container 222, and the capillary cathode end 227 contaminated by the migration medium is cleaned.

Next, preliminary electrophoresis (S404) is performed. This step may be performed automatically or sequentially by sending a control signal from the electrophoresis device control unit 108. Preliminary electrophoresis is a procedure in which a predetermined voltage is applied to the electrophoretic medium to make the electrophoretic medium suitable for electrophoresis. In the preliminary electrophoresis (S404) in this embodiment, first, the capillary cathode end 227 is immersed in the buffer solution in the buffer container 221 by the conveyor 225 to form a current path. Then, a voltage of several to several tens of kilovolts is applied to the electrophoretic medium by the high-voltage power supply 204 for several to several tens of minutes to make the electrophoretic medium suitable for electrophoresis. Finally, the capillary cathode end 227 is immersed in the cleaning solution in the cleaning container 222 to clean the capillary cathode end 227 contaminated by the buffer solution.

Next, sample introduction (S405) is performed. This step may be performed automatically or sequentially by sending a control signal from the electrophoresis device control unit 108. In sample introduction (S405), sample components are introduced into the migration path. In sample introduction (S405) in this embodiment, first, the capillary cathode end 227 is immersed in the sample held in the well of the sample plate 224 by the conveyor 225, and then the solenoid valve 213 is opened. This forms a current path, and the sample components are ready to be introduced into the migration path. Then, a pulse voltage is applied to the current path by the high-voltage power supply 204, and the sample components are introduced into the migration path. Finally, the capillary cathode end 227 is immersed in a cleaning solution in the cleaning container 222, and the capillary cathode end 227 contaminated by the sample is washed.

Next, electrophoretic analysis (S406) is performed. This step may be performed automatically or sequentially by sending a control signal from the electrophoretic device control unit 108. In electrophoretic analysis (S406), each sample component contained in the sample is separated and analyzed by electrophoresis. In the electrophoretic analysis (S406) in this embodiment, first, the capillary cathode end 227 is immersed in the buffer solution in the buffer container 221 by the conveyor 225 to form a current path. Next, a high voltage of about 15 kV is applied to the current path by the high-voltage power supply 204 to generate an electric field in the electrophoretic path. Due to the generated electric field, each sample component in the electrophoretic path moves to the detection unit 216 at a speed that depends on the properties of each sample component. In other words, the sample components are separated due to the difference in their moving speed. Then, the sample components that reach the detection unit 216 are detected in order. For example, when a sample contains many DNAs with different base lengths, the migration speed differs depending on the base length, and the DNAs reach the detection unit 216 in order starting from the shortest base length. A fluorescent dye that depends on the terminal base sequence is attached to each DNA. When the detection unit 216 is irradiated with excitation light from the light source 214, information light, that is, fluorescence having a wavelength that depends on the sample, is generated from the sample and released to the outside. This information light is detected by the optical detector 215. During the electrophoretic analysis, the optical detector 215 detects this information light at regular time intervals and transmits image data to the data analysis device 112. Alternatively, in order to reduce the amount of information to be transmitted, the luminance of only a part of the image data may be transmitted instead of the image data. For example, luminance values sampled only at wavelength positions at regular intervals may be transmitted for each capillary. This luminance value data represents the spectral waveform of each capillary. This spectral waveform is stored in the memory unit 104.

Finally, when the planned image data has been acquired, the voltage application is stopped and the electrophoretic analysis is terminated (S407). The above is an example of the electrophoretic process (S301) in FIG. 4.

FIG. 5 shows the flow of base calling in S305.
First, the analysis interval detection unit 109 of the base calling unit 107 detects a signal interval from the time-series data of the corrected fluorescence intensity (step S501).
The base calling unit 107 analyzes the detected signal section and identifies the base sequence of the sample (step S502).

6 shows a flow of the signal section detection in S501. The signal section detection in S501 includes steps S601 to S604.
Step S601: The analysis interval detection unit 109 divides the entire time series data into a plurality of small intervals. Then, the process proceeds to step S602.
In step S602, the analysis interval detection unit 109 selects one of the small intervals and generates non-signal feature values for each signal included in that interval. Each signal is four pieces of fluorescence intensity data corresponding to four bases. The analysis interval detection unit 109 generates non-signal feature values for each of the four pieces of fluorescence intensity data in the selected small interval. Then, the process proceeds to step S603.

Step S603: The analysis section detection unit 109 determines the non-signal feature of the selected subsection. Specifically, the analysis section detection unit 109 sets the smallest feature of the four non-signal features calculated from the four fluorescence intensity data as the non-signal feature of the subsection. After step S603, if there are still subsections remaining for which non-signal features have not been determined, the process returns to step S602. Once non-signal features have been determined for all subsections, the process proceeds to step S604. Note that, as described above, when signal features are used instead of non-signal features, the largest signal feature is set as the signal feature of the subsection, and processing is performed in the same manner as above.

Step S604: The analysis section detection unit 109 uses the non-signal features determined for each small section to determine the boundary between the non-signal section and the signal section, and ends the process.

Here, we will explain the non-signal features. Figure 7 is a diagram explaining the features of the non-signal section. Figure 8 is a diagram explaining the features of the signal section. In Figures 7 and 8, Dye1 to Dye4 indicate four fluorescent dyes corresponding to four bases. In Figures 7 and 8, the horizontal axis is time and the vertical axis is fluorescence intensity.

Comparing the fluorescence intensity data in Figures 7 and 8, there are many unevennesses and flatnesses in the non-signal sections, and fewer unevennesses and flatnesses in the signal sections. In particular, in the signal sections, the transition in fluorescence intensity is gradual, so there are significantly fewer flat areas.

Then, the analysis section detection unit 109 generates non-signal features based on the number of occurrences of the three shape patterns (maximum, minimum, flat) in the fluorescence intensity data. The generation of non-signal features from the shape patterns corresponds to step S602.

FIG. 9 is an explanatory diagram of generation of non-signal features from a shape pattern.
The shape pattern "flat" is defined as a case where the intensity difference between adjacent points is ±h1. In other words, the following formula is satisfied. Note that the points referred to here are individual sample values of the electrophoretic signal, and are determined by the time interval or sampling rate at which the optical detector 215 acquires data. This time interval is determined in advance by the user or as a default value for the device.
-h1≦(y[k+1]-y[k])<=h1
The shape pattern "maximum" is a pattern that satisfies the following formula.
y[k]-y[k-1]>h2 &&y[k]-y[k+1]>h2
The shape pattern is "minimal" when the following formula is satisfied.
y[k-1]-y[k]>h3 &&y[k+1]-y[k]>h3
The above h1, h2, and h3 may be values that are determined in advance according to the sampling rate and the electrophoretic voltage.

The analysis interval detection unit 109 regards the number of times the three patterns appear as non-signal features of the fluorescence intensity data in that interval. Note that it is also possible to normalize by the interval length and regard the frequency of appearance of the three patterns as non-signal features.

Furthermore, of the three patterns, the "flat" shape pattern is unlikely to appear in signal sections and is therefore highly important as a feature of non-signal sections. Therefore, the "flat" shape pattern may be weighted more heavily than the other shape patterns to generate non-signal features.

FIG. 10 is an explanatory diagram of the determination of the non-signal feature of a section in S603. In FIG. 10, the minimum value of the non-signal feature of the fluorescence intensity data within the section is set as the non-signal feature of that section. The graph shown in FIG. 10 is the fluorescence intensity data of Dye1 to Dye4. F(Dye1) to F(Dye4) are the non-signal feature generated from the fluorescence intensity data of Dye1 to Dye4.

The analysis section detection unit 109 finds the non-signal feature Fq for section q using Min(F(Dye1), F(Dye2), F(Dye3), F(Dye4)). In FIG. 10, the fluorescence intensity data for Dye1 has the gentlest slope, so F(Dye1) is smaller than F(Dye2) to F(Dye4). Therefore, Fq = F(Dye1).

FIGS. 11 and 12 are explanatory diagrams of the determination of the signal boundary in S604. As shown in FIG. 11, the analysis interval detection unit 109 plots the non-signal features of each interval and performs smoothing and interpolation. This makes it possible to suppress the effects of small fluctuations in the non-signal features. The analysis interval detection unit 109 determines the time at which the smoothed and interpolated non-signal features exceed a threshold value as the signal interval boundary.

As shown in FIG. 12, the analysis interval detection unit 109 may determine the boundary to be the time when the interval in which the threshold value is exceeded continuously is equal to or exceeds a certain margin. This makes it possible to be robust against the effects of fluctuations near the boundary.

FIG. 13 is an explanatory diagram of a case where a threshold is determined from the distribution of non-signal features in a section. The distribution of non-signal features is assumed to be bimodal. The signal portion is low, and the non-signal portion is high. The analysis section detection unit 109 can determine the threshold based on this distribution. For example, X% of the peak Fp of the higher (non-signal) mountain can be set as the threshold. Alternatively, a value at which the slope of the mountain becomes relatively flat can be set as the threshold.

However, depending on the user's settings, there may be cases where the acquired electrophoretic data does not include non-signal sections, so a predetermined fixed value may always be used.
Alternatively, it may be possible to determine whether the distribution of the feature amount is bimodal or not, and then determine whether to dynamically determine the distribution or to obtain a fixed value.

14 to 18 are explanatory diagrams for cases in which non-signal features are used in combination with other features.
14, the threshold value is changed according to the signal level. That is, the threshold value is determined according to the following (1) and (2).
(1) If the signal strength is above a certain level, it is always determined to be a signal section.
(2) If the signal strength is below a certain level, the threshold is lowered according to the signal strength.
Here, in the range (2), the lower the signal strength, the easier it is to determine that it is a non-signal section.
Alternatively, a differential signal may be used. By lowering the threshold as the difference increases, the rising and falling edges of the signal section can be detected.

15, a feature vector including non-signal features and other features is given as input to a signal section identifier 121, and a signal section is obtained as output. Any other feature such as signal intensity or a differential signal can be used.
As the signal interval identifier 121, any model such as a Deep Neural Network (DNN), a Support-Vector Machine (SVM), or Random Forest can be used.
The output may be a discrimination result of whether it is a signal section or a non-signal section, or may be the probability (likelihood) of it being a signal section, or the like.

FIG. 16 shows ensemble learning that combines multiple classifiers. In the example shown in FIG. 16, a first feature vector is provided as input to the first signal section classifier 122, and a second feature vector is provided as input to the second signal section classifier 123. The outputs of the signal section classifiers 122-123 are then input to the discriminator 124, which finally obtains an output.

The first feature vector includes, for example, a non-signal feature and a signal intensity, and the second feature vector includes, for example, a differential signal.
The classifier 124 determines the output by majority vote, etc. Other methods such as bagging, boosting, and stacking may also be used.

FIG. 17 is an explanatory diagram of the configuration for learning on the device. The signal section information storage unit 125 is a memory unit provided in the genetic analysis device. Each time the user performs a measurement, the signal section information storage unit 125 stores the signal section information (a label indicating whether it is a signal section or not) in association with the feature vector. The analysis section detection unit 109 can read the feature vector and label from the signal section information storage unit 125 and provide them to a signal section classifier to perform supervised learning.

FIG. 18 is an explanatory diagram of learning that reflects the results of user adjustments. When the user adjusts the analysis section and performs reanalysis, the operation and the results are stored in a specified information storage unit and reflected in the next learning. This improves the accuracy of signal section detection. Although FIG. 18 shows an example in which the user operates the boundary between the signal section and the non-signal section, the results of adjustments to other parameters can also be used. For example, if the user adjusts non-signal feature parameters (conditions for flat sections, conditions for maximum and minimum points), threshold settings, and parameters related to signal boundary determination, the results of that operation can be stored and learned.

The base call unit 107 in the genetic analysis device 101 may reanalyze fluorescence intensity data other than the fluorescence intensity data generated from the electrophoresis results in the fluorescence intensity calculation unit 110. In this case, for example, the fluorescence intensity data may be stored in the storage unit 104 or may be transmitted through a communication cable. In this case, the user can adjust the analysis interval by editing the fluorescence intensity data. FIG. 19 is an explanatory diagram of a case where the analysis interval is corrected by editing the fluorescence intensity data. For the fluorescence intensity data in the upper part of the figure, as shown in the lower part of the figure, the data is corrected so that the fluorescence intensity data near the beginning includes many flat parts, so that the non-signal feature amount increases and the signal start position can be moved. Note that if the fluorescence intensity near the beginning is set to a zero value, the fluorescence intensity will change significantly, and the base call result near the beginning may change compared to before the correction. For this reason, in the lower part of the figure, only the signal start position is changed by increasing the flat parts so that the signal intensity is within a certain range (gray range in the lower part of the figure) from the signal intensity before the correction. Note that not only the flat parts but also maximum and minimum points may be added. Such editing of the fluorescence intensity data may be performed using an external tool. Alternatively, the genetic analysis device 101 may have a function for editing such fluorescence intensity data.

As described above, the disclosed genetic analysis device 101 includes a fluorescence intensity calculation unit 110 as an acquisition unit that acquires time-series data showing the results of electrophoresis of a sample, and a base calling unit 107 as an analysis unit that analyzes the base sequence of the sample from the time-series data. The time-series data includes a plurality of fluorescence intensity data corresponding to a plurality of bases, and the analysis unit divides the time-series data into a plurality of intervals, generates for each of the plurality of fluorescence intensity data a feature amount indicating the frequency of occurrence of at least one of a maximum portion, a minimum portion, and a flat portion of the fluorescence intensity data in each interval, determines an interval feature amount from the plurality of feature amounts generated for the plurality of fluorescence intensity data based on the magnitude relationship of the feature amounts, and detects a signal region, which is an analysis target region of the base sequence in the time-series data, using the interval feature amount.
According to this configuration, a signal section (signal region) can be detected with high accuracy from time-series data indicating the results of electrophoresis.
For example, a signal section including a low-intensity signal can be detected with high accuracy. In addition, a signal section can be detected with high accuracy even when the signal intensity varies widely. Specifically, the detection accuracy of a signal section is improved for data including an unintended, suddenly high-intensity signal (dye blob) caused by sample pretreatment, data including a high-intensity signal that appears at the end of a PCR reaction, and data in which the signal intensity is attenuated due to a special sample or pretreatment.

Moreover, the analysis section determines, as the section feature, the feature that is the smallest among the feature values of the plurality of pieces of fluorescence intensity data.
In this way, the signal section can be detected by identifying the non-signal section using a shape pattern that appears characteristically not in the signal section but in the non-signal section.
Furthermore, due to the characteristics of electrophoresis, only one base will exist at the same position among multiple fluorescence intensity data, so by selecting the feature that is most likely to be a signal from the multiple fluorescence intensity data as the representative feature of the section, the characteristics of electrophoresis can be utilized to detect the signal section with high accuracy.

The analysis unit may compare the feature amount of the section with a threshold value to determine whether the section is a non-signal section. The threshold value may be a predetermined fixed value or a value calculated from the distribution of the feature amount of the section.
In this configuration, the signal section can be detected with high accuracy by taking into account the distribution of non-signal features.

In addition, when a certain number of consecutive sections have a feature amount greater than the threshold value, the analysis unit determines that the boundary between the consecutive sections and the signal area adjacent to the consecutive sections is the boundary between the signal area and non-signal area.
This configuration can suppress the effects of fluctuations near the boundaries of the signal sections.

The analysis unit detects the signal region using a discrimination model in which a feature amount of the section is one of inputs.
In this configuration, signal sections can be flexibly detected from non-signal features and other features.

Furthermore, the analysis unit generates feature quantities of the fluorescence intensity data by using a weight that is greater for flat portions of the fluorescence intensity data than for maximum and minimum portions of the fluorescence intensity data.
In this way, by placing emphasis on the characteristic shape pattern of the non-signal section, the signal section can be detected with high accuracy.

The analysis unit can detect a second signal section different from the fluorescence intensity data by using second fluorescence intensity data edited from the fluorescence intensity data, where the second fluorescence intensity data is a deviation amount within a certain range from the intensity of the fluorescence intensity data.
By detecting the second signal interval using such second fluorescence intensity data, it is possible to change the analysis interval while minimizing the effect on the analysis results.

The present invention is not limited to the above-mentioned embodiment, and various modifications are included. For example, the above-mentioned embodiment is described in detail to easily explain the present invention, and is not necessarily limited to the embodiment having all the described configurations. Moreover, the present invention is not limited to the deletion of the configurations, and it is also possible to replace or add the configurations.
For example, in the above embodiment, a configuration was exemplified in which a device that learns from data and a device that updates the signal section discriminator are integrated, but the learning and updating of the signal section discriminator may be performed by separate devices.

Reference Signs List 101: Genetic analysis device, 102: Central control unit, 103: User interface unit, 104: Memory unit, 105: Electrophoresis device, 106: Sample information setting unit, 107: Base calling unit, 108: Electrophoresis device control unit, 109: Analysis section detection unit, 110: Fluorescence intensity calculation unit, 112: Data analysis device, 121 to 123: Signal section discriminator, 124: Discriminator, 125: Signal section information storage unit

Claims

an acquisition unit that acquires time-series data indicating the results of electrophoresis of a sample;
an analysis unit that analyzes the base sequence of the sample from the time-series data;
Equipped with
the time-series data includes a plurality of fluorescence intensity data corresponding to a plurality of bases,
The analysis unit is
Dividing the time series data into a plurality of intervals;
generating a feature quantity indicating an occurrence frequency of at least one of a maximum portion, a minimum portion, and a flat portion of the fluorescence intensity data in each section for each of the plurality of fluorescence intensity data;
determining an interval feature value from among the plurality of feature values generated for the plurality of fluorescence intensity data based on a magnitude relationship between the feature values;
A genetic analysis device comprising: a signal region, which is a region to be analyzed of a base sequence in the time-series data, detected by using the section feature amount.
The genetic analysis device according to claim 1 ,
The analysis unit is
A gene analysis device comprising: a feature quantity that is the smallest among the plurality of feature quantities; and a feature quantity that is the smallest among the plurality of feature quantities.
The genetic analysis device according to claim 1 ,
The analysis unit compares the section feature amount with a threshold value to determine whether the section is a signal region;
The gene analysis device, wherein the threshold value is a predetermined fixed value or a value obtained from a distribution of the section feature amount.
The genetic analysis device according to claim 3,
The analysis unit, when a certain number of consecutive sections have section features whose section characteristics are greater than the threshold value, determines that the boundary between the consecutive sections and a signal region adjacent to the consecutive sections is the boundary between the signal region and non-signal region.
The genetic analysis device according to claim 1 ,
The gene analysis device is characterized in that the analysis unit detects the signal region using a discrimination model in which a feature amount of the section is one of the inputs.
The genetic analysis device according to claim 1 ,
The gene analysis device according to claim 1, wherein the analysis unit generates the feature quantity by using a weight that is greater for flat portions of the fluorescence intensity data than for maximum and minimum portions of the fluorescence intensity data.
The genetic analysis device according to claim 1 ,
the analysis unit detects a second signal section different from the fluorescence intensity data by using second fluorescence intensity data obtained by editing the fluorescence intensity data,
The second fluorescence intensity data is a deviation amount within a certain range from the intensity of the fluorescence intensity data.
acquiring time-series data indicating the result of electrophoresis of the sample, the time-series data including a plurality of fluorescence intensity data corresponding to a plurality of bases;
Dividing the time series data into a plurality of intervals;
generating, for each of the plurality of fluorescence intensity data, a feature quantity indicative of a degree of non-signal of the fluorescence intensity data in each section based on an occurrence frequency of at least one of a maximum portion, a minimum portion, and a flat portion of the fluorescence intensity data in each section;
determining an interval feature from among a plurality of feature values generated for the plurality of fluorescence intensity data based on a magnitude relationship between the feature values;
detecting a signal region, which is a region to be analyzed of a base sequence in the time-series data, by using the section feature;
A genetic analysis method comprising the steps of: